10 分钟入门 Mars DataFrame¶

本页面是一个对 Mars DataFrame 的简短介绍，内容修改自 10 分钟入门 pandas 。

如果没有说明，我们默认导入下面的包：

In [1]: import mars.tensor as mt

In [2]: import mars.dataframe as md

创建对象¶

通过传入一个包含值的 list 来创建 Series 实例，并使用默认的整数索引：

In [3]: s = md.Series([1, 3, 5, mt.nan, 6, 8])

In [4]: s.execute()
Out[4]: 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

通过一个 Mars Tensor 来创建 DataFrame 实例，并使用时间日期索引和列标签：

In [5]: dates = md.date_range('20130101', periods=6)

In [6]: dates.execute()
Out[6]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [7]: df = md.DataFrame(mt.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [8]: df.execute()
Out[8]: 
                   A         B         C         D
2013-01-01  0.572153 -0.881939  0.004131 -0.854504
2013-01-02  0.445309 -0.346412  0.801229 -0.058830
2013-01-03  0.889143 -1.011239  0.167834 -1.136507
2013-01-04  1.391901  0.286807  1.100873 -1.599418
2013-01-05 -2.934178  0.274940 -1.153236 -0.226924
2013-01-06  0.381820 -0.095590  0.198525  0.139728

通过值可转换为序列的字典来创建 DataFrame 实例：

In [9]: df2 = md.DataFrame({'A': 1.,
   ...:                     'B': md.Timestamp('20130102'),
   ...:                     'C': md.Series(1, index=list(range(4)), dtype='float32'),
   ...:                     'D': mt.array([3] * 4, dtype='int32'),
   ...:                     'E': 'foo'})
   ...: 

In [10]: df2.execute()
Out[10]: 
     A          B    C  D    E
0  1.0 2013-01-02  1.0  3  foo
1  1.0 2013-01-02  1.0  3  foo
2  1.0 2013-01-02  1.0  3  foo
3  1.0 2013-01-02  1.0  3  foo

最终生成的 DataFrame 中，每列的类型均不相同。

In [11]: df2.dtypes
Out[11]: 
A           float64
B    datetime64[ns]
C           float32
D             int32
E            object
dtype: object

查看数据¶

下面是显示 DataFrame 中开头和结尾若干行的方法：

In [12]: df.head().execute()
Out[12]: 
                   A         B         C         D
2013-01-01  0.572153 -0.881939  0.004131 -0.854504
2013-01-02  0.445309 -0.346412  0.801229 -0.058830
2013-01-03  0.889143 -1.011239  0.167834 -1.136507
2013-01-04  1.391901  0.286807  1.100873 -1.599418
2013-01-05 -2.934178  0.274940 -1.153236 -0.226924

In [13]: df.tail(3).execute()
Out[13]: 
                   A         B         C         D
2013-01-04  1.391901  0.286807  1.100873 -1.599418
2013-01-05 -2.934178  0.274940 -1.153236 -0.226924
2013-01-06  0.381820 -0.095590  0.198525  0.139728

显示索引和列：

In [14]: df.index.execute()
Out[14]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [15]: df.columns.execute()
Out[15]: Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_tensor() 将 DataFrame 中的数据转换为 Mars Tensor 表示。注意当 DataFrame 中的列类型不同时，该操作可能代价很高。这也揭示了 DataFrame 和 Tensor 之间一项最基本的差异：在 Tensor 中，对于整个 Tensor 对象只有一个 dtype，但 DataFrame 对每列都有一个 dtype。当调用 DataFrame.to_tensor() 时，Mars DataFrame 将会找出一个可存储 DataFrame 中所有对象的 dtype，这可能是一个 object，并将导致 DataFrame 中的每个值都被转换为一个 Python 对象。

在上面的 df 对象中，DataFrame 实例中的值均为浮点数，因而 DataFrame.to_tensor() 执行速度会很快，且不需要数据复制。

In [16]: df.to_tensor().execute()
Out[16]: 
array([[ 0.57215282, -0.88193916,  0.00413138, -0.85450354],
       [ 0.44530857, -0.3464121 ,  0.80122904, -0.05883041],
       [ 0.8891429 , -1.01123939,  0.16783413, -1.136507  ],
       [ 1.39190145,  0.28680691,  1.10087264, -1.59941828],
       [-2.93417751,  0.27494016, -1.15323625, -0.22692437],
       [ 0.38182028, -0.09559009,  0.1985253 ,  0.13972849]])

而对于 df2 对象，DataFrame 实例中有不同的数据类型，因而 DataFrame.to_tensor() 执行代价就相对较高了。

In [17]: df2.to_tensor().execute()
Out[17]: 
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo']],
      dtype=object)

注解

DataFrame.to_tensor() 在输出结果中 不保留 索引或列标签。

describe() 将会为你的数据显示一份简单的统计摘要：

In [18]: df.describe().execute()
Out[18]: 
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.124358 -0.295572  0.186559 -0.622743
std    1.543763  0.559148  0.780079  0.682289
min   -2.934178 -1.011239 -1.153236 -1.599418
25%    0.397692 -0.748057  0.045057 -1.066006
50%    0.508731 -0.221001  0.183180 -0.540714
75%    0.809895  0.182308  0.650553 -0.100854
max    1.391901  0.286807  1.100873  0.139728

按坐标排序：

In [19]: df.sort_index(axis=1, ascending=False).execute()
Out[19]: 
                   D         C         B         A
2013-01-01 -0.854504  0.004131 -0.881939  0.572153
2013-01-02 -0.058830  0.801229 -0.346412  0.445309
2013-01-03 -1.136507  0.167834 -1.011239  0.889143
2013-01-04 -1.599418  1.100873  0.286807  1.391901
2013-01-05 -0.226924 -1.153236  0.274940 -2.934178
2013-01-06  0.139728  0.198525 -0.095590  0.381820

按值排序：

In [20]: df.sort_values(by='B').execute()
Out[20]: 
                   A         B         C         D
2013-01-03  0.889143 -1.011239  0.167834 -1.136507
2013-01-01  0.572153 -0.881939  0.004131 -0.854504
2013-01-02  0.445309 -0.346412  0.801229 -0.058830
2013-01-06  0.381820 -0.095590  0.198525  0.139728
2013-01-05 -2.934178  0.274940 -1.153236 -0.226924
2013-01-04  1.391901  0.286807  1.100873 -1.599418

选择数据¶

注解

尽管在交互式分析场景下，使用标准的 Python / Numpy 表达式选择和设置 DataFrame 数据非常自然且便于使用，但对生产代码，我们推荐使用经过优化的数据访问方法，即 .at、.iat、.loc 和 .iloc。

获取数据¶

选择一列，将返回一个 Series 实例。这一操作等价于 df.A ：

In [21]: df['A'].execute()
Out[21]: 
2013-01-01    0.572153
2013-01-02    0.445309
2013-01-03    0.889143
2013-01-04    1.391901
2013-01-05   -2.934178
2013-01-06    0.381820
Freq: D, Name: A, dtype: float64

通过 [] 选择数据，将在行中选取。

In [22]: df[0:3].execute()
Out[22]: 
                   A         B         C         D
2013-01-01  0.572153 -0.881939  0.004131 -0.854504
2013-01-02  0.445309 -0.346412  0.801229 -0.058830
2013-01-03  0.889143 -1.011239  0.167834 -1.136507

In [23]: df['20130102':'20130104'].execute()
Out[23]: 
                   A         B         C         D
2013-01-02  0.445309 -0.346412  0.801229 -0.058830
2013-01-03  0.889143 -1.011239  0.167834 -1.136507
2013-01-04  1.391901  0.286807  1.100873 -1.599418

按标签选择数据¶

通过行标签选择一行数据：

In [24]: df.loc['20130101'].execute()
Out[24]: 
A    0.572153
B   -0.881939
C    0.004131
D   -0.854504
Name: 2013-01-01 00:00:00, dtype: float64

在特定坐标上指定标签：

In [25]: df.loc[:, ['A', 'B']].execute()
Out[25]: 
                   A         B
2013-01-01  0.572153 -0.881939
2013-01-02  0.445309 -0.346412
2013-01-03  0.889143 -1.011239
2013-01-04  1.391901  0.286807
2013-01-05 -2.934178  0.274940
2013-01-06  0.381820 -0.095590

在多个坐标上指定标签，带有这些标签的 所有数据 均会被选取：

In [26]: df.loc['20130102':'20130104', ['A', 'B']].execute()
Out[26]: 
                   A         B
2013-01-02  0.445309 -0.346412
2013-01-03  0.889143 -1.011239
2013-01-04  1.391901  0.286807

在特定坐标上降低返回对象的维度：

In [27]: df.loc['20130102', ['A', 'B']].execute()
Out[27]: 
A    0.445309
B   -0.346412
Name: 2013-01-02 00:00:00, dtype: float64

获得一个常量：

In [28]: df.loc['20130101', 'A'].execute()
Out[28]: 0.5721528161827565

快速获取一个常数（和前述方法等价）：

In [29]: df.at['20130101', 'A'].execute()
Out[29]: 0.5721528161827565

按位置选择¶

通过传入的整数选择相应位置的数据：

In [30]: df.iloc[3].execute()
Out[30]: 
A    1.391901
B    0.286807
C    1.100873
D   -1.599418
Name: 2013-01-04 00:00:00, dtype: float64

通过整数切片来选择数据，与 Numpy / Python 行为一致：

In [31]: df.iloc[3:5, 0:2].execute()
Out[31]: 
                   A         B
2013-01-04  1.391901  0.286807
2013-01-05 -2.934178  0.274940

通过整数列表选择相应位置的数据，与 Numpy / Python 行为一致：

In [32]: df.iloc[[1, 2, 4], [0, 2]].execute()
Out[32]: 
                   A         C
2013-01-02  0.445309  0.801229
2013-01-03  0.889143  0.167834
2013-01-05 -2.934178 -1.153236

显示对行切片：

In [33]: df.iloc[1:3, :].execute()
Out[33]: 
                   A         B         C         D
2013-01-02  0.445309 -0.346412  0.801229 -0.058830
2013-01-03  0.889143 -1.011239  0.167834 -1.136507

显示对列切片：

In [34]: df.iloc[:, 1:3].execute()
Out[34]: 
                   B         C
2013-01-01 -0.881939  0.004131
2013-01-02 -0.346412  0.801229
2013-01-03 -1.011239  0.167834
2013-01-04  0.286807  1.100873
2013-01-05  0.274940 -1.153236
2013-01-06 -0.095590  0.198525

显示获取某个位置的常数：

In [35]: df.iloc[1, 1].execute()
Out[35]: -0.34641209874727774

快速获取一个常数（和前述方法等价）：

In [36]: df.iat[1, 1].execute()
Out[36]: -0.34641209874727774

布尔索引¶

使用一行布尔值选择数据：

In [37]: df[df['A'] > 0].execute()
Out[37]: 
                   A         B         C         D
2013-01-01  0.572153 -0.881939  0.004131 -0.854504
2013-01-02  0.445309 -0.346412  0.801229 -0.058830
2013-01-03  0.889143 -1.011239  0.167834 -1.136507
2013-01-04  1.391901  0.286807  1.100873 -1.599418
2013-01-06  0.381820 -0.095590  0.198525  0.139728

从 DataFrame 选择满足某个布尔条件的值：

In [38]: df[df > 0].execute()
Out[38]: 
                   A         B         C         D
2013-01-01  0.572153       NaN  0.004131       NaN
2013-01-02  0.445309       NaN  0.801229       NaN
2013-01-03  0.889143       NaN  0.167834       NaN
2013-01-04  1.391901  0.286807  1.100873       NaN
2013-01-05       NaN  0.274940       NaN       NaN
2013-01-06  0.381820       NaN  0.198525  0.139728

数据操作¶

统计¶

除缺失值外的常见操作：

计算描述统计值：

In [39]: df.mean().execute()
Out[39]: 
A    0.124358
B   -0.295572
C    0.186559
D   -0.622743
dtype: float64

在另一条坐标轴上进行相同的操作：

In [40]: df.mean(1).execute()
Out[40]: 
2013-01-01   -0.290040
2013-01-02    0.210324
2013-01-03   -0.272692
2013-01-04    0.295041
2013-01-05   -1.009849
2013-01-06    0.156121
Freq: D, dtype: float64

在维度不同的对象上进行操作，这需要进行对齐。此外，Mars DataFrame 会自动在给定的坐标轴上对数据进行广播操作。

In [41]: s = md.Series([1, 3, 5, mt.nan, 6, 8], index=dates).shift(2)

In [42]: s.execute()
Out[42]: 
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

In [43]: df.sub(s, axis='index').execute()
Out[43]: 
                   A         B         C         D
2013-01-01       NaN       NaN       NaN       NaN
2013-01-02       NaN       NaN       NaN       NaN
2013-01-03 -0.110857 -2.011239 -0.832166 -2.136507
2013-01-04 -1.608099 -2.713193 -1.899127 -4.599418
2013-01-05 -7.934178 -4.725060 -6.153236 -5.226924
2013-01-06       NaN       NaN       NaN       NaN

应用函数¶

在数据上应用函数：

In [44]: df.apply(lambda x: x.max() - x.min()).execute()
Out[44]: 
A    4.326079
B    1.298046
C    2.254109
D    1.739147
dtype: float64

字符串方法¶

如同下面的例子展示的那样，Series 对象通过 str 属性提供了一系列字符串操作方法以便于操作每一个元素。注意通过 str 进行的模式匹配通常会默认（在某些情形下会一直）用到正则表达式。更多的信息可在向量化字符串方法中查看。

In [45]: s = md.Series(['A', 'B', 'C', 'Aaba', 'Baca', mt.nan, 'CABA', 'dog', 'cat'])

In [46]: s.str.lower().execute()
Out[46]: 
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

数据合并¶

拼接¶

Mars DataFrame 提供一系列的方法方便地将 Series 和 DataFrame 对象连接到一起。这些方法基于一系列在索引上的集合逻辑以及关系代数上的功能来实现 Join / 合并这样的操作。

通过 concat(): 拼接 DataFrame 对象：

In [47]: df = md.DataFrame(mt.random.randn(10, 4))

In [48]: df.execute()
Out[48]: 
          0         1         2         3
0 -0.010468 -0.565373 -1.256138  0.892066
1 -0.746989  0.362044  0.093031 -0.534680
2 -0.245727  0.467933 -0.277734  0.828227
3  0.683176 -0.910527 -0.696149  0.019743
4  0.579134  0.523877 -1.229499 -1.235385
5 -0.057204  0.986720 -0.284292 -1.320958
6  0.316745 -0.970189 -1.254495 -0.216132
7 -1.779230 -0.070959 -0.274425  1.839446
8  0.071652  0.771412 -0.127523  0.473681
9 -0.166045  0.174637  0.759082 -0.787874

# break it into pieces
In [49]: pieces = [df[:3], df[3:7], df[7:]]

In [50]: md.concat(pieces).execute()
Out[50]: 
          0         1         2         3
0 -0.010468 -0.565373 -1.256138  0.892066
1 -0.746989  0.362044  0.093031 -0.534680
2 -0.245727  0.467933 -0.277734  0.828227
3  0.683176 -0.910527 -0.696149  0.019743
4  0.579134  0.523877 -1.229499 -1.235385
5 -0.057204  0.986720 -0.284292 -1.320958
6  0.316745 -0.970189 -1.254495 -0.216132
7 -1.779230 -0.070959 -0.274425  1.839446
8  0.071652  0.771412 -0.127523  0.473681
9 -0.166045  0.174637  0.759082 -0.787874

注解

向 DataFrame 增加一列是相对较为高效的，但增加一行需要数据复制，因而可能会比较昂贵。我们建议向 DataFrame 的构造函数传入一系列预先填充的列表来构建 DataFrame 而不是向 DataFrame 迭代追加数据。

Join¶

SQL 样式的数据合并。参考 Database style joining 章节以获取更多信息。

In [51]: left = md.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})

In [52]: right = md.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [53]: left.execute()
Out[53]: 
   key  lval
0  foo     1
1  foo     2

In [54]: right.execute()
Out[54]: 
   key  rval
0  foo     4
1  foo     5

In [55]: md.merge(left, right, on='key').execute()
Out[55]: 
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5

另一个可供参考的例子如下：

In [56]: left = md.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})

In [57]: right = md.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

In [58]: left.execute()
Out[58]: 
   key  lval
0  foo     1
1  bar     2

In [59]: right.execute()
Out[59]: 
   key  rval
0  foo     4
1  bar     5

In [60]: md.merge(left, right, on='key').execute()
Out[60]: 
   key  lval  rval
0  foo     1     4
1  bar     2     5

分组¶

当提到“分组”时，我们指的是下面一个或多个步骤组成的过程：

拆分：根据某些条件将数据拆分成组

应用函数 ：对每一组数据分别应用某个函数

合并：将结果合并为一组数据

In [61]: df = md.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
   ....:                          'foo', 'bar', 'foo', 'foo'],
   ....:                    'B': ['one', 'one', 'two', 'three',
   ....:                          'two', 'two', 'one', 'three'],
   ....:                    'C': mt.random.randn(8),
   ....:                    'D': mt.random.randn(8)})
   ....: 

In [62]: df.execute()
Out[62]: 
     A      B         C         D
0  foo    one -1.095408  0.852857
1  bar    one  0.258248  0.376903
2  foo    two  0.683997  1.279104
3  bar  three -2.550251 -0.034523
4  foo    two  0.212246 -0.337863
5  bar    two -1.092934 -1.614972
6  foo    one -0.514371 -0.100971
7  foo  three  0.861557  0.566483

分组，然后在结果上执行 sum() 函数。

In [63]: df.groupby('A').sum().execute()
Out[63]: 
            C         D
A                      
bar -3.384938 -1.272591
foo  0.148021  2.259611

我们也可以利用多列进行分组，这将形成一个多重索引。在此结果上，我们也可以执行 sum 函数。

In [64]: df.groupby(['A', 'B']).sum().execute()
Out[64]: 
                  C         D
A   B                        
foo one   -1.609778  0.751886
    two    0.896243  0.941241
    three  0.861557  0.566483
bar one    0.258248  0.376903
    two   -1.092934 -1.614972
    three -2.550251 -0.034523

绘图¶

我们使用标准的约定来引用 matplotlib API：

In [65]: import matplotlib.pyplot as plt

In [66]: plt.close('all')

In [67]: ts = md.Series(mt.random.randn(1000),
   ....:                index=md.date_range('1/1/2000', periods=1000))
   ....: 

In [68]: ts = ts.cumsum()

In [69]: ts.plot()
Out[69]: <AxesSubplot:>

在 DataFrame 中， plot() 方法可用于方便地绘制带有标签的行数据：

In [70]: df = md.DataFrame(mt.random.randn(1000, 4), index=ts.index,
   ....:                   columns=['A', 'B', 'C', 'D'])
   ....: 

In [71]: df = df.cumsum()

In [72]: plt.figure()
Out[72]: <Figure size 640x480 with 0 Axes>

In [73]: df.plot()
Out[73]: <AxesSubplot:>

In [74]: plt.legend(loc='best')
Out[74]: <matplotlib.legend.Legend at 0x7f1841402e10>

读取和写入数据¶

CSV¶

In [75]: df.to_csv('foo.csv').execute()
Out[75]: 
Empty DataFrame
Columns: []
Index: []

从 CSV 文件读取数据

In [76]: md.read_csv('foo.csv').execute()
Out[76]: 
     Unnamed: 0         A          B          C          D
0    2000-01-01 -0.166641   0.702534  -0.072652  -0.575882
1    2000-01-02  2.266297  -0.228048  -1.826486  -0.771543
2    2000-01-03  1.676180   0.545600   0.610733  -2.646980
3    2000-01-04  0.334493  -0.323038   0.610592  -3.605813
4    2000-01-05  0.787006  -0.350260   0.180054  -3.689806
..          ...       ...        ...        ...        ...
995  2002-09-22  3.689279 -34.981194 -22.221220  45.174546
996  2002-09-23  5.132589 -34.184650 -21.642247  45.591730
997  2002-09-24  4.305187 -35.015174 -21.031496  45.332951
998  2002-09-25  5.497941 -34.453720 -21.600697  45.123470
999  2002-09-26  6.122513 -33.825127 -22.280876  44.581487

[1000 rows x 5 columns]

Mars DataFrame Mars Learn