这里是官网上的原文。每次都还链接到最新版本。这篇文章,对于不是英语母语的,10分钟,我觉得绝逼是看不完的。网上已经有很多翻译了。我只是把文章的结构组织以下,死记硬背一下它提到的知识点。
s = pd.Series([1,3,5,np.nan,6,8])
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df.head()
df.tail(5)
df.index
df.columns
df.values
df.describe()
df.T
df.sort_index(axis=1, ascending=False)
df.sort_values(by='B')
df['A']
df[['A','B']]
df[0:3]
df['20130102':'20130104']
'20130102':'20130104'
可自动转化为dateIndexdf.loc[dates[0]]
df.loc[:,['A','B']]
df.loc['20130102':'20130104',['A','B']]
df.loc[dates[0],'A']
df.at[dates[0],'A']
df.iloc[3]
df.iloc[3:5,0:2]
df.iloc[[1,2,4],[0,2]]
df.iloc[1:3,:]
df.iloc[:,1:3]
df.iloc[1,1]
df.iat[1,1]
df[df.A > 0] #过滤
df[df > 0] #填充NaN
df2[df2['E'].isin(['two','four'])]
df['F'] = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
df.at[dates[0],'A'] = 0
df.iat[0,1] = 0
df.loc[:,'D'] = np.array([5] * len(df))
df[df > 0]=-df #将所有大于0的值,设为负数
df.dropna(how='any')
df.fillna(value=5)
pd.isnull(df)
df.mean()
df.mean(1) #axis
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
df.sub(s, axis='index')
默认按axis=0的方向,对整列数据使用函数,也可以设置axis
df.apply(np.cumsum)
df.apply(lambda x: x.max()-x.min()
仅针对Series
s.str.lower()
pieces = [df[:3], df[3:7], df[7:]]
df.contat(pieces)
pd.merge(left, right, on='key')
s = df.iloc[3]
df.append(s,ignore_index=True)
groupby()返回的不是DataFrame,而是DataFrameGroupBy,需要调用额外的步骤来返回需要的值:
- Splitting,基于应用场景,将值划分到不同的group
- Applying,对每个group上的数据,独立的应用函数来处理
- Combining ,将每个group的数据合并到特定的数据结构中
df.groupby('A').sum()
df.groupby(['A','B']).sum()
In [95]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
....: 'foo', 'foo', 'qux', 'qux'],
....: ['one', 'two', 'one', 'two',
....: 'one', 'two', 'one', 'two']]))
....:
In [96]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [97]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
In [98]: df2 = df[:4]
In [99]: df2
Out[99]:
A B
first second
bar one 0.029399 -0.542108
two 0.282696 -0.087302
baz one -1.575170 1.771208
two 0.816482 1.100230
In [100]: stacked = df2.stack()
In [101]: stacked
Out[101]:
first second
bar one A 0.029399
B -0.542108
two A 0.282696
B -0.087302
baz one A -1.575170
B 1.771208
two A 0.816482
B 1.100230
dtype: float64
In [102]: stacked.unstack()
Out[102]:
A B
first second
bar one 0.029399 -0.542108
two 0.282696 -0.087302
baz one -1.575170 1.771208
two 0.816482 1.100230
In [103]: stacked.unstack(1)
Out[103]:
second one two
first
bar A 0.029399 0.282696
B -0.542108 -0.087302
baz A -1.575170 0.816482
B 1.771208 1.100230
In [104]: stacked.unstack(0)
Out[104]:
first bar baz
second
one A 0.029399 -1.575170
B -0.542108 1.771208
two A 0.282696 0.816482
B -0.087302 1.100230
In [105]: df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
.....: 'B' : ['A', 'B', 'C'] * 4,
.....: 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
.....: 'D' : np.random.randn(12),
.....: 'E' : np.random.randn(12)})
.....:
In [106]: df
Out[106]:
A B C D E
0 one A foo 1.418757 -0.179666
1 one B foo -1.879024 1.291836
2 two C foo 0.536826 -0.009614
3 three A bar 1.006160 0.392149
4 one B bar -0.029716 0.264599
5 one C bar -1.146178 -0.057409
6 two A foo 0.100900 -1.425638
7 three B foo -1.035018 1.024098
8 one C foo 0.314665 -0.106062
9 one A bar -0.773723 1.824375
10 two B bar -1.170653 0.595974
11 three C bar 0.648740 1.167115
In [107]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[107]:
C bar foo
A B
one A -0.773723 1.418757
B -0.029716 -1.879024
C -1.146178 0.314665
three A 1.006160 NaN
B NaN -1.035018
C 0.648740 NaN
two A NaN 0.100900
B -1.170653 NaN
C NaN 0.536826
Time Series section
In [108]: rng = pd.date_range('1/1/2012', periods=100, freq='S')
In [109]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
In [110]: ts.resample('5Min').sum()
Out[110]:
2012-01-01 25083
Freq: 5T, dtype: int64
categorical introduction
Plotting docs.