一个Datarame表示一个表格,类似电子表格的数据结构,包含一个经过排序的列表集,它的每一列都可以有不同的类型值(数字,字符串,布尔等等)。Datarame有行和列的索引;它可以被看作是一个Series的字典(Series们共享一个索引)。与其它你以前使用过的(如 R 的 data.frame )类似Datarame的结构相比,在DataFrame里的面向行和面向列的操作大致是对称的。在底层,数据是作为一个或多个二维数组存储的,而不是列表,字典,或其它一维的数组集合。
DataFrame([data, index, columns, dtype, copy])
# Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, np.nan], # np.nan表示NA
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
DataFrame(data,
# index=['a','b','c','d','e']
# index = range(5)
) # 默认生成整数索引, 字典的键作列,值作行
输出结果为:
state year pop
0 Ohio 2000.0 1.5
1 Ohio 2001.0 1.7
2 Ohio 2002.0 3.6
3 Nevada 2001.0 2.4
4 Nevada NaN 2.9
# 两层嵌套
d = {'a': {'tp': 26, 'fp': 112},
'b': {'tp': 26, 'fp': 91},
'c': {'tp': 23, 'fp': 74}}
df_index = pd.DataFrame.from_dict(d, orient='index')
df_index
输出结果为:
tp fp
a 26 112
b 26 91
c 23 74
df_columns = pd.DataFrame.from_dict(d,orient='columns')
df_columns
输出结果为:
a b c
fp 112 91 74
tp 26 26 23
data = DataFrame(np.arange(10,26).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
输出结果为:
one two three four
Ohio 10 11 12 13
Colorado 14 15 16 17
Utah 18 19 20 21
New York 22 23 24 25
生成一个df
np.random.seed(10)
dates = pd.date_range('20190101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
A B C D
2019-01-01 1.331587 0.715279 -1.545400 -0.008384
2019-01-02 0.621336 -0.720086 0.265512 0.108549
2019-01-03 0.004291 -0.174600 0.433026 1.203037
2019-01-04 -0.965066 1.028274 0.228630 0.445138
2019-01-05 -1.136602 0.135137 1.484537 -1.079805
2019-01-06 -1.977728 -1.743372 0.266070 2.384967
df.index
输出结果为:
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-06'],
dtype='datetime64[ns]', freq='D')
df.index.name = 'time'
df.columns
输出结果为:
Index(['A', 'B', 'C', 'D'], dtype='object')
df.columns.name = 'alphabet'
df.values
array([[-0.96506567, 1.02827408, 0.22863013, 0.44513761],
[-1.13660221, 0.13513688, 1.484537 , -1.07980489],
[-1.97772828, -1.7433723 , 0.26607016, 2.38496733]])
df.head(3) # 显示前三行
A B C D
2019-01-01 1.331587 0.715279 -1.545400 -0.008384
2019-01-02 0.621336 -0.720086 0.265512 0.108549
2019-01-03 0.004291 -0.174600 0.433026 1.203037
df.tail(3) # 显示后三行
A B C D
2019-01-04 -0.965066 1.028274 0.228630 0.445138
2019-01-05 -1.136602 0.135137 1.484537 -1.079805
2019-01-06 -1.977728 -1.743372 0.266070 2.384967
df = DataFrame({'a': range(7), 'b': range(7, 0, -1),
'c': ['one', 'one', 'one', 'two', 'two',
'two', 'two'],
'd': [0, 1, 2, 0, 1, 2, 3]})
df
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
# set_index方法将DataFrame的一个或者多个列转化为行索引
df2 = df.set_index(['c', 'd'])
df2
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
df.set_index(['c', 'd'], drop=False)
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
-reset_index的功能和set_index的刚好相反,层次化索引的级别会被转移到列里面
df2.reset_index()
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
df.round(2)
A B C D
2019-01-01 1.33 0.72 -1.55 -0.01
2019-01-02 0.62 -0.72 0.27 0.11
2019-01-03 0.00 -0.17 0.43 1.20
2019-01-04 -0.97 1.03 0.23 0.45
2019-01-05 -1.14 0.14 1.48 -1.08
2019-01-06 -1.98 -1.74 0.27 2.38
df.round({'A': 1, 'C': 2})
A B C D
2019-01-01 1.3 0.715279 -1.55 -0.008384
2019-01-02 0.6 -0.720086 0.27 0.108549
2019-01-03 0.0 -0.174600 0.43 1.203037
2019-01-04 -1.0 1.028274 0.23 0.445138
2019-01-05 -1.1 0.135137 1.48 -1.079805
2019-01-06 -2.0 -1.743372 0.27 2.384967
# 数值型数据的快速统计汇总
df.describe()
alphabet A B C D
count 3.000000 3.000000 3.000000 3.000000
mean -1.359799 -0.193320 0.659746 0.583433
std 0.541972 1.414715 0.714535 1.736521
min -1.977728 -1.743372 0.228630 -1.079805
25% -1.557165 -0.804118 0.247350 -0.317334
50% -1.136602 0.135137 0.266070 0.445138
75% -1.050834 0.581705 0.875304 1.415052
max -0.965066 1.028274 1.484537 2.384967
df
A B C D F
2019-01-01 0.000000 0.000000 -1.545400 5 NaN
2019-01-02 0.621336 -0.720086 0.265512 5 1.0
2019-01-03 0.004291 -0.174600 0.433026 5 2.0
2019-01-04 -0.965066 1.028274 0.228630 5 3.0
2019-01-05 -1.136602 0.135137 1.484537 5 4.0
2019-01-06 -1.977728 -1.743372 0.266070 5 5.0
df.apply(np.cumsum, axis=0, result_type=None )
A B C D F
2019-01-01 0.000000 0.000000 -1.545400 5 NaN
2019-01-02 0.621336 -0.720086 -1.279889 10 1.0
2019-01-03 0.625627 -0.894686 -0.846863 15 3.0
2019-01-04 -0.339438 0.133588 -0.618232 20 6.0
2019-01-05 -1.476040 0.268725 0.866305 25 10.0
2019-01-06 -3.453769 -1.474647 1.132375 30 15.0
df.apply(lambda x: x.max() - x.min()) # 每一列的极差
df.rename(columns = {'A':'key2'},inplace=False)
# 默认axis=0,按行索引对行进行排序;ascending=True,升序排序
df.sort_index(axis=0, ascending=False)
# df.sort_index(axis=0, ascending=True)
A B C D
2019-01-06 -1.977728 -1.743372 0.266070 2.384967
2019-01-05 -1.136602 0.135137 1.484537 -1.079805
2019-01-04 -0.965066 1.028274 0.228630 0.445138
2019-01-03 0.004291 -0.174600 0.433026 1.203037
2019-01-02 0.621336 -0.720086 0.265512 0.108549
2019-01-01 1.331587 0.715279 -1.545400 -0.008384