一.简介:
dataframe是带标签的大小可变的2维异构表格,由多个Series构成(在DataFrame中称为Column),但所有Series共用1组标签
DataFrame unifies two or more Series into a single data structure.Each Series then represents a named column of
the DataFrame, and instead of each column having its own index, the DataFrame provides a single index and the
data in all columns is aligned to the master index of the DataFrame
二.使用
1.创建:
创建DataFrame对象:pd.DataFrame([<data>,index=(0,1...n-1),columns=(0,1...n-1),dtype=<dtype>,copy=False])
#参数说明:其他同pd.Series类
data:指定要存储的数据;为ndarray/Iterable/dict/DataFrame,默认为None
#dict中可包含Series对象/arrays/scalar/list-like/str/set;key必须可哈希化
#为dict且其value均为scalar时,必须传入,且的长度将决定DataFrame对象的行数
columns:指定列名;为array-like/dict/pandas.Index对象
#如果为dict,则的key会替代columns(如果没有指定columns)而不是index,这与Series不同
#虽然源码注释中不是这么写的???
#如果为dict,会使用key作为列名
#实例:
>>> pd.DataFrame(np.zeros([2,2]))
0 1
0 0.0 0.0
1 0.0 0.0
>>> pd.DataFrame()
Empty DataFrame
Columns: []
Index: []
>>> pd.DataFrame({"a":1,"b":2,"c":3},index=[1,2,3])
#values均为scalar,此时必须传入
a b c
1 1 2 3
2 1 2 3
3 1 2 3
>>> pd.DataFrame({"a":"q","b":"w","c":"e"},index=[1,2,3])
#values均为scalar
a b c
1 q w e
2 q w e
3 q w e
>>> pd.DataFrame({"a":[5,6],"b":[3,4],"c":[1,2]})
a b c
0 5 3 1
1 6 4 2
>>> pd.DataFrame([np.zeros(7),np.zeros(7)])
0 1 2 3 4 5 6
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3.操作
(1)索引与切片:
参见 Python.第三方模块.数据分析.Pandas模块.索引与切片.一.2 部分
(2)根据条件筛选:
根据条件筛选行:<df>[<cond>]
#参数说明:
cond:指定筛选条件
#常用条件:>,<,==,≤,≥,!=,between(),isnull(),&,|,not,contains()
#实例:
>>> df=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]],index=["A","B","C","D"],columns=["e","f","g"])
>>> df[df.e>1]
e f g
B 4 5 6
C 7 8 9
D 10 11 12
(3)判断2个DataFrame对象中对应的值是否相等:
判断2个DataFrame对象中对应的值是否相等:<df1>==<df2>
>>> df_a=pd.DataFrame([[2,3],[4,1]])
>>> df_b=pd.DataFrame([[1,3],[4,2]])
>>> df_a==df_b
0 1
0 False True
1 True False
(4)行列操作:
查看指定列:<df>.<column>
#注意:①不能通过类似的方法查看指定行 ②实际上这是的属性
#参数说明:
column:指定列名
#实例:
>>> df=pd.DataFrame([[1,2,3],[4,5,6]],columns=["a","b","c"],index=["A","B"])
>>> df.a
A 1
B 4
Name: a, dtype: int64
>>> df.A
Traceback (most recent call last):
File "" , line 1, in <module>
File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\generic.py", line 5136, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'A'
#################################################################################################
增加列:直接通过索引赋值
通过.loc赋值
增加行:通过.loc赋值
#实例:
>>> df=pd.DataFrame([[1,2,3],[5,6,7],[9,0,1]])
>>> df[3]=100
>>> df.loc[:,4]=1000
>>> df
0 1 2 3 4
0 1 2 3 100 1000
1 5 6 7 100 1000
2 9 0 1 100 1000
>>> df[3][3]=0#这种方法无效
>>> df
0 1 2 3 4
0 1 2 3 100 1000
1 5 6 7 100 1000
2 9 0 1 100 1000
>>> df.loc[1.1,:]=9.9
>>> df
0 1 2 3 4
0.0 1.0 2.0 3.0 100.0 1000.0
1.0 5.0 6.0 7.0 100.0 1000.0
2.0 9.0 0.0 1.0 100.0 1000.0
1.1 9.9 9.9 9.9 9.9 9.9
#################################################################################################
删除列:使用del
删除行:使用.drop()
#实例:关于删除行,参见 4 部分
>>> df=pd.DataFrame([[1,2],[5,6]])
>>> del df[0]
>>> df
1
0 2
1 6
#################################################################################################
翻转列:使用.loc[::-1,:]
翻转行:使用.loc[:,::-1]
#实例:
>>> df=pd.DataFrame([[1,2],[5,6]])
>>> df.loc[:,::-1]
1 0
0 2 1
1 6 5
>>> df.loc[::-1,:]
0 1
1 5 6
0 1 2
>>> df.loc[::-1,::-1]
1 0
1 6 5
0 2 1
#################################################################################################
交换2列的值:使用.loc
交换2行的值:使用.loc
#实例:
>>> df=pd.DataFrame([[1,2],[5,6]])
>>> df.loc[:,[0,1]]=df.loc[:,[1,0]].values#一定要使用.values,否则无效
>>> df
0 1
0 2 1
1 6 5
>>> df.loc[[0,1],:]=df.loc[[1,0],:].values#一定要使用.values,否则无效
>>> df
0 1
0 6 5
1 2 1
#################################################################################################
修改指定列的列名:使用.rename()
#实例:参见 4 部分
三.相关函数
DataFrame还可以调用很多Series的函数,并且通常可额外指定axis参数以确定沿哪个轴进行计算
NumPy模块的ufuncs(元素级数组方法)也可用于DataFrame
1.基本操作
(1)查:
获取DataFrame对象的前n行:<df>.head(<n>)
获取DataFrame对象的后n行:<df>.tail(<n>)
#均返回DataFrame对象
#参数说明:
n:指定获取的行数
#实例:
>>> df=pd.DataFrame([[1,2],[3,4],[5,6]])
>>> df.head(2)
0 1
0 1 2
1 3 4
>>> df.tail(1)
0 1
2 5 6
(2)改:
修改数据类型:<S>.astype(<dtype>[,copy=True,errors="raise"])
#参数说明:其他同Series.astype()
dtype:指定修改后的数据类型;为str/dict
#dict格式为{:"",...},为列名,dtype指定该列的数据类型(为str)
#################################################################################################
修改行/列名:<df>.rename(mapper=None,index=None,columns=None,axis='index',copy=True,inplace=False,level=None,errors="ignore")
#参数说明:
mapper:(和axis共同)指定如何修改;为dict-like/function
#dict的格式为{:...},为旧列/行名,为新列/行名
axis:指定在哪个维度上修改;可为0或'index'/1或'columns'
#mapper和axis需要配合使用
index,columns:分别指定如何修改标签和列名;为dict-like/function
#指定了index/columns后不能再指定axis/mapper
#实例:
>>> df=pd.DataFrame({"A":[1,2,3,4],"B":[5,6,7,8],"C":[1,1,1,1]},index=['q','w','e','r'])
>>> df.rename(mapper={'A':'a'},axis=1)
a B C
q 1 5 1
w 2 6 1
e 3 7 1
r 4 8 1
>>> df.rename(index={'q':'Q'})
A B C
Q 1 5 1
w 2 6 1
e 3 7 1
r 4 8 1
#################################################################################################
将指定列转换为行标签索引:<df>.set_index(<keys>[,drop=True,append=False,inplace=False,verify_integrity=False])
#参数说明:
keys:指定列;为scalar/array-like/scalar list/array list
drop:指定是否删除用作行标签的列;为bool
append:为False表示替换原有标签;为True表示添加到原有标签中
#实例:
>>> df=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],index=["a","b","c"],columns=["A","B","C"])
>>> df
A B C
a 1 2 3
b 4 5 6
c 7 8 9
>>> df.set_index(["A"])
B C
A
1 2 3
4 5 6
7 8 9
>>> df.set_index(["A","B"])
C
A B
1 2 3
4 5 6
7 8 9
>>> df.set_index([["A","B","C"]])
A B C
A 1 2 3
B 4 5 6
C 7 8 9
>>> df.set_index([["A","B","C"],["A","A","B"]])
A B C
A A 1 2 3
B A 4 5 6
C B 7 8 9
>>> df.set_index(["A"],append=True)
B C
A
a 1 2 3
b 4 5 6
c 7 8 9
(3)删:
删除指定元素:<df>.drop([labels=None,axis=0,index=None,columns=None,inplace=False,errors="raise"])
#参数说明:其他同.drop()
columns:指定要删除的元素的列名
#labels/index/columns应至少指定1个
axis:指定对哪个轴进行操作;0为行,1为列(参见实例)
#仅在传入位置参数时有用
#实例:
>>> df=pd.DataFrame([[1,2],[3,4],[5,6]])
>>> df.drop(columns=0,index=0)
1
1 4
2 6
>>> df.drop(0)
0 1
1 3 4
2 5 6
>>> df.drop(0,0)
0 1
1 3 4
2 5 6
>>> df.drop(0,axis=0)
0 1
1 3 4
2 5 6
>>> df.drop(0,axis=1)
1
0 2
1 4
2 6
>>> df.drop(index=0,columns=1,axis=1)
0
1 3
2 5
>>> df.drop(index=0,columns=1)
0
1 3
2 5
>>> df.drop(0,0,axis=1)
Traceback (most recent call last):
File "" , line 1, in <module>
TypeError: drop() got multiple values for argument 'axis'
(4)增:
增加新列:<df>.assign([**kwargs])
#实例:
>>> df=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],index=["k1","k2","k3"],columns=["v1","v2","v3"])
>>> df.assign(v4=np.random.randint(0,10,size=3))
v1 v2 v3 v4
k1 1 2 3 0
k2 4 5 6 3
k3 7 8 9 3
>>> df.assign(v2=lambda x:x*2)#相当于lambda x:x.v1*2
v1 v2 v3
k1 1 2 3
k2 4 8 6
k3 7 14 9
>>> df.assign(v2=lambda x:x.v3*2)
v1 v2 v3
k1 1 6 3
k2 4 12 6
k3 7 18 9
2.格式转换与合并
(1)旋转:
将列压缩为行:<df>.stack([level=-1,dropna=True])
#参数说明:
level:指定压缩哪个列层级;为int/str/int list/str list
dropna:是否丢弃值为NaN的行/列;为bool
#实例:
>>> df=pd.DataFrame([[1,2],[3,np.NaN]],index=["a","b"],columns=["A","B"])
>>> df.stack()
a A 1.0
B 2.0
b A 3.0
dtype: float64
>>> df.stack(dropna=False)
a A 1.0
B 2.0
b A 3.0
B NaN
dtype: float64
>>> df=pd.DataFrame([[1.0, 2.0],[3.0, 4.0]],index=['cat','dog'],columns=[["w","h"],["kg","m"]])
>>> df.stack()
h w
cat kg NaN 1.0
m 2.0 NaN
dog kg NaN 3.0
m 4.0 NaN
>>> df.stack(0)
kg m
cat h NaN 2.0
w 1.0 NaN
dog h NaN 4.0
w 3.0 NaN
(2)合并:
按列合并:<df>.join(<other>[,on=None,how="left",lsuffix="",rsuffix="",sort=False])
#参数说明:
other:指定另1个数据集;为DataFrame/Series/DataFrame list
sort:是否对结果进行排序;为bool
#实例:
>>> df1=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]],index=["A","B","C","D"],columns=["a","b","c"])
>>> df2=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
>>> df1.join(df2)
a b c 0 1 2
A 1 2 3 NaN NaN NaN
B 4 5 6 NaN NaN NaN
C 7 8 9 NaN NaN NaN
D 10 11 12 NaN NaN NaN
3.统计分析
(1)基础:
获取基础统计信息:<df>.describe()
#实例:
>>> df=pd.DataFrame([[1,2,3,"d1","e1"],[6,7,8,"d2","e2"]],columns=["a","b","c","d","e"])
>>> df.describe()
a b c
count 2.000000 2.000000 2.000000
mean 3.500000 4.500000 5.500000
std 3.535534 3.535534 3.535534
min 1.000000 2.000000 3.000000
25% 2.250000 3.250000 4.250000
50% 3.500000 4.500000 5.500000
75% 4.750000 5.750000 6.750000
max 6.000000 7.000000 8.000000
#非数值类型的列仅在仅包含非数值类型的列时才会被描述:
>>> df[["d","e"]].describe()
d e
count 2 2
unique 2 2
top d2 e1
freq 1 1
>>> df=pd.DataFrame([["a1","b1","c1","d1","e1"],["a2","b2","c2","d2","e2"]],columns=["a","b","c","d","e"])
>>> df.describe()
a b c d e
count 2 2 2 2 2
unique 2 2 2 2 2
top a1 b2 c2 d2 e1
freq 1 1 1 1 1
(2)相关性:
求相关系数:<df>.corr()
#计算中任意2个列间的相关系数
#实例:
>>> df=pd.DataFrame([[1,2,3],[4,5,6],[5,4,3]])
>>> df.corr()
0 1 2
0 1.000000 0.838628 0.277350
1 0.838628 1.000000 0.755929
2 0.277350 0.755929 1.000000
(3)最值:
分别求前1-n个数的最大值:<df>.cummax([axis=0])
分别求前1-n个数的最小值:<df>.cummin([axis=0])
#参数说明:
axis:指定沿哪个轴计算;为int/None()
#实例:
>>> df=pd.DataFrame([[3,2,5,1,9],[1,3,4,6,2],[5,-2,11,0,4]])
>>> df.cummax()
0 1 2 3 4
0 3 2 5 1 9
1 3 3 5 6 9
2 5 3 11 6 9
>>> df.cummax(1)
0 1 2 3 4
0 3 3 5 5 9
1 1 3 4 6 6
2 5 5 11 11 11
>>> df.cummin()
0 1 2 3 4
0 3 2 5 1 9
1 1 2 4 1 2
2 1 -2 4 0 2
>>> df.cummin(1)
0 1 2 3 4
0 3 2 2 1 1
1 1 1 1 1 1
2 5 -2 -2 -2 -2
(4)交叉分析:
进行交叉分析(透视):<df>.pivot(columns=None[,index=None,values=None])
#长数据→宽数据;不支持数据聚合;相当于pd.pivot(,columns=None[,index=None,values=None])
#参数说明:
index,columns:分别指定行/列标签;为column label/column label list
values:指定数据;为column label/column label list
#实例:
>>> df=pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]],index=["a","b","c","d"],columns=["A","B","C","D"])
>>> df.pivot(columns="A")
B C D
A 1 5 9 13 1 5 9 13 1 5 9 13
a 2.0 NaN NaN NaN 3.0 NaN NaN NaN 4.0 NaN NaN NaN
b NaN 6.0 NaN NaN NaN 7.0 NaN NaN NaN 8.0 NaN NaN
c NaN NaN 10.0 NaN NaN NaN 11.0 NaN NaN NaN 12.0 NaN
d NaN NaN NaN 14.0 NaN NaN NaN 15.0 NaN NaN NaN 16.0
>>> df.pivot(columns="B",values="B")
B 2 6 10 14
a 2.0 NaN NaN NaN
b NaN 6.0 NaN NaN
c NaN NaN 10.0 NaN
d NaN NaN NaN 14.0
>>> df.pivot(columns="B",values="B",index="C")
B 2 6 10 14
C
3 2.0 NaN NaN NaN
7 NaN 6.0 NaN NaN
11 NaN NaN 10.0 NaN
15 NaN NaN NaN 14.0
######################################################################################################################
进行交叉分析(透视):<df>.pivot_table([values=None,index=None,columns=None,aggfunc="mean",fill_value=None,margins=False,dropna=True,margins_name="All",observed=False])
#长数据→宽数据;支持数据聚合;参见 Python.第三方模块.数据分析.Pandas模块.介绍,IO,函数.三.2.(3) 部分
######################################################################################################################
还原交叉分析(逆透视):<df>.melt([id_vars=None,value_vars=None,var_name="variable",value_name="value",col_level=None,ignore_index=True])
#宽数据→长数据;相当于pd.melt([,id_vars=None,value_vars=None,var_name=None,value_name="value",col_level=None,ignore_index=True])
#参数说明:
id_vars:指定不还原的列;为column label/column label list,默认为None
value_vars:指定要还原的列;为column label/column label list,默认为所有
var_name,value_name:分别指定转换后列名/值所在列的列名;为str
col_level:指定要还原的列层级;为str/int
#实例:
>>> df=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],index=["a","b","c"],columns=["A","B","C"])
>>> df.melt()
variable value
0 A 1
1 A 4
2 A 7
3 B 2
4 B 5
5 B 8
6 C 3
7 C 6
8 C 9
>>> df.melt(id_vars="A")
A variable value
0 1 B 2
1 4 B 5
2 7 B 8
3 1 C 3
4 4 C 6
5 7 C 9
>>> df.melt(value_vars=["A","B"],var_name="new",value_name="val")
new val
0 A 1
1 A 4
2 A 7
3 B 2
4 B 5
5 B 8
4.缺失值的处理:
删除缺失值:<df>.dropna([axis=0,how="any",thresh=None,subset=None,inplace=False])
#参数说明:
axis:指定删除行/列;为0'/index'(删除行)或1/'columns'(删除列)
how:指定删除条件;为"all"(全部缺失才删除)/"any"(有1个缺失就删除)
inplace:是否就地修改(即是否直接修改<df>);为bool
#实例:
>>> df=pd.DataFrame([[1,32,4,np.NaN],[4,3,4,2],[0,1,2,3]])
>>> df.dropna()
0 1 2 3
1 4 3 4 2.0
2 0 1 2 3.0
>>> df.dropna(axis=1)
0 1 2
0 1 32 4
1 4 3 4
2 0 1 2
>>> df.dropna(how="all")
0 1 2 3
0 1 32 4 NaN
1 4 3 4 2.0
2 0 1 2 3.0
######################################################################################################################
填充缺失值:<df>.fillna([value=None,method=None,axis=None,inplace=False,limit=None,downcast=None])
#参数说明:
value:指定用于填充的值
#实例:接上
>>> df.fillna(value=99)
0 1 2 3
0 1 32 4 99.0
1 4 3 4 2.0
2 0 1 2 3.0
######################################################################################################################
判定是否为缺失值:<df>.isnull()
#实例:接上
>>> df.isnull()
0 1 2 3
0 False False False True
1 False False False False
2 False False False False
5.其他:
对每列执行指定函数:<df>.apply(<func>,axis=0,raw=False,result_type=None,args=(),**kwds)
#参数说明:
func:指定要执行的函数;为function
#可为函数名,也可为匿名函数;但该函数必须可以接收恰好1个参数(即的列)
axis:为0/"index"表示对每列使用<func>
为1/"columns"表示对每行使用<func>
raw:为True表示将<df>作为ndarray对象传递给<func>
为False表示将<df>的各个列/行作为Series对象分别传递给<func>
result_type:为"expand"表示???
为"reduce"表示返回Series对象
为"broadcast"表示将返回值广播成和<df>形状相同的DataFrame对象
为None表示根据<func>的返回值的类型决定(list-like→Series,Series→DataFrame(形状和<df>相同))
#仅在axis=1/"columns"时有效
args,kwds:指定其他要传给<func>的关键字/位置参数;分别为tuple与dict
#实例:
>>> df=pd.DataFrame([[1,2],[3,4],[5,6]],index=["a","b","c"],columns=["A","B"])
>>> df
A B
a 1 2
b 3 4
c 5 6
>>> def f(df):
... return [1,2,3]
...
>>> df.apply(f,result_type="reduce")
A [1, 2, 3]
B [1, 2, 3]
dtype: object
>>> type(df.apply(f,result_type="reduce"))
<class 'pandas.core.series.Series'>
>>> df.apply(f,result_type="expand")
A B
a 1 1
b 2 2
c 3 3
>>> type(df.apply(f,result_type="expand"))
<class 'pandas.core.frame.DataFrame'>
>>> df.apply(f,result_type="broadcast")
A B
a 1 1
b 2 2
c 3 3
>>> type(df.apply(f,result_type="broadcast"))
<class 'pandas.core.frame.DataFrame'>
>>> def ff(df):
... return 101
...
>>> df.apply(ff,result_type="reduce")
A 101
B 101
dtype: int64
>>> type(df.apply(ff,result_type="reduce"))
<class 'pandas.core.series.Series'>
>>> df.apply(ff,result_type="expand")
A 101
B 101
dtype: int64
>>> type(df.apply(ff,result_type="expand"))
<class 'pandas.core.series.Series'>
>>> df.apply(ff,result_type="broadcast")
A B
a 101 101
b 101 101
c 101 101
>>> type(df.apply(ff,result_type="broadcast"))
<class 'pandas.core.frame.DataFrame'>
>>> def fff(df):
... return pd.DataFrame([[1,2],[3,4]])
...
>>> df.apply(fff,result_type="reduce")
A 0 1
0 1 2
1 3 4
B 0 1
0 1 2
1 3 4
dtype: object
>>> type(df.apply(fff,result_type="reduce"))
<class 'pandas.core.series.Series'>
######################################################################################################################
对每个元素执行指定函数:<df>.applymap(<func>)
#参数说明:
func:指定函数;为callable object
#实例:
>>> df=pd.DataFrame([[1,2],[-3,4],[-5,6]],index=["a","b","c"],columns=["A","B"])
>>> def f(x):
... return 1 if x>0 else -1
...
>>> df.applymap(f)
A B
a 1 1
b -1 1
c -1 1
四.属性:
属性既可用于查询,也可用于修改
查看元素个数<df>.size
#实例:
>>> pd.DataFrame([[1,2],[3,4],[3,2]]).size
6
#################################################################################################
获取DataFrame对象的形状:(<a>,<b>)=<df>.shape
#参数说明:
df:指定DataFrame对象
a,b:返回len(<index>)和len(<columns>)
#实例:
>>> df=pd.DataFrame(np.zeros([2,3]))
>>> df.shape
(2, 3)
#################################################################################################
返回<columns>构成的RangeIndex对象:<df>.columns
返回<index>构成的RangeIndex对象:<df>.index
#实例:接上
>>> df.columns
RangeIndex(start=0, stop=3, step=1)
>>> df.index
RangeIndex(start=0, stop=2, step=1)
>>> df.columns=['q','w','e']
>>> df
q w e
0 0.0 0.0 0.0
1 0.0 0.0 0.0
#################################################################################################
返回所有数据构成的二维数组:<df>.values
#返回的实际上是numpy.ndarray对象
#实例:接上
>>> df.values
array([[0., 0., 0.],
[0., 0., 0.]])