Python 第三方模块 数据分析 Pandas模块 DataFrame

一.简介:

  • 提供了比R语言的data.frame更丰富的功能
dataframe是带标签的大小可变的2维异构表格,由多个Series构成(在DataFrame中称为Column),但所有Series共用1组标签

Python 第三方模块 数据分析 Pandas模块 DataFrame_第1张图片

DataFrame unifies two or more Series into a single data structure.Each Series then represents a named column of
the DataFrame, and instead of each column having its own index, the DataFrame provides a single index and the
data in all columns is aligned to the master index of the DataFrame

二.使用
1.创建:

创建DataFrame对象:pd.DataFrame([<data>,index=(0,1...n-1),columns=(0,1...n-1),dtype=<dtype>,copy=False])
  #参数说明:其他同pd.Series类
    data:指定要存储的数据;为ndarray/Iterable/dict/DataFrame,默认为None
      #dict中可包含Series对象/arrays/scalar/list-like/str/set;key必须可哈希化
      #为dict且其value均为scalar时,必须传入,且的长度将决定DataFrame对象的行数
    columns:指定列名;为array-like/dict/pandas.Index对象
      #如果为dict,则的key会替代columns(如果没有指定columns)而不是index,这与Series不同
      #虽然源码注释中不是这么写的???
      #如果为dict,会使用key作为列名

#实例:
>>> pd.DataFrame(np.zeros([2,2]))
     0    1
0  0.0  0.0
1  0.0  0.0
>>> pd.DataFrame()
Empty DataFrame
Columns: []
Index: []
>>> pd.DataFrame({"a":1,"b":2,"c":3},index=[1,2,3])
#values均为scalar,此时必须传入
   a  b  c
1  1  2  3
2  1  2  3
3  1  2  3
>>> pd.DataFrame({"a":"q","b":"w","c":"e"},index=[1,2,3])
#values均为scalar
   a  b  c
1  q  w  e
2  q  w  e
3  q  w  e
>>> pd.DataFrame({"a":[5,6],"b":[3,4],"c":[1,2]})
   a  b  c
0  5  3  1
1  6  4  2
>>> pd.DataFrame([np.zeros(7),np.zeros(7)])
     0    1    2    3    4    5    6
0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0  0.0  0.0  0.0

3.操作
(1)索引与切片:

参见 Python.第三方模块.数据分析.Pandas模块.索引与切片..2 部分

(2)根据条件筛选:

根据条件筛选行:<df>[<cond>]
  #参数说明:
    cond:指定筛选条件
      #常用条件:>,<,==,≤,≥,!=,between(),isnull(),&,|,not,contains()

#实例:
>>> df=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]],index=["A","B","C","D"],columns=["e","f","g"])
>>> df[df.e>1]
    e   f   g
B   4   5   6
C   7   8   9
D  10  11  12

(3)判断2个DataFrame对象中对应的值是否相等:

判断2个DataFrame对象中对应的值是否相等:<df1>==<df2>

>>> df_a=pd.DataFrame([[2,3],[4,1]])
>>> df_b=pd.DataFrame([[1,3],[4,2]])
>>> df_a==df_b
       0      1
0  False   True
1   True  False

(4)行列操作:

查看指定列:<df>.<column>
  #注意:①不能通过类似的方法查看指定行 ②实际上这是的属性
  #参数说明:
    column:指定列名

#实例:
>>> df=pd.DataFrame([[1,2,3],[4,5,6]],columns=["a","b","c"],index=["A","B"])
>>> df.a
A    1
B    4
Name: a, dtype: int64
>>> df.A
Traceback (most recent call last):
  File "", line 1, in <module>
  File "C:\Users\Euler\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\generic.py", line 5136, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'A'

#################################################################################################

增加列:直接通过索引赋值
      通过.loc赋值
增加行:通过.loc赋值

#实例:
>>> df=pd.DataFrame([[1,2,3],[5,6,7],[9,0,1]])
>>> df[3]=100
>>> df.loc[:,4]=1000
>>> df
   0  1  2    3     4
0  1  2  3  100  1000
1  5  6  7  100  1000
2  9  0  1  100  1000
>>> df[3][3]=0#这种方法无效
>>> df
   0  1  2    3     4
0  1  2  3  100  1000
1  5  6  7  100  1000
2  9  0  1  100  1000
>>> df.loc[1.1,:]=9.9
>>> df
       0    1    2      3       4
0.0  1.0  2.0  3.0  100.0  1000.0
1.0  5.0  6.0  7.0  100.0  1000.0
2.0  9.0  0.0  1.0  100.0  1000.0
1.1  9.9  9.9  9.9    9.9     9.9

#################################################################################################

删除列:使用del
删除行:使用.drop()

#实例:关于删除行,参见 4 部分
>>> df=pd.DataFrame([[1,2],[5,6]])
>>> del df[0]
>>> df
   1
0  2
1  6

#################################################################################################

翻转列:使用.loc[::-1,:]
翻转行:使用.loc[:,::-1]

#实例:
>>> df=pd.DataFrame([[1,2],[5,6]])
>>> df.loc[:,::-1]
   1  0
0  2  1
1  6  5
>>> df.loc[::-1,:]
   0  1
1  5  6
0  1  2
>>> df.loc[::-1,::-1]
   1  0
1  6  5
0  2  1

#################################################################################################

交换2列的值:使用.loc
交换2行的值:使用.loc

#实例:
>>> df=pd.DataFrame([[1,2],[5,6]])
>>> df.loc[:,[0,1]]=df.loc[:,[1,0]].values#一定要使用.values,否则无效
>>> df
   0  1
0  2  1
1  6  5
>>> df.loc[[0,1],:]=df.loc[[1,0],:].values#一定要使用.values,否则无效
>>> df
   0  1
0  6  5
1  2  1

#################################################################################################

修改指定列的列名:使用.rename()

#实例:参见 4 部分

三.相关函数

DataFrame还可以调用很多Series的函数,并且通常可额外指定axis参数以确定沿哪个轴进行计算
NumPy模块的ufuncs(元素级数组方法)也可用于DataFrame

1.基本操作
(1)查:

获取DataFrame对象的前n行:<df>.head(<n>)
获取DataFrame对象的后n行:<df>.tail(<n>)
  #均返回DataFrame对象
  #参数说明:
    n:指定获取的行数

#实例:
>>> df=pd.DataFrame([[1,2],[3,4],[5,6]])
>>> df.head(2)
   0  1
0  1  2
1  3  4
>>> df.tail(1)
   0  1
2  5  6

(2)改:

修改数据类型:<S>.astype(<dtype>[,copy=True,errors="raise"])
  #参数说明:其他同Series.astype()
    dtype:指定修改后的数据类型;str/dict
      #dict格式为{:"",...},为列名,dtype指定该列的数据类型(为str)

#################################################################################################

修改行/列名:<df>.rename(mapper=None,index=None,columns=None,axis='index',copy=True,inplace=False,level=None,errors="ignore")
  #参数说明:
    mapper:(和axis共同)指定如何修改;dict-like/function
      #dict的格式为{:...},为旧列/行名,为新列/行名
    axis:指定在哪个维度上修改;可为0'index'/1'columns'
      #mapper和axis需要配合使用
    index,columns:分别指定如何修改标签和列名;dict-like/function
      #指定了index/columns后不能再指定axis/mapper

#实例:
>>> df=pd.DataFrame({"A":[1,2,3,4],"B":[5,6,7,8],"C":[1,1,1,1]},index=['q','w','e','r'])
>>> df.rename(mapper={'A':'a'},axis=1)
   a  B  C
q  1  5  1
w  2  6  1
e  3  7  1
r  4  8  1
>>> df.rename(index={'q':'Q'})
   A  B  C
Q  1  5  1
w  2  6  1
e  3  7  1
r  4  8  1

#################################################################################################

将指定列转换为行标签索引:<df>.set_index(<keys>[,drop=True,append=False,inplace=False,verify_integrity=False])
  #参数说明:
  	keys:指定列;为scalar/array-like/scalar list/array list
  	drop:指定是否删除用作行标签的列;bool
  	append:False表示替换原有标签;True表示添加到原有标签中

#实例:
>>> df=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],index=["a","b","c"],columns=["A","B","C"])
>>> df
   A  B  C
a  1  2  3
b  4  5  6
c  7  8  9
>>> df.set_index(["A"])
   B  C
A
1  2  3
4  5  6
7  8  9
>>> df.set_index(["A","B"])
     C
A B
1 2  3
4 5  6
7 8  9
>>> df.set_index([["A","B","C"]])
   A  B  C
A  1  2  3
B  4  5  6
C  7  8  9
>>> df.set_index([["A","B","C"],["A","A","B"]])
     A  B  C
A A  1  2  3
B A  4  5  6
C B  7  8  9
>>> df.set_index(["A"],append=True)
     B  C
  A
a 1  2  3
b 4  5  6
c 7  8  9

(3)删:

删除指定元素:<df>.drop([labels=None,axis=0,index=None,columns=None,inplace=False,errors="raise"])
  #参数说明:其他同.drop()
    columns:指定要删除的元素的列名
      #labels/index/columns应至少指定1个
    axis:指定对哪个轴进行操作;0为行,1为列(参见实例)
      #仅在传入位置参数时有用

#实例:
>>> df=pd.DataFrame([[1,2],[3,4],[5,6]])
>>> df.drop(columns=0,index=0)
   1
1  4
2  6
>>> df.drop(0)
   0  1
1  3  4
2  5  6
>>> df.drop(0,0)
   0  1
1  3  4
2  5  6
>>> df.drop(0,axis=0)
   0  1
1  3  4
2  5  6
>>> df.drop(0,axis=1)
   1
0  2
1  4
2  6
>>> df.drop(index=0,columns=1,axis=1)
   0
1  3
2  5
>>> df.drop(index=0,columns=1)
   0
1  3
2  5
>>> df.drop(0,0,axis=1)
Traceback (most recent call last):
  File "", line 1, in <module>
TypeError: drop() got multiple values for argument 'axis'

(4)增:

增加新列:<df>.assign([**kwargs])

#实例:
>>> df=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],index=["k1","k2","k3"],columns=["v1","v2","v3"])
>>> df.assign(v4=np.random.randint(0,10,size=3))
    v1  v2  v3  v4
k1   1   2   3   0
k2   4   5   6   3
k3   7   8   9   3
>>> df.assign(v2=lambda x:x*2)#相当于lambda x:x.v1*2
    v1  v2  v3
k1   1   2   3
k2   4   8   6
k3   7  14   9
>>> df.assign(v2=lambda x:x.v3*2)
    v1  v2  v3
k1   1   6   3
k2   4  12   6
k3   7  18   9

2.格式转换与合并
(1)旋转:

将列压缩为行:<df>.stack([level=-1,dropna=True])
  #参数说明:
    level:指定压缩哪个列层级;int/str/int list/str list
    dropna:是否丢弃值为NaN的行/;bool

#实例:
>>> df=pd.DataFrame([[1,2],[3,np.NaN]],index=["a","b"],columns=["A","B"])
>>> df.stack()
a  A    1.0
   B    2.0
b  A    3.0
dtype: float64
>>> df.stack(dropna=False)
a  A    1.0
   B    2.0
b  A    3.0
   B    NaN
dtype: float64
>>> df=pd.DataFrame([[1.0, 2.0],[3.0, 4.0]],index=['cat','dog'],columns=[["w","h"],["kg","m"]])
>>> df.stack()
          h    w
cat kg  NaN  1.0
    m   2.0  NaN
dog kg  NaN  3.0
    m   4.0  NaN
>>> df.stack(0)
        kg    m
cat h  NaN  2.0
    w  1.0  NaN
dog h  NaN  4.0
    w  3.0  NaN

(2)合并:

按列合并:<df>.join(<other>[,on=None,how="left",lsuffix="",rsuffix="",sort=False])
  #参数说明:
  	other:指定另1个数据集;为DataFrame/Series/DataFrame list
  	sort:是否对结果进行排序;bool

#实例:
>>> df1=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]],index=["A","B","C","D"],columns=["a","b","c"])
>>> df2=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
>>> df1.join(df2)
    a   b   c   0   1   2
A   1   2   3 NaN NaN NaN
B   4   5   6 NaN NaN NaN
C   7   8   9 NaN NaN NaN
D  10  11  12 NaN NaN NaN

3.统计分析
(1)基础:

获取基础统计信息:<df>.describe()

#实例:
>>> df=pd.DataFrame([[1,2,3,"d1","e1"],[6,7,8,"d2","e2"]],columns=["a","b","c","d","e"])
>>> df.describe()
              a         b         c
count  2.000000  2.000000  2.000000
mean   3.500000  4.500000  5.500000
std    3.535534  3.535534  3.535534
min    1.000000  2.000000  3.000000
25%    2.250000  3.250000  4.250000
50%    3.500000  4.500000  5.500000
75%    4.750000  5.750000  6.750000
max    6.000000  7.000000  8.000000
#非数值类型的列仅在仅包含非数值类型的列时才会被描述:
>>> df[["d","e"]].describe()
         d   e
count    2   2
unique   2   2
top     d2  e1
freq     1   1
>>> df=pd.DataFrame([["a1","b1","c1","d1","e1"],["a2","b2","c2","d2","e2"]],columns=["a","b","c","d","e"])
>>> df.describe()
         a   b   c   d   e
count    2   2   2   2   2
unique   2   2   2   2   2
top     a1  b2  c2  d2  e1
freq     1   1   1   1   1

(2)相关性:

求相关系数:<df>.corr()
  #计算中任意2个列间的相关系数

#实例:
>>> df=pd.DataFrame([[1,2,3],[4,5,6],[5,4,3]])
>>> df.corr()
          0         1         2
0  1.000000  0.838628  0.277350
1  0.838628  1.000000  0.755929
2  0.277350  0.755929  1.000000

(3)最值:

分别求前1-n个数的最大值:<df>.cummax([axis=0])
分别求前1-n个数的最小值:<df>.cummin([axis=0])
  #参数说明:
  	axis:指定沿哪个轴计算;int/None()

#实例:
>>> df=pd.DataFrame([[3,2,5,1,9],[1,3,4,6,2],[5,-2,11,0,4]])
>>> df.cummax()
   0  1   2  3  4
0  3  2   5  1  9
1  3  3   5  6  9
2  5  3  11  6  9
>>> df.cummax(1)
   0  1   2   3   4
0  3  3   5   5   9
1  1  3   4   6   6
2  5  5  11  11  11
>>> df.cummin()
   0  1  2  3  4
0  3  2  5  1  9
1  1  2  4  1  2
2  1 -2  4  0  2
>>> df.cummin(1)
   0  1  2  3  4
0  3  2  2  1  1
1  1  1  1  1  1
2  5 -2 -2 -2 -2

(4)交叉分析:

进行交叉分析(透视):<df>.pivot(columns=None[,index=None,values=None])
  #长数据→宽数据;不支持数据聚合;相当于pd.pivot(,columns=None[,index=None,values=None])
  #参数说明:
  	index,columns:分别指定行/列标签;为column label/column label list
  	values:指定数据;为column label/column label list

#实例:
>>> df=pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]],index=["a","b","c","d"],columns=["A","B","C","D"])
>>> df.pivot(columns="A")
     B                     C                     D
A   1    5     9     13   1    5     9     13   1    5     9     13
a  2.0  NaN   NaN   NaN  3.0  NaN   NaN   NaN  4.0  NaN   NaN   NaN
b  NaN  6.0   NaN   NaN  NaN  7.0   NaN   NaN  NaN  8.0   NaN   NaN
c  NaN  NaN  10.0   NaN  NaN  NaN  11.0   NaN  NaN  NaN  12.0   NaN
d  NaN  NaN   NaN  14.0  NaN  NaN   NaN  15.0  NaN  NaN   NaN  16.0
>>> df.pivot(columns="B",values="B")
B   2    6     10    14
a  2.0  NaN   NaN   NaN
b  NaN  6.0   NaN   NaN
c  NaN  NaN  10.0   NaN
d  NaN  NaN   NaN  14.0
>>> df.pivot(columns="B",values="B",index="C")
B    2    6     10    14
C
3   2.0  NaN   NaN   NaN
7   NaN  6.0   NaN   NaN
11  NaN  NaN  10.0   NaN
15  NaN  NaN   NaN  14.0

######################################################################################################################

进行交叉分析(透视):<df>.pivot_table([values=None,index=None,columns=None,aggfunc="mean",fill_value=None,margins=False,dropna=True,margins_name="All",observed=False])
  #长数据→宽数据;支持数据聚合;参见 Python.第三方模块.数据分析.Pandas模块.介绍,IO,函数.三.2.(3) 部分

######################################################################################################################

还原交叉分析(逆透视):<df>.melt([id_vars=None,value_vars=None,var_name="variable",value_name="value",col_level=None,ignore_index=True])
  #宽数据→长数据;相当于pd.melt([,id_vars=None,value_vars=None,var_name=None,value_name="value",col_level=None,ignore_index=True])
  #参数说明:
	id_vars:指定不还原的列;为column label/column label list,默认为None
	value_vars:指定要还原的列;为column label/column label list,默认为所有
	var_name,value_name:分别指定转换后列名/值所在列的列名;str
	col_level:指定要还原的列层级;str/int

#实例:
>>> df=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],index=["a","b","c"],columns=["A","B","C"])
>>> df.melt()
  variable  value
0        A      1
1        A      4
2        A      7
3        B      2
4        B      5
5        B      8
6        C      3
7        C      6
8        C      9
>>> df.melt(id_vars="A")
   A variable  value
0  1        B      2
1  4        B      5
2  7        B      8
3  1        C      3
4  4        C      6
5  7        C      9
>>> df.melt(value_vars=["A","B"],var_name="new",value_name="val")
  new  val
0   A    1
1   A    4
2   A    7
3   B    2
4   B    5
5   B    8

4.缺失值的处理:

删除缺失值:<df>.dropna([axis=0,how="any",thresh=None,subset=None,inplace=False])
  #参数说明:
  	axis:指定删除行/;0'/index'(删除行)1/'columns'(删除列)
  	how:指定删除条件;"all"(全部缺失才删除)/"any"(1个缺失就删除)
  	inplace:是否就地修改(即是否直接修改<df>);bool

#实例:
>>> df=pd.DataFrame([[1,32,4,np.NaN],[4,3,4,2],[0,1,2,3]])
>>> df.dropna()
   0  1  2    3
1  4  3  4  2.0
2  0  1  2  3.0
>>> df.dropna(axis=1)
   0   1  2
0  1  32  4
1  4   3  4
2  0   1  2
>>> df.dropna(how="all")
   0   1  2    3
0  1  32  4  NaN
1  4   3  4  2.0
2  0   1  2  3.0

######################################################################################################################

填充缺失值:<df>.fillna([value=None,method=None,axis=None,inplace=False,limit=None,downcast=None])
  #参数说明:
  	value:指定用于填充的值

#实例:接上
>>> df.fillna(value=99)
   0   1  2     3
0  1  32  4  99.0
1  4   3  4   2.0
2  0   1  2   3.0

######################################################################################################################

判定是否为缺失值:<df>.isnull()

#实例:接上
>>> df.isnull()
       0      1      2      3
0  False  False  False   True
1  False  False  False  False
2  False  False  False  False

5.其他:

对每列执行指定函数:<df>.apply(<func>,axis=0,raw=False,result_type=None,args=(),**kwds)
  #参数说明:
    func:指定要执行的函数;为function
      #可为函数名,也可为匿名函数;但该函数必须可以接收恰好1个参数(即的列)
    axis:0/"index"表示对每列使用<func>1/"columns"表示对每行使用<func>
    raw:True表示将<df>作为ndarray对象传递给<func>False表示将<df>的各个列/行作为Series对象分别传递给<func>
    result_type:"expand"表示???
                为"reduce"表示返回Series对象
                为"broadcast"表示将返回值广播成和<df>形状相同的DataFrame对象
                为None表示根据<func>的返回值的类型决定(list-like→Series,Series→DataFrame(形状和<df>相同))
                  #仅在axis=1/"columns"时有效
    args,kwds:指定其他要传给<func>的关键字/位置参数;分别为tupledict

#实例:
>>> df=pd.DataFrame([[1,2],[3,4],[5,6]],index=["a","b","c"],columns=["A","B"])
>>> df
   A  B
a  1  2
b  3  4
c  5  6
>>> def f(df):
...     return [1,2,3]
...
>>> df.apply(f,result_type="reduce")
A    [1, 2, 3]
B    [1, 2, 3]
dtype: object
>>> type(df.apply(f,result_type="reduce"))
<class 'pandas.core.series.Series'>
>>> df.apply(f,result_type="expand")
   A  B
a  1  1
b  2  2
c  3  3
>>> type(df.apply(f,result_type="expand"))
<class 'pandas.core.frame.DataFrame'>
>>> df.apply(f,result_type="broadcast")
   A  B
a  1  1
b  2  2
c  3  3
>>> type(df.apply(f,result_type="broadcast"))
<class 'pandas.core.frame.DataFrame'>
>>> def ff(df):
...     return 101
...
>>> df.apply(ff,result_type="reduce")
A    101
B    101
dtype: int64
>>> type(df.apply(ff,result_type="reduce"))
<class 'pandas.core.series.Series'>
>>> df.apply(ff,result_type="expand")
A    101
B    101
dtype: int64
>>> type(df.apply(ff,result_type="expand"))
<class 'pandas.core.series.Series'>
>>> df.apply(ff,result_type="broadcast")
     A    B
a  101  101
b  101  101
c  101  101
>>> type(df.apply(ff,result_type="broadcast"))
<class 'pandas.core.frame.DataFrame'>
>>> def fff(df):
...     return pd.DataFrame([[1,2],[3,4]])
...
>>> df.apply(fff,result_type="reduce")
A       0  1
0  1  2
1  3  4
B       0  1
0  1  2
1  3  4
dtype: object
>>> type(df.apply(fff,result_type="reduce"))
<class 'pandas.core.series.Series'>

######################################################################################################################

对每个元素执行指定函数:<df>.applymap(<func>)
  #参数说明:
  	func:指定函数;callable object

#实例:
>>> df=pd.DataFrame([[1,2],[-3,4],[-5,6]],index=["a","b","c"],columns=["A","B"])
>>> def f(x):
...     return 1 if x>0 else -1
...
>>> df.applymap(f)
   A  B
a  1  1
b -1  1
c -1  1

四.属性:

属性既可用于查询,也可用于修改
查看元素个数<df>.size

#实例:
>>> pd.DataFrame([[1,2],[3,4],[3,2]]).size
6

#################################################################################################

获取DataFrame对象的形状:(<a>,<b>)=<df>.shape
  #参数说明:
    df:指定DataFrame对象
    a,b:返回len(<index>)len(<columns>)

#实例:
>>> df=pd.DataFrame(np.zeros([2,3]))
>>> df.shape
(2, 3)

#################################################################################################

返回<columns>构成的RangeIndex对象:<df>.columns
返回<index>构成的RangeIndex对象:<df>.index

#实例:接上
>>> df.columns
RangeIndex(start=0, stop=3, step=1)
>>> df.index
RangeIndex(start=0, stop=2, step=1)
>>> df.columns=['q','w','e']
>>> df
     q    w    e
0  0.0  0.0  0.0
1  0.0  0.0  0.0

#################################################################################################

返回所有数据构成的二维数组:<df>.values
  #返回的实际上是numpy.ndarray对象

#实例:接上
>>> df.values
array([[0., 0., 0.],
       [0., 0., 0.]])

你可能感兴趣的:(#,数据分析,#,Python,python,数据分析,pandas,DataFrame)