科学计算系列学习 02:Pandas

科学计算系列学习 01:Numpy
科学计算系列学习 02:Pandas
科学计算系列学习 03:Matplotlib


  • 生成日期数据

In [50]: pd.date_range('20190701',periods=6)
Out[50]:
DatetimeIndex(['2019-07-01', '2019-07-02', '2019-07-03', '2019-07-04',
               '2019-07-05', '2019-07-06'],
              dtype='datetime64[ns]', freq='D')
  • Series

In [55]: np.Series([1,3,4,8,-2],index=['a','b','c','d','e'])
Out[55]:
a    1
b    3
c    4
d    8
e   -2
dtype: int64
  • 用np或者直接导入数据生成DataFrame

In [49]: pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d'])
Out[49]:
                   a         b         c         d
2019-07-01 -0.943911  0.930244 -1.002432 -1.495716
2019-07-02 -0.529640  0.559569 -0.552342 -1.403447
2019-07-03  1.226341  0.277729  0.014151  0.154364
2019-07-04 -1.767719 -0.798156 -0.555459 -0.746608
2019-07-05 -0.922795  0.592672  0.295197 -0.187842
2019-07-06  1.384318  0.924977  1.320110 -0.784771
In [48]: pd.DataFrame(np.arange(15).reshape(3,5))
Out[48]:
    0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
  • 传入字典的方式构建DataFrame

key代表列名,行名自动生成从0开始

In [59]: df = pd.DataFrame({'A':1.,'B':pd.Timestamp("20190701"),'C':pd.Series(1,index=list(range(4)),dtype='float32'),'D':np.array([3]*4,dtype='int32'),'E':pd.Categorical(["test","train","test","train"]),'F':'foo'})
Out[59]:
     A          B    C  D      E    F
0  1.0 2019-07-01  1.0  3   test  foo
1  1.0 2019-07-01  1.0  3  train  foo
2  1.0 2019-07-01  1.0  3   test  foo
3  1.0 2019-07-01  1.0  3  train  foo
  • 查看每列数据类型
In [61]: df.dtypes
Out[61]:
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object
  • 查看行索引
In [62]: df.index
Out[62]: Int64Index([0, 1, 2, 3], dtype='int64')
  • 查看列名称
In [63]: df.columns
Out[63]: Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
  • 查看所有的值
In [64]: df.values
Out[64]:
array([[1.0, Timestamp('2019-07-01 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2019-07-01 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2019-07-01 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2019-07-01 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)
  • 查看描述

只支持数字类型

In [66]: df.describe()
Out[66]:
         A    C    D
count  4.0  4.0  4.0
mean   1.0  1.0  3.0
std    0.0  0.0  0.0
min    1.0  1.0  3.0
25%    1.0  1.0  3.0
50%    1.0  1.0  3.0
75%    1.0  1.0  3.0
max    1.0  1.0  3.0
  • 矩阵颠倒,行变列,列变行
In [67]: df.T
Out[67]:
                     0                    1                    2                    3
A                    1                    1                    1                    1
B  2019-07-01 00:00:00  2019-07-01 00:00:00  2019-07-01 00:00:00  2019-07-01 00:00:00
C                    1                    1                    1                    1
D                    3                    3                    3                    3
E                 test                train                 test                train
F                  foo                  foo                  foo                  foo
  • 排序

axis:0表示对行索引进行排序,1表示对列索引进行排序;ascending=False表示倒序

In [71]: df.sort_index(axis=0,ascending=False)
Out[71]:
     A          B    C  D      E    F
3  1.0 2019-07-01  1.0  3  train  foo
2  1.0 2019-07-01  1.0  3   test  foo
1  1.0 2019-07-01  1.0  3  train  foo
0  1.0 2019-07-01  1.0  3   test  foo

对指定列E进行排序

In [74]: df.sort_values(by='E')
Out[74]:
     A          B    C  D      E    F
0  1.0 2019-07-01  1.0  3   test  foo
2  1.0 2019-07-01  1.0  3   test  foo
1  1.0 2019-07-01  1.0  3  train  foo
3  1.0 2019-07-01  1.0  3  train  foo

二、选择数据

  • 列切片
In [76]: dates = pd.date_range('20190701',periods=6)
In [79]: df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
Out[79]:
             A   B   C   D
2019-07-01   0   1   2   3
2019-07-02   4   5   6   7
2019-07-03   8   9  10  11
2019-07-04  12  13  14  15
2019-07-05  16  17  18  19
2019-07-06  20  21  22  23

In [81]: df.A
Out[81]:
2019-07-01     0
2019-07-02     4
2019-07-03     8
2019-07-04    12
2019-07-05    16
2019-07-06    20
Freq: D, Name: A, dtype: int32

In [82]: df['A']
Out[82]:
2019-07-01     0
2019-07-02     4
2019-07-03     8
2019-07-04    12
2019-07-05    16
2019-07-06    20
Freq: D, Name: A, dtype: int32
  • 行切片
In [84]: df[0:3]
Out[84]:
            A  B   C   D
2019-07-01  0  1   2   3
2019-07-02  4  5   6   7
2019-07-03  8  9  10  11

In [85]: df['2019-07-05':'2019-07-06']
Out[85]:
             A   B   C   D
2019-07-05  16  17  18  19
2019-07-06  20  21  22  23
  • 标签选择
    单行
In [94]: df.loc['20190701']
Out[94]:
A    0
B    1
C    2
D    3
Name: 2019-07-01 00:00:00, dtype: int32

多列

In [98]: df.loc[:,['A','B']]
Out[98]:
             A   B
2019-07-01   0   1
2019-07-02   4   5
2019-07-03   8   9
2019-07-04  12  13
2019-07-05  16  17
2019-07-06  20  21
  • 位置选择

第三行到第五行,第一列到第三列

In [100]: df.iloc[3:5,1:3]
Out[100]:
             B   C
2019-07-04  13  14
2019-07-05  17  18
  • 大于小于等于

打印出A列大于8的所有行

In [103]: df[df.A>8]
Out[103]:
             A   B   C   D
2019-07-04  12  13  14  15
2019-07-05  16  17  18  19
2019-07-06  20  21  22  23

三、更改数据

  • 指定标签
In [20]: df.loc['20190701','A']=6

In [21]: df
Out[21]:
             A   B   C   D
2019-07-01   6   1   2   3
2019-07-02   4   5   6   7
2019-07-03   8   9  10  11
2019-07-04  12  13  14  15
2019-07-05  16  17  18  19
2019-07-06  20  21  22  23
  • 指定位置
In [24]: df.iloc[2,2]=22

In [25]: df
Out[25]:
             A   B   C   D
2019-07-01   6   1   2   3
2019-07-02   4   5   6   7
2019-07-03   8   9  22  11
2019-07-04  12  13  14  15
2019-07-05  16  17  18  19
2019-07-06  20  21  22  23
  • 根据判断条件进行更改
In [26]: df.A[df.A>6]=0

In [27]: df
Out[27]:
            A   B   C   D
2019-07-01  6   1   2   3
2019-07-02  4   5   6   7
2019-07-03  0   9  22  11
2019-07-04  0  13  14  15
2019-07-05  0  17  18  19
2019-07-06  0  21  22  23

四、添加列

  • 添加一列 E,设其默认值为NaN
In [29]: df['E']=np.nan

In [30]: df
Out[30]:
            A   B   C   D   E
2019-07-01  6   1   2   3 NaN
2019-07-02  4   5   6   7 NaN
2019-07-03  0   9  22  11 NaN
2019-07-04  0  13  14  15 NaN
2019-07-05  0  17  18  19 NaN
2019-07-06  0  21  22  23 NaN
  • 添加一列 F ,指定值,需要指定index
In [31]: df['F'] = pd.Series([1,2,3,4,5,6],index=pd.date_range('20190701',periods=6))

In [32]: df
Out[32]:
            A   B   C   D   E  F
2019-07-01  6   1   2   3 NaN  1
2019-07-02  4   5   6   7 NaN  2
2019-07-03  0   9  22  11 NaN  3
2019-07-04  0  13  14  15 NaN  4
2019-07-05  0  17  18  19 NaN  5
2019-07-06  0  21  22  23 NaN  6

五、处理丢失数据:

  • 删除包含NaN的行或列

axis:0代表行,1代表列;any代表包含任意个,all表示全部是。默认为any

In [51]: df
Out[51]:
             A     B     C   D
2019-07-01   0   NaN   2.0   3
2019-07-02   4   5.0   NaN   7
2019-07-03   8   9.0  10.0  11
2019-07-04  12  13.0  14.0  15
2019-07-05  16  17.0  18.0  19
2019-07-06  20  21.0  22.0  23

In [52]: df.dropna(axis=0,how='any')
Out[52]:
             A     B     C   D
2019-07-03   8   9.0  10.0  11
2019-07-04  12  13.0  14.0  15
2019-07-05  16  17.0  18.0  19
2019-07-06  20  21.0  22.0  23
  • 将NaN填充指定值
In [60]: df.fillna(value=999)
Out[60]:
                A      B      C   D
2019-07-01  999.0  999.0    2.0   3
2019-07-02    4.0  999.0  999.0   7
2019-07-03    8.0    9.0   10.0  11
2019-07-04   12.0   13.0   14.0  15
2019-07-05   16.0   17.0   18.0  19
2019-07-06   20.0   21.0   22.0  23
  • 判断是否有数据丢失
np.any(df.isnull()) ==True

六、导入导出到文件

  • 导出
 df.to_csv ('a.csv')
  • 导入

默认会自动添加行索引

pd.read_csv('a.csv')

七、数据合并

1、concat

  • 纵向或横向合并

默认axis=0,axis=0:纵向合并,axis=1:横向合并。
默认:join = outer ,纵向合并行,横向合并列:outer:合并成并集,inner:合并成交集。
ignore_index=True 横向合并重新排列 列索引号,纵向合并重新排列行索引号
join_axes=[df1.index],横向合并时,以df1的横向索引为标准

In [100]: df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'],index=[1,2,3])
In [101]: df2 = pd.DataFrame(np.ones((3,4))*1,columns=['b','c','d','e'],index=[2,3,4])


In [123]: pd.concat([df1,df2])

Out[123]:
     a    b    c    d    e
0  0.0  0.0  0.0  0.0  NaN
1  0.0  0.0  0.0  0.0  NaN
2  0.0  0.0  0.0  0.0  NaN
3  NaN  1.0  1.0  1.0  1.0
4  NaN  1.0  1.0  1.0  1.0
5  NaN  1.0  1.0  1.0  1.0

In [130]: pd.concat([df1,df2],axis=1)
Out[130]:
     a    b    c    d    b    c    d    e
1  0.0  0.0  0.0  0.0  NaN  NaN  NaN  NaN
2  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
3  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
4  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0

In [131]: pd.concat([df1,df2],axis=1,ignore_index=True)
Out[131]:
     0    1    2    3    4    5    6    7
1  0.0  0.0  0.0  0.0  NaN  NaN  NaN  NaN
2  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
3  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
4  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0

In [137]: pd.concat([df1,df2],axis=1,join_axes=[df1.index])
Out[137]:
     a    b    c    d    b    c    d    e
1  0.0  0.0  0.0  0.0  NaN  NaN  NaN  NaN
2  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
3  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0

2、merge

3、追加元素 append

In [143]: df1                                            
Out[143]:                                                
     a    b    c    d                                    
1  0.0  0.0  0.0  0.0                                    
2  0.0  0.0  0.0  0.0                                    
3  0.0  0.0  0.0  0.0                                    
                                                         
In [144]: pd.Series([1,2,3,4],index=['a','b','c','d'])   
Out[144]:                                                
a    1                                                   
b    2                                                   
c    3                                                   
d    4                                                   
dtype: int64                                             
                                                         
In [145]: df1.append(s1,ignore_index=True)               
Out[145]:                                                
     a    b    c    d                                    
0  0.0  0.0  0.0  0.0                                    
1  0.0  0.0  0.0  0.0                                    
2  0.0  0.0  0.0  0.0                                    
3  1.0  2.0  3.0  4.0                                    

你可能感兴趣的:(科学计算系列学习 02:Pandas)