pandas入门教程 (2)

pandas入门教程

缺失值处理

在pandas中,用np.nan来代表缺失值,这些值默认不会参与运算。

reindex()允许你修改、增加、删除指定轴上的索引,并返回一个数据副本。

df1 = df.reindex(index=dates[0:4], columns=list(df.columns)+['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
'''
                A	         B	         C	    D	 F	 E
2013-01-01	0.000000	0.000000	-0.541762	5	NaN	1.0
2013-01-02	-0.884117	-0.650741	0.217345	5	1.0	1.0
2013-01-03	0.220822	0.790527	0.692172	5	2.0	NaN
2013-01-04	1.260276	1.000297	0.809801	5	3.0	NaN
'''

剔除所有包含缺失值的行数据

df1.dropna(how='any')
'''
                 A	          B	         C	    D	 F	 E
2013-01-02	-0.884117	-0.650741	0.217345	5	1.0	1.0
'''

填充缺失值

df1.fillna(value=5)
'''
                 A	        B	         C	    D	 F	 E
2013-01-01	0.000000	0.000000	-0.541762	5	5.0	1.0
2013-01-02	-0.884117	-0.650741	0.217345	5	1.0	1.0
2013-01-03	0.220822	0.790527	0.692172	5	2.0	5.0
2013-01-04	1.260276	1.000297	0.809801	5	3.0	5.0
'''

获取值是否为nan的布尔标记

pd.isnull(df1)
'''
              A   	  B	      C       D	      F	      E
2013-01-01	False	False	False	False	True	False
2013-01-02	False	False	False	False	False	False
2013-01-03	False	False	False	False	False	True
2013-01-04	False	False	False	False	False	True
'''

运算

统计

运算过程中,通常不包含缺失值。

进行描述性统计

df.mean()
'''
A    0.474038
B    0.474008
C    0.442600
D    5.000000
F    3.000000
dtype: float64
'''

对其他轴进行同样的运算

df.mean(1)
'''
2013-01-01    1.114559
2013-01-02    0.936497
2013-01-03    1.740704
2013-01-04    2.214075
2013-01-05    2.501728
2013-01-06    2.384123
Freq: D, dtype: float64
'''

对于拥有不同维度的对象进行运算时需要对齐。除此之外,pandas会自动沿着指定维度计算。

s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s
'''
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64
'''

df.sub(s, axis='index')
'''
                A	         B	         C	       D	 F
2013-01-01	   NaN	        NaN	        NaN	      NaN	 NaN
2013-01-02	   NaN	        NaN	        NaN	      NaN	 NaN
2013-01-03	-0.779178	-0.209473	-0.307828	  4.0	 1.0
2013-01-04	-1.739724	-1.999703	-2.190199	  2.0	 0.0
2013-01-05	-3.320619	-3.531391	-4.639352	  0.0	-1.0
2013-01-06	   NaN	        NaN	        NaN	      NaN	 NaN
'''

Apply 函数作用

通过apply()对函数作用

df.apply(np.cumsum)
'''
	            A	        B	         C	    D	 F
2013-01-01	0.000000	0.000000	-0.541762	5	NaN
2013-01-02	-0.884117	-0.650741	-0.324417	10	1.0
2013-01-03	-0.663295	0.139786	0.367755	15	3.0
2013-01-04	0.596982	1.140083	1.177556	20	6.0
2013-01-05	2.276362	2.608693	1.538204	25	10.0
2013-01-06	2.844229	2.844045	2.655599	30	15.0
'''
df.apply(lambda x:x.max()-x.min())
'''
A    2.563498
B    2.119350
C    1.659157
D    0.000000
F    4.000000
dtype: float64
'''

频数统计

s = pd.Series(np.random.randint(0, 7, size=10))
s
'''
0    2
1    0
2    5
3    1
4    6
5    1
6    1
7    1
8    5
9    4
dtype: int32
'''

s.value_counts()
'''
1    4
5    2
6    1
4    1
2    1
0    1
dtype: int64
'''

字符串方法

对于Series对象,在其str属性中有着一系列的字符串处理方法。就如同下段代码一样,能很方便的对array中各个元素进行运算。值得注意的是,在str属性中的模式匹配默认使用正则表达式。

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
'''
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object
'''

合并

Concat 连接

pandas中提供了大量的方法能够轻松对Series,DataFrame和Panel对象进行不同满足逻辑关系的合并操作

通过concat()来连接pandas对象

df = pd.DataFrame(np.random.randn(10,4))
df
'''
0	1	2	3
0	1.560690	2.253479	1.728586	1.224112
1	-1.237557	-1.571768	-1.687004	-0.226474
2	-0.591146	-0.054644	0.600806	0.076132
3	-0.567678	0.426496	-0.972487	0.200211
4	-2.073311	-1.566767	-0.533602	1.366468
5	2.244767	1.612232	1.934717	-0.403805
6	-2.640917	0.640549	1.257238	0.043773
7	1.545405	1.771884	-0.273687	2.441483
8	-0.440476	0.567536	2.379072	1.152354
9	-0.047853	-0.440427	-1.382389	0.647217
'''

#break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pieces
'''
[          0         1         2         3
 0  1.560690  2.253479  1.728586  1.224112
 1 -1.237557 -1.571768 -1.687004 -0.226474
 2 -0.591146 -0.054644  0.600806  0.076132,
           0         1         2         3
 3 -0.567678  0.426496 -0.972487  0.200211
 4 -2.073311 -1.566767 -0.533602  1.366468
 5  2.244767  1.612232  1.934717 -0.403805
 6 -2.640917  0.640549  1.257238  0.043773,
           0         1         2         3
 7  1.545405  1.771884 -0.273687  2.441483
 8 -0.440476  0.567536  2.379072  1.152354
 9 -0.047853 -0.440427 -1.382389  0.647217]
'''

pd.concat(pieces)
'''
        0	        1	        2	        3
0	1.560690	2.253479	1.728586	1.224112
1	-1.237557	-1.571768	-1.687004	-0.226474
2	-0.591146	-0.054644	0.600806	0.076132
3	-0.567678	0.426496	-0.972487	0.200211
4	-2.073311	-1.566767	-0.533602	1.366468
5	2.244767	1.612232	1.934717	-0.403805
6	-2.640917	0.640549	1.257238	0.043773
7	1.545405	1.771884	-0.273687	2.441483
8	-0.440476	0.567536	2.379072	1.152354
9	-0.047853	-0.440427	-1.382389	0.647217
'''

Join 合并

类似于SQL中的合并(merge)

left = pd.DataFrame({'key':['foo', 'foo'], 'lval':[1,2]})
left
'''
    key	lval
0	foo	 1
1	foo	 2
'''

right = pd.DataFrame({'key':['foo', 'foo'], 'lval':[4,5]})
right
'''
	key	lval
0	foo	4
1	foo	5
'''
pd.merge(left, right, on='key')
'''
	key	lval_x	lval_y
0	foo	  1  	  4
1	foo	  1	      5
2	foo	  2	      4
3	foo	  2	      5
'''

Append 添加

将若干行添加到dataFrame后面

df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df
'''
        A	         B	         C	        D
0	0.415810	-1.106857	-0.687920	2.422911
1	0.696149	-1.235975	0.201409	1.424596
2	-0.540622	0.121096	-0.861667	-0.171690
3	0.163904	1.324567	-0.768324	-0.205520
4	-1.581152	-0.079061	0.251810	-0.195755
5	1.254246	1.604556	0.766464	-1.090743
6	0.608609	1.000765	-0.407980	0.034970
7	-3.111914	2.163344	0.619885	-0.705518
'''

s = df.iloc[3]
s
'''
A    0.163904
B    1.324567
C   -0.768324
D   -0.205520
Name: 3, dtype: float64
'''

df.append(s, ignore_index=True)
'''
        A	          B	         C	        D
0	0.415810	-1.106857	-0.687920	2.422911
1	0.696149	-1.235975	0.201409	1.424596
2	-0.540622	0.121096	-0.861667	-0.171690
3	0.163904	1.324567	-0.768324	-0.205520
4	-1.581152	-0.079061	0.251810	-0.195755
5	1.254246	1.604556	0.766464	-1.090743
6	0.608609	1.000765	-0.407980	0.034970
7	-3.111914	2.163344	0.619885	-0.705518
8	0.163904	1.324567	-0.768324	-0.205520
'''

分组

对于“group by”操作,我们通常是指以下一个或几个步骤:

  • 划分 按照某些标准将数据分为不同的组
  • 应用 对每组数据分别执行一个函数
  • 组合 将结果组合到一个数据结构
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 
                          'foo', 'bar', 'foo', 'bar'],
                   'B' : ['one', 'one', 'two', 'three', 
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df
'''
     A	  B	         C	        D
0	foo	 one	 0.190663	0.589384
1	bar	 one	 1.056331	0.035044
2	foo	 two	 0.723645	0.372672
3	bar	 three   -1.306869	0.435296
4	foo	 two     0.673661	-1.292242
5	bar	 two	 -0.823728	0.837556
6	foo	 one	 0.638573	2.453041
7	bar	 three   0.508922	0.578740
'''

分组并对每个分组应用sum函数

df.groupby('A').sum()
'''
         C	        D
 A		
bar	-0.565344	1.886637
foo	2.226542	2.122855
'''

按多个列分组形成层级索引,然后应用函数
df.groupby(['A','B']).sum()
'''
              C	        D
A	 B		
bar	one	  1.056331	 0.035044
   three  -0.797947	 1.014036
    two	  -0.823728	 0.837556
foo	one	  0.829236	 3.042425
    two	  1.397306	 -0.919570
'''

变形

堆叠

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
                     
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
index
'''
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])
                     A         B
'''
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
'''
                     A         B
first second                    
bar   one     0.033187 -1.575675
      two    -0.116085  0.463558
baz   one    -1.264759 -0.076262
      two     0.290001 -0.908792
foo   one     0.417612 -0.119032
      two    -0.041429  0.372586
qux   one    -1.233305 -2.380777
      two    -0.647859 -0.367107
'''
df2 = df[:4]
df2
'''
                     A         B
first second                    
bar   one     0.033187 -1.575675
      two    -0.116085  0.463558
baz   one    -1.264759 -0.076262
      two     0.290001 -0.908792
'''

stack()方法对DataFrame的列“压缩”一个层级

stacked = df2.stack()
stacked
'''
first  second   
bar    one     A   -1.696274
               B   -1.439440
       two     A   -0.808252
               B   -0.111020
baz    one     A    0.976283
               B   -0.394769
       two     A    1.604678
               B    0.499703
dtype: float64
'''

对于一个“堆叠过的”DataFrame或者Series(拥有MultiIndex作为索引),stack()的逆操作是unstack(),默认反堆叠到上一个层级

stacked.unstack()
'''
                     A         B
first second                    
bar   one     0.048974 -0.849294
      two     0.256863 -2.006332
baz   one    -0.026727  0.296044
      two    -1.112311 -0.342600
'''

stacked.unstack(1)
'''
second        one       two
first                      
bar   A  0.048974  0.256863
      B -0.849294 -2.006332
baz   A -0.026727 -1.112311
      B  0.296044 -0.342600
'''
stacked.unstack(0)
'''
first          bar       baz
second                      
one    A  0.048974 -0.026727
       B -0.849294  0.296044
two    A  0.256863 -1.112311
       B -2.006332 -0.342600
'''

数据透视表

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df
'''
        A  B    C         D         E
0     one  A  foo  0.328414 -0.219345
1     one  B  foo -0.363916  1.015422
2     two  C  foo -0.413828 -0.346556
3   three  A  bar -0.817349 -0.561905
4     one  B  bar  0.421502  0.410673
5     one  C  bar  0.147630 -0.646454
6     two  A  foo  3.257885  1.025650
7   three  B  foo  0.664719  0.004742
8     one  C  foo  0.875158 -0.921567
9     one  A  bar -1.001131  0.867019
10    two  B  bar -0.260534 -0.202553
11  three  C  bar -0.142559  0.114470
'''

我们可以轻松地从这个数据得到透视表

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
'''
C             bar       foo
A     B                    
one   A -0.244547 -1.141382
      B -0.784806 -1.055516
      C  0.364710 -0.624468
three A  1.100998       NaN
      B       NaN -0.522151
      C -0.442411       NaN
two   A       NaN -1.758408
      B  1.586401       NaN
      C       NaN  0.601868
'''

你可能感兴趣的:(pandas问题,python)