3.1,pandas【基本功能】

一:改变索引

  reindex方法对于Series直接索引,对于DataFrame既可以改变行索引,也可以改变列索引,还可以两个一起改变.

  1)对于Series

 1 In [2]: seri = pd.Series([4.5,7.2,-5.3,3.6],index = ['d','b','a','c'])
 2 
 3 In [3]: seri
 4 Out[3]:
 5 d    4.5
 6 b    7.2
 7 a   -5.3
 8 c    3.6
 9 dtype: float64
10 
11 In [4]: seri1 = seri.reindex(['a','b','c','d','e'])
12 
13 In [5]: seri1
14 Out[5]:
15 a   -5.3
16 b    7.2
17 c    3.6
18 d    4.5
19 e    NaN    #没有的即为NaN
20 dtype: float64
21 
22 In [6]: seri.reindex(['a','b','c','d','e'], fill_value=0)
23 Out[6]:
24 a   -5.3
25 b    7.2
26 c    3.6
27 d    4.5
28 e    0.0     #没有的填充为0
29 dtype: float64
30 
31 In [7]: seri
32 Out[7]:
33 d    4.5
34 b    7.2
35 a   -5.3
36 c    3.6
37 dtype: float64
38 
39 In [8]: seri_2 = pd.Series(['blue','purple','yellow'], index=[0,2,4])
40 
41 In [9]: seri_2
42 Out[9]:
43 0      blue
44 2    purple
45 4    yellow
46 dtype: object
47 
48 #reindex可用的方法:ffill为向前填充,bfill为向后填充
49 
50 In [10]: seri_2.reindex(range(6),method='ffill')
51 Out[10]:
52 0      blue
53 1      blue
54 2    purple
55 3    purple
56 4    yellow
57 5    yellow
58 dtype: object
59 
60 In [11]: seri_2.reindex(range(6),method='bfill')
61 Out[11]:
62 0      blue
63 1    purple
64 2    purple
65 3    yellow
66 4    yellow
67 5       NaN
68 dtype: object
Series的改变索引

  2)对于DataFrame

    其reindex的函数参数:method="ffill/bfill";fill_value=...[若为NaN时的填充值];......

 1 In [4]: dframe_1 = pd.DataFrame(np.arange(9).reshape((3,3)),index=['a','b','c'],
 2 columns=['Ohio','Texas','Cal'])
 3 In [5]: dframe_1
 4 Out[5]:
 5    Ohio  Texas  Cal
 6 a     0      1    2
 7 b     3      4    5
 8 c     6      7    8
 9 
10 In [6]: dframe_2 = dframe_1.reindex(['a','b','c','d'])
11 
12 In [7]: dframe_2
13 Out[7]:
14    Ohio  Texas  Cal
15 a     0      1    2
16 b     3      4    5
17 c     6      7    8
18 d   NaN    NaN  NaN
19 
20 In [16]: dframe_1.reindex(index=['a','b','c','d'],method='ffill',columns=['Ohio'
21 ,'Beijin','Cal'])
22 Out[16]:
23    Ohio  Beijin  Cal
24 a     0     NaN    2
25 b     3     NaN    5
26 c     6     NaN    8
27 d     6     NaN    8
28 
29 In [17]: dframe_1.reindex(index=['a','b','c','d'],fill_value='Z',columns=['Ohio'
30 Out[17]: ,'Cal'])
31   Ohio Beijin Cal
32 a    0      Z   2
33 b    3      Z   5
34 c    6      Z   8
35 d    Z      Z   Z
36 
37 In [8]: dframe_1.reindex(columns=['Chengdu','Beijin','Shanghai','Guangdong'])
38 Out[8]:
39    Chengdu  Beijin  Shanghai  Guangdong
40 a      NaN     NaN       NaN        NaN
41 b      NaN     NaN       NaN        NaN
42 c      NaN     NaN       NaN        NaN
43 
44 In [9]: dframe_1
45 Out[9]:
46    Ohio  Texas  Cal
47 a     0      1    2
48 b     3      4    5
49 c     6      7    8
50 
51 #用ix关键字同时改变行/列索引
52 In [10]: dframe_1.ix[['a','b','c','d'],['Ohio','Beijing','Guangdong']]
53 Out[10]:
54    Ohio  Beijing  Guangdong
55 a     0      NaN        NaN
56 b     3      NaN        NaN
57 c     6      NaN        NaN
58 d   NaN      NaN        NaN
DataFrame的改变索引

 

二:丢弃指定轴的数据

  drop方法, 通过索引删除

  1)对于Series

 1 In [21]: seri = pd.Series(np.arange(5),index=['a','b','c','d','e'])
 2 
 3 In [22]: seri
 4 Out[22]:
 5 a    0
 6 b    1
 7 c    2
 8 d    3
 9 e    4
10 dtype: int32
11 
12 In [23]: seri.drop('b')
13 Out[23]:
14 a    0
15 c    2
16 d    3
17 e    4
18 dtype: int32
19 
20 In [24]: seri.drop(['d','e'])
21 Out[24]:
22 a    0
23 b    1
24 c    2
25 dtype: int32
Series的删除数据

  2)对于DataFrame

 1 In [29]: dframe = pd.DataFrame(np.arange(16).reshape((4,4)),index=['Chen','Bei',
 2 'Shang','Guang'],columns=['one','two','three','four'])
 3 
 4 In [30]: dframe
 5 Out[30]:
 6        one  two  three  four
 7 Chen     0    1      2     3
 8 Bei      4    5      6     7
 9 Shang    8    9     10    11
10 Guang   12   13     14    15
11 
12 #删除行
13 In [31]: dframe.drop(['Bei','Shang'])
14 Out[31]:
15        one  two  three  four
16 Chen     0    1      2     3
17 Guang   12   13     14    15
18 
19 #删除列
20 In [33]: dframe.drop(['two','three'],axis=1)
21 Out[33]:
22        one  four
23 Chen     0     3
24 Bei      4     7
25 Shang    8    11
26 Guang   12    15
27 
28 #若第一个参数只有一个时可以不要【】
DataFrame的删除数据

 

三:索引,选取,过滤

  1)Series

    仍然可以向list那些那样用下标访问,不过我觉得不太还,最好还是选择用索引值来进行访问,并且索引值也可用于切片

In [4]: seri = pd.Series(np.arange(4),index=['a','b','c','d'])

In [5]: seri
Out[5]:
a    0
b    1
c    2
d    3
dtype: int32

In [6]: seri['a']
Out[6]: 0

In [7]: seri[['b','a']]       #显示顺序也变了
Out[7]:
b    1
a    0
dtype: int32


In [18]: seri[seri<2]    #!!元素级别运算!!
Out[18]:
a    0
b    1
dtype: int32

In [11]: seri['a':'c']     #索引用于切片
Out[11]:
a    0
b    1
c    2
dtype: int32

In [12]: seri['a':'c']='z'

In [13]: seri
Out[13]:
a    z
b    z
c    z
d    3
dtype: object
Series选取

  2)DataFrame

    其实就是获取一个或多个列的问题。需要注意的是,其实DataFrame可以看作多列索引相同的Series组成的,对应DataFrame数据来说,其首行横向的字段才应该看作是他的索引,所以通过dframe【【n个索引值】】可以选出多列Series,而其中的索引值必须是首行横向的字段,否者报错。而想要取列的话可以通过切片完成,如dframe[:2]选出第0和1行。通过ix【参数1(x),参数2(y)】可以在两个方向上进行选取。

 1 In [19]: dframe = pd.DataFrame(np.arange(16).reshape((4,4)),index=['one','two','
 2 three','four'],columns=['Bei','Shang','Guang','Sheng'])
 3 
 4 In [21]: dframe
 5 Out[21]:
 6        Bei  Shang  Guang  Sheng
 7 one      0      1      2      3
 8 two      4      5      6      7
 9 three    8      9     10     11
10 four    12     13     14     15
11 
12 In [22]: dframe[['one']]         #即是开头讲的索引值用的不正确而报错
13 ---------------------------------------------------------------------------
14 KeyError                                  Traceback (most recent call last)
15 <ipython-input-22-c2522043b676> in <module>()
16 ----> 1 dframe[['one']]
17 
18 In [25]: dframe[['Bei']]
19 Out[25]:
20        Bei
21 one      0
22 two      4
23 three    8
24 four    12
25 
26 In [26]: dframe[['Bei','Sheng']]
27 Out[26]:
28        Bei  Sheng
29 one      0      3
30 two      4      7
31 three    8     11
32 four    12     15
33 
34 In [27]: dframe[:2]        #取行
35 Out[27]:
36      Bei  Shang  Guang  Sheng
37 one    0      1      2      3
38 two    4      5      6      7
39 
40 In [32]: #为了在DataFrame中引入标签索引,用ix字段,其第一个参数是对行的控制,第二个为对列的控制
41 
42 In [33]: dframe.ix[['one','two'],['Bei','Shang']]
43 Out[33]:
44      Bei  Shang
45 one    0      1
46 two    4      5
47 
48 #有此可看出横向的每个字段为dframe实例的属性
49 In [35]: dframe.Bei
50 Out[35]:
51 one       0
52 two       4
53 three     8
54 four     12
55 Name: Bei, dtype: int32
56 
57 In [36]: dframe[dframe.Bei<5]
58 Out[36]:
59      Bei  Shang  Guang  Sheng
60 one    0      1      2      3
61 two    4      5      6      7
62 
63 In [38]: dframe.ix[dframe.Bei<5,:2]
64 Out[38]:
65      Bei  Shang
66 one    0      1
67 two    4      5
68 
69 In [43]: dframe.ix[:'two',['Shang','Bei']]
70 Out[43]:
71      Shang  Bei
72 one      1    0
73 two      5    4
DataFrame选取

 

四:算术运算

  1)Series

    在运算时会自动按索引对齐后再运算,且在索引值不重叠时产生的运算结果是NaN值, 用运算函数时可以避免此情况。

 1 In [4]: seri_1 = pd.Series([1,2,3,4],index = ['a','b','c','d'])
 2 
 3 In [5]: seri_2 = pd.Series([5,6,7,8,9],index = ['a','c','e','g','f'])
 4 
 5 In [6]: seri_1 + seri_2
 6 Out[6]:
 7 a     6
 8 b   NaN
 9 c     9
10 d   NaN
11 e   NaN
12 f   NaN
13 g   NaN
14 dtype: float64
15 
16 In [8]: seri_1.add(seri_2)
17 Out[8]:
18 a     6
19 b   NaN
20 c     9
21 d   NaN
22 e   NaN
23 f   NaN
24 g   NaN
25 dtype: float64
26 
27 In [7]: seri_1.add(seri_2,fill_value = 0)
28 Out[7]:
29 a    6
30 b    2
31 c    9
32 d    4
33 e    7
34 f    9
35 g    8
36 dtype: float64
37 
38 #上面的未重叠区依然有显示值而不是NaN!!
39 #对应的方法是:add:+; mul: X; sub: -; div : /  
Series算术运算

  2)DataFrame

 1 In [10]: df_1 = pd.DataFrame(np.arange(12).reshape((3,4)),columns = list('abcd')
 2 )
 3 In [11]: df_2 = pd.DataFrame(np.arange(20).reshape((4,5)),columns = list('abcde'
 4 ))
 5 In [12]: df_1 + df_2
 6 Out[12]:
 7     a   b   c   d   e
 8 0   0   2   4   6 NaN
 9 1   9  11  13  15 NaN
10 2  18  20  22  24 NaN
11 3 NaN NaN NaN NaN NaN
12 
13 In [13]: df_1.add(df_2)
14 Out[13]:
15     a   b   c   d   e
16 0   0   2   4   6 NaN
17 1   9  11  13  15 NaN
18 2  18  20  22  24 NaN
19 3 NaN NaN NaN NaN NaN
20 
21 In [14]: df_1.add(df_2, fill_value = 0)
22 Out[14]:
23     a   b   c   d   e
24 0   0   2   4   6   4
25 1   9  11  13  15   9
26 2  18  20  22  24  14
27 3  15  16  17  18  19
DataFrame算术运算

  3)DataFrame与Series之间进行运算

  类似:np.array

 1 In [15]: arr_1 = np.arange(12).reshape((3,4))
 2 
 3 In [16]: arr_1 - arr_1[0]
 4 Out[16]:
 5 array([[0, 0, 0, 0],
 6        [4, 4, 4, 4],
 7        [8, 8, 8, 8]])
 8 
 9 In [17]: arr_1
10 Out[17]:
11 array([[ 0,  1,  2,  3],
12        [ 4,  5,  6,  7],
13        [ 8,  9, 10, 11]])
array型
 1 In [18]: dframe_1 = pd.DataFrame(np.arange(12).reshape((4,3)),columns=list('bde'
 2 ),index = ['Chen','Bei','Shang','Sheng'])
 3 In [19]: dframe_1
 4 Out[19]:
 5        b   d   e
 6 Chen   0   1   2
 7 Bei    3   4   5
 8 Shang  6   7   8
 9 Sheng  9  10  11
10 
11 In [20]: seri = dframe_1.ix[0]
12 
13 In [21]: seri
14 Out[21]:
15 b    0
16 d    1
17 e    2
18 Name: Chen, dtype: int32
19 
20 In [22]: dframe_1 - seri      #每行匹配的进行运算
21 Out[22]:
22        b  d  e
23 Chen   0  0  0
24 Bei    3  3  3
25 Shang  6  6  6
26 Sheng  9  9  9
27 
28 In [23]: seri_2 = pd.Series(range(3),index=['b','e','f'])
29 
30 In [24]: dframe_1 - seri_2          
31 Out[24]:
32        b   d   e   f
33 Chen   0 NaN   1 NaN
34 Bei    3 NaN   4 NaN
35 Shang  6 NaN   7 NaN
36 Sheng  9 NaN  10 NaN
37 
38 In [27]: seri_3 = dframe_1['d']
39 
40 In [28]: seri_3        #注意!Serie_3索引并不与dframe_1的相同,与上面的运算形式不同
41 Out[28]:
42 Chen      1
43 Bei       4
44 Shang     7
45 Sheng    10
46 Name: d, dtype: int32
47 
48 In [29]: dframe_1 - seri_3
49 Out[29]:
50        Bei  Chen  Shang  Sheng   b   d   e
51 Chen   NaN   NaN    NaN    NaN NaN NaN NaN
52 Bei    NaN   NaN    NaN    NaN NaN NaN NaN
53 Shang  NaN   NaN    NaN    NaN NaN NaN NaN
54 Sheng  NaN   NaN    NaN    NaN NaN NaN NaN
55 #注意dframe的columns已经变成了Series的index和其自己的columns相加了
56 
57 #通过运算函数中的axis参数可改变匹配轴以避免上情况
58 #0为列匹配,1为行匹配
59 In [31]: dframe_1.sub(seri_3,axis=0)  
60 Out[31]:
61        b  d  e
62 Chen  -1  0  1
63 Bei   -1  0  1
64 Shang -1  0  1
65 Sheng -1  0  1
66 
67 In [33]: dframe_1.sub(seri_3,axis=1)
68 Out[33]:
69        Bei  Chen  Shang  Sheng   b   d   e
70 Chen   NaN   NaN    NaN    NaN NaN NaN NaN
71 Bei    NaN   NaN    NaN    NaN NaN NaN NaN
72 Shang  NaN   NaN    NaN    NaN NaN NaN NaN
73 Sheng  NaN   NaN    NaN    NaN NaN NaN NaN
DataFrame & Series运算

    注:axis按轴取可以看成  0:以index为index的Series【竖轴】, 1:以colum为index的Series【横轴】

五:使用函数

使用函数
 1 In [6]: dframe=pd.DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Che
 2 n','Bei','Shang','Sheng'])
 3 In [7]: dframe
 4 Out[7]:
 5               b         d         e
 6 Chen   1.838620  1.023421  0.641420
 7 Bei    0.920563 -2.037778 -0.853871
 8 Shang -0.587332  0.576442  0.596269
 9 Sheng  0.366174 -0.689582 -1.064030
10 
11 In [8]: np.abs(dframe)       #绝对值函数
12 Out[8]:
13               b         d         e
14 Chen   1.838620  1.023421  0.641420
15 Bei    0.920563  2.037778  0.853871
16 Shang  0.587332  0.576442  0.596269
17 Sheng  0.366174  0.689582  1.064030
18 
19 In [9]: func = lambda x: x.max() - x.min()
20 
21 In [10]: dframe.apply(func)
22 Out[10]:
23 b    2.425952
24 d    3.061200
25 e    1.705449
26 dtype: float64
27 
28 In [11]: dframe.apply(func,axis=1)
29 Out[11]:
30 Chen     1.197200
31 Bei      2.958341
32 Shang    1.183602
33 Sheng    1.430204
34 dtype: float64
35 
36 In [12]: dframe.max()  #即dframe.max(axis=0)
37 Out[12]:
38 b    1.838620
39 d    1.023421
40 e    0.641420
41 dtype: float64
42 
43 In [15]: dframe.max(axis=1)
44 Out[15]:
45 Chen     1.838620
46 Bei      0.920563
47 Shang    0.596269
48 Sheng    0.366174
49 dtype: float64

 六:排序

  1)按索引排序:sort_index(【axis=0/1,ascending=False/True】)注,其中默认axis为0(index排序),ascending为True(升序)

 1 In [16]: seri = pd.Series(range(4),index=['d','a','d','c'])
 2 
 3 In [17]: seri
 4 Out[17]:
 5 d    0
 6 a    1
 7 d    2
 8 c    3
 9 dtype: int64
10 
11 In [18]: seri.sort_index()
12 Out[18]:
13 a    1
14 c    3
15 d    2
16 d    0
17 dtype: int64
Series的索引排序
 1 In [22]: dframe
 2 Out[22]:
 3               c         a         b
 4 Chen   1.838620  1.023421  0.641420
 5 Bei    0.920563 -2.037778 -0.853871
 6 Shang -0.587332  0.576442  0.596269
 7 Sheng  0.366174 -0.689582 -1.064030
 8 
 9 In [23]: dframe.sort_index()
10 Out[23]:
11               c         a         b
12 Bei    0.920563 -2.037778 -0.853871
13 Chen   1.838620  1.023421  0.641420
14 Shang -0.587332  0.576442  0.596269
15 Sheng  0.366174 -0.689582 -1.064030
16 
17 In [24]: dframe.sort_index(axis=1)
18 Out[24]:
19               a         b         c
20 Chen   1.023421  0.641420  1.838620
21 Bei   -2.037778 -0.853871  0.920563
22 Shang  0.576442  0.596269 -0.587332
23 Sheng -0.689582 -1.064030  0.366174
DataFrame的索引排序,用axis制定是按index(默认)还是columns进行排序(1)

  2)按值排序sort_values方法【注:order方法已不推荐使用了】

 1 In [32]: seri =pd.Series([4,7,np.nan,-1,2,np.nan])
 2 
 3 In [33]: seri
 4 Out[33]:
 5 0     4
 6 1     7
 7 2   NaN
 8 3    -1
 9 4     2
10 5   NaN
11 dtype: float64
12 
13 In [34]: seri.sort_values()
14 Out[34]:
15 3    -1
16 4     2
17 0     4
18 1     7
19 2   NaN
20 5   NaN
21 dtype: float64
22 
23 #NaN值会默认排到最后
Series的值排序
 1 In [38]: dframe = pd.DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
 2 
 3 In [39]: dframe
 4 Out[39]:
 5    a  b
 6 0  0  4
 7 1  1  7
 8 2  0 -3
 9 3  1  2
10 
11 In [54]: dframe.sort_values('a')
12 Out[54]:
13    a  b
14 0  0  4
15 2  0 -3
16 1  1  7
17 3  1  2
18 
19 In [55]: dframe.sort_values('b')
20 Out[55]:
21    a  b
22 2  0 -3
23 3  1  2
24 0  0  4
25 1  1  7
26 
27 In [57]: dframe.sort_values(['a','b'])
28 Out[57]:
29    a  b
30 2  0 -3
31 0  0  4
32 3  1  2
33 1  1  7
34 
35 In [58]: dframe.sort_values(['b','a'])
36 Out[58]:
37    a  b
38 2  0 -3
39 3  1  2
40 0  0  4
41 1  1  7
DataFrame的值排序

 

七:排名

  rank方法

 

八:统计计算

  count:非NaN值  describe:对Series或DataFrame列计算汇总统计  min,max  argmin,argmax(整数值):最值得索引值  idmax,idmin:最值索引值

  sum  mean:平均数  var:样本方差  std:样本标准差  kurt:峰值  cumsum:累积和  cummin/cummax:累计最值  pct_change:百分数变化

 1 In [63]: df = pd.DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]]
 2 ,index=['a','b','c','d'],columns=['one','two'])
 3 
 4 In [64]: df
 5 Out[64]:
 6     one  two
 7 a  1.40  NaN
 8 b  7.10 -4.5
 9 c   NaN  NaN
10 d  0.75 -1.3
11 
12 In [66]: df.sum()
13 Out[66]:
14 one    9.25
15 two   -5.80
16 dtype: float64
17 
18 In [67]: df.sum(axis=1)
19 Out[67]:
20 a    1.40
21 b    2.60
22 c     NaN
23 d   -0.55
24 dtype: float64
25 
26 #求平均值,skipna:跳过NaN
27 In [68]: df.mean(axis=1,skipna=False)
28 Out[68]:
29 a      NaN
30 b    1.300
31 c      NaN
32 d   -0.275
33 dtype: float64
34 
35 
36 In [70]: df.idxmax()
37 Out[70]:
38 one    b
39 two    d
40 dtype: object
41 
42 In [71]: df.cumsum()
43 Out[71]:
44     one  two
45 a  1.40  NaN
46 b  8.50 -4.5
47 c   NaN  NaN
48 d  9.25 -5.8
49 
50 In [72]: df.describe()
51 Out[72]:
52             one       two
53 count  3.000000  2.000000
54 mean   3.083333 -2.900000
55 std    3.493685  2.262742
56 min    0.750000 -4.500000
57 25%    1.075000 -3.700000
58 50%    1.400000 -2.900000
59 75%    4.250000 -2.100000
60 max    7.100000 -1.300000
一些统计计算

 

九:唯一值,值计数,以及成员资格

  unique方法  value_counts:顶级方法  isin方法

 1 In [74]: seri = pd.Series(['c','a','d','a','a','b','b','c','c'])
 2 
 3 In [75]: seri
 4 Out[75]:
 5 0    c
 6 1    a
 7 2    d
 8 3    a
 9 4    a
10 5    b
11 6    b
12 7    c
13 8    c
14 dtype: object
15 
16 In [76]: seri.unique()
17 Out[76]: array(['c', 'a', 'd', 'b'], dtype=object)
18 
19 In [77]: seri.value_counts()
20 Out[77]:
21 c    3
22 a    3
23 b    2
24 d    1
25 dtype: int64
26 
27 In [78]: pd.value_counts(seri.values,sort=False)
28 Out[78]:
29 a    3
30 c    3
31 b    2
32 d    1
33 dtype: int64
34 
35 
36 In [81]: seri.isin(['b','c'])
37 Out[81]:
38 0     True
39 1    False
40 2    False
41 3    False
42 4    False
43 5     True
44 6     True
45 7     True
46 8     True
47 dtype: bool
唯一值,值计数,成员资格

 

十:缺少数据处理

  一)删除NaN:dropna方法

    1)Series

      python中的None即是对应到的Numpy的NaN

 1 In [3]: seri = pd.Series(['aaa','bbb',np.nan,'ccc'])
 2 
 3 In [4]: seri[0]=None
 4 
 5 In [5]: seri
 6 Out[5]:
 7 0    None
 8 1     bbb
 9 2     NaN
10 3     ccc
11 dtype: object
12 
13 In [7]: seri.isnull()
14 Out[7]:
15 0     True
16 1    False
17 2     True
18 3    False
19 dtype: bool
20 
21 In [8]: seri.dropna()   #返回非NaN值
22 Out[8]:
23 1    bbb
24 3    ccc
25 dtype: object
26 
27 In [9]: seri
28 Out[9]:
29 0    None
30 1     bbb
31 2     NaN
32 3     ccc
33 dtype: object
34 
35 In [10]: seri[seri.notnull()]      #返回非空值
36 Out[10]:
37 1    bbb
38 3    ccc
39 dtype: object
Series数据处理

    2)DataFrame

      对于DataFrame事情稍微复杂,有时希望删除全NaN或者含有NaN的行或列。

 1 In [15]: df = pd.DataFrame([[1,6.5,3],[1,np.nan,np.nan],[np.nan,np.nan,np.nan],[
 2 np.nan,6.5,3]])
 3 
 4 In [16]: df
 5 Out[16]:
 6     0    1   2
 7 0   1  6.5   3
 8 1   1  NaN NaN
 9 2 NaN  NaN NaN
10 3 NaN  6.5   3
11 
12 In [17]: df.dropna()   #默认以行(axis=0),只要有NaN的就删除
13 Out[17]:
14    0    1  2
15 0  1  6.5  3
16 
17 In [19]: df.dropna(how='all') #只删除全是NaN的行
18 Out[19]:
19     0    1   2
20 0   1  6.5   3
21 1   1  NaN NaN
22 3 NaN  6.5   3
23 
24 In [21]: df.dropna(axis=1,how='all')  #以列为标准来丢弃列
25 Out[21]:
26     0    1   2
27 0   1  6.5   3
28 1   1  NaN NaN
29 2 NaN  NaN NaN
30 3 NaN  6.5   3
31 
32 In [22]: df.dropna(axis=1)    
33 Out[22]:
34 Empty DataFrame
35 Columns: []
36 Index: [0, 1, 2, 3]
DataFrame的数据处理

  

  二)填充NaN:fillna方法    

 1 In [88]: df
 2 Out[88]:
 3     one  two
 4 a  1.40  NaN
 5 b  7.10 -4.5
 6 c   NaN  NaN
 7 d  0.75 -1.3
 8 
 9 In [90]: df.fillna(0)
10 Out[90]:
11     one  two
12 a  1.40  0.0
13 b  7.10 -4.5
14 c  0.00  0.0
15 d  0.75 -1.3
填充NaN

 

十一:层次化索引

 1 In [30]: seri = pd.Series(np.random.randn(10),index=[['a','a','a','b','b','b','c
 2 ','c','d','d'],[1,2,3,1,2,3,1,2,2,3]])
 3 In [31]: seri
 4 Out[31]:
 5 a  1    0.528387
 6    2   -0.152286
 7    3   -0.776540
 8 b  1    0.025425
 9    2   -1.412776
10    3    0.969498
11 c  1    0.478260
12    2    0.116301
13 d  2    1.464144
14    3    2.266069
15 dtype: float64
16 
17 In [32]: seri['a']
18 Out[32]:
19 1    0.528387
20 2   -0.152286
21 3   -0.776540
22 dtype: float64
23 
24 In [33]: seri.index
25 Out[33]:
26 MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
27            labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2
28 ]])
29 
30 In [35]: seri['a':'c']
31 Out[35]:
32 a  1    0.528387
33    2   -0.152286
34    3   -0.776540
35 b  1    0.025425
36    2   -1.412776
37    3    0.969498
38 c  1    0.478260
39    2    0.116301
40 dtype: float64
41 
42 In [45]: seri.unstack()
43 Out[45]:
44           1         2         3
45 a  0.528387 -0.152286 -0.776540
46 b  0.025425 -1.412776  0.969498
47 c  0.478260  0.116301       NaN
48 d       NaN  1.464144  2.266069
49 
50 In [46]: seri.unstack().stack()
51 Out[46]:
52 a  1    0.528387
53    2   -0.152286
54    3   -0.776540
55 b  1    0.025425
56    2   -1.412776
57    3    0.969498
58 c  1    0.478260
59    2    0.116301
60 d  2    1.464144
61    3    2.266069
62 dtype: float64
Series层次化索引,利用unstack方法可以转化为DataFrame型数据
 1 In [48]: df = pd.DataFrame(np.arange(12).reshape((4,3)),index=[['a','a','b','b']
 2 ,[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])
 3 
 4 In [49]: df
 5 Out[49]:
 6      Ohio     Colorado
 7     Green Red    Green
 8 a 1     0   1        2
 9   2     3   4        5
10 b 1     6   7        8
11   2     9  10       11
12 
13 In [50]: df.index
14 Out[50]:
15 MultiIndex(levels=[[u'a', u'b'], [1, 2]],
16            labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
17 
18 In [51]: df.columns
19 Out[51]:
20 MultiIndex(levels=[[u'Colorado', u'Ohio'], [u'Green', u'Red']],
21            labels=[[1, 1, 0], [0, 1, 0]])
22 
23 In [53]: df['Ohio']
24 Out[53]:
25      Green  Red
26 a 1      0    1
27   2      3    4
28 b 1      6    7
29   2      9   10
30 
31 In [57]: df.ix['a','Ohio']
32 Out[57]:
33    Green  Red
34 1      0    1
35 2      3    4
36 
37 In [61]: df.ix['a','Ohio'].ix[1,'Red']
38 Out[61]: 1
DataFrame层次化索引

 

 

 

你可能感兴趣的:(3.1,pandas【基本功能】)