DataFrame分组级运算和转换

目录

前言

apply:一般性的‘拆份-应用-合并’

禁止分组建

分位数和桶分析

透视表

交叉表:crosstab

示例:用特定于分组的值填充缺失值

示例:随机采样和队列

示例:分组加权平均数和相关系数

 


前言

假设我们为DataFrame添加用于存放各索引分组平均值的列,一个办法是先聚合在合并。

>>> k1_means = df.groupby('key1').mean().add_prefix('mean_')
>>> k1_means
      mean_data1  mean_data2
key1                        
a      -0.380460   -0.332537
b      -0.314586   -0.605574
>>> pd.merge(df,k1_means,left_on='key1',right_index=True)
      data1     data2 key1 key2  mean_data1  mean_data2
0 -0.291328  0.257737    a  one   -0.380460   -0.332537
1 -1.390843 -1.081238    a  two   -0.380460   -0.332537
4  0.540790 -0.174112    a  one   -0.380460   -0.332537
2  0.574857  0.202979    b  one   -0.314586   -0.605574
3 -1.204029 -1.414127    b  two   -0.314586   -0.605574

这次我们在GroupBy上使用transForm方法。

transForm或将一个函数应用到各个分组,然后将结果放到适当的位置上。

>>> people = DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'],
...                    index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
>>> people.groupby(key).mean()
            a         b         c         d         e
one  0.684081  0.110111 -0.122685 -0.392944  0.676586
two  0.295614 -0.488849  0.111023 -0.452018 -0.593795
>>> people.groupby(key).transform(np.mean)
               a         b         c         d         e
Joe     0.684081  0.110111 -0.122685 -0.392944  0.676586
Steve   0.295614 -0.488849  0.111023 -0.452018 -0.593795
Wes     0.684081  0.110111 -0.122685 -0.392944  0.676586
Jim     0.295614 -0.488849  0.111023 -0.452018 -0.593795
Travis  0.684081  0.110111 -0.122685 -0.392944  0.676586

假如你希望从各组中减去平均值,为此我们先创建一个距平化函数,然后将其传给transform

>>> def demean(arr):
...     return arr - arr.mean()
... 
>>> demeaned  =people.groupby(key).transform(demean)
>>> demeaned
               a         b         c         d         e
Joe    -0.779960  0.893851 -1.448675 -0.091887 -0.162785
Steve  -0.323736  0.072072  0.659981 -0.131960 -0.498387
Wes     0.305050 -1.817776  0.450697 -0.454107 -0.952844
Jim     0.323736 -0.072072 -0.659981  0.131960  0.498387
Travis  0.474909  0.923925  0.997978  0.545994  1.115629

你可以检查一下demeaned各组的平均值是否为0

apply:一般性的‘拆份-应用-合并’

假设你想要根据分组选出5个最高的tip_pct值,首先先写一个指定列具有最大值的行的函数

>>> def top(df,n=5,columns='tip_pct'):
...     return df.sort_index(by=columns)[-n:]
>>> top(tips,n=6)
     total_bill   tip smoker  day    time  size   tip_pct
109       14.31  4.00    Yes  Sat  Dinner     2  0.279525
183       23.17  6.50    Yes  Sun  Dinner     4  0.280535
232       11.61  3.39     No  Sat  Dinner     2  0.291990
67         3.07  1.00    Yes  Sat  Dinner     1  0.325733
178        9.60  4.00    Yes  Sun  Dinner     2  0.416667
172        7.25  5.15    Yes  Sun  Dinner     2  0.710345

top涵数在DataFrame的个个片段上调用,最后由pandas.concat组装到一起。

>>> tips.groupby(['smoker','day']).apply(top)
                 total_bill   tip smoker   day    time  size   tip_pct
smoker day                                                            
No     Fri  99        12.46  1.50     No   Fri  Dinner     2  0.120385
            94        22.75  3.25     No   Fri  Dinner     2  0.142857
            91        22.49  3.50     No   Fri  Dinner     2  0.155625
            223       15.98  3.00     No   Fri   Lunch     3  0.187735
       Sat  228       13.28  2.72     No   Sat  Dinner     2  0.204819
            108       18.24  3.76     No   Sat  Dinner     2  0.206140
            110       14.00  3.00     No   Sat  Dinner     2  0.214286
            20        17.92  4.08     No   Sat  Dinner     2  0.227679
            232       11.61  3.39     No   Sat  Dinner     2  0.291990
       Sun  46        22.23  5.00     No   Sun  Dinner     2  0.224921
            17        16.29  3.71     No   Sun  Dinner     3  0.227747
            6          8.77  2.00     No   Sun  Dinner     2  0.228050
            185       20.69  5.00     No   Sun  Dinner     5  0.241663
            51        10.29  2.60     No   Sun  Dinner     2  0.252672
       Thur 81        16.66  3.40     No  Thur   Lunch     2  0.204082
            139       13.16  2.75     No  Thur   Lunch     2  0.208967
            87        18.28  4.00     No  Thur   Lunch     2  0.218818
            88        24.71  5.85     No  Thur   Lunch     2  0.236746
            149        7.51  2.00     No  Thur   Lunch     2  0.266312
Yes    Fri  226       10.09  2.00    Yes   Fri   Lunch     2  0.198216
            100       11.35  2.50    Yes   Fri  Dinner     2  0.220264
            222        8.58  1.92    Yes   Fri   Lunch     1  0.223776
            221       13.42  3.48    Yes   Fri   Lunch     2  0.259314
            93        16.32  4.30    Yes   Fri  Dinner     2  0.263480
       Sat  171       15.81  3.16    Yes   Sat  Dinner     2  0.199873
            63        18.29  3.76    Yes   Sat  Dinner     4  0.205577
            214       28.17  6.50    Yes   Sat  Dinner     3  0.230742
            109       14.31  4.00    Yes   Sat  Dinner     2  0.279525
            67         3.07  1.00    Yes   Sat  Dinner     1  0.325733
       Sun  174       16.82  4.00    Yes   Sun  Dinner     2  0.237812
            181       23.33  5.65    Yes   Sun  Dinner     2  0.242177
            183       23.17  6.50    Yes   Sun  Dinner     4  0.280535
            178        9.60  4.00    Yes   Sun  Dinner     2  0.416667
            172        7.25  5.15    Yes   Sun  Dinner     2  0.710345
       Thur 204       20.53  4.00    Yes  Thur   Lunch     4  0.194837
            205       16.47  3.23    Yes  Thur   Lunch     3  0.196114
            191       19.81  4.19    Yes  Thur   Lunch     2  0.211509
            200       18.71  4.00    Yes  Thur   Lunch     3  0.213789
            194       16.58  4.00    Yes  Thur   Lunch     2  0.241255

禁止分组建

>>> tips.groupby('smoker',group_keys=False).apply(top)
     total_bill   tip smoker   day    time  size   tip_pct
88        24.71  5.85     No  Thur   Lunch     2  0.236746
185       20.69  5.00     No   Sun  Dinner     5  0.241663
51        10.29  2.60     No   Sun  Dinner     2  0.252672
149        7.51  2.00     No  Thur   Lunch     2  0.266312
232       11.61  3.39     No   Sat  Dinner     2  0.291990
109       14.31  4.00    Yes   Sat  Dinner     2  0.279525
183       23.17  6.50    Yes   Sun  Dinner     4  0.280535
67         3.07  1.00    Yes   Sat  Dinner     1  0.325733
178        9.60  4.00    Yes   Sun  Dinner     2  0.416667
172        7.25  5.15    Yes   Sun  Dinner     2  0.710345

分位数和桶分析

>>> frame = DataFrame({'data1':np.random.randn(1000),'data2':np.random.randn(1000)})
>>> factor = pd.cut(frame.data1,4)
>>> factor[:10]
0     (-1.6, -0.026]
1     (-1.6, -0.026]
2     (1.548, 3.123]
3     (-1.6, -0.026]
4    (-0.026, 1.548]
5    (-0.026, 1.548]
6     (-1.6, -0.026]
7    (-0.026, 1.548]
8    (-0.026, 1.548]
9     (-1.6, -0.026]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.181, -1.6] < (-1.6, -0.026] < (-0.026, 1.548] <
                                    (1.548, 3.123]]
>>> def get_stats(group):
...     return {'min':group.min(),'max':group.max(),'count':group.count(),'mean':group.mean()}
... 
>>> grouped = frame.data2.groupby(factor)
>>> grouped.apply(get_stats).unstack()
                 count       max      mean       min
data1                                               
(-3.181, -1.6]    47.0  1.560586  0.067778 -3.094980
(-1.6, -0.026]   431.0  2.920156 -0.031899 -2.778233
(-0.026, 1.548]  460.0  2.339734 -0.057856 -2.739892
(1.548, 3.123]    62.0  1.728365 -0.143399 -2.449822
>>> grouping = pd.qcut(frame.data1,10,labels=False)
>>> grouped = frame.data2.groupby(grouping)
>>> grouped.apply(get_stats).unstack()
       count       max      mean       min
data1                                     
0      100.0  2.248114  0.069002 -3.094980
1      100.0  1.923236 -0.237785 -2.743977
2      100.0  2.920156  0.115480 -2.778233
3      100.0  2.481512 -0.060810 -2.581747
4      100.0  2.793314  0.030760 -2.595131
5      100.0  2.337741 -0.142877 -2.332392
6      100.0  2.339734 -0.046468 -2.589412
7      100.0  2.275533 -0.008744 -2.588843
8      100.0  1.901215 -0.095933 -2.739892
9      100.0  2.229256 -0.083296 -2.449822

透视表

透视表在excel中比较容易实现,下面讲解python实现。

将sex 和 smoker放到行上

>>> tips.pivot_table(index=['sex','smoker'])
                   size       tip   tip_pct  total_bill
sex    smoker                                          
FeMale No      2.765432  3.211235  0.162538   20.298765
       Yes     2.317073  2.965366  0.156436   21.481220
Male   No      2.557143  2.738000  0.155614   17.903286
       Yes     2.480769  3.042885  0.168526   20.184808

聚合'tip_pct','size',根据day进行分组

>>> tips.pivot_table(['tip_pct','size'],index=['sex','day'],columns='smoker')
                 size             tip_pct          
smoker             No       Yes        No       Yes
sex    day                                         
FeMale Fri   2.250000  2.000000  0.151650  0.174211
       Sat   2.826087  2.333333  0.160727  0.148431
       Sun   3.000000  2.555556  0.160367  0.153037
       Thur  2.555556  2.285714  0.167865  0.163618
Male   Fri        NaN  2.125000       NaN  0.175284
       Sat   2.272727  2.583333  0.155247  0.147513
       Sun   2.866667  2.600000  0.159884  0.218042
       Thur  2.388889  2.400000  0.148948  0.164035

设置margins将会显示All列

>>> tips.pivot_table(['tip_pct','size'],index=['sex','day'],columns='smoker',margins=True)
                 size              ...      tip_pct          
smoker             No       Yes    ...          Yes       All
sex    day                         ...                       
FeMale Fri   2.250000  2.000000    ...     0.174211  0.166007
       Sat   2.826087  2.333333    ...     0.148431  0.155329
       Sun   3.000000  2.555556    ...     0.153037  0.158535
       Thur  2.555556  2.285714    ...     0.163618  0.166990
Male   Fri        NaN  2.125000    ...     0.175284  0.175284
       Sat   2.272727  2.583333    ...     0.147513  0.151212
       Sun   2.866667  2.600000    ...     0.218042  0.174423
       Thur  2.388889  2.400000    ...     0.164035  0.154336
All          2.668874  2.408602    ...     0.163196  0.160803

[9 rows x 6 columns]

其他聚合函数例如分组大小

>>> tips.pivot_table('size',index=['time','sex','smoker'],columns='day',aggfunc='sum',fill_value=0)
day                   Fri  Sat  Sun  Thur
time   sex    smoker                     
Dinner FeMale No        6   65   81     0
              Yes      10   42   23     0
       Male   No        0   50   86     2
              Yes      10   62   26     0
Lunch  FeMale No        3    0    0    69
              Yes       4    0    0    16
       Male   No        0    0    0    41
              Yes       7    0    0    24

对空值处理

>>> tips.pivot_table('tip_pct',index=['sex','smoker'],columns='day',margins=True,aggfunc=len)
day             Fri   Sat   Sun  Thur    All
sex    smoker                               
FeMale No       4.0  23.0  27.0  27.0   81.0
       Yes      7.0  18.0   9.0   7.0   41.0
Male   No       NaN  22.0  30.0  18.0   70.0
       Yes      8.0  24.0  10.0  10.0   52.0
All            19.0  87.0  76.0  62.0  244.0

DataFrame分组级运算和转换_第1张图片

交叉表:crosstab

>>> data = DataFrame([[1,'F','R'],[2,'M','L'],[3,'F','R'],[4,'M','R'],[5,'M','L'],[6,'M','R'],[7,'F','R'],[8,'F','L'],[9,'M','R'],[10,'F','R']],columns=['Sample','Gender','Handedness'])
>>> data
   Sample Gender Handedness
0       1      F          R
1       2      M          L
2       3      F          R
3       4      M          R
4       5      M          L
5       6      M          R
6       7      F          R
7       8      F          L
8       9      M          R
9      10      F          R

 如下图,即为Gender,Handedness,的组合统计

>>> pd.crosstab(data.Gender,data.Handedness,margins=True)
Handedness  L  R  All
Gender               
F           1  4    5
M           2  3    5
All         3  7   10

crosstab的前两个参数可以是Series,数组,列表

>>> pd.crosstab([tips.time,tips.day],tips.smoker,margins=True)
smoker        No  Yes  All
time   day                
Dinner Fri     3    9   12
       Sat    45   42   87
       Sun    57   19   76
       Thur    1    0    1
Lunch  Fri     1    6    7
       Thur   44   17   61
All          151   93  244

示例:用特定于分组的值填充缺失值

用平均值填充NA值:

>>> s = Series(np.random.randn(6))
>>> s[::2] = np.nan
>>> s
0         NaN
1   -1.430336
2         NaN
3    0.937739
4         NaN
5    0.236223
dtype: float64
>>> s.fillna(s.mean())
0   -0.085458
1   -1.430336
2   -0.085458
3    0.937739
4   -0.085458
5    0.236223
dtype: float64

假设你想根据分组填充不同数据,只需要将数据分组,并使用apply和一个能够对个数据块调用的fillna对的函数即可

>>> states = ['Ohio','New York','Vermont','Florida','Oregon','Nevada','California','Idaho']
>>> group_key = ['East']*4 + ['West'] *4
>>> data = Series(np.random.randn(8),index=states)
>>> data[['Vermont','Nevada','Idaho']] = np.nan
>>> data
Ohio         -0.734886
New York      1.573174
Vermont            NaN
Florida      -1.172843
Oregon        0.988466
Nevada             NaN
California   -1.872393
Idaho              NaN
dtype: float64
>>> data.groupby(group_key).mean()
East   -0.111518
West   -0.441964
dtype: float64

我们利用分组平均值去填充NA值

>>> fill_mean = lambda g: g.fillna(g.mean())
>>> data.groupby(group_key).apply(fill_mean)
Ohio         -0.734886
New York      1.573174
Vermont      -0.111518
Florida      -1.172843
Oregon        0.988466
Nevada       -0.441964
California   -1.872393
Idaho        -0.441964
dtype: float64

我们也可以在代码中预定义各组的填充值

>>> fill_values = {'East':0.5,'West':-1}
>>> fill_func = lambda g: g.fillna(fill_values[g.name])
>>> data.groupby(group_key).apply(fill_func)
Ohio         -0.734886
New York      1.573174
Vermont       0.500000
Florida      -1.172843
Oregon        0.988466
Nevada       -1.000000
California   -1.872393
Idaho        -1.000000
dtype: float64

示例:随机采样和队列

np.random.permutation(N),N为完整数据大小

>>> suits = ['H','S','C','D']
>>> card_val = (range(1,11)+[10]*3)*4
>>> card_val
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10]
>>> base_names = ["A"] + range(2,11) + ['J','K','Q']
>>> base_names
['A', 2, 3, 4, 5, 6, 7, 8, 9, 10, 'J', 'K', 'Q']
>>> carda = []
>>> for suit in suits:
...     carda.extend(str(num) + suit for num in base_names)
>>> deck = Series(card_val,index=carda)
>>> deck[:13]
AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
dtype: int64
>>> def draw(deck,n=5):
...     return deck.take(np.random.permutation(len(deck))[:n])
... 
>>> draw(deck)
3H      3
KS     10
QC     10
JS     10
10C    10
dtype: int64
>>> get_suit = lambda card: card[-1]
>>> deck.groupby(get_suit).apply(draw,n=2)
C  9C      9
   JC     10
D  4D      4
   AD      1
H  4H      4
   10H    10
S  3S      3
   KS     10
dtype: int64
>>> deck.groupby(get_suit,group_keys=False).apply(draw,n=2)
7C     7
4C     4
AD     1
5D     5
9H     9
4H     4
6S     6
KS    10
dtype: int64

示例:分组加权平均数和相关系数

>>> df = DataFrame({'category':['a','a','a','a','b','b','b','b'],'data':np.random.randn(8),'weights':np.random.rand(8)})
>>> df
  category      data   weights
0        a -1.493554  0.300840
1        a -2.008278  0.693407
2        a  1.006548  0.736280
3        a -1.226051  0.128157
4        b -0.981050  0.327538
5        b -0.487632  0.201700
6        b -1.262182  0.201121
7        b -0.205049  0.206801
>>> grouped = df.groupby('category')
>>> get_wavg = lambda g: np.average(g['data'],weights = g['weights'])
>>> grouped.apply(get_wavg)
category
a   -0.676769
b   -0.763949
dtype: float64

 

你可能感兴趣的:(数据清洗)