利用python进行数据分析-数据聚合与分组运算2

1.分组级运算和转换

聚合只不过是分组运算的其中一种而已。介绍transform和apply方法,它们能够执行更多其他的分组运算

假设我们想要为一个DataFrame添加一个用于存放各索引分组平均值的列。一个办法是先聚合再合并

print df
k1_means=df.groupby('key1').mean().add_prefix('mean_')
print k1_means
print pd.merge(df,k1_means,left_on='key1',right_index=True)

结果为:

      data1     data2 key1 key2
0  1.297237  1.069077    a  one
1  1.586896 -0.679487    a  two
2 -0.866223  0.597460    b  one
3  1.440054  0.304970    b  two
4 -0.705452 -0.104290    a  one
      mean_data1  mean_data2
key1                        
a       0.726227    0.095100
b       0.286915    0.451215
      data1     data2 key1 key2  mean_data1  mean_data2
0  1.297237  1.069077    a  one    0.726227    0.095100
1  1.586896 -0.679487    a  two    0.726227    0.095100
4 -0.705452 -0.104290    a  one    0.726227    0.095100
2 -0.866223  0.597460    b  one    0.286915    0.451215
3  1.440054  0.304970    b  two    0.286915    0.451215

虽然这样也行,但是不太灵活。你可以将该过程看做利用np.mean函数对两个数据列进行转换。

key=['one','two','one','two','one']
print people
print people.groupby(key).mean()
print people.groupby(key).transform(np.mean)

结果为:

               a         b         c         d         e
Joe     0.286169 -1.354119  0.171155 -1.654205  0.034074
Steve   1.436373  0.746910 -0.010747  0.481846 -0.291208
Wes    -1.678330       NaN       NaN -0.659455  0.502947
Jim     0.097660  0.010846  0.770573  0.451625 -0.867913
Travis -0.029736 -0.743793 -0.490787  0.204776  0.275947
            a         b         c         d         e
one -0.473966 -1.048956 -0.159816 -0.702961  0.270989
two  0.767017  0.378878  0.379913  0.466736 -0.579561
               a         b         c         d         e
Joe    -0.473966 -1.048956 -0.159816 -0.702961  0.270989
Steve   0.767017  0.378878  0.379913  0.466736 -0.579561
Wes    -0.473966 -1.048956 -0.159816 -0.702961  0.270989
Jim     0.767017  0.378878  0.379913  0.466736 -0.579561
Travis -0.473966 -1.048956 -0.159816 -0.702961  0.270989

transform会将一个函数应用到各个分组,然后将结果放置到适当的位置上。如果各分组产生的是一个标量值,则该值就会被广播出去

假设你希望从各组中减去平均值。我们先创建一个距平化函数,然后将其传给transform

def demean(arr):
    return arr-arr.mean()

demeaned=people.groupby(key).transform(demean)
print demeaned

结果为:

               a         b         c         d         e
Joe     0.067355 -0.120862  0.487208  0.190557 -2.555154
Steve  -0.751458 -0.616967  0.113476 -0.433799  0.949293
Wes     0.062807       NaN       NaN  0.917042  2.490994
Jim     0.751458  0.616967 -0.113476  0.433799 -0.949293
Travis -0.130163  0.120862 -0.487208 -1.107599  0.064159

你可以检查一下demeaned现在的分组平均值是否为0

print demeaned.groupby(key).mean()


2.apply:一般性的“拆分-应用-合并”

跟aggregate一样,transform也是一个有着严格条件的特殊函数:传入的函数只能产生两种结果,要么产生一个可以广播的标量值(如np.mean),要么产生一个相同大小的结果数组。

回到之前那个小费数据集,假设你想要根据分组选出最高的5个tip_pct值。首先,编写一个选取指定列具有最大值的行的函数

def top(df,n=5,column='tip_pct'):
    return df.sort_index(by=column)[-n:]

print top(tips,n=6)

结果为:

     total_bill   tip     sex smoker  day    time  size   tip_pct
109       14.31  4.00  Female    Yes  Sat  Dinner     2  0.279525
183       23.17  6.50    Male    Yes  Sun  Dinner     4  0.280535
232       11.61  3.39    Male     No  Sat  Dinner     2  0.291990
67         3.07  1.00  Female    Yes  Sat  Dinner     1  0.325733
178        9.60  4.00  Female    Yes  Sun  Dinner     2  0.416667
172        7.25  5.15    Male    Yes  Sun  Dinner     2  0.710345

现在,如果对smoker分组并用该函数调用apply,就会得到

print tips.groupby('smoker').apply(top)

结果为:

            total_bill   tip     sex smoker   day    time  size   tip_pct
smoker                                                                   
No     88        24.71  5.85    Male     No  Thur   Lunch     2  0.236746
       185       20.69  5.00    Male     No   Sun  Dinner     5  0.241663
       51        10.29  2.60  Female     No   Sun  Dinner     2  0.252672
       149        7.51  2.00    Male     No  Thur   Lunch     2  0.266312
       232       11.61  3.39    Male     No   Sat  Dinner     2  0.291990
Yes    109       14.31  4.00  Female    Yes   Sat  Dinner     2  0.279525
       183       23.17  6.50    Male    Yes   Sun  Dinner     4  0.280535
       67         3.07  1.00  Female    Yes   Sat  Dinner     1  0.325733
       178        9.60  4.00  Female    Yes   Sun  Dinner     2  0.416667
       172        7.25  5.15    Male    Yes   Sun  Dinner     2  0.710345

top函数在DataFrame的各个片段上调用,然后结果有pandas.concat组装到一起,并以分组名称进行了标记。于是,最终结果就有了一个层次化索引,其内层索引值来自原DataFrame

如果传给apply的函数能够接受其他参数或关键字,则可以将这些内容放在函数名后面一并传入

print tips.groupby(['smoker','day']).apply(top,n=1,column='total_bill')

结果为:

                 total_bill    tip     sex smoker   day    time  size  \
smoker day                                                              
No     Fri  94        22.75   3.25  Female     No   Fri  Dinner     2   
       Sat  212       48.33   9.00    Male     No   Sat  Dinner     4   
       Sun  156       48.17   5.00    Male     No   Sun  Dinner     6   
       Thur 142       41.19   5.00    Male     No  Thur   Lunch     5   
Yes    Fri  95        40.17   4.73    Male    Yes   Fri  Dinner     4   
       Sat  170       50.81  10.00    Male    Yes   Sat  Dinner     3   
       Sun  182       45.35   3.50    Male    Yes   Sun  Dinner     3   
       Thur 197       43.11   5.00  Female    Yes  Thur   Lunch     4   

                  tip_pct  
smoker day                 
No     Fri  94   0.142857  
       Sat  212  0.186220  
       Sun  156  0.103799  
       Thur 142  0.121389  
Yes    Fri  95   0.117750  
       Sat  170  0.196812  
       Sun  182  0.077178  
       Thur 197  0.115982
之前我在GroupBy对象上调用过describe

result=tips.groupby('smoker')['tip_pct'].describe()
print result
print result.unstack('smoker')

结果为:

smoker       
No      count    151.000000
        mean       0.159328
        std        0.039910
        min        0.056797
        25%        0.136906
        50%        0.155625
        75%        0.185014
        max        0.291990
Yes     count     93.000000
        mean       0.163196
        std        0.085119
        min        0.035638
        25%        0.106771
        50%        0.153846
        75%        0.195059
        max        0.710345
dtype: float64
smoker          No        Yes
count   151.000000  93.000000
mean      0.159328   0.163196
std       0.039910   0.085119
min       0.056797   0.035638
25%       0.136906   0.106771
50%       0.155625   0.153846
75%       0.185014   0.195059
max       0.291990   0.710345

在GroupBy中,当你调用诸如describe之类的方法时,实际上只是应用了下面两条代码的快捷方式而已

f=lamda x: x.describe()

grouped.apply(f)


3.禁止分组键

从上面的例子中可以看出,分组键跟原始对象的索引共同构成结果对象中的层次化索引。将group_keys=False传入groupby即可禁止该效果

print tips.groupby('smoker',group_keys=False).apply(top)

结果为:

     total_bill   tip     sex smoker   day    time  size   tip_pct
88        24.71  5.85    Male     No  Thur   Lunch     2  0.236746
185       20.69  5.00    Male     No   Sun  Dinner     5  0.241663
51        10.29  2.60  Female     No   Sun  Dinner     2  0.252672
149        7.51  2.00    Male     No  Thur   Lunch     2  0.266312
232       11.61  3.39    Male     No   Sat  Dinner     2  0.291990
109       14.31  4.00  Female    Yes   Sat  Dinner     2  0.279525
183       23.17  6.50    Male    Yes   Sun  Dinner     4  0.280535
67         3.07  1.00  Female    Yes   Sat  Dinner     1  0.325733
178        9.60  4.00  Female    Yes   Sun  Dinner     2  0.416667
172        7.25  5.15    Male    Yes   Sun  Dinner     2  0.710345


4.分位数和桶分析

pandas有一些能根据指定面元或样本分位数将数据拆分成多块的工具(比如cut和qcut)。将这些函数跟groupby结合起来,就能非常轻松地实现对数据集的桶(bucket)或分位数(quantile)分析了。下面这个简单的随机数据集为例,我们利用cut将其装入长度相等的桶中

frame=DataFrame({'data1':np.random.randn(1000),
                 'data2':np.random.randn(1000)})
factor=pd.cut(frame.data1,4)
print factor[:10]

结果为:

0     (0.266, 1.999]
1    (-1.467, 0.266]
2     (0.266, 1.999]
3     (0.266, 1.999]
4     (0.266, 1.999]
5     (0.266, 1.999]
6    (-1.467, 0.266]
7    (-1.467, 0.266]
8     (0.266, 1.999]
9    (-1.467, 0.266]
Name: data1, dtype: category
Categories (4, object): [(-3.207, -1.467] < (-1.467, 0.266] < (0.266, 1.999] < (1.999, 3.733]]

由cut返回的Factor对象可直接用于groupby。因此,我们可以像下面这样对data2做一些统计计算

def get_stats(group):
    return {'min':group.min(),'max':group.max(),
            'count':group.count(),'mean':group.mean()}
grouped=frame.data2.groupby(factor)
print grouped.apply(get_stats).unstack()

结果为:

                  count       max      mean       min
data1                                                
(-3.319, -1.784]     49  2.838210 -0.033478 -2.326415
(-1.784, -0.254]    379  2.863969 -0.069882 -2.956759
(-0.254, 1.276]     479  2.986985 -0.054764 -2.701478
(1.276, 2.806]       93  2.737609  0.009805 -2.147242

这些长度相等的桶。要根据样本分位数得到大小相等的桶,使用qcut即可。传入labels=False即可只获取分位数的编号

#返回分位数编号
grouping=pd.qcut(frame.data1,10,labels=False)
grouped=frame.data2.groupby(grouping)
print grouped.apply(get_stats).unstack()

结果为:

       count       max      mean       min
data1                                     
0        100  2.238949  0.043572 -2.568918
1        100  2.478662 -0.090292 -2.300763
2        100  2.958132  0.023370 -2.788298
3        100  1.667885 -0.154884 -1.980243
4        100  2.675465 -0.043593 -2.086390
5        100  2.604741  0.058040 -2.093968
6        100  1.852199 -0.078393 -2.197167
7        100  2.881980 -0.001233 -2.231411
8        100  2.579919 -0.026435 -3.883737
9        100  2.128517  0.011582 -2.288749

说明:“长度相等的桶”指的是“区间大小相等”,“大小相等的桶”指的是“数据点数量相等”


5.用特定分组的值填充缺失值

在下面的例子中,用平均值去填充NA值

s=Series(np.random.randn(6))
s[::2]=np.nan
print s
print s.fillna(s.mean())

结果为:

0         NaN
1    0.480783
2         NaN
3    0.781734
4         NaN
5   -0.260310
dtype: float64
0    0.334069
1    0.480783
2    0.334069
3    0.781734
4    0.334069
5   -0.260310
dtype: float64

假设你需要对不同的分组填充不同的值。只需将数据分组,并使用apply和一个能够对各数据块调用fillna的函数即可。

states=['Ohio','New York','Vermont','Florida','Oregon','Nevada','California','Idaho']
group_key=['East']*4+['West']*4
data=Series(np.random.randn(8),index=states)
data[['Vermont','Nevada','Idaho']]=np.nan
print data
print data.groupby(group_key).mean()

结果为:

Ohio         -0.356702
New York     -0.202025
Vermont            NaN
Florida      -1.008743
Oregon        1.093500
Nevada             NaN
California    0.309398
Idaho              NaN
dtype: float64
East   -0.522490
West    0.701449
dtype: float64

我们可以用分组平均值去填充NA值

fill_mean=lambda g:g.fillna(g.mean())
print data.groupby(group_key).apply(fill_mean)

结果为:

Ohio          1.099918
New York     -0.381588
Vermont       0.194471
Florida      -0.134917
Oregon       -0.071170
Nevada       -0.131315
California   -0.191460
Idaho        -0.131315
dtype: float64

此外,也可以在代码中预定义各组的填充值。由于分组具有一个name属性,所以我们可以拿来用一下

fill_values={'East':0.5,'West':-1}
fill_func=lambda g:g.fillna(fill_values[g.name])
print data.groupby(group_key).apply(fill_func)

结果为:

Ohio         -1.032449
New York      0.652871
Vermont       0.500000
Florida       0.081261
Oregon        0.355251
Nevada       -1.000000
California    0.578408
Idaho        -1.000000
dtype: float64


6.随机采样和排列

假设你想要从一个大数据集中随机抽取样本以进行蒙特卡罗模拟或其他分析工作。“抽取”的方式有很多,其中一些的效率会比其他的高很多。一个办法是,选取np.random.permutation(N)的前K个元素,其中N为完整数据的大小,K为期望的样本大小。

#红桃(Hearts)、黑桃(Spades)、梅花(Clubs)、方片(Diamonds)
suits=['H','S','C','D']
card_val=(range(1,11)+[10]*3)*4
base_names=['A']+range(2,11)+['J','K','Q']
cards=[]
for suit in ['H','S','C','D']:
    cards.extend(str(num)+suit for num in base_names)
deck=Series(card_val,index=cards)

现在有了一个长度为52的Series,其索引为牌名,值则是21点或其他游戏中用于计分的点数(为了简单起见,我当A的点数为1)

print deck[:13]

结果为:

AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
dtype: int64

现在,从整副牌中抽出5张

def draw(deck,n=5):
    return deck.take(np.random.permutation(len(deck))[:n])
print draw(deck)

结果为:

4H     4
KD    10
9C     9
7S     7
7H     7
dtype: int64

假设你想要从每种花色中随机抽取两张牌。由于花色是牌名的最后一个字符,所以我们可以据此进行分组,并使用apply

get_suit=lambda card:card[-1]#只要最后一个字母就可以了
print deck.groupby(get_suit).apply(draw,n=2)

结果为:

C  JC    10
   2C     2
D  8D     8
   KD    10
H  6H     6
   9H     9
S  AS     1
   QS    10
dtype: int64

另一种方法

print deck.groupby(get_suit,group_keys=False).apply(draw,n=2)

结果为:

QC     10
AC      1
KD     10
10D    10
10H    10
9H      9
10S    10
2S      2
dtype: int64


7.分组加权平均数和相关系数

根据groupby的“拆分-应用-合并”范式,DataFrame的列与列之间或两个Series之间的运算(比如分组加权平均)成为一种标准作业。以下面这个数据集为例,它含有分组键、值以及一些权重值

df=DataFrame({'category':['a','a','a','a','b','b','b','b'],
              'daata':np.random.randn(8),
              'weights':np.random.rand(8)})
print df

结果为:

  category     daata   weights
0        a -0.465809  0.949090
1        a -1.618565  0.368662
2        a -1.109244  0.601452
3        a  0.287113  0.168754
4        b -0.896898  0.165138
5        b  0.625344  0.537256
6        b  2.129728  0.465858
7        b -0.292746  0.484719

然后可以利用category计算分组加权平均数

grouped=df.groupby('category')
get_wavg=lambda g:np.average(g['data'],weights=g['weights'])
print grouped.apply(get_wavg)

结果为:

category
a   -0.261281
b   -0.059888
dtype: float64

看一个稍微实际点的例子——来自Yahoo!Finance的数据集

close_px=pd.read_csv('data/stock_px.csv',parse_dates=True,index_col=0)
print close_px[-4:]

结果为:

              AAPL   MSFT    XOM      SPX
2011-10-11  400.29  27.00  76.27  1195.54
2011-10-12  402.19  26.96  77.16  1207.25
2011-10-13  408.43  27.18  76.37  1203.66
2011-10-14  422.00  27.27  78.11  1224.58

计算一个由日收益率(通过百分数变化计算)与SPX之间的年度相关系数组成的DataFrame

rets=close_px.pct_change().dropna()
spx_corr=lambda x:x.corrwith(x['SPX'])
by_year=rets.groupby(lambda x:x.year)
print by_year.apply(spx_corr)

结果为:

          AAPL      MSFT       XOM  SPX
2003  0.541124  0.745174  0.661265    1
2004  0.374283  0.588531  0.557742    1
2005  0.467540  0.562374  0.631010    1
2006  0.428267  0.406126  0.518514    1
2007  0.508118  0.658770  0.786264    1
2008  0.681434  0.804626  0.828303    1
2009  0.707103  0.654902  0.797921    1
2010  0.710105  0.730118  0.839057    1
2011  0.691931  0.800996  0.859975    1

你还可以计算列与列之间的相关系数

#苹果和微软的年度相关系数
print by_year.apply(lambda g:g['AAPL'].corr(g['MSFT']))

结果为:

2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64


8.面向分组的线性回归

顺着上一个例子继续,你可以用groupby执行更为复杂的分组统计分析,只要函数返回的是pandas对象或标量值即可。例如,我可以定义下面这个regress函数(利用statsmodels库)对各数据块执行普通最小二乘法回归

import statsmodels.api as sm

def regress(data,yvar,xvars):
    Y=data[yvar]
    X=data[xvars]
    X['intercept']=1.
    result=sm.OLS(Y,X).fit()
    return result.params

为了按年计算AAPL对SPX收益率的线性回归,执行

print by_year.apply(regress,'AAPL',['SPX'])

结果为:

           SPX  intercept
2003  1.195406   0.000710
2004  1.363463   0.004201
2005  1.766415   0.003246
2006  1.645496   0.000080
2007  1.198761   0.003438
2008  0.968016  -0.001110
2009  0.879103   0.002954
2010  1.052608   0.001261
2011  0.806605   0.001514


9.透视表

它根据一个或多个键对数据进行聚合,并根据行和列上的分组键将数据分配到各个矩形区域中。DataFrame有一个pivot_table方法,此外还有一个顶级的pandas.pivot_table函数。除能为groupby提供便利外,pivot_table还可以添加分项小计(也叫做margins)

回到小费数据集,假设我想要根据sex和smoker计算分组平均数(pivot_table的默认聚合类型),并将sex和smoker放到行上

print tips.pivot_table(index=['sex','smoker'])

结果为:

                   size       tip   tip_pct  total_bill
sex    smoker                                          
Female No      2.592593  2.773519  0.156921   18.105185
       Yes     2.242424  2.931515  0.182150   17.977879
Male   No      2.711340  3.113402  0.160669   19.791237
       Yes     2.500000  3.051167  0.152771   22.284500

假设我们只想聚合tip_pct和size,而且想根据day进行分组。我将smoker放到列上,把day放到行上

print tips.pivot_table(['tip_pct','size'],index=['sex','day'],
                       columns='smoker')

结果为:

              tip_pct                size          
smoker             No       Yes        No       Yes
sex    day                                         
Female Fri   0.165296  0.209129  2.500000  2.000000
       Sat   0.147993  0.163817  2.307692  2.200000
       Sun   0.165710  0.237075  3.071429  2.500000
       Thur  0.155971  0.163073  2.480000  2.428571
Male   Fri   0.138005  0.144730  2.000000  2.125000
       Sat   0.162132  0.139067  2.656250  2.629630
       Sun   0.158291  0.173964  2.883721  2.600000
       Thur  0.165706  0.164417  2.500000  2.300000

还可以对这个表进一步处理,传入margins=True添加分项小计。这将会添加标签为All的行和列,其值对应于单个等级中所有数据的分组统计。在下面这个例子中,All值为平均数:不单独考虑烟民与非烟民(All列),不单独考虑行分组两个级别中的任何单项(All行)

print tips.pivot_table(['tip_pct','size'],index=['sex','day'],
                       columns='smoker',margins=True)

结果为:

              tip_pct                          size                    
smoker             No       Yes       All        No       Yes       All
sex    day                                                             
Female Fri   0.165296  0.209129  0.199388  2.500000  2.000000  2.111111
       Sat   0.147993  0.163817  0.156470  2.307692  2.200000  2.250000
       Sun   0.165710  0.237075  0.181569  3.071429  2.500000  2.944444
       Thur  0.155971  0.163073  0.157525  2.480000  2.428571  2.468750
Male   Fri   0.138005  0.144730  0.143385  2.000000  2.125000  2.100000
       Sat   0.162132  0.139067  0.151577  2.656250  2.629630  2.644068
       Sun   0.158291  0.173964  0.162344  2.883721  2.600000  2.810345
       Thur  0.165706  0.164417  0.165276  2.500000  2.300000  2.433333
All          0.159328  0.163196  0.160803  2.668874  2.408602  2.569672

要使用其他的聚合函数,将其传给aggfunc即可。例如,使用count和len可以得到有关分组大小的交叉表

print tips.pivot_table('tip_pct',index=['sex','smoker'],columns='day',
                       aggfunc=len,margins=True)

结果为:

day            Fri  Sat  Sun  Thur  All
sex    smoker                          
Female No        2   13   14    25   54
       Yes       7   15    4     7   33
Male   No        2   32   43    20   97
       Yes       8   27   15    10   60
All             19   87   76    62  244

如果存在空的组合(也就是NA),你可能会希望设置一个fill_value

print tips.pivot_table('size',index=['time','sex','smoker'],
                       columns='day',aggfunc='sum',fill_value=0)

结果为:

day                   Fri  Sat  Sun  Thur
time   sex    smoker                     
Dinner Female No        2   30   43     2
              Yes       8   33   10     0
       Male   No        4   85  124     0
              Yes      12   71   39     0
Lunch  Female No        3    0    0    60
              Yes       6    0    0    17
       Male   No        0    0    0    50
              Yes       5    0    0    23

利用python进行数据分析-数据聚合与分组运算2_第1张图片


10.交叉表:crosstab

交叉表是一种用于计算分组频率的特殊透视表。

print pd.crosstab([tips.time,tips.day],tips.smoker,margins=True)

结果为:


smoker        No  Yes  All
time   day                
Dinner Fri     3    9   12
       Sat    45   42   87
       Sun    57   19   76
       Thur    1    0    1
Lunch  Fri     1    6    7
       Thur   44   17   61
All          151   93  244


11.示例:2012联邦选举委员会数据库

数据库包括赞助者的姓名、职业、雇主、地址以及出资额等信息。

fec=pd.read_csv('data/P00000001-ALL.csv')
print fec.ix[123456]

结果为:

cmte_id                             C00431445
cand_id                             P80003338
cand_nm                         Obama, Barack
contbr_nm                         ELLMAN, IRA
contbr_city                             TEMPE
contbr_st                                  AZ
contbr_zip                          852816719
contbr_employer      ARIZONA STATE UNIVERSITY
contbr_occupation                   PROFESSOR
contb_receipt_amt                          50
contb_receipt_dt                    01-DEC-11
receipt_desc                              NaN
memo_cd                                   NaN
memo_text                                 NaN
form_tp                                 SA17A
file_num                               772372
Name: 123456, dtype: object

该数据中没有党派信息,因此把它加进去。通过unique,可以获取全部候选人名单(NumPy不会输出信息中字符串两侧的引号)

unique_cands=fec.cand_nm.unique()
print unique_cands

结果为:

['Bachmann, Michelle' 'Romney, Mitt' 'Obama, Barack'
 "Roemer, Charles E. 'Buddy' III" 'Pawlenty, Timothy' 'Johnson, Gary Earl'
 'Paul, Ron' 'Santorum, Rick' 'Cain, Herman' 'Gingrich, Newt'
 'McCotter, Thaddeus G' 'Huntsman, Jon' 'Perry, Rick']

最简单的方法是利用字典说明党派关系

parties={'Bachmann, Michelle':'Republican',
         'Cain, Herman':'Republican',
         'Gingrich, Newt':'Republican',
         'Huntsman, Jon':'Republican',
         'Johnson, Gary Earl':'Republican',
         'McCotter, Thaddeus G':'Republican',
         'Obama, Barack':'Democrat',
         'Paul, Ron':'Republican',
         'Pawlenty, Timothy':'Republican',
         'Perry, Rick':'Republican',
         "Roemer, Charles E. 'Buddy' III":'Republican',
         'Romney, Mitt':'Republican',
         'Santorum, Rick':'Republican'}
现在,通过这个映射以及Series对象的map方法,可以根据候选人姓名得到一组党派信息

print fec.cand_nm[123456:123461]
print fec.cand_nm[123456:123461].map(parties)

结果为:

123456    Obama, Barack
123457    Obama, Barack
123458    Obama, Barack
123459    Obama, Barack
123460    Obama, Barack
Name: cand_nm, dtype: object
123456    Democrat
123457    Democrat
123458    Democrat
123459    Democrat
123460    Democrat
Name: cand_nm, dtype: object

#将其添加为一个新列
fec['party']=fec.cand_nm.map(parties)
print fec['party'].value_counts()

结果为:

Democrat      593746
Republican    407985
Name: party, dtype: int64

这里有两个需要注意的地方。第一,该数据既包括赞助也包括退款(负的出资额)

print (fec.contb_receipt_amt>0).value_counts()

结果为:

True     991475
False     10256
Name: contb_receipt_amt, dtype: int64

为了简化分析过程,我限定数据集只能有正的出资额

fec=fec[fec.contb_receipt_amt>0]

由于Obama Back和Romney Mitt是最主要的两个候选人,所以我还专门准备了一个子集,只包含针对他们两人的竞选活动的赞助信息

fec_mrbo=fec[fec.cand_nm.isin(['Obama, Barack','Romney, Mitt'])]


12.根据职业和雇主统计赞助信息

首先,根据职业计算出资总额

print fec.contbr_occupation.value_counts()[:10]

结果为:

RETIRED                                   233990
INFORMATION REQUESTED                      35107
ATTORNEY                                   34286
HOMEMAKER                                  29931
PHYSICIAN                                  23432
INFORMATION REQUESTED PER BEST EFFORTS     21138
ENGINEER                                   14334
TEACHER                                    13990
CONSULTANT                                 13273
PROFESSOR                                  12555
Name: contbr_occupation, dtype: int64

许多职业都涉及相同的基本工作类型,或者同一样东西有多种变体。下面的代码片段可以清理一些这样的数据(将一个职业信息映射到另一个)。这里巧妙地利用了dict.get,它允许没有映射关系的职业也能“通过”

occ_mapping={
    'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED',
    'INFORMATION REQUESTED':'NOT PROVIDED',
    'INFORMATION REQUESTED (BEST EFFORTS)':'NOT PROVIDED',
    'C.E.O':'CEO'
}
#如果没有提供相关映射,则返回x
f=lambda x:occ_mapping.get(x,x)
fec.contbr_occupation=fec.contbr_occupation.map(f)

我对雇主信息也进行了同样的处理:

emp_mapping={
    'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED',
    'INFORMATION REQUESTED':'NOT PROVIDED',
    'SELF':'SELF-EMPLOYED',
    'SELF EMPLOYED':'SELF-EMPLOYED'
}
#如果没有提供相关映射,则返回x
f=lambda x:emp_mapping.get(x,x)
fec.contbr_employer=fec.contbr_employer.map(f)

现在,可以通过pivot_table根据党派和职业对数据进行聚合,然后过滤掉总出资额不足200万美元的数据

by_occupation=fec.pivot_table('contb_receipt_amt',index='contbr_occupation',
                              columns='party',aggfunc='sum')
over_2mm=by_occupation[by_occupation.sum(1)>2000000]

结果为:

party                 Democrat       Republican
contbr_occupation                              
ATTORNEY           11141982.97   7477194.430000
C.E.O.                 1690.00   2592983.110000
CEO                 2074284.79   1640758.410000
CONSULTANT          2459912.71   2544725.450000
ENGINEER             951525.55   1818373.700000
EXECUTIVE           1355161.05   4138850.090000
HOMEMAKER           4248875.80  13634275.780000
INVESTOR             884133.00   2431768.920000
LAWYER              3160478.87    391224.320000
MANAGER              762883.22   1444532.370000
NOT PROVIDED        4866973.96  20565473.010000
OWNER               1001567.36   2408286.920000
PHYSICIAN           3735124.94   3594320.240000
PRESIDENT           1878509.95   4720923.760000
PROFESSOR           2165071.08    296702.730000
REAL ESTATE          528902.09   1625902.250000
RETIRED            25305116.38  23561244.489999
SELF-EMPLOYED        672393.40   1640252.540000

把这些数据做成柱状图看起来更加清楚

over_2mm.plot(kind='barh')

你可能还想了解一下对Obama和Romney总出资额最高的职业和企业。为此,我们先对候选人进行分组

def get_top_amounts(group,key,n=5):
    totals=group.groupby(key)['contb_receipt_amt'].sum()
    
    #根据key对totals进行降序排列
    return totals.order(ascending=False)[n:]

然后根据职业和雇主进行聚合

grouped=fec_mrbo.groupby('cand_nm')
print grouped.apply(get_top_amounts,'contbr_occupation',n=7)
print grouped.apply(get_top_amounts,'contbr_employer',n=10)

结果为:

cand_nm        contbr_occupation                     
Obama, Barack  PROFESSOR                                 2165071.08
               CEO                                       2073284.79
               PRESIDENT                                 1878509.95
               NOT EMPLOYED                              1709188.20
               EXECUTIVE                                 1355161.05
               TEACHER                                   1250969.15
               WRITER                                    1084188.88
               OWNER                                     1001567.36
               ENGINEER                                   951525.55
               INVESTOR                                   884133.00
               ARTIST                                     763125.00
               MANAGER                                    762883.22
               SELF-EMPLOYED                              672393.40
               STUDENT                                    628099.75
               REAL ESTATE                                528902.09
               CHAIRMAN                                   496547.00
               ARCHITECT                                  483859.89
               DIRECTOR                                   471741.73
               BUSINESS OWNER                             449979.30
               EDUCATOR                                   436600.89
               PSYCHOLOGIST                               427299.92
               SOFTWARE ENGINEER                          396985.65
               PARTNER                                    395759.50
               SALES                                      392886.91
               EXECUTIVE DIRECTOR                         348180.94
               MANAGING DIRECTOR                          329688.25
               SOCIAL WORKER                              326844.43
               VICE PRESIDENT                             325647.15
               ADMINISTRATOR                              323079.26
               SCIENTIST                                  319227.88
   
Romney, Mitt   NON-PROFIT VETERANS ORG. CHAIR/ANNUITA         10.00
               PARAPLANNER                                    10.00
               APPRAISAL                                      10.00
               SIGN CONTRACTOR                                10.00
               POLITICAL OPERATIVE                            10.00
               PORT MGT                                       10.00
               PRESIDENT EMERITUS                             10.00
               CONTRACTS SPECIALIST                            9.00
               TEACHER & FREE-LANCE JOURNALIST                 9.00
               FOUNDATION CONSULTANT                           6.00
               MAIL HANDLER                                    6.00
               TREASURER & DIRECTOR OF FINANCE                 6.00
               SECRETARY/BOOKKEPPER                            6.00
               ELAYNE WELLS HARMER                             6.00
               CHICKEN GRADER                                  5.00
               DIRECTOR REISCHAUER CENTER FOR EAST A           5.00
               SCOTT GREENBAUM                                 5.00
               EDUCATION ADMIN                                 5.00
               ENGINEER/RISK EXPERT                            5.00
               PLANNING AND OPERATIONS ANALYST                 5.00
               VILLA NOVA                                      5.00
               FINANCIAL INSTITUTION - CEO                     5.00
               HORTICULTURIST                                  5.00
               MD - UROLOGIST                                  5.00
               DISTRICT REPRESENTATIVE                         5.00
               INDEPENDENT PROFESSIONAL                        3.00
               REMODELER & SEMI RETIRED                        3.00
               AFFORDABLE REAL ESTATE DEVELOPER                3.00
               IFC CONTRACTING SOLUTIONS                       3.00
               3RD GENERATION FAMILY BUSINESS OWNER            3.00
Name: contb_receipt_amt, dtype: float64
cand_nm        contbr_employer                
Obama, Barack  SIDLEY AUSTIN LLP                  168254.00
               REFUSED                            149516.07
               DLA PIPER                          148235.00
               HARVARD UNIVERSITY                 131368.94
               IBM                                128490.93
               GOOGLE                             125302.88
               MICROSOFT CORPORATION              108849.00
               KAISER PERMANENTE                  104949.95
               JONES DAY                          103712.50
               STANFORD UNIVERSITY                101630.75
               COLUMBIA UNIVERSITY                 96325.12
               UNIVERSITY OF CHICAGO               88575.00
               AT&T                                88132.12
               US GOVERNMENT                       87689.00
               MORGAN & MORGAN                     87250.00
               VERIZON                             85318.30
               UNIVERSITY OF MICHIGAN              84856.33
               DISABLED                            78417.87
               UCLA                                78092.50
               ARNOLD & PORTER LLP                 76330.00
               OBAMA FOR AMERICA                   72028.89
               UNIVERSITY OF WASHINGTON            70445.86
               NORTHWESTERN UNIVERSITY             69489.05
               DEPARTMENT OF DEFENSE               67253.40
               COMCAST                             65158.00
               US ARMY                             64768.91
               FEDERAL GOVERNMENT                  64590.26
               WELLS FARGO                         62749.60
               UNIVERSITY OF CALIFORNIA            62432.00
               SKADDEN ARPS                        61904.00
   
Romney, Mitt   SMILE MANAGEMENT                        6.00
               ENERGY ALLOYS                           6.00
               MORGAN STANLEY SMITH BARNEY LLC         5.00
               PEACE FROGS INC.                        5.00
               LEGACY SCHOOL                           5.00
               APPLIANCE INSTALLATIONS INC.            5.00
               PAULA HAWKINS & ASSOCIATES              5.00
               VILLA NOVA FINANCING GROUP LLC          5.00
               PICKET FENCES INC.                      5.00
               AA FLIPPEN ASSOCIATES                   5.00
               PLUM HEALTHCARE                         5.00
               SCOTT GREENBAUM                         5.00
               GOLDIE'S SALON                          5.00
               PACIFIC BIOSCIENCES                     5.00
               LEAVITT INSURANCE AGENCY                5.00
               INTERSTELLAR HOLDINGS LLC               5.00
               CA STATE SENATE                         5.00
               R. A. RAUCH & ASSOCIATES                5.00
               RST GLOBAL LICENSING LLC                5.00
               SAIS/JOHNS HOPKINS UNIVERSITY           5.00
               EASTHAM CAPITAL                         5.00
               GREGORY GALLIVAN                        5.00
               DIRECT LENDERS LLC                      5.00
               LOUGH INVESTMENT ADVISORY LLC           4.00
               WATERWORKS INDUSRTIES                   3.00
               WILL MERRIFIELD                         3.00
               HONOLD COMMUNICTAIONS                   3.00
               INDEPENDENT PROFESSIONAL                3.00
               UPTOWN CHEAPSKATE                       3.00
               UN                                      3.00
Name: contb_receipt_amt, dtype: float64


13.对出资额分组

利用cut函数根据出资额的大小将数据离散化到多个面元中

bins=np.array([0,1,10,100,1000,10000,100000,1000000,10000000])
labels=pd.cut(fec_mrbo.contb_receipt_amt,bins)
print labels

结果为:

411           (10, 100]
412         (100, 1000]
413         (100, 1000]
414           (10, 100]
415           (10, 100]
416           (10, 100]
417         (100, 1000]
418           (10, 100]
419         (100, 1000]
420           (10, 100]
421           (10, 100]
422         (100, 1000]
423         (100, 1000]
424         (100, 1000]
425         (100, 1000]
426         (100, 1000]
427       (1000, 10000]
428         (100, 1000]
429         (100, 1000]
430           (10, 100]
431       (1000, 10000]
432         (100, 1000]
433         (100, 1000]
434         (100, 1000]
435         (100, 1000]
436         (100, 1000]
437           (10, 100]
438         (100, 1000]
439         (100, 1000]
440           (10, 100]
     
701356        (10, 100]
701357          (1, 10]
701358        (10, 100]
701359        (10, 100]
701360        (10, 100]
701361        (10, 100]
701362      (100, 1000]
701363        (10, 100]
701364        (10, 100]
701365        (10, 100]
701366        (10, 100]
701367        (10, 100]
701368      (100, 1000]
701369        (10, 100]
701370        (10, 100]
701371        (10, 100]
701372        (10, 100]
701373        (10, 100]
701374        (10, 100]
701375        (10, 100]
701376    (1000, 10000]
701377        (10, 100]
701378        (10, 100]
701379      (100, 1000]
701380    (1000, 10000]
701381        (10, 100]
701382      (100, 1000]
701383          (1, 10]
701384        (10, 100]
701385      (100, 1000]
Name: contb_receipt_amt, dtype: category
Categories (8, object): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]

然后根据候选人姓名以及面元标签对数据进行分组

grouped=fec_mrbo.groupby(['cand_nm',labels])
print grouped.size().unstack(0)

结果为:

cand_nm              Obama, Barack  Romney, Mitt
contb_receipt_amt                               
(0, 1]                         493            77
(1, 10]                      40070          3681
(10, 100]                   372280         31853
(100, 1000]                 153991         43357
(1000, 10000]                22284         26186
(10000, 100000]                  2             1
(100000, 1000000]                3           NaN
(1000000, 10000000]              4           NaN

从这个数据中可以看出,在小额赞助方面,Obama获得的数量比Romney多得多。可以对出资额求和并在面元内规格化,以便图形化显示两位候选人各种赞助额度的比例

bucket_sums=grouped.contb_receipt_amt.sum().unstack(0)
print bucket_sums
normed_sums=bucket_sums.div(bucket_sums.sum(axis=1),axis=0)
#我排除两个最大的面元,因为这些不是由个人捐赠的
normed_sums[:-2].plot(kind='barh',stacked=True)

结果为:

cand_nm              Obama, Barack  Romney, Mitt
contb_receipt_amt                               
(0, 1]                      318.24         77.00
(1, 10]                  337267.62      29819.66
(10, 100]              20288981.41    1987783.76
(100, 1000]            54798531.46   22363381.69
(1000, 10000]          51753705.67   63942145.42
(10000, 100000]           59100.00      12700.00
(100000, 1000000]       1490683.08           NaN
(1000000, 10000000]     7148839.76           NaN
cand_nm              Obama, Barack  Romney, Mitt
contb_receipt_amt                               
(0, 1]                    0.805182      0.194818
(1, 10]                   0.918767      0.081233
(10, 100]                 0.910769      0.089231
(100, 1000]               0.710176      0.289824
(1000, 10000]             0.447326      0.552674
(10000, 100000]           0.823120      0.176880
(100000, 1000000]         1.000000           NaN
(1000000, 10000000]       1.000000           NaN


14.根据州统计赞助信息

首先是根据候选人和州对数据进行聚合

grouped=fec_mrbo.groupby(['cand_nm','contbr_st'])
totals=grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
totals=totals[totals.sum(1)>100000]
print totals[:10]

结果为:

cand_nm    Obama, Barack  Romney, Mitt
contbr_st                             
AK             281840.15      86204.24
AL             543123.48     527303.51
AR             359247.28     105556.00
AZ            1506476.98    1888436.23
CA           23824984.24   11237636.60
CO            2132429.49    1506714.12
CT            2068291.26    3499475.45
DC            4373538.80    1025137.50
DE             336669.14      82712.00
FL            7318178.58    8338458.81

如果对各行除以总赞助额,就会得到各候选人在各州的总赞助额比例

percent=totals.div(totals.sum(1),axis=0)
print percent[:10]

结果为:

cand_nm    Obama, Barack  Romney, Mitt
contbr_st                             
AK              0.765778      0.234222
AL              0.507390      0.492610
AR              0.772902      0.227098
AZ              0.443745      0.556255
CA              0.679498      0.320502
CO              0.585970      0.414030
CT              0.371476      0.628524
DC              0.810113      0.189887
DE              0.802776      0.197224
FL              0.467417      0.532583


你可能感兴趣的:(python)