python数据聚合分组实战-----2012联邦选举委员会数据库

目录

数据来源:

根据职业和雇主统计赞助信息

对出资额分组

根据州统计赞助信息


数据来源:

https://github.com/wesm/pydata-book

>>> fec = pd.read_csv('D:\python\DataAnalysis\data\P00000001-ALL.csv',low_memory = False)
>>> fec
           cmte_id    cand_id   ...    form_tp file_num
0        C00410118  P20002978   ...      SA17A   736166
1        C00410118  P20002978   ...      SA17A   736166
2        C00410118  P20002978   ...      SA17A   749073
3        C00410118  P20002978   ...      SA17A   749073
4        C00410118  P20002978   ...      SA17A   736166
5        C00410118  P20002978   ...      SA17A   736166
6        C00410118  P20002978   ...      SA17A   736166
7        C00410118  P20002978   ...      SA17A   749073
8        C00410118  P20002978   ...      SA17A   736166
9        C00410118  P20002978   ...      SA17A   736166
10       C00410118  P20002978   ...      SA17A   736166
11       C00410118  P20002978   ...      SA17A   736166
12       C00410118  P20002978   ...      SA17A   736166
13       C00410118  P20002978   ...      SA17A   736166
14       C00410118  P20002978   ...      SA17A   749073
15       C00410118  P20002978   ...      SA17A   749073
16       C00410118  P20002978   ...      SA17A   736166
17       C00410118  P20002978   ...      SA17A   749073
18       C00410118  P20002978   ...      SA17A   736166
19       C00410118  P20002978   ...      SA17A   736166
20       C00410118  P20002978   ...      SA17A   736166
21       C00410118  P20002978   ...      SA17A   736166
22       C00410118  P20002978   ...      SA17A   736166
23       C00410118  P20002978   ...      SA17A   736166
24       C00410118  P20002978   ...      SA17A   749073
25       C00410118  P20002978   ...      SA17A   749073
26       C00410118  P20002978   ...      SA17A   749073
27       C00410118  P20002978   ...      SA17A   749073
28       C00410118  P20002978   ...      SA17A   749073
29       C00410118  P20002978   ...      SA17A   749073
            ...        ...   ...        ...      ...
1001701  C00500587  P20003281   ...      SA17A   772060
1001702  C00500587  P20003281   ...      SA17A   772060
1001703  C00500587  P20003281   ...      SA17A   772060
1001704  C00500587  P20003281   ...      SA17A   772060
1001705  C00500587  P20003281   ...      SA17A   772060
1001706  C00500587  P20003281   ...      SA17A   772060
1001707  C00500587  P20003281   ...      SA17A   772060
1001708  C00500587  P20003281   ...      SA17A   772060
1001709  C00500587  P20003281   ...      SA17A   772060
1001710  C00500587  P20003281   ...      SB28A   779144
1001711  C00500587  P20003281   ...      SA17A   751678
1001712  C00500587  P20003281   ...      SA17A   751678
1001713  C00500587  P20003281   ...      SA17A   751678
1001714  C00500587  P20003281   ...      SA17A   751678
1001715  C00500587  P20003281   ...      SA17A   751678
1001716  C00500587  P20003281   ...      SA17A   751678
1001717  C00500587  P20003281   ...      SA17A   772060
1001718  C00500587  P20003281   ...      SA17A   772060
1001719  C00500587  P20003281   ...      SA17A   772060
1001720  C00500587  P20003281   ...      SA17A   772060
1001721  C00500587  P20003281   ...      SA17A   772060
1001722  C00500587  P20003281   ...      SA17A   772060
1001723  C00500587  P20003281   ...      SA17A   772060
1001724  C00500587  P20003281   ...      SA17A   751678
1001725  C00500587  P20003281   ...      SA17A   751678
1001726  C00500587  P20003281   ...      SA17A   751678
1001727  C00500587  P20003281   ...      SA17A   751678
1001728  C00500587  P20003281   ...      SA17A   751678
1001729  C00500587  P20003281   ...      SA17A   751678
1001730  C00500587  P20003281   ...      SA17A   751678

[1001731 rows x 16 columns]

 

>>> fec.ix[123456]
cmte_id                             C00431445
cand_id                             P80003338
cand_nm                         Obama, Barack
contbr_nm                         ELLMAN, IRA
contbr_city                             TEMPE
contbr_st                                  AZ
contbr_zip                          852816719
contbr_employer      ARIZONA STATE UNIVERSITY
contbr_occupation                   PROFESSOR
contb_receipt_amt                          50
contb_receipt_dt                    01-DEC-11
receipt_desc                              NaN
memo_cd                                   NaN
memo_text                                 NaN
form_tp                                 SA17A
file_num                               772372
Name: 123456, dtype: object

所有候选人名单

>>> unique_cands = fec.cand_nm.unique()
>>> unique_cands
array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
       "Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
       'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick',
       'Cain, Herman', 'Gingrich, Newt', 'McCotter, Thaddeus G',
       'Huntsman, Jon', 'Perry, Rick'], dtype=object)

利用字典说明党派人关系

>>> parties = {'Bachmann, Michelle':'Republican','Cain, Herman':'Republican','Gingrich, Newt':'Republican','Huntsman, Jon':'Republican','Johnson, Gary Earl':'Republican','McCotter, Thaddeus G':'Republican','Obama, Barack':'Democrat','Paul, Ron':'Republican','Pawlenty, Timothy':'Republican','Perry, Rick':'Republican',"Roemer, Charles E. 'Buddy' III":'Republican','Romney, Mitt':'Republican','Santorum, Rick':'Republican'}
>>> fec.cand_nm[123456:123461].map(parties)
123456    Democrat
123457    Democrat
123458    Democrat
123459    Democrat
123460    Democrat
Name: cand_nm, dtype: object

 简单统计

>>> fec['party'] = fec.cand_nm.map(parties)
>>> fec['party'].value_counts()
Democrat      593746
Republican    407985
Name: party, dtype: int64

查看赞助信息

>>> (fec.contb_receipt_amt > 0).value_counts()
True     991475
False     10256
Name: contb_receipt_amt, dtype: int64

但是该数据既包括退款也包括赞助,下面简化数据。

>>> fec = fec[fec.contb_receipt_amt > 0]

下面为针对两位主要候选人的数据

>>> fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack','Romney, Mitt'])]

根据职业和雇主统计赞助信息

基于职业的赞助信息统计是另一种经常被研究的统计任务。例如律师们喜欢民主党,企业主喜欢共和党,看数据

>>> fec.contbr_occupation.value_counts()[:10]
RETIRED                                   233990
INFORMATION REQUESTED                      35107
ATTORNEY                                   34286
HOMEMAKER                                  29931
PHYSICIAN                                  23432
INFORMATION REQUESTED PER BEST EFFORTS     21138
ENGINEER                                   14334
TEACHER                                    13990
CONSULTANT                                 13273
PROFESSOR                                  12555
Name: contbr_occupation, dtype: int64

整合相同职业,并允许没有映射关系的职业通过

>>> occ_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED','INFORMATION REQUESTED':'NOT PROVIDED','INFORMATION REQUESTED (BEST EFFORTS)':'NOT PROVIDED','C.E.O':'CEO'}
>>> f = lambda x : occ_mapping.get(x,x)
>>> fec.contbr_occupation = fec.contbr_occupation.map(f)

对雇主信息进行同样处理

>>> emp_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED','INFORMATION REQUESTED':'NOT PROVIDED','SELF':'SELF-EMPLOYED','SELF EMPLOYED':'SELF-EMPLOYED',}
>>> f = lambda x: emp_mapping.get(x,x)
>>> fec.contbr_employer = fec.contbr_employer.map(f)

根据党派和职业聚合数据,然后过滤掉总出资额不足200万美元的数据:

>>> by_occupation = fec.pivot_table('contb_receipt_amt',index='contbr_occupation',columns='party',aggfunc='sum')
>>> over_2mm = by_occupation[by_occupation.sum(1) > 2000000]
>>> over_2mm
party                 Democrat    Republican
contbr_occupation                           
ATTORNEY           11141982.97  7.477194e+06
C.E.O.                 1690.00  2.592983e+06
CEO                 2074284.79  1.640758e+06
CONSULTANT          2459912.71  2.544725e+06
ENGINEER             951525.55  1.818374e+06
EXECUTIVE           1355161.05  4.138850e+06
HOMEMAKER           4248875.80  1.363428e+07
INVESTOR             884133.00  2.431769e+06
LAWYER              3160478.87  3.912243e+05
MANAGER              762883.22  1.444532e+06
NOT PROVIDED        4866973.96  2.056547e+07
OWNER               1001567.36  2.408287e+06
PHYSICIAN           3735124.94  3.594320e+06
PRESIDENT           1878509.95  4.720924e+06
PROFESSOR           2165071.08  2.967027e+05
REAL ESTATE          528902.09  1.625902e+06
RETIRED            25305116.38  2.356124e+07
SELF-EMPLOYED        672393.40  1.640253e+06

做成水平柱状图

>>> over_2mm.plot(kind='barh')

python数据聚合分组实战-----2012联邦选举委员会数据库_第1张图片

下面列出对两位候选人出资最高的职业和企业。

然后根据职业和雇主进行聚合

def get_top_amounts(group,key,n=5):
     totals = group.groupby(key)['contb_receipt_amt'].sum()
     return totals.sort_values(ascending=False)[n:]
 grouped = fec_mrbo.groupby('cand_nm')
grouped.apply(get_top_amounts,'contbr_occupation',n=7)
cand_nm        contbr_occupation                     
Obama, Barack  PROFESSOR                                 2165071.08
               CEO                                       2074284.79
               PRESIDENT                                 1878509.95
               NOT EMPLOYED                              1709188.20
               EXECUTIVE                                 1355161.05
               TEACHER                                   1250969.15
               WRITER                                    1084188.88
               OWNER                                     1001567.36
               ENGINEER                                   951525.55
               INVESTOR                                   884133.00
               ARTIST                                     763125.00
               MANAGER                                    762883.22
               SELF-EMPLOYED                              672393.40
               STUDENT                                    628099.75
               REAL ESTATE                                528902.09
               CHAIRMAN                                   496547.00
               ARCHITECT                                  483859.89
               DIRECTOR                                   471741.73
               BUSINESS OWNER                             449979.30
               EDUCATOR                                   436600.89
               PSYCHOLOGIST                               427299.92
               SOFTWARE ENGINEER                          396985.65
               PARTNER                                    395759.50
               SALES                                      392886.91
               EXECUTIVE DIRECTOR                         348180.94
               MANAGING DIRECTOR                          329688.25
               SOCIAL WORKER                              326844.43
               VICE PRESIDENT                             325647.15
               ADMINISTRATOR                              323079.26
               SCIENTIST                                  319227.88
                                                            ...    
Romney, Mitt   NON-PROFIT VETERANS ORG. CHAIR/ANNUITA         10.00
               CHRMN/CEO                                      10.00
               LEAN SIX SIGMA PRACTIONER                      10.00
               PRESIDENT EMERITUS                             10.00
               SIGN CONTRACTOR                                10.00
               JUDGE ADVOCATE ATTORNEY                        10.00
               SALES SUPPORT                                  10.00
               CONTRACTS SPECIALIST                            9.00
               TEACHER & FREE-LANCE JOURNALIST                 9.00
               FOUNDATION CONSULTANT                           6.00
               ELAYNE WELLS HARMER                             6.00
               MAIL HANDLER                                    6.00
               SECRETARY/BOOKKEPPER                            6.00
               TREASURER & DIRECTOR OF FINANCE                 6.00
               DISTRICT REPRESENTATIVE                         5.00
               HORTICULTURIST                                  5.00
               FINANCIAL INSTITUTION - CEO                     5.00
               PLANNING AND OPERATIONS ANALYST                 5.00
               MD - UROLOGIST                                  5.00
               VILLA NOVA                                      5.00
               SCOTT GREENBAUM                                 5.00
               ENGINEER/RISK EXPERT                            5.00
               EDUCATION ADMIN                                 5.00
               DIRECTOR REISCHAUER CENTER FOR EAST A           5.00
               CHICKEN GRADER                                  5.00
               IFC CONTRACTING SOLUTIONS                       3.00
               INDEPENDENT PROFESSIONAL                        3.00
               AFFORDABLE REAL ESTATE DEVELOPER                3.00
               REMODELER & SEMI RETIRED                        3.00
               3RD GENERATION FAMILY BUSINESS OWNER            3.00
Name: contb_receipt_amt, Length: 35973, dtype: float64


>>> grouped.apply(get_top_amounts,'contbr_employer',n=10)
cand_nm        contbr_employer                
Obama, Barack  DLA PIPER                          148235.00
               HARVARD UNIVERSITY                 131368.94
               IBM                                128490.93
               GOOGLE                             125302.88
               MICROSOFT CORPORATION              108849.00
               KAISER PERMANENTE                  104949.95
               JONES DAY                          103712.50
               STANFORD UNIVERSITY                101630.75
               COLUMBIA UNIVERSITY                 96325.12
               UNIVERSITY OF CHICAGO               88575.00
               AT&T                                88132.12
               US GOVERNMENT                       87689.00
               MORGAN & MORGAN                     87250.00
               VERIZON                             85318.30
               UNIVERSITY OF MICHIGAN              84856.33
               DISABLED                            78417.87
               UCLA                                78092.50
               ARNOLD & PORTER LLP                 76330.00
               OBAMA FOR AMERICA                   72028.89
               UNIVERSITY OF WASHINGTON            70445.86
               NORTHWESTERN UNIVERSITY             69489.05
               DEPARTMENT OF DEFENSE               67253.40
               COMCAST                             65158.00
               US ARMY                             64768.91
               FEDERAL GOVERNMENT                  64590.26
               WELLS FARGO                         62749.60
               UNIVERSITY OF CALIFORNIA            62432.00
               SKADDEN ARPS                        61904.00
               DEBEVOISE & PLIMPTON LLP            61268.00
               MAYER BROWN LLP                     60315.40
                                                    ...    
Romney, Mitt   U.S. POST OFFICE                        6.00
               ENERGY ALLOYS                           6.00
               VILLA NOVA FINANCING GROUP LLC          5.00
               LEAVITT INSURANCE AGENCY                5.00
               SAIS/JOHNS HOPKINS UNIVERSITY           5.00
               EASTHAM CAPITAL                         5.00
               PAULA HAWKINS & ASSOCIATES              5.00
               APPLIANCE INSTALLATIONS INC.            5.00
               DIRECT LENDERS LLC                      5.00
               PEACE FROGS INC.                        5.00
               INTERSTELLAR HOLDINGS LLC               5.00
               GOLDIE'S SALON                          5.00
               PICKET FENCES INC.                      5.00
               R. A. RAUCH & ASSOCIATES                5.00
               PLUM HEALTHCARE                         5.00
               GREGORY GALLIVAN                        5.00
               LEGACY SCHOOL                           5.00
               AA FLIPPEN ASSOCIATES                   5.00
               PACIFIC BIOSCIENCES                     5.00
               CA STATE SENATE                         5.00
               SCOTT GREENBAUM                         5.00
               RST GLOBAL LICENSING LLC                5.00
               MORGAN STANLEY SMITH BARNEY LLC         5.00
               LOUGH INVESTMENT ADVISORY LLC           4.00
               HONOLD COMMUNICTAIONS                   3.00
               WILL MERRIFIELD                         3.00
               UN                                      3.00
               UPTOWN CHEAPSKATE                       3.00
               INDEPENDENT PROFESSIONAL                3.00
               WATERWORKS INDUSRTIES                   3.00
Name: contb_receipt_amt, Length: 95888, dtype: float64

对出资额分组

>>> bins = np.array([0,1,10,100,1000,10000,100000,1000000,10000000])
>>> labels = pd.cut(fec_mrbo.contb_receipt_amt,bins)
>>> labels
411           (10, 100]
412         (100, 1000]
413         (100, 1000]
414           (10, 100]
415           (10, 100]
416           (10, 100]
417         (100, 1000]
418           (10, 100]
419         (100, 1000]
420           (10, 100]
421           (10, 100]
422         (100, 1000]
423         (100, 1000]
424         (100, 1000]
425         (100, 1000]
426         (100, 1000]
427       (1000, 10000]
428         (100, 1000]
429         (100, 1000]
430           (10, 100]
431       (1000, 10000]
432         (100, 1000]
433         (100, 1000]
434         (100, 1000]
435         (100, 1000]
436         (100, 1000]
437           (10, 100]
438         (100, 1000]
439         (100, 1000]
440           (10, 100]
              ...      
701356        (10, 100]
701357          (1, 10]
701358        (10, 100]
701359        (10, 100]
701360        (10, 100]
701361        (10, 100]
701362      (100, 1000]
701363        (10, 100]
701364        (10, 100]
701365        (10, 100]
701366        (10, 100]
701367        (10, 100]
701368      (100, 1000]
701369        (10, 100]
701370        (10, 100]
701371        (10, 100]
701372        (10, 100]
701373        (10, 100]
701374        (10, 100]
701375        (10, 100]
701376    (1000, 10000]
701377        (10, 100]
701378        (10, 100]
701379      (100, 1000]
701380    (1000, 10000]
701381        (10, 100]
701382      (100, 1000]
701383          (1, 10]
701384        (10, 100]
701385      (100, 1000]
Name: contb_receipt_amt, Length: 694282, dtype: category
Categories (8, interval[int64]): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] <
                                  (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]

然后根据候选人姓名以及面元标签对数据进行分组:

在小额赞助方面,奥巴马比较多

>>> grouped = fec_mrbo.groupby(['cand_nm',labels])
>>> grouped.size().unstack(0)
cand_nm              Obama, Barack  Romney, Mitt
contb_receipt_amt                               
(0, 1]                       493.0          77.0
(1, 10]                    40070.0        3681.0
(10, 100]                 372280.0       31853.0
(100, 1000]               153991.0       43357.0
(1000, 10000]              22284.0       26186.0
(10000, 100000]                2.0           1.0
(100000, 1000000]              3.0           NaN
(1000000, 10000000]            4.0           NaN

对出资额求和并在面元内规格化。

>>> bucket_sums = grouped.contb_receipt_amt.sum().unstack(0)
>>> bucket_sums
cand_nm              Obama, Barack  Romney, Mitt
contb_receipt_amt                               
(0, 1]                      318.24         77.00
(1, 10]                  337267.62      29819.66
(10, 100]              20288981.41    1987783.76
(100, 1000]            54798531.46   22363381.69
(1000, 10000]          51753705.67   63942145.42
(10000, 100000]           59100.00      12700.00
(100000, 1000000]       1490683.08           NaN
(1000000, 10000000]     7148839.76           NaN

赞助额比例

>>> normed_sums = bucket_sums.div(bucket_sums.sum(axis=1),axis=0)
>>> normed_sums
cand_nm              Obama, Barack  Romney, Mitt
contb_receipt_amt                               
(0, 1]                    0.805182      0.194818
(1, 10]                   0.918767      0.081233
(10, 100]                 0.910769      0.089231
(100, 1000]               0.710176      0.289824
(1000, 10000]             0.447326      0.552674
(10000, 100000]           0.823120      0.176880
(100000, 1000000]         1.000000           NaN
(1000000, 10000000]       1.000000           NaN
normed_sums[:-2].plot(kind='barh',stacked=True)

python数据聚合分组实战-----2012联邦选举委员会数据库_第2张图片

根据州统计赞助信息

>>> grouped = fec_mrbo.groupby(['cand_nm','contbr_st'])
>>> totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
>>> totals = totals[totals.sum(1)>100000]
>>> totals[:10]
cand_nm    Obama, Barack  Romney, Mitt
contbr_st                             
AK             281840.15      86204.24
AL             543123.48     527303.51
AR             359247.28     105556.00
AZ            1506476.98    1888436.23
CA           23824984.24   11237636.60
CO            2132429.49    1506714.12
CT            2068291.26    3499475.45
DC            4373538.80    1025137.50
DE             336669.14      82712.00
FL            7318178.58    8338458.81

对各行除以总赞助额,就会得到各候选人在各州的总赞助额比例:

>>> percent = totals.div(totals.sum(1),axis=0)
>>> percent[:10]
cand_nm    Obama, Barack  Romney, Mitt
contbr_st                             
AK              0.765778      0.234222
AL              0.507390      0.492610
AR              0.772902      0.227098
AZ              0.443745      0.556255
CA              0.679498      0.320502
CO              0.585970      0.414030
CT              0.371476      0.628524
DC              0.810113      0.189887
DE              0.802776      0.197224
FL              0.467417      0.532583

 

你可能感兴趣的:(数据清洗)