目录
数据来源:
根据职业和雇主统计赞助信息
对出资额分组
根据州统计赞助信息
https://github.com/wesm/pydata-book
>>> fec = pd.read_csv('D:\python\DataAnalysis\data\P00000001-ALL.csv',low_memory = False)
>>> fec
cmte_id cand_id ... form_tp file_num
0 C00410118 P20002978 ... SA17A 736166
1 C00410118 P20002978 ... SA17A 736166
2 C00410118 P20002978 ... SA17A 749073
3 C00410118 P20002978 ... SA17A 749073
4 C00410118 P20002978 ... SA17A 736166
5 C00410118 P20002978 ... SA17A 736166
6 C00410118 P20002978 ... SA17A 736166
7 C00410118 P20002978 ... SA17A 749073
8 C00410118 P20002978 ... SA17A 736166
9 C00410118 P20002978 ... SA17A 736166
10 C00410118 P20002978 ... SA17A 736166
11 C00410118 P20002978 ... SA17A 736166
12 C00410118 P20002978 ... SA17A 736166
13 C00410118 P20002978 ... SA17A 736166
14 C00410118 P20002978 ... SA17A 749073
15 C00410118 P20002978 ... SA17A 749073
16 C00410118 P20002978 ... SA17A 736166
17 C00410118 P20002978 ... SA17A 749073
18 C00410118 P20002978 ... SA17A 736166
19 C00410118 P20002978 ... SA17A 736166
20 C00410118 P20002978 ... SA17A 736166
21 C00410118 P20002978 ... SA17A 736166
22 C00410118 P20002978 ... SA17A 736166
23 C00410118 P20002978 ... SA17A 736166
24 C00410118 P20002978 ... SA17A 749073
25 C00410118 P20002978 ... SA17A 749073
26 C00410118 P20002978 ... SA17A 749073
27 C00410118 P20002978 ... SA17A 749073
28 C00410118 P20002978 ... SA17A 749073
29 C00410118 P20002978 ... SA17A 749073
... ... ... ... ...
1001701 C00500587 P20003281 ... SA17A 772060
1001702 C00500587 P20003281 ... SA17A 772060
1001703 C00500587 P20003281 ... SA17A 772060
1001704 C00500587 P20003281 ... SA17A 772060
1001705 C00500587 P20003281 ... SA17A 772060
1001706 C00500587 P20003281 ... SA17A 772060
1001707 C00500587 P20003281 ... SA17A 772060
1001708 C00500587 P20003281 ... SA17A 772060
1001709 C00500587 P20003281 ... SA17A 772060
1001710 C00500587 P20003281 ... SB28A 779144
1001711 C00500587 P20003281 ... SA17A 751678
1001712 C00500587 P20003281 ... SA17A 751678
1001713 C00500587 P20003281 ... SA17A 751678
1001714 C00500587 P20003281 ... SA17A 751678
1001715 C00500587 P20003281 ... SA17A 751678
1001716 C00500587 P20003281 ... SA17A 751678
1001717 C00500587 P20003281 ... SA17A 772060
1001718 C00500587 P20003281 ... SA17A 772060
1001719 C00500587 P20003281 ... SA17A 772060
1001720 C00500587 P20003281 ... SA17A 772060
1001721 C00500587 P20003281 ... SA17A 772060
1001722 C00500587 P20003281 ... SA17A 772060
1001723 C00500587 P20003281 ... SA17A 772060
1001724 C00500587 P20003281 ... SA17A 751678
1001725 C00500587 P20003281 ... SA17A 751678
1001726 C00500587 P20003281 ... SA17A 751678
1001727 C00500587 P20003281 ... SA17A 751678
1001728 C00500587 P20003281 ... SA17A 751678
1001729 C00500587 P20003281 ... SA17A 751678
1001730 C00500587 P20003281 ... SA17A 751678
[1001731 rows x 16 columns]
>>> fec.ix[123456]
cmte_id C00431445
cand_id P80003338
cand_nm Obama, Barack
contbr_nm ELLMAN, IRA
contbr_city TEMPE
contbr_st AZ
contbr_zip 852816719
contbr_employer ARIZONA STATE UNIVERSITY
contbr_occupation PROFESSOR
contb_receipt_amt 50
contb_receipt_dt 01-DEC-11
receipt_desc NaN
memo_cd NaN
memo_text NaN
form_tp SA17A
file_num 772372
Name: 123456, dtype: object
所有候选人名单
>>> unique_cands = fec.cand_nm.unique()
>>> unique_cands
array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
"Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick',
'Cain, Herman', 'Gingrich, Newt', 'McCotter, Thaddeus G',
'Huntsman, Jon', 'Perry, Rick'], dtype=object)
利用字典说明党派人关系
>>> parties = {'Bachmann, Michelle':'Republican','Cain, Herman':'Republican','Gingrich, Newt':'Republican','Huntsman, Jon':'Republican','Johnson, Gary Earl':'Republican','McCotter, Thaddeus G':'Republican','Obama, Barack':'Democrat','Paul, Ron':'Republican','Pawlenty, Timothy':'Republican','Perry, Rick':'Republican',"Roemer, Charles E. 'Buddy' III":'Republican','Romney, Mitt':'Republican','Santorum, Rick':'Republican'}
>>> fec.cand_nm[123456:123461].map(parties)
123456 Democrat
123457 Democrat
123458 Democrat
123459 Democrat
123460 Democrat
Name: cand_nm, dtype: object
简单统计
>>> fec['party'] = fec.cand_nm.map(parties)
>>> fec['party'].value_counts()
Democrat 593746
Republican 407985
Name: party, dtype: int64
查看赞助信息
>>> (fec.contb_receipt_amt > 0).value_counts()
True 991475
False 10256
Name: contb_receipt_amt, dtype: int64
但是该数据既包括退款也包括赞助,下面简化数据。
>>> fec = fec[fec.contb_receipt_amt > 0]
下面为针对两位主要候选人的数据
>>> fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack','Romney, Mitt'])]
基于职业的赞助信息统计是另一种经常被研究的统计任务。例如律师们喜欢民主党,企业主喜欢共和党,看数据
>>> fec.contbr_occupation.value_counts()[:10]
RETIRED 233990
INFORMATION REQUESTED 35107
ATTORNEY 34286
HOMEMAKER 29931
PHYSICIAN 23432
INFORMATION REQUESTED PER BEST EFFORTS 21138
ENGINEER 14334
TEACHER 13990
CONSULTANT 13273
PROFESSOR 12555
Name: contbr_occupation, dtype: int64
整合相同职业,并允许没有映射关系的职业通过
>>> occ_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED','INFORMATION REQUESTED':'NOT PROVIDED','INFORMATION REQUESTED (BEST EFFORTS)':'NOT PROVIDED','C.E.O':'CEO'}
>>> f = lambda x : occ_mapping.get(x,x)
>>> fec.contbr_occupation = fec.contbr_occupation.map(f)
对雇主信息进行同样处理
>>> emp_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED','INFORMATION REQUESTED':'NOT PROVIDED','SELF':'SELF-EMPLOYED','SELF EMPLOYED':'SELF-EMPLOYED',}
>>> f = lambda x: emp_mapping.get(x,x)
>>> fec.contbr_employer = fec.contbr_employer.map(f)
根据党派和职业聚合数据,然后过滤掉总出资额不足200万美元的数据:
>>> by_occupation = fec.pivot_table('contb_receipt_amt',index='contbr_occupation',columns='party',aggfunc='sum')
>>> over_2mm = by_occupation[by_occupation.sum(1) > 2000000]
>>> over_2mm
party Democrat Republican
contbr_occupation
ATTORNEY 11141982.97 7.477194e+06
C.E.O. 1690.00 2.592983e+06
CEO 2074284.79 1.640758e+06
CONSULTANT 2459912.71 2.544725e+06
ENGINEER 951525.55 1.818374e+06
EXECUTIVE 1355161.05 4.138850e+06
HOMEMAKER 4248875.80 1.363428e+07
INVESTOR 884133.00 2.431769e+06
LAWYER 3160478.87 3.912243e+05
MANAGER 762883.22 1.444532e+06
NOT PROVIDED 4866973.96 2.056547e+07
OWNER 1001567.36 2.408287e+06
PHYSICIAN 3735124.94 3.594320e+06
PRESIDENT 1878509.95 4.720924e+06
PROFESSOR 2165071.08 2.967027e+05
REAL ESTATE 528902.09 1.625902e+06
RETIRED 25305116.38 2.356124e+07
SELF-EMPLOYED 672393.40 1.640253e+06
做成水平柱状图
>>> over_2mm.plot(kind='barh')
下面列出对两位候选人出资最高的职业和企业。
然后根据职业和雇主进行聚合
def get_top_amounts(group,key,n=5):
totals = group.groupby(key)['contb_receipt_amt'].sum()
return totals.sort_values(ascending=False)[n:]
grouped = fec_mrbo.groupby('cand_nm')
grouped.apply(get_top_amounts,'contbr_occupation',n=7)
cand_nm contbr_occupation
Obama, Barack PROFESSOR 2165071.08
CEO 2074284.79
PRESIDENT 1878509.95
NOT EMPLOYED 1709188.20
EXECUTIVE 1355161.05
TEACHER 1250969.15
WRITER 1084188.88
OWNER 1001567.36
ENGINEER 951525.55
INVESTOR 884133.00
ARTIST 763125.00
MANAGER 762883.22
SELF-EMPLOYED 672393.40
STUDENT 628099.75
REAL ESTATE 528902.09
CHAIRMAN 496547.00
ARCHITECT 483859.89
DIRECTOR 471741.73
BUSINESS OWNER 449979.30
EDUCATOR 436600.89
PSYCHOLOGIST 427299.92
SOFTWARE ENGINEER 396985.65
PARTNER 395759.50
SALES 392886.91
EXECUTIVE DIRECTOR 348180.94
MANAGING DIRECTOR 329688.25
SOCIAL WORKER 326844.43
VICE PRESIDENT 325647.15
ADMINISTRATOR 323079.26
SCIENTIST 319227.88
...
Romney, Mitt NON-PROFIT VETERANS ORG. CHAIR/ANNUITA 10.00
CHRMN/CEO 10.00
LEAN SIX SIGMA PRACTIONER 10.00
PRESIDENT EMERITUS 10.00
SIGN CONTRACTOR 10.00
JUDGE ADVOCATE ATTORNEY 10.00
SALES SUPPORT 10.00
CONTRACTS SPECIALIST 9.00
TEACHER & FREE-LANCE JOURNALIST 9.00
FOUNDATION CONSULTANT 6.00
ELAYNE WELLS HARMER 6.00
MAIL HANDLER 6.00
SECRETARY/BOOKKEPPER 6.00
TREASURER & DIRECTOR OF FINANCE 6.00
DISTRICT REPRESENTATIVE 5.00
HORTICULTURIST 5.00
FINANCIAL INSTITUTION - CEO 5.00
PLANNING AND OPERATIONS ANALYST 5.00
MD - UROLOGIST 5.00
VILLA NOVA 5.00
SCOTT GREENBAUM 5.00
ENGINEER/RISK EXPERT 5.00
EDUCATION ADMIN 5.00
DIRECTOR REISCHAUER CENTER FOR EAST A 5.00
CHICKEN GRADER 5.00
IFC CONTRACTING SOLUTIONS 3.00
INDEPENDENT PROFESSIONAL 3.00
AFFORDABLE REAL ESTATE DEVELOPER 3.00
REMODELER & SEMI RETIRED 3.00
3RD GENERATION FAMILY BUSINESS OWNER 3.00
Name: contb_receipt_amt, Length: 35973, dtype: float64
>>> grouped.apply(get_top_amounts,'contbr_employer',n=10)
cand_nm contbr_employer
Obama, Barack DLA PIPER 148235.00
HARVARD UNIVERSITY 131368.94
IBM 128490.93
GOOGLE 125302.88
MICROSOFT CORPORATION 108849.00
KAISER PERMANENTE 104949.95
JONES DAY 103712.50
STANFORD UNIVERSITY 101630.75
COLUMBIA UNIVERSITY 96325.12
UNIVERSITY OF CHICAGO 88575.00
AT&T 88132.12
US GOVERNMENT 87689.00
MORGAN & MORGAN 87250.00
VERIZON 85318.30
UNIVERSITY OF MICHIGAN 84856.33
DISABLED 78417.87
UCLA 78092.50
ARNOLD & PORTER LLP 76330.00
OBAMA FOR AMERICA 72028.89
UNIVERSITY OF WASHINGTON 70445.86
NORTHWESTERN UNIVERSITY 69489.05
DEPARTMENT OF DEFENSE 67253.40
COMCAST 65158.00
US ARMY 64768.91
FEDERAL GOVERNMENT 64590.26
WELLS FARGO 62749.60
UNIVERSITY OF CALIFORNIA 62432.00
SKADDEN ARPS 61904.00
DEBEVOISE & PLIMPTON LLP 61268.00
MAYER BROWN LLP 60315.40
...
Romney, Mitt U.S. POST OFFICE 6.00
ENERGY ALLOYS 6.00
VILLA NOVA FINANCING GROUP LLC 5.00
LEAVITT INSURANCE AGENCY 5.00
SAIS/JOHNS HOPKINS UNIVERSITY 5.00
EASTHAM CAPITAL 5.00
PAULA HAWKINS & ASSOCIATES 5.00
APPLIANCE INSTALLATIONS INC. 5.00
DIRECT LENDERS LLC 5.00
PEACE FROGS INC. 5.00
INTERSTELLAR HOLDINGS LLC 5.00
GOLDIE'S SALON 5.00
PICKET FENCES INC. 5.00
R. A. RAUCH & ASSOCIATES 5.00
PLUM HEALTHCARE 5.00
GREGORY GALLIVAN 5.00
LEGACY SCHOOL 5.00
AA FLIPPEN ASSOCIATES 5.00
PACIFIC BIOSCIENCES 5.00
CA STATE SENATE 5.00
SCOTT GREENBAUM 5.00
RST GLOBAL LICENSING LLC 5.00
MORGAN STANLEY SMITH BARNEY LLC 5.00
LOUGH INVESTMENT ADVISORY LLC 4.00
HONOLD COMMUNICTAIONS 3.00
WILL MERRIFIELD 3.00
UN 3.00
UPTOWN CHEAPSKATE 3.00
INDEPENDENT PROFESSIONAL 3.00
WATERWORKS INDUSRTIES 3.00
Name: contb_receipt_amt, Length: 95888, dtype: float64
>>> bins = np.array([0,1,10,100,1000,10000,100000,1000000,10000000])
>>> labels = pd.cut(fec_mrbo.contb_receipt_amt,bins)
>>> labels
411 (10, 100]
412 (100, 1000]
413 (100, 1000]
414 (10, 100]
415 (10, 100]
416 (10, 100]
417 (100, 1000]
418 (10, 100]
419 (100, 1000]
420 (10, 100]
421 (10, 100]
422 (100, 1000]
423 (100, 1000]
424 (100, 1000]
425 (100, 1000]
426 (100, 1000]
427 (1000, 10000]
428 (100, 1000]
429 (100, 1000]
430 (10, 100]
431 (1000, 10000]
432 (100, 1000]
433 (100, 1000]
434 (100, 1000]
435 (100, 1000]
436 (100, 1000]
437 (10, 100]
438 (100, 1000]
439 (100, 1000]
440 (10, 100]
...
701356 (10, 100]
701357 (1, 10]
701358 (10, 100]
701359 (10, 100]
701360 (10, 100]
701361 (10, 100]
701362 (100, 1000]
701363 (10, 100]
701364 (10, 100]
701365 (10, 100]
701366 (10, 100]
701367 (10, 100]
701368 (100, 1000]
701369 (10, 100]
701370 (10, 100]
701371 (10, 100]
701372 (10, 100]
701373 (10, 100]
701374 (10, 100]
701375 (10, 100]
701376 (1000, 10000]
701377 (10, 100]
701378 (10, 100]
701379 (100, 1000]
701380 (1000, 10000]
701381 (10, 100]
701382 (100, 1000]
701383 (1, 10]
701384 (10, 100]
701385 (100, 1000]
Name: contb_receipt_amt, Length: 694282, dtype: category
Categories (8, interval[int64]): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] <
(10000, 100000] < (100000, 1000000] < (1000000, 10000000]]
然后根据候选人姓名以及面元标签对数据进行分组:
在小额赞助方面,奥巴马比较多
>>> grouped = fec_mrbo.groupby(['cand_nm',labels])
>>> grouped.size().unstack(0)
cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 493.0 77.0
(1, 10] 40070.0 3681.0
(10, 100] 372280.0 31853.0
(100, 1000] 153991.0 43357.0
(1000, 10000] 22284.0 26186.0
(10000, 100000] 2.0 1.0
(100000, 1000000] 3.0 NaN
(1000000, 10000000] 4.0 NaN
对出资额求和并在面元内规格化。
>>> bucket_sums = grouped.contb_receipt_amt.sum().unstack(0)
>>> bucket_sums
cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 318.24 77.00
(1, 10] 337267.62 29819.66
(10, 100] 20288981.41 1987783.76
(100, 1000] 54798531.46 22363381.69
(1000, 10000] 51753705.67 63942145.42
(10000, 100000] 59100.00 12700.00
(100000, 1000000] 1490683.08 NaN
(1000000, 10000000] 7148839.76 NaN
赞助额比例
>>> normed_sums = bucket_sums.div(bucket_sums.sum(axis=1),axis=0)
>>> normed_sums
cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 0.805182 0.194818
(1, 10] 0.918767 0.081233
(10, 100] 0.910769 0.089231
(100, 1000] 0.710176 0.289824
(1000, 10000] 0.447326 0.552674
(10000, 100000] 0.823120 0.176880
(100000, 1000000] 1.000000 NaN
(1000000, 10000000] 1.000000 NaN
normed_sums[:-2].plot(kind='barh',stacked=True)
>>> grouped = fec_mrbo.groupby(['cand_nm','contbr_st'])
>>> totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
>>> totals = totals[totals.sum(1)>100000]
>>> totals[:10]
cand_nm Obama, Barack Romney, Mitt
contbr_st
AK 281840.15 86204.24
AL 543123.48 527303.51
AR 359247.28 105556.00
AZ 1506476.98 1888436.23
CA 23824984.24 11237636.60
CO 2132429.49 1506714.12
CT 2068291.26 3499475.45
DC 4373538.80 1025137.50
DE 336669.14 82712.00
FL 7318178.58 8338458.81
对各行除以总赞助额,就会得到各候选人在各州的总赞助额比例:
>>> percent = totals.div(totals.sum(1),axis=0)
>>> percent[:10]
cand_nm Obama, Barack Romney, Mitt
contbr_st
AK 0.765778 0.234222
AL 0.507390 0.492610
AR 0.772902 0.227098
AZ 0.443745 0.556255
CA 0.679498 0.320502
CO 0.585970 0.414030
CT 0.371476 0.628524
DC 0.810113 0.189887
DE 0.802776 0.197224
FL 0.467417 0.532583