目录
需求:
1.加载数据,查看数据的基本信息
2.指定数据截取,将如下字段的数据进行提取,其他数据舍弃
3.对新数据进行总览df.info(),查看是否存在缺失数据
4.用统计学指标快速描述数值型属性的概要。df.describe()
5.空值处理。可能因为忘记填写或者保密等等原因,相关字段出现了空值,将其填充为NOT PROVIDE
6.异常值处理。将捐款金额<=0的数据删除
7.新建一列为各个候选人所在党派party
8.查看party这一列中有哪些不同的元素
9.统计party列中各个元素出现次数
10.查看各个党派收到的政治献金总数contb_receipt_amt
11.查看具体每天各个党派收到的政治献金总数contb_receipt_amt
12.将表中日期格式转换为'yyyy-mm-dd'
13.查看老兵(捐献者职业)DISABLED VETERAN主要支持谁
14.找出各个候选人的捐赠者中,捐赠金额最大的人的职业以及捐献额
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#1.加载数据,查看数据的基本信息
df=pd.read_csv('./data/usa_election.txt',error_bad_lines=False)
print(df.head())
print(df.info())
运行结果
cmte_id cand_id cand_nm ... memo_text form_tp file_num
0 C00410118 P20002978 Bachmann, Michelle ... NaN SA17A 736166
1 C00410118 P20002978 Bachmann, Michelle ... NaN SA17A 736166
2 C00410118 P20002978 Bachmann, Michelle ... NaN SA17A 749073
3 C00410118 P20002978 Bachmann, Michelle ... NaN SA17A 749073
4 C00410118 P20002978 Bachmann, Michelle ... NaN SA17A 736166
[5 rows x 16 columns]
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cmte_id 536041 non-null object
1 cand_id 536041 non-null object
2 cand_nm 536041 non-null object
3 contbr_nm 536041 non-null object
4 contbr_city 536026 non-null object
5 contbr_st 536040 non-null object
6 contbr_zip 535973 non-null object
7 contbr_employer 525088 non-null object
8 contbr_occupation 530520 non-null object
9 contb_receipt_amt 536041 non-null float64
10 contb_receipt_dt 536041 non-null object
11 receipt_desc 8479 non-null object
12 memo_cd 49718 non-null object
13 memo_text 52740 non-null object
14 form_tp 536041 non-null object
15 file_num 536041 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 65.4+ MB
None
Process finished with exit code 0
cand_nm :候选人姓名
contbr_nm : 捐赠人姓名
contbr_st :捐赠人所在州
contbr_employer : 捐赠人所在公司
contbr_occupation : 捐赠人职业
contb_receipt_amt :捐赠数额(美元)
contb_receipt_dt : 捐款的日期
#2.指定数据截取,将如下字段的数据进行提取,其他数据舍弃
df=df[['cand_nm','contbr_nm','contbr_st','contbr_employer','contbr_occupation','contb_receipt_amt','contb_receipt_dt']]
print(df.head())
#3.对新数据进行总览df.info(),查看是否存在缺失数据
print(df.info())
输出结果
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cand_nm 536041 non-null object
1 contbr_nm 536041 non-null object
2 contbr_st 536040 non-null object
3 contbr_employer 525088 non-null object
4 contbr_occupation 530520 non-null object
5 contb_receipt_amt 536041 non-null float64
6 contb_receipt_dt 536041 non-null object
dtypes: float64(1), object(6)
memory usage: 28.6+ MB
None
Process finished with exit code 0
#4.用统计学指标快速描述数值型属性的概要。df.describe()
print(df.describe())
输出结果
contb_receipt_amt
count 5.360410e+05
mean 3.750373e+02
std 3.564436e+03
min -3.080000e+04
25% 5.000000e+01
50% 1.000000e+02
75% 2.500000e+02
max 1.944042e+06
Process finished with exit code 0
#5.空值处理。可能因为忘记填写或者保密等等原因,相关字段出现了空值,将其填充为NOT PROVIDE
# 使用NOT PROVIDE对空值进行填充
df.fillna(value='NOT PROVIDE',inplace=True)
# 重新查看列是否有空值
print(df.info())
输出结果
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cand_nm 536041 non-null object
1 contbr_nm 536041 non-null object
2 contbr_st 536041 non-null object
3 contbr_employer 536041 non-null object
4 contbr_occupation 536041 non-null object
5 contb_receipt_amt 536041 non-null float64
6 contb_receipt_dt 536041 non-null object
dtypes: float64(1), object(6)
memory usage: 28.6+ MB
None
Process finished with exit code 0
#6.异常值处理。将捐款金额<=0的数据删除
df=df.loc[~(df['contb_receipt_amt']<=0)]
print(df.info())
方法2
df['contb_receipt_amt']<=0 #判断
drop_index=df.loc[df['contb_receipt_amt']<=0].index
df.drop(labels=drop_index,axis=0,inplace=True)
print(df.info())
输出结果
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cand_nm 530314 non-null object
1 contbr_nm 530314 non-null object
2 contbr_st 530314 non-null object
3 contbr_employer 530314 non-null object
4 contbr_occupation 530314 non-null object
5 contb_receipt_amt 530314 non-null float64
6 contb_receipt_dt 530314 non-null object
dtypes: float64(1), object(6)
memory usage: 32.4+ MB
None
Process finished with exit code 0
# 不同候选人党派对应表
parties = {
'Bachmann, Michelle': 'Republican',
'Romney, Mitt': 'Republican',
'Obama, Barack': 'Democrat',
"Roemer, Charles E. 'Buddy' III": 'Reform',
'Pawlenty, Timothy': 'Republican',
'Johnson, Gary Earl': 'Libertarian',
'Paul, Ron': 'Republican',
'Santorum, Rick': 'Republican',
'Cain, Herman': 'Republican',
'Gingrich, Newt': 'Republican',
'McCotter, Thaddeus G': 'Republican',
'Huntsman, Jon': 'Republican',
'Perry, Rick': 'Republican'
}
#查看共有多少个不同的候选人,返回所有值 一个列表
num1=df['cand_nm'].unique()
print(num1)
# 查看候选人的个数,返回所有值的个数 13个人
num2=df['cand_nm'].nunique()
print(num2)
# 利用映射为每个候选人添加党派信息
df['party']=df['cand_nm'].map(parties)
print(df.head())
输出结果
cand_nm contbr_nm ... contb_receipt_dt party
0 Bachmann, Michelle HARVEY, WILLIAM ... 20-JUN-11 Republican
1 Bachmann, Michelle HARVEY, WILLIAM ... 23-JUN-11 Republican
2 Bachmann, Michelle SMITH, LANIER ... 05-JUL-11 Republican
3 Bachmann, Michelle BLEVINS, DARONDA ... 01-AUG-11 Republican
4 Bachmann, Michelle WARDENBURG, HAROLD ... 20-JUN-11 Republican
[5 rows x 8 columns]
Process finished with exit code 0
#8.查看party这一列中有哪些不同的元素
print(df['party'].unique())
输出结果
['Republican' 'Democrat' 'Reform' 'Libertarian']
#9.统计party列中各个元素出现次数
print(df['party'].value_counts())# value_counts()统计Series中不同元素出现的次数
输出结果
Democrat 289999
Republican 234300
Reform 5313
Libertarian 702
Name: party, dtype: int64
Process finished with exit code 0
#查看各个党派收到的政治献金总数contb_receipt_amt
df_sum=df.groupby(by='party')['contb_receipt_amt'].sum()
print(df_sum)
输出结果
Democrat 8.259441e+07
Libertarian 4.132769e+05
Reform 3.429658e+05
Republican 1.251181e+08
Name: contb_receipt_amt, dtype: float64
Process finished with exit code 0
#查看具体每天各个党派收到的政治献金总数contb_receipt_amt
df_sum2=df.groupby(by=['contb_receipt_dt','party'])['contb_receipt_amt'].sum()
print(df_sum2)
输出结果
contb_receipt_dt party
01-APR-11 Reform 50.00
Republican 12635.00
01-AUG-11 Democrat 182198.00
Libertarian 1000.00
Reform 1847.00
...
31-MAY-11 Republican 313839.80
31-OCT-11 Democrat 216971.87
Libertarian 4250.00
Reform 3205.00
Republican 751542.36
Name: contb_receipt_amt, Length: 1183, dtype: float64
Process finished with exit code 0
#将表中日期格式转换为'yyyy-mm-dd'
months = {"JAN":1, "FEB":2, "MAR":3, "APR":4, "MAY":5, "JUN":6,
"JUL":7, "AUG":8, "SEP":9, "OCT":10, "NOV":11, "DEC":12}
def transform_date(d):
day,month,year=d.split("-")
month = months[month]
return '20'+year+'-'+str(month)+'-'+day
df['contb_receipt_dt']=df['contb_receipt_dt'].map(transform_date)
print(df.head())
输出结果
cand_nm contbr_nm ... contb_receipt_dt party
0 Bachmann, Michelle HARVEY, WILLIAM ... 2011-6-20 Republican
1 Bachmann, Michelle HARVEY, WILLIAM ... 2011-6-23 Republican
2 Bachmann, Michelle SMITH, LANIER ... 2011-7-05 Republican
3 Bachmann, Michelle BLEVINS, DARONDA ... 2011-8-01 Republican
4 Bachmann, Michelle WARDENBURG, HAROLD ... 2011-6-20 Republican
[5 rows x 8 columns]
Process finished with exit code 0
# 1.取出老兵这个职业对应的行数据
old_bing_df = df.loc[df['contbr_occupation'] == 'DISABLED VETERAN']
# 2.根据竞选者分组
who1=old_bing_df.groupby(by='cand_nm')['contb_receipt_amt'].sum()
print(who1)
输出结果:给谁捐赠的钱越多 越支持谁
cand_nm
Cain, Herman 300.00
Obama, Barack 4205.00
Paul, Ron 2425.49
Santorum, Rick 250.00
Name: contb_receipt_amt, dtype: float64
Process finished with exit code 0
#找出各个候选人的捐赠者中,捐赠金额最大的人的职业以及捐献额通过query("查询条件来查找捐献人职业")
df['contb_receipt_amt'].max()
df.query('contb_receipt_amt == 1944042.43')
max_amt = df.groupby(by='cand_nm')['contb_receipt_amt'].max()
for i in range(max_amt.size):
max_money = max_amt[i]
display(df.query('contb_receipt_amt == '+str(max_money)))
2012美国大选献金项目数据分析案例总结: