编程实践(Pandas)综合练习1

心得:花了一天时间做出来了三道题,但是办法实在太笨了,后面希望可以看一下优秀笔记的写法,多学习多练习。时间已经过去了半个多月,从不会写代码到现在已经可以独立做出综合练习(虽然办法很笨,感觉有很多地方想复杂了),自己感觉还是很开心的,继续努力!

1. 企业收入的多样性

一个企业的产业收入多样性可以仿照信息熵的概念来定义收入熵指标: I = − ∑ i p ( x i ) log ⁡ ( p ( x i ) ) \rm I=-\sum_{i}p(x_i)\log(p(x_i)) I=ip(xi)log(p(xi)),其中 p ( x i ) \rm p(x_i) p(xi)是企业该年某产业收入额占该年所有产业总收入的比重。在company.csv中存有需要计算的企业和年份,在company_data.csv中存有企业、各类收入额和收入年份的信息。现请利用后一张表中的数据,在前一张表中增加一列表示该公司该年份的收入熵指标 I \rm I I

# df1_2添加一列年份
df1_2['年份'] = pd.to_datetime(df1_2['日期']).dt.year
# 算出每个企业每个产业每一年的收入额
s1 = df1_2.groupby(['证券代码','收入类型','年份'])['收入额'].sum()
len(s1)
>>> 964022

经过计算发现s2即收入额,数据为1年1条数据,可以直接算p(xi)

# 直接算p(xi)
s2 =  df1_2.groupby(['证券代码','年份'])['收入额'].transform(lambda x: x/x.sum())
# 将占比加入到df1_2中
df1_2.insert(df1_2.shape[1], '占比', s2)
# 算出每个证券代码的收入熵指标 I
df1_2['对数占比'] = df1_2['占比'].apply(np.log)
df1_2['p(xi)*log(p(xi))'] = df1_2.apply(lambda x: x['占比'] * x['对数占比'], axis=1)
s3 = -df1_2.groupby(['证券代码','年份'])['p(xi)*log(p(xi))'].sum()
df1_3 = s3.reset_index()
df1_3.columns = ['证券代码', '年份', '信息熵']
df1_3.head()
证券代码 年份 信息熵
0 1 2008 2.085159
1 1 2009 1.671752
2 1 2010 2.108355
3 1 2011 3.150479
4 1 2012 2.718759
# 更改df1_1的证券代码
df1_4 = pd.concat([df1_1['证券代码'], df1_1['日期'], df1_1['证券代码'].str.split('#',expand = True)], axis=1)
s4 = pd.to_numeric(df1_4.iloc[:,-1])
df1_1.insert(df1_1.shape[1], '证券代码1', s4)
df1_1 = df1_1.merge(df1_3, left_on=['证券代码1','日期'],right_on=['证券代码','年份'],how='left')
df1_final = pd.DataFrame(df1_1, columns = ['证券代码_x','日期','信息熵'])
df1_final.columns = ['证券代码','日期','信息熵']
df1_final.head()
证券代码 日期 信息熵
0 #000007 2014 3.070462
1 #000403 2015 2.790585
2 #000408 2016 2.818541
3 #000408 2017 NaN
4 #000426 2015 3.084266

2. 组队学习信息表的变换

请把组队学习的队伍信息表变换为如下形态,其中“是否队长”一列取1表示队长,否则为0

自己的想法:
Step1:把编号和昵称合并为1列
Step2:使用melt把宽表变长表
Step3:拆分
Step4:添加【是否队长】1列
Step5:排序

df2_1 = pd.read_excel('data/final/组队信息汇总表(Pandas).xlsx')
# Step1:把编号和昵称合并(方法太笨,后面尝试优化)
df2_1 = df2_1.astype(str)
df2_1['队长'] = df2_1['队长_群昵称'] + '|' +  df2_1['队长编号']
df2_1['队员1'] = df2_1['队员_群昵称'] + '|' + df2_1['队员1 编号']
df2_1['队员2'] = df2_1['队员_群昵称.1'] + '|' + df2_1['队员2 编号'] 
df2_1['队员3'] = df2_1['队员_群昵称.2'] + '|' + df2_1['队员3 编号']
df2_1['队员4'] = df2_1['队员_群昵称.3'] + '|' + df2_1['队员4 编号']
df2_1['队员5'] = df2_1['队员_群昵称.4'] + '|' + df2_1['队员5 编号']
df2_1['队员6'] = df2_1['队员_群昵称.5'] + '|' + df2_1['队员6 编号']
df2_1['队员7'] = df2_1['队员_群昵称.6'] + '|' + df2_1['队员7 编号']
df2_1['队员8'] = df2_1['队员_群昵称.7'] + '|' + df2_1['队员8 编号']
df2_1['队员9'] = df2_1['队员_群昵称.8'] + '|' + df2_1['队员9 编号']
df2_1['队员10'] = df2_1['队员_群昵称.9'] + '|' + df2_1['队员10编号']
# 创建1张新表
df2_2 = pd.DataFrame(df2_1, columns = ['队伍名称', '队长', '队员1', '队员2', '队员3', '队员4', '队员5', '队员6', '队员7', '队员8', '队员9', '队员10'])
df2_2 = df2_2.reset_index()
# Step2:使用melt——宽表变长表
df_melted = df2_2.melt(id_vars = ['index','队伍名称'],
                       value_vars  = ['队长', '队员1', '队员2', '队员3', '队员4','队员5', '队员6','队员7', '队员8','队员9', '队员10'],
                       var_name = 'title',
                       value_name = '昵称|编号')
# Step3:拆分
df2_3 = pd.concat([df_melted['index'],df_melted['队伍名称'],df_melted['title'], df_melted['昵称|编号'].str.split('|', expand=True)], axis=1)
df2_3.columns = ['index','队伍名称','title','昵称','编号']
df2_4 = df2_3[~df2_3['编号'].isin(['nan'])]
# Step4:添加【是否队长】1列
s1 = df2_4.title.apply(lambda x: 1 if '队长' in x else 0)
df2_4.insert(1,'是否队长',s1)
# Step5:排序
df2_4 = df2_4.sort_values(by=['index','是否队长'], ascending=[True, False])
df2_4 = df2_4.reset_index(drop=True)
df2_final = pd.DataFrame(df2_4, columns = ['是否队长','队伍名称','昵称','编号'])

3. 美国大选投票情况

两张数据表中分别给出了美国各县(county)的人口数以及大选的投票情况,请解决以下问题:

  • 有多少县满足总投票数超过县人口数的一半
  • 把州(state)作为行索引,把投票候选人作为列名,列名的顺序按照候选人在全美的总票数由高到低排序,行列对应的元素为该候选人在该州获得的总票数
  • 每一个州下设若干县,定义拜登在该县的得票率减去川普在该县的得票率为该县的BT指标,若某个州所有县BT指标的中位数大于0,则称该州为Biden State,请找出所有的Biden State

3.1自己的想法:
Step1:计算出每个县的总投票数,建立新表df3_3
Step2:合并df3_3的列【.country,state】与df3_1匹配表
Step3:计数:总投票数超过县人口数的一半

df3_1 = pd.read_csv('data/final/county_population.csv')
df3_2 = pd.read_csv('data/final/president_county_candidate.csv')
# Step1:计算出每个县的总投票数,建立新表df3_3
df3_3 = df3_2.groupby(['state','county'])['total_votes']
df3_3 = pd.DataFrame(df3_3.sum()).reset_index()
# Step2:合并df3_3的列【.country,state】并新建1列
df3_3['US County'] = '.' + df3_3['county'] + ', ' + df3_3['state']
# 新建列与df3_1匹配
df3_4 = df3_3.merge(df3_1, on='US County', how='left')
# Step3:筛选出符合条件【总投票数超过县人口数的一半】的行
df3_5 = df3_4[(df3_4['total_votes']/df3_4['Population']>0.5)]
len(df3_5)
>>> 1434

3.2自己的想法:
Step1:候选人总票数求和排序
Step2:取出指定顺序的列名
Step3:长表变宽表——pivot_table
Step4:调整列顺序

# Step1:候选人总票数求和排序
df3_6 = df3_2.groupby(['candidate'])['total_votes']
# Step2:取出指定顺序的列名
cols = pd.DataFrame(df3_6.sum().sort_values(ascending = False)).index
# Step3:长表变宽表——pivot_table
df3_7 = df3_2.pivot_table(index = 'state',
                  columns = 'candidate',
                  values = 'total_votes')
# Step4:调整列顺序
df3_7 = df3_7[cols]
df3_7.head()
candidate Joe Biden Donald Trump Jo Jorgensen Howie Hawkins Write-ins Rocky De La Fuente Gloria La Riva Kanye West Don Blankenship Brock Pierce ... Tom Hoefling Ricki Sue King Princess Jacob-Fambro Blake Huber Richard Duncan Joseph Kishore Jordan Scott Gary Swing Keith McCormic Zachary Scalf
state
Alabama 12681.313433 21509.970149 375.761194 NaN 109.134328 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Alaska 3835.125000 4747.300000 222.400000 NaN 877.179487 7.950000 NaN NaN 28.175000 20.625000 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Arizona 111476.200000 110779.066667 3431.000000 NaN 135.466667 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Arkansas 5652.426667 10141.960000 175.106667 39.733333 NaN 17.613333 17.813333 54.653333 28.106667 28.546667 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
California 191547.655172 103551.051724 3239.396552 1396.982759 5.333333 1037.155172 879.931034 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 38 columns

3.2自己的想法:
Step1:算出所有县的BT指标(Joe Biden_votes - Donald Trump_votes)
Step2:找出Biden State(median(BT)>0)

# 新建一张表
df3_8 = df3_2.pivot_table(index = ['state', 'county'],
                          columns = 'candidate',
                          values = 'total_votes')
col2 = ['Joe Biden', 'Donald Trump']
df3_8 = df3_8[col2].reset_index()
# Step1:算出所有县的BT指标(Joe Biden_votes - Donald Trump_votes)
df3_8['BT指标'] = df3_8.apply(lambda x: x['Joe Biden'] - x['Donald Trump'], axis=1)
# Step2:找出Biden State(median(BT)>0)
gb = df3_8.groupby(['state'])['BT指标']
df3_9 = pd.DataFrame(gb.median()>0).reset_index()
Biden_State = df3_9[df3_9['BT指标'] == True]
len(Biden_State)
>>> 9

你可能感兴趣的:(datawhale,python)