一个企业的产业收入多样性可以仿照信息熵的概念来定义收入熵指标: I = − ∑ i p ( x i ) log ( p ( x i ) ) \rm I=-\sum_{i}p(x_i)\log(p(x_i)) I=−∑ip(xi)log(p(xi)),其中 p ( x i ) \rm p(x_i) p(xi)是企业该年某产业收入额占该年所有产业总收入的比重。在
company.csv
中存有需要计算的企业和年份,在company_data.csv
中存有企业、各类收入额和收入年份的信息。现请利用后一张表中的数据,在前一张表中增加一列表示该公司该年份的收入熵指标 I \rm I I。
# df1_2添加一列年份
df1_2['年份'] = pd.to_datetime(df1_2['日期']).dt.year
# 算出每个企业每个产业每一年的收入额
s1 = df1_2.groupby(['证券代码','收入类型','年份'])['收入额'].sum()
len(s1)
>>> 964022
经过计算发现s2即收入额,数据为1年1条数据,可以直接算p(xi)
# 直接算p(xi)
s2 = df1_2.groupby(['证券代码','年份'])['收入额'].transform(lambda x: x/x.sum())
# 将占比加入到df1_2中
df1_2.insert(df1_2.shape[1], '占比', s2)
# 算出每个证券代码的收入熵指标 I
df1_2['对数占比'] = df1_2['占比'].apply(np.log)
df1_2['p(xi)*log(p(xi))'] = df1_2.apply(lambda x: x['占比'] * x['对数占比'], axis=1)
s3 = -df1_2.groupby(['证券代码','年份'])['p(xi)*log(p(xi))'].sum()
df1_3 = s3.reset_index()
df1_3.columns = ['证券代码', '年份', '信息熵']
df1_3.head()
证券代码 | 年份 | 信息熵 | |
---|---|---|---|
0 | 1 | 2008 | 2.085159 |
1 | 1 | 2009 | 1.671752 |
2 | 1 | 2010 | 2.108355 |
3 | 1 | 2011 | 3.150479 |
4 | 1 | 2012 | 2.718759 |
# 更改df1_1的证券代码
df1_4 = pd.concat([df1_1['证券代码'], df1_1['日期'], df1_1['证券代码'].str.split('#',expand = True)], axis=1)
s4 = pd.to_numeric(df1_4.iloc[:,-1])
df1_1.insert(df1_1.shape[1], '证券代码1', s4)
df1_1 = df1_1.merge(df1_3, left_on=['证券代码1','日期'],right_on=['证券代码','年份'],how='left')
df1_final = pd.DataFrame(df1_1, columns = ['证券代码_x','日期','信息熵'])
df1_final.columns = ['证券代码','日期','信息熵']
df1_final.head()
证券代码 | 日期 | 信息熵 | |
---|---|---|---|
0 | #000007 | 2014 | 3.070462 |
1 | #000403 | 2015 | 2.790585 |
2 | #000408 | 2016 | 2.818541 |
3 | #000408 | 2017 | NaN |
4 | #000426 | 2015 | 3.084266 |
请把组队学习的队伍信息表变换为如下形态,其中“是否队长”一列取1表示队长,否则为0
自己的想法:
Step1:把编号和昵称合并为1列
Step2:使用melt把宽表变长表
Step3:拆分
Step4:添加【是否队长】1列
Step5:排序
df2_1 = pd.read_excel('data/final/组队信息汇总表(Pandas).xlsx')
# Step1:把编号和昵称合并(方法太笨,后面尝试优化)
df2_1 = df2_1.astype(str)
df2_1['队长'] = df2_1['队长_群昵称'] + '|' + df2_1['队长编号']
df2_1['队员1'] = df2_1['队员_群昵称'] + '|' + df2_1['队员1 编号']
df2_1['队员2'] = df2_1['队员_群昵称.1'] + '|' + df2_1['队员2 编号']
df2_1['队员3'] = df2_1['队员_群昵称.2'] + '|' + df2_1['队员3 编号']
df2_1['队员4'] = df2_1['队员_群昵称.3'] + '|' + df2_1['队员4 编号']
df2_1['队员5'] = df2_1['队员_群昵称.4'] + '|' + df2_1['队员5 编号']
df2_1['队员6'] = df2_1['队员_群昵称.5'] + '|' + df2_1['队员6 编号']
df2_1['队员7'] = df2_1['队员_群昵称.6'] + '|' + df2_1['队员7 编号']
df2_1['队员8'] = df2_1['队员_群昵称.7'] + '|' + df2_1['队员8 编号']
df2_1['队员9'] = df2_1['队员_群昵称.8'] + '|' + df2_1['队员9 编号']
df2_1['队员10'] = df2_1['队员_群昵称.9'] + '|' + df2_1['队员10编号']
# 创建1张新表
df2_2 = pd.DataFrame(df2_1, columns = ['队伍名称', '队长', '队员1', '队员2', '队员3', '队员4', '队员5', '队员6', '队员7', '队员8', '队员9', '队员10'])
df2_2 = df2_2.reset_index()
# Step2:使用melt——宽表变长表
df_melted = df2_2.melt(id_vars = ['index','队伍名称'],
value_vars = ['队长', '队员1', '队员2', '队员3', '队员4','队员5', '队员6','队员7', '队员8','队员9', '队员10'],
var_name = 'title',
value_name = '昵称|编号')
# Step3:拆分
df2_3 = pd.concat([df_melted['index'],df_melted['队伍名称'],df_melted['title'], df_melted['昵称|编号'].str.split('|', expand=True)], axis=1)
df2_3.columns = ['index','队伍名称','title','昵称','编号']
df2_4 = df2_3[~df2_3['编号'].isin(['nan'])]
# Step4:添加【是否队长】1列
s1 = df2_4.title.apply(lambda x: 1 if '队长' in x else 0)
df2_4.insert(1,'是否队长',s1)
# Step5:排序
df2_4 = df2_4.sort_values(by=['index','是否队长'], ascending=[True, False])
df2_4 = df2_4.reset_index(drop=True)
df2_final = pd.DataFrame(df2_4, columns = ['是否队长','队伍名称','昵称','编号'])
两张数据表中分别给出了美国各县(
county
)的人口数以及大选的投票情况,请解决以下问题:
- 有多少县满足总投票数超过县人口数的一半
- 把州(
state
)作为行索引,把投票候选人作为列名,列名的顺序按照候选人在全美的总票数由高到低排序,行列对应的元素为该候选人在该州获得的总票数- 每一个州下设若干县,定义拜登在该县的得票率减去川普在该县的得票率为该县的BT指标,若某个州所有县BT指标的中位数大于0,则称该州为
Biden State
,请找出所有的Biden State
3.1自己的想法:
Step1:计算出每个县的总投票数,建立新表df3_3
Step2:合并df3_3的列【.country,state】与df3_1匹配表
Step3:计数:总投票数超过县人口数的一半
df3_1 = pd.read_csv('data/final/county_population.csv')
df3_2 = pd.read_csv('data/final/president_county_candidate.csv')
# Step1:计算出每个县的总投票数,建立新表df3_3
df3_3 = df3_2.groupby(['state','county'])['total_votes']
df3_3 = pd.DataFrame(df3_3.sum()).reset_index()
# Step2:合并df3_3的列【.country,state】并新建1列
df3_3['US County'] = '.' + df3_3['county'] + ', ' + df3_3['state']
# 新建列与df3_1匹配
df3_4 = df3_3.merge(df3_1, on='US County', how='left')
# Step3:筛选出符合条件【总投票数超过县人口数的一半】的行
df3_5 = df3_4[(df3_4['total_votes']/df3_4['Population']>0.5)]
len(df3_5)
>>> 1434
3.2自己的想法:
Step1:候选人总票数求和排序
Step2:取出指定顺序的列名
Step3:长表变宽表——pivot_table
Step4:调整列顺序
# Step1:候选人总票数求和排序
df3_6 = df3_2.groupby(['candidate'])['total_votes']
# Step2:取出指定顺序的列名
cols = pd.DataFrame(df3_6.sum().sort_values(ascending = False)).index
# Step3:长表变宽表——pivot_table
df3_7 = df3_2.pivot_table(index = 'state',
columns = 'candidate',
values = 'total_votes')
# Step4:调整列顺序
df3_7 = df3_7[cols]
df3_7.head()
candidate | Joe Biden | Donald Trump | Jo Jorgensen | Howie Hawkins | Write-ins | Rocky De La Fuente | Gloria La Riva | Kanye West | Don Blankenship | Brock Pierce | ... | Tom Hoefling | Ricki Sue King | Princess Jacob-Fambro | Blake Huber | Richard Duncan | Joseph Kishore | Jordan Scott | Gary Swing | Keith McCormic | Zachary Scalf |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
state | |||||||||||||||||||||
Alabama | 12681.313433 | 21509.970149 | 375.761194 | NaN | 109.134328 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Alaska | 3835.125000 | 4747.300000 | 222.400000 | NaN | 877.179487 | 7.950000 | NaN | NaN | 28.175000 | 20.625000 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Arizona | 111476.200000 | 110779.066667 | 3431.000000 | NaN | 135.466667 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Arkansas | 5652.426667 | 10141.960000 | 175.106667 | 39.733333 | NaN | 17.613333 | 17.813333 | 54.653333 | 28.106667 | 28.546667 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
California | 191547.655172 | 103551.051724 | 3239.396552 | 1396.982759 | 5.333333 | 1037.155172 | 879.931034 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 38 columns
3.2自己的想法:
Step1:算出所有县的BT指标(Joe Biden_votes - Donald Trump_votes)
Step2:找出Biden State(median(BT)>0)
# 新建一张表
df3_8 = df3_2.pivot_table(index = ['state', 'county'],
columns = 'candidate',
values = 'total_votes')
col2 = ['Joe Biden', 'Donald Trump']
df3_8 = df3_8[col2].reset_index()
# Step1:算出所有县的BT指标(Joe Biden_votes - Donald Trump_votes)
df3_8['BT指标'] = df3_8.apply(lambda x: x['Joe Biden'] - x['Donald Trump'], axis=1)
# Step2:找出Biden State(median(BT)>0)
gb = df3_8.groupby(['state'])['BT指标']
df3_9 = pd.DataFrame(gb.median()>0).reset_index()
Biden_State = df3_9[df3_9['BT指标'] == True]
len(Biden_State)
>>> 9