学习参考:http://datawhale.club/t/topic/579
【题目描述】一个企业的产业收入多样性可以仿照信息熵的概念来定义收入熵指标:
其中 p(xi)是企业该年某产业收入额占该年所有产业总收入的比重。在company.csv中存有需要计算的企业和年份,在company_data.csv中存有企业、各类收入额和收入年份的信息。现请利用后一张表中的数据,在前一张表中增加一列表示该公司该年份的收入熵指标 I。
import numpy as np
import pandas as pd
df1 = pd.read_csv('../data/第一次综合练习-数据集/任务一/company.csv')
df2 = pd.read_csv('../data/第一次综合练习-数据集/任务一/company_data.csv')
df1.head()
df2.head()
df1['code'] = df1['证券代码'].str.replace('#[0]*','',regex=True).astype('int64') # 新建code列
df1.head()
def income_entropy(x):
income_sum = np.abs(x.sum())
ratio = np.abs(x)/income_sum
entropy = -1*(np.sum(ratio*np.log2(ratio)))
return entropy
df2_w = df2.pivot(index=['证券代码','日期'], columns='收入类型', values='收入额')
df2_w1 = df2_w.apply(income_entropy,axis=1).to_frame()
df2_w1.reset_index(inplace = True)
df2_w1.head()
df2_w1['日期'] = df2_w1['日期'].str.replace('/12/31','',regex=True).astype('int64')
df2_w1['证券代码'] = df2_w1['证券代码'].astype('int64')
df2_w1.columns = ['code','日期','收入熵']
df2_w1.head()
df3 = df1.merge(df2_w1, on=['code','日期'], how='left')
df3.shape #(1048, 4)
df3.head()
【题目描述】请把组队学习的队伍信息表变换为如下形态,其中“是否队长”一列取1表示队长,否则为0
file2 = '../data/第一次综合练习-数据集/任务二/组队信息汇总表(Pandas).xlsx'
df = pd.read_excel(file2)
df.head()
from openpyxl import load_workbook
name_dict = {
} # {'昵称':'队伍名称'}
id_dict = {
} # {'昵称':'编号'}
leader_dict = {
} # {'昵称':id} id=0/1
workbook = load_workbook(file2,data_only=True)
booksheet = workbook.active # 只有一个工作表
flag = 0
for row in booksheet.rows:
if flag == 0: #跳过第一行
flag += 1
continue
line_list = [col.value for col in row][1:]
team_name = line_list[0]
name_list = [i for i in line_list[2::2] if i != None] # 需要去除空格
id_list = [i for i in line_list[1::2] if i != None]
for i in name_list:
name_dict[i] = team_name
index = name_list.index(i)
id_dict[i] = id_list[index]
if index == 0:
leader_dict[i] = 1
else:
leader_dict[i] = 0
df = pd.DataFrame([leader_dict,name_dict,id_dict]).T.reset_index()
df.columns = ['昵称',"是否队长","队伍名称","编号"]
df
【题目描述】两张数据表中分别给出了美国各县(county)的人口数以及大选的投票情况,请解决以下问题:
有多少县满足总投票数超过县人口数的一半
把州(state)作为行索引,把投票候选人作为列名,列名的顺序按照候选人在全美的总票数由高到低排序,行列对应的元素为该候选人在该州获得的总票数
每一个州下设若干县,定义拜登在该县的得票率减去川普在该县的得票率为该县的BT指标,若某个州所有县BT指标的中位数大于0,则称该州为Biden
State,请找出所有的Biden State
file_pop = "../data/第一次综合练习-数据集/任务三/county_population.csv"
file_vote = "../data/第一次综合练习-数据集/任务三/president_county_candidate.csv"
df_pop = pd.read_csv(file_pop)
df_pop.head()
df_vote = pd.read_csv(file_vote)
df_vote.head()
#1.有多少县满足总投票数超过县人口数的一半
df_pop = pd.read_csv(file_pop)
print(df_pop.shape)
df_pop.head()
df_pop['US County'] = df_pop['US County'].str.strip('.') # 去除名字前面的'.'符号
df_pop.head()
df_vote = pd.read_csv(file_vote)
df_vote.head()
df_pop_vote = df_vote.groupby(['state','county'])['total_votes'].sum().to_frame().reset_index()
df_pop_vote['US County'] = df_pop_vote['county']+', '+df_pop_vote['state']
print(df_pop_vote.shape)
df_pop_vote.head()
df_pop_vote = df_pop_vote.merge(df_pop,on = 'US County',how = 'inner')
df_pop_vote['vote_rate'] = df_pop_vote['total_votes']/df_pop_vote['Population']
df_pop_vote.head()
df_pop_vote.loc[df_pop_vote['vote_rate']> 0.5,'US County'] # 一共1419个县投票率超过0.5
#2.把州(state)作为行索引,把投票候选人作为列名,列名的顺序按照候选人在全美的总票数由高到低排序,行列对应的元素为该候选人在该州获得的总票数
df_candidate = df_vote.groupby(['candidate','state'])['total_votes'].sum().to_frame().reset_index()
df_candidate_towide = df_candidate.pivot(index = 'state', columns='candidate', values='total_votes')
df_candidate_towide.head()
candidate_vote_count = pd.Series(df_candidate_towide.sum(axis = 0),index = df_candidate_towide.columns,name='state vote count')
df_candidate_vote = df_candidate_towide.append(candidate_vote_count)
df_candidate_vote.tail()
df_candidate_vote.sort_values(by='state vote count',axis=1,ascending = False) # 按指定索引行排序
#3.每一个州下设若干县,定义拜登在该县的得票率减去川普在该县的得票率为该县的BT指标,若某个州所有县BT指标的中位数大于0,则称该州为Biden State,请找出所有的Biden State
df_state_vote = df_vote.groupby(['state','county'])['total_votes'].sum().to_frame().reset_index()
df_BT = df_vote.groupby(['state','county','candidate'])['total_votes'].sum().to_frame().reset_index()
df_BT = df_BT.query('candidate == ["Donald Trump","Joe Biden"]').rename(columns={
'total_votes':'votes'})
print(df_BT.shape)
df_BT.head()
df_BT = df_BT.merge(df_state_vote,on=['state','county'], how='left')
df_BT['vote ratio'] = df_BT['votes']/df_BT['total_votes']
df_BT.head()
df_BT = df_BT.pivot(index = ['state','county'],columns='candidate', values='vote ratio').reset_index()
df_BT['BT_index'] = df_BT['Joe Biden'] - df_BT['Donald Trump']
df_BT.head()
df_BT_state = df_BT.groupby(['state'])['county','BT_index'].median()
df_BT_state.reset_index().rename(columns = {
'candidate':''})
df_BT_state.head()
df1 = df_BT_state.query('BT_index > 0')
df1.reset_index()['state'] # 一共9个Biden State