项目背景:根据已有数据,对给定车型进行数据分析,得出给定车型近半年每个月在各个城市真实搜索指数
数据来源:数据使用百度指数给出的数据,通过对给定车型关键词进行数据抓取,得到车型的全国搜索指数。省份搜索热度、城市搜索热度。
数据:链接:https://pan.baidu.com/s/1ZDd8kaKlKPMItMNEzt0gpQ
提取码:lnal
打开百度指数趋势表,发现有如下问题需要处理:
In [1] :import numpy as np
import pandas as pd
# 读取百度指数表
In [2] :index = pd.read_excel('baidu_index_0625.xlsx')
prov_id = pd.read_excel('prov_id .xlsx')
city_id = pd.read_excel('prov_id .xlsx')
# 处理缺失值
In [3] :index = index.fillna(0)
# 对data字段进行格式化处理
In [4] :index['date'] = pd.to_datetime(index['date'])
index['date'] = index['date'].dt.strftime('%B')
# 对关键词字段进行统一处理
In [5] :index['keyword'] = index['keyword'].apply(lambda x: x.strip(' \r\n\t').upper())
# 根据keyword,date对搜索指数进行分类汇总求和
In [6] :new_index_mean = index.groupby(['keyword','date’])['_index'].sum()
# 展示结果
In [7] :new_index_mean
Out[7] :
keyword date
IX25 April 29144.0
December 32422.0
February 28511.0
January 32204.0
June 882.0
March 30081.0
May 27164.0
November 30810.0
October 31611.0
T-CROSS April 150414.0
December 0.0
February 0.0
January 0.0
June 29702.0
March 77619.0
May 120753.0
November 0.0
October 0.0
XR-V April 6656.0
December 7034.0
February 6835.0
January 6864.0
June 207.0
March 7227.0
May 7046.0
November 6745.0
October 7421.0
三菱ASX April 7067.0
December 6015.0
February 7012.0
...
长安CS15 May 28927.0
November 26988.0
October 28556.0
长安CS35 April 74132.0
December 469865.0
February 118634.0
January 200485.0
June 2167.0
March 103712.0
May 78706.0
November 347321.0
October 92131.0
长安CS35PLUS April 49649.0
December 413372.0
February 50194.0
January 85418.0
June 1444.0
March 57344.0
May 54125.0
November 281566.0
October 42239.0
雪铁龙C3-XR April 7291.0
December 6717.0
February 7118.0
January 6622.0
June 184.0
March 9967.0
May 6419.0
November 6346.0
October 7757.0
Name: _index, Length: 234, dtype: float64
通过我们观察省份搜索指数表,发现index最大值为1000,显示它并不是一个真实值,而是根据比例给出的比例值。上面我们已经得出各个车型在近半年每个月真实的搜索指数。那么,对于某一车型,根据某省份在全国的搜索占比,再乘以真的全国搜索指数,那么就能得到该车型在某一省份真实的搜索指数。
# 读取省份搜索指数数据
In [8] :prov_index = pd.read_excel('province_index_0625.xlsx')
# 同样对日期进行格式化处理
In [9] :prov_index['date'] = prov_index['date'].apply(lambda x :x.split("|")[0]
prov_index['date']=pd.to_datetime(prov_index['date']) # 格式化日期
prov_index['date'] = prov_index['date'].dt.strftime('%B') # 将日期转换成月份
# 对字段进行统一大写和去空格处理
In [10] :prov_index['keyword'] = prov_index['keyword'].apply(lambda x: x.strip(' \r\n\t').upper())
In [11] :prov_index.head()
Out[11] :
id keyword prov prov_index date
0 1 缤智 913 1000 December
1 2 缤智 901 272 December
2 3 缤智 917 246 December
3 4 缤智 916 234 December
4 5 缤智 920 203 December
# 根据keyword,date对搜索指数进行分类汇总,得出假的全国搜索指数
In [12] :prov_index_sum = prov_index.groupby(['keyword','date'])['prov_index'].sum()
# 分类汇总后得到的是一个Series,因为需要合并DataFrame,所以进行reset_index()处理
In [13] :prov_index_sum = prov_index_sum.reset_index()
prov_index_sum.head()
Out[13] :
index keyword date prov_index
0 0 IX25 April 7987
1 1 IX25 December 8257
2 2 IX25 February 8778
3 3 IX25 January 9291
4 4 IX25 March 8352
# 数据合并,将总和列合并到省份DF
In [14] :prov_index2 = pd.merge(prov_index,prov_index_sum,on=("keyword","date"))
# 根据省月份平均搜索指数/全国月份平均搜索指数得出搜索所占比
In [15] :prov_index2['pct'] = prov_index2['prov_index_x']/prov_index2['prov_index_y']
In [16] :prov_index2.head()
Out[16] :
id keyword prov prov_index_x date prov_index_y pct
0 1 缤智 913 1000 December 4116 0.242954
1 2 缤智 901 272 December 4116 0.066084
2 3 缤智 917 246 December 4116 0.059767
3 4 缤智 916 234 December 4116 0.056851
4 5 缤智 920 203 December 4116 0.049320
In [17] :new_index_mean = new_index_mean.reset_index()
In [18] :new_index_mean
# 将真实全国搜索指数合并到省份表
In [19] :prov_index_final = pd.merge(prov_index2,new_index_mean,on=('keyword','date'))
# 省份占比 * 真实搜索指数 = 省份真实搜索指数
In [20] :prov_index_final['real_prov_index'] = prov_index_final['pct'] *prov_index_final['_index']
prov_index_final.head()
Out[20] :
id keyword prov prov_index_x date prov_index_y pct index _index real_prov_index
0 1 缤智 913 1000 December 4116 0.242954 163 82882.0 20136.540330
1 2 缤智 901 272 December 4116 0.066084 163 82882.0 5477.138970
2 3 缤智 917 246 December 4116 0.059767 163 82882.0 4953.588921
3 4 缤智 916 234 December 4116 0.056851 163 82882.0 4711.950437
4 5 缤智 920 203 December 4116 0.049320 163 82882.0 4087.717687
# 省份在这里是以ID的形式展示出来,需要将其与省份ID对照表合并
In [21] :prov_index_final = pd.merge(prov_index_final,prov_id,left_on="prov",right_on="id")
# 输出成excel
In [22] :prov_index_final.to_excel('prov_index_final.xlsx')
In [23] :city_index = pd.read_excel('city_index_0625.xlsx')
In [24] :city_index.head()
id keyword city city_index prov date
0 1 缤智 77 1000 901 2018-12-01|2018-12-31
1 2 缤智 1 934 901 2018-12-01|2018-12-31
2 3 缤智 79 699 901 2018-12-01|2018-12-31
3 4 缤智 80 414 901 2018-12-01|2018-12-31
4 5 缤智 78 165 901 2018-12-01|2018-12-31
# 对date进行处理,格式化,并转成月份
Out[24] :city_index['date'] = city_index['date'].apply(lambda x :x.split("|")[0])
city_index['date']=pd.to_datetime(city_index['date'])
city_index['date'] = city_index['date'].dt.strftime('%B')
#根据keyword,date,prov对城市搜索指数进行分类汇总求和
In [25] :city_index_sum = city_index.groupby(['keyword','date','prov'])['city_index'].sum()
In [26] :city_index_sum = city_index_sum.reset_index()
# 将求和后的df与原df合并
In [27] :city_index2 = pd.merge(city_index,city_index_sum,on=("keyword","date","prov"))
# 求出城市在该城市所在省所占比
In [28] :city_index2['pct'] = city_index2['city_index_x']/city_index2['city_index_y']
In [29] :city_index2.head()
Out[29] :
id keyword city city_index_x prov date city_index_y pct
0 1 缤智 77 1000 901 December 4195 0.238379
1 2 缤智 1 934 901 December 4195 0.222646
2 3 缤智 79 699 901 December 4195 0.166627
3 4 缤智 80 414 901 December 4195 0.098689
4 5 缤智 78 165 901 December 4195 0.039333
# 对求出的城市搜索占比与真实省份搜索指数进行合并
In [30] :city_index_final = pd.merge(city_index2,prov_index_final,on=('keyword','prov','date'))
# 城市占比 * 省份真实搜索指数 = 城市真实指数
In [31] :city_index_final["real_city_index"] = city_index_final['pct_x']*city_index_final['real_prov_index']
# 与城市ID对照表进行合并并输出
In [32] :city_index_final = pd.merge(city_index_final,city_id,left_on='city',right_on="id")
city_index_final.to_excel('city_index_final_0625.xlsx')
整片代码只是简单的Pandas数据处理练习,主要包含数据字段处理,数据合并,数据计算,整体来说并不是很难。
有时间考虑一下数据可视化QAQ!鼓得拜!!