Pandas数据分析实战项目(简单)

一、项目介绍

项目背景:根据已有数据,对给定车型进行数据分析,得出给定车型近半年每个月在各个城市真实搜索指数
数据来源:数据使用百度指数给出的数据,通过对给定车型关键词进行数据抓取,得到车型的全国搜索指数。省份搜索热度、城市搜索热度。
数据:链接:https://pan.baidu.com/s/1ZDd8kaKlKPMItMNEzt0gpQ
提取码:lnal
Pandas数据分析实战项目(简单)_第1张图片
Pandas数据分析实战项目(简单)_第2张图片
Pandas数据分析实战项目(简单)_第3张图片

二、项目处理

1、处理全国指数趋势表

打开百度指数趋势表,发现有如下问题需要处理:

  1. 对于个别车型是近期才有数据,之前没有数据,需要对缺失值进行处理;
  2. 结果是需要月级数据,但是原始数据是按天的,需要对日期进行处理;
  3. 对于原始数据关键词keyword字段,为防止合并时出现大小写区别而合并错误,需要对关键词进行统一处理。
代码实现:
In [1] :import numpy as np
		import pandas as pd
		
# 读取百度指数表
In [2] :index = pd.read_excel('baidu_index_0625.xlsx')
		prov_id = pd.read_excel('prov_id .xlsx')
		city_id = pd.read_excel('prov_id .xlsx')

# 处理缺失值
In [3] :index = index.fillna(0)

# 对data字段进行格式化处理
In [4] :index['date'] = pd.to_datetime(index['date'])
		index['date'] = index['date'].dt.strftime('%B')
		
# 对关键词字段进行统一处理
In [5] :index['keyword'] = index['keyword'].apply(lambda x: x.strip(' \r\n\t').upper())
		
# 根据keyword,date对搜索指数进行分类汇总求和
In [6] :new_index_mean = index.groupby(['keyword','date’])['_index'].sum()

# 展示结果
In [7] :new_index_mean
Out[7] :
			keyword     date    
IX25        April        29144.0
            December     32422.0
            February     28511.0
            January      32204.0
            June           882.0
            March        30081.0
            May          27164.0
            November     30810.0
            October      31611.0
T-CROSS     April       150414.0
            December         0.0
            February         0.0
            January          0.0
            June         29702.0
            March        77619.0
            May         120753.0
            November         0.0
            October          0.0
XR-V        April         6656.0
            December      7034.0
            February      6835.0
            January       6864.0
            June           207.0
            March         7227.0
            May           7046.0
            November      6745.0
            October       7421.0
三菱ASX       April         7067.0
            December      6015.0
            February      7012.0
                          ...   
长安CS15      May          28927.0
            November     26988.0
            October      28556.0
长安CS35      April        74132.0
            December    469865.0
            February    118634.0
            January     200485.0
            June          2167.0
            March       103712.0
            May          78706.0
            November    347321.0
            October      92131.0
长安CS35PLUS  April        49649.0
            December    413372.0
            February     50194.0
            January      85418.0
            June          1444.0
            March        57344.0
            May          54125.0
            November    281566.0
            October      42239.0
雪铁龙C3-XR    April         7291.0
            December      6717.0
            February      7118.0
            January       6622.0
            June           184.0
            March         9967.0
            May           6419.0
            November      6346.0
            October       7757.0
Name: _index, Length: 234, dtype: float64

2、读取省份搜索指数数据

   通过我们观察省份搜索指数表,发现index最大值为1000,显示它并不是一个真实值,而是根据比例给出的比例值。上面我们已经得出各个车型在近半年每个月真实的搜索指数。那么,对于某一车型,根据某省份在全国的搜索占比,再乘以真的全国搜索指数,那么就能得到该车型在某一省份真实的搜索指数。

代码实现
# 读取省份搜索指数数据
In [8] :prov_index = pd.read_excel('province_index_0625.xlsx') 

# 同样对日期进行格式化处理
In [9] :prov_index['date'] = prov_index['date'].apply(lambda x :x.split("|")[0]
		prov_index['date']=pd.to_datetime(prov_index['date']) # 格式化日期
		prov_index['date'] = prov_index['date'].dt.strftime('%B') # 将日期转换成月份
		
# 对字段进行统一大写和去空格处理
In [10] :prov_index['keyword'] = prov_index['keyword'].apply(lambda x: x.strip(' \r\n\t').upper())
In [11] :prov_index.head()
Out[11] :
	id	keyword	prov	prov_index	date
0	1	缤智		913			1000	December
1	2	缤智		901			272		December
2	3	缤智		917			246		December
3	4	缤智		916			234		December
4	5	缤智		920			203		December

# 根据keyword,date对搜索指数进行分类汇总,得出假的全国搜索指数
In [12] :prov_index_sum = prov_index.groupby(['keyword','date'])['prov_index'].sum() 

# 分类汇总后得到的是一个Series,因为需要合并DataFrame,所以进行reset_index()处理
In [13] :prov_index_sum = prov_index_sum.reset_index()
		 prov_index_sum.head()
Out[13] :
  index   keyword	date	prov_index
0	0		IX25	April		7987
1	1		IX25	December	8257
2	2		IX25	February	8778
3	3		IX25	January		9291
4	4		IX25	March		8352

# 数据合并,将总和列合并到省份DF
In [14] :prov_index2 = pd.merge(prov_index,prov_index_sum,on=("keyword","date"))

# 根据省月份平均搜索指数/全国月份平均搜索指数得出搜索所占比
In [15] :prov_index2['pct'] = prov_index2['prov_index_x']/prov_index2['prov_index_y']
In [16] :prov_index2.head()
Out[16] :
	id keyword	prov	prov_index_x	date	prov_index_y	pct
0	1	缤智		913		1000		  December		4116	  0.242954
1	2	缤智		901		272			  December		4116	  0.066084
2	3	缤智		917		246			  December		4116	  0.059767
3	4	缤智		916		234			  December		4116	  0.056851
4	5	缤智		920		203			  December		4116	  0.049320
In [17] :new_index_mean = new_index_mean.reset_index()
In [18] :new_index_mean

# 将真实全国搜索指数合并到省份表
In [19] :prov_index_final = pd.merge(prov_index2,new_index_mean,on=('keyword','date'))

# 省份占比 * 真实搜索指数 = 省份真实搜索指数
In [20] :prov_index_final['real_prov_index'] = prov_index_final['pct'] *prov_index_final['_index']
		 prov_index_final.head()
Out[20] :


id	keyword	prov	prov_index_x	date	prov_index_y	pct			index		_index	   real_prov_index
0	1	缤智	913			1000		December	4116		0.242954	163			82882.0		20136.540330
1	2	缤智	901			272			December	4116		0.066084	163			82882.0		5477.138970
2	3	缤智	917			246			December	4116		0.059767	163			82882.0		4953.588921
3	4	缤智	916			234			December	4116		0.056851	163			82882.0		4711.950437
4	5	缤智	920			203			December	4116		0.049320	163			82882.0		4087.717687

# 省份在这里是以ID的形式展示出来,需要将其与省份ID对照表合并
In [21] :prov_index_final = pd.merge(prov_index_final,prov_id,left_on="prov",right_on="id")

# 输出成excel
In [22] :prov_index_final.to_excel('prov_index_final.xlsx')

3、根据省份处理城市搜索指数

In [23] :city_index = pd.read_excel('city_index_0625.xlsx')
In [24] :city_index.head()
	id	keyword	city city_index	prov		date
0	1	缤智		77		1000	901		2018-12-01|2018-12-31
1	2	缤智		1		934		901		2018-12-01|2018-12-31
2	3	缤智		79		699		901		2018-12-01|2018-12-31
3	4	缤智		80		414		901		2018-12-01|2018-12-31
4	5	缤智		78		165		901		2018-12-01|2018-12-31

# 对date进行处理,格式化,并转成月份
Out[24] :city_index['date'] = city_index['date'].apply(lambda x :x.split("|")[0])
		 city_index['date']=pd.to_datetime(city_index['date'])
		 city_index['date'] = city_index['date'].dt.strftime('%B')
		 
#根据keyword,date,prov对城市搜索指数进行分类汇总求和
In [25] :city_index_sum = city_index.groupby(['keyword','date','prov'])['city_index'].sum()
In [26] :city_index_sum = city_index_sum.reset_index()

# 将求和后的df与原df合并
In [27] :city_index2 = pd.merge(city_index,city_index_sum,on=("keyword","date","prov"))

# 求出城市在该城市所在省所占比
In [28] :city_index2['pct'] = city_index2['city_index_x']/city_index2['city_index_y']
In [29] :city_index2.head()
Out[29] :
	id	keyword	city city_index_x	prov	date	city_index_y	pct
0	1	缤智		77		1000		901	  December		4195	0.238379
1	2	缤智		1		934			901	  December		4195	0.222646
2	3	缤智		79		699			901	  December		4195	0.166627
3	4	缤智		80		414			901	  December		4195	0.098689
4	5	缤智		78		165			901	  December		4195	0.039333

# 对求出的城市搜索占比与真实省份搜索指数进行合并
In [30] :city_index_final = pd.merge(city_index2,prov_index_final,on=('keyword','prov','date'))

# 城市占比 * 省份真实搜索指数 = 城市真实指数
In [31] :city_index_final["real_city_index"] = city_index_final['pct_x']*city_index_final['real_prov_index']

# 与城市ID对照表进行合并并输出
In [32] :city_index_final = pd.merge(city_index_final,city_id,left_on='city',right_on="id")
	  	 city_index_final.to_excel('city_index_final_0625.xlsx')

三、总结

  整片代码只是简单的Pandas数据处理练习,主要包含数据字段处理,数据合并,数据计算,整体来说并不是很难。
  有时间考虑一下数据可视化QAQ!鼓得拜!!

你可能感兴趣的:(数据分析)