Adventure项目1学习记录

一、导入模块

导入mysql模块：

import pymysql
#为了兼容mysqldb，只需要加入
pymysql.install_as_MySQLdb()
#导入sqlalchemy中的create_engine
from sqlalchemy import create_engine

PS：由于之前学习下载ipython包时发现电脑windows版本较低或网络环境较差的原因，导致无法通过pip install 第三方包下载第三方包，因此这次下载pymysql包，直接选择在Anaconda下载，结果不错，安装成功。
具体按照操作：Anaconda Navigator—Environments—apply pymysql包
安装过程如下图：

1.1.png

1.2.png

1.3.png

1.4.png

1.5.png

1.6.png

二、读取数据

打开数据库——engine =create_engine('dialect+driver://username:password@host:port/database')

其中：

dialect -- 数据库类型
driver -- 数据库驱动选择
username -- 数据库用户名
password -- 用户密码
host 服务器地址
port 端口
database 数据库
charset=gbk文件编码格式
示例：engine = create_engine('mysql+pymysql://project1:mima123@localhost/foo,charset=gbk')

1045问题报错：

2.1.png

问题原因：数据库用户名填写有误，已解决

三、查看数据信息

数据字典：

3.1.png

数据源初步分析思路：

3.2.png

数据源初步操作思路：

3.3.png

四、需注意的操作

1.1.4提取月份维度：

gather_customer_order['create_date'] =pd.to_datetime(gather_customer_order['create_date'])
gather_customer_order['create_year_month']=gather_customer_order['create_date'].astype('str').str[0:7]

1.2.2新增一列order_num_diff，此为每月自行车销售订单量环比，本月与上月对比，例如本期2019-02月销售额与上一期2019-01月销售额做对比

第一步：利用diff函数和列表方法计算环比

#求每月自行车销售订单量环比，观察最近一年数据变化趋势
#环比是本月与上月的对比，例如本期2019-02月销售额与上一期2019-01月销售额做对比
order_num_diff = list((overall_sales_performance.order_num.diff()/overall_sales_performance.order_num)-1)
order_num_diff.pop(0)
order_num_diff.append(0)
order_num_diff

第二步：将环比转化为DataFrame,并合并表格

order_num_diff= pd.DataFrame(order_num_diff)
overall_sales_performance= pd.concat([overall_sales_performance,pd.DataFrame(order_num_diff)],axis=1)

第三步：更改列名,否则显示列名是0

overall_sales_performance =overall_sales_performance.rename(columns ={0:'order_num_diff'})

ps:diff()函数用法

df.diff() 内部实际先执行df.shift()，后执行df.shift()-df 两个操作
diff()函数原型：
DataFrame.diff(periods=1, axis=0)
参数：
periods：移动的幅度，int类型，默认值为1。
axis：移动的方向，{0 or ‘index’, 1 or ‘columns’}，如果为0或者’index’，则上下移动，如果为1或者’columns’，则左右移动。

1.2.4存储数据至mysql：

1054报错问题：

1.2.4.png

问题原因：表格名重复，现修改，已解决

2.2.2求不同区域10月11月的环比：

#1、获得去重区域的列表region_list
region_list=list(gather_customer_order_10_11_group.chinese_territory.unique())
region_list

#2、利用for循环区域列表，结合loc定位符合区域，利用pct_change()函数实现环比效果，形成新的Series
order_x =pd.Series([])
amount_x =pd.Series([])
for i in region_list:
    a =gather_customer_order_10_11_group.loc[gather_customer_order_10_11_group['chinese_territory']==i]['order_num'].pct_change()
    b=gather_customer_order_10_11_group.loc[gather_customer_order_10_11_group['chinese_territory']==i]['sum_amount'].pct_change()
    order_x=order_x.append(a)
    amount_x = amount_x.append(b)

#3、赋予新的Series的变量名并增加列
gather_customer_order_10_11_group['order_diff']=order_x.
gather_customer_order_10_11_group['amount_diff']=amount_x

#4、由0替换NaN值
gather_customer_order_10_11_group['order_diff']=gather_customer_order_10_11_group['order_diff'].fillna(value =0)
gather_customer_order_10_11_group['amount_diff']=gather_customer_order_10_11_group['amount_diff'].fillna(value =0)

ps:pct_change()函数

表示当前元素与上一个元素的相差百分比，由于筛选出的只有10-11月的数据，因此只有11月份同比有数据，10月无数据，返回结果会是空值，即NaN，需要在清洗时由0值替换。

2.3.2、将gather_customer_order_11按照chinese_city城市分组，求和销售数量order_num，赋予变量gather_customer_order_city_head

一开始写成这样报错（尴尬）：

grouped_city=gather_customer_order_11.groupby('chinese_city')
gather_customer_order_city_head=grouped_city['order_num'].agg({'order_num':sum}

1.png

2.png

正解：

grouped_city=gather_customer_order_11.groupby('chinese_city')
gather_customer_order_city_head=pd.DataFrame(grouped_city['order_num'].sum()).reset_index()

3.png

ps：更加理解单列分组聚合与多列分组聚合后返回结果的数据结构区别，即返回后的值为Series还是DataFrame，了解这个，以便于进行后续操作。

4.1.1 根据sales_customer_order_11['birth_date']，获取客人的年份作为新的一列

正解：

sales_customer_order_11['birth_date']= pd.DataFrame((sales_customer_order_11['birth_date'].astype('str').str[0:4]))

4.1.2 sales_customer_order_11['birth_year']字段要求修改为int类型

正解：

sales_customer_order_11['birth_year']=sales_customer_order_11['birth_year'].astype(float).fillna(value=0).astype(int)

做的过程中报错：ValueError: Cannot convert non-finite values (NA or inf) to integer
原因：原因是['birth_year']列有些记录是空，所以没法转成int。
解决：将此列空值填充为0或其他数字，然后再使用df['birth_year'].astype(int)转换数据格式

4.1.3 利用customer_age字段，进行年龄分层，划分层次为"30-34","35-39","40-44","45-49","50-54","55-59","60-64"

bins =[30,34,39,44,49,54,59,64]
group_names=["30-34","35-39","40-44","45-49","50-54","55-59","60-64"]
sales_customer_order_11['age_level']= pd.cut(sales_customer_order_11['customer_age'],bins=bins,labels =group_names)

PS：注意labels数量必须少于bins一个，还有bins取范围(30-34],(34-39]
所以想要表达30-34，,35-39区间,bins=[30,34,39]！！！

五、数据加工后的图表数据字典

图.png