优衣库(Uniqlo) 的内在涵义是指通过摒弃了不必要装潢装饰的仓储型店铺,采用超市型的自助购物方式,以合理可信的价格提供顾客希望的商品价廉物美的休闲装“UNIQLO”是Unique Clothing Warehouse的缩写,意为消费者提供“低价良品、品质保证”的经营理念,在日本经济低迷时期取得了惊人的业绩。
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
% matplotlib inline
RangeIndex: 22293 entries, 0 to 22292
Data columns (total 12 columns):
store_id 22293 non-null int64
city 22293 non-null object
channel 22293 non-null object
gender_group 22293 non-null object
age_group 22293 non-null object
wkd_ind 22293 non-null object
product 22293 non-null object
customer 22293 non-null int64
revenue 22293 non-null float64
order 22293 non-null int64
quant 22293 non-null int64
unit_cost 22293 non-null int64
dtypes: float64(1), int64(5), object(6)
memory usage: 2.0+ MB
# 对数据进行清洗和整理
store_id | city | channel | gender_group | age_group | wkd_ind | product | customer | revenue | order | quant | unit_cost | |
0 | 658 | 深圳 | 线下 | Female | 25-29 | Weekday | 当季新品 | 4 | 796.0 | 4 | 4 | 59 |
1 | 146 | 杭州 | 线下 | Female | 25-29 | Weekday | 运动 | 1 | 149.0 | 1 | 1 | 49 |
2 | 70 | 深圳 | 线下 | Male | >=60 | Weekday | T恤 | 2 | 178.0 | 2 | 2 | 49 |
store_id | customer | revenue | order | quant | unit_cost | |
count | 22293.000000 | 22293.000000 | 22293.000000 | 22293.000000 | 22293.000000 | 22293.000000 |
mean | 335.391558 | 1.629480 | 159.531371 | 1.651998 | 1.858072 | 46.124658 |
std | 230.236167 | 1.785605 | 276.254066 | 1.861480 | 2.347301 | 19.124347 |
min | 19.000000 | 1.000000 | -0.660000 | 1.000000 | 1.000000 | 9.000000 |
25% | 142.000000 | 1.000000 | 64.000000 | 1.000000 | 1.000000 | 49.000000 |
50% | 315.000000 | 1.000000 | 99.000000 | 1.000000 | 1.000000 | 49.000000 |
75% | 480.000000 | 2.000000 | 175.000000 | 2.000000 | 2.000000 | 49.000000 |
max | 831.000000 | 58.000000 | 12538.000000 | 65.000000 | 84.000000 | 99.000000 |
#发现异常 20049 91 武汉 线上 Female 55-59 Weekday 运动 1 -0.66 1 2 49
## 解果没有发现有缺失值
store_id False
city False
channel False
gender_group False
age_group False
wkd_ind False
product False
customer False
revenue False
order False
quant False
unit_cost False
dtype: bool
# 清楚异常值
store_id | customer | revenue | order | quant | unit_cost | |
count | 22292.000000 | 22292.000000 | 22292.000000 | 22292.000000 | 22292.000000 | 22292.000000 |
mean | 335.402521 | 1.629508 | 159.538557 | 1.652028 | 1.858066 | 46.124529 |
std | 230.235512 | 1.785640 | 276.258179 | 1.861517 | 2.347353 | 19.124766 |
min | 19.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 9.000000 |
25% | 142.000000 | 1.000000 | 64.000000 | 1.000000 | 1.000000 | 49.000000 |
50% | 315.000000 | 1.000000 | 99.000000 | 1.000000 | 1.000000 | 49.000000 |
75% | 480.000000 | 2.000000 | 175.000000 | 2.000000 | 2.000000 | 49.000000 |
max | 831.000000 | 58.000000 | 12538.000000 | 65.000000 | 84.000000 | 99.000000 |
# 产品销售数量quant、销售金额revenue、顾客人数customer 柱状图
sns.barplot(x='wkd_ind',y='quant',data=Data0) #quant
E:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
sns.barplot(x='wkd_ind',y='revenue',data=Data0) #revenue
E:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
sns.barplot(x='wkd_ind',y='customer',data=Data0) #customer
E:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
发现:quant、revenue与customer 均呈现周中大于周末的结果
#不同产品即指product与销售额revenue 统计描述
count | mean | std | min | 25% | 50% | 75% | max | |
product | ||||||||
T恤 | 10610.0 | 145.027789 | 154.278714 | 0.0 | 79.0 | 99.0 | 158.0 | 6636.00 |
当季新品 | 2540.0 | 232.545228 | 597.253282 | 0.0 | 76.0 | 111.0 | 197.0 | 12538.00 |
毛衣 | 807.0 | 304.375217 | 290.733202 | 0.0 | 149.0 | 199.0 | 396.0 | 4975.00 |
牛仔裤 | 1412.0 | 174.311246 | 238.681718 | 0.0 | 59.0 | 79.0 | 199.0 | 2087.00 |
短裤 | 1694.0 | 63.450933 | 55.646467 | 0.0 | 37.0 | 40.0 | 77.0 | 676.00 |
袜子 | 2053.0 | 62.216931 | 51.183226 | 0.0 | 27.0 | 52.0 | 79.0 | 595.36 |
裙子 | 629.0 | 218.287409 | 172.449212 | 10.0 | 99.0 | 197.0 | 237.0 | 1442.00 |
运动 | 975.0 | 121.087528 | 142.760425 | 18.0 | 39.0 | 78.0 | 149.0 | 1257.00 |
配件 | 1572.0 | 282.878594 | 398.705054 | 0.0 | 99.0 | 149.0 | 298.0 | 4187.00 |
E:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
#不同性别gender_group 的顾客对线上、线下两种购买方式的偏好
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.rcParams['axes.unicode_minus'] = False
#年龄段age_group 的顾客对线上、线下两种购买方式的偏好
#城市city 的顾客对线上、线下两种购买方式的偏好
Data0['unit_price'] = (Data0['revenue']/Data0['quant']) #单价
Data0['margin'] = (Data0['revenue']/Data0['quant']-Data0['unit_cost']) #利润
E:\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
E:\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
This is separate from the ipykernel package so we can avoid doing imports until
count 22292.000000
mean 38.144962
std 40.265440
min -99.000000
25% 14.000000
50% 30.000000
75% 50.000000
max 270.000000
Name: margin, dtype: float64
#margin何分布 绘制直方图
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.rcParams['axes.unicode_minus'] = False
Data0['margin'].plot(kind='hist',color='violet',legend=True,edgecolor = 'k',title='总体销售利润分布')
# 销售产品种类与利润的分布关系 箱线图
sns.boxplot(y = 'product',x='margin',data = Data0)
revenue | unit_cost | ||
product | |||
T恤 | revenue | 1.0 | NaN |
unit_cost | NaN | NaN | |
当季新品 | revenue | 1.0 | NaN |
unit_cost | NaN | NaN | |
毛衣 | revenue | 1.0 | NaN |
unit_cost | NaN | NaN | |
牛仔裤 | revenue | 1.0 | NaN |
unit_cost | NaN | NaN | |
短裤 | revenue | 1.0 | NaN |
unit_cost | NaN | NaN | |
袜子 | revenue | 1.0 | NaN |
unit_cost | NaN | NaN | |
裙子 | revenue | 1.0 | NaN |
unit_cost | NaN | NaN | |
运动 | revenue | 1.0 | NaN |
unit_cost | NaN | NaN | |
配件 | revenue | 1.0 | NaN |
unit_cost | NaN | NaN |
DataFrame.corr(method=‘pearson’, min_periods=1) 参数详解
method:可选值为{‘pearson’, ‘kendall’, ‘spearman’}
pearson:Pearson相关系数来衡量两个数据集合是否在一条线上面,即针对线性数据的相关系数计算,针对非线性 数据便会有误差。
# 探究不同城市和门店中成本和销售额的相关性。
revenue | unit_cost | ||
city | |||
上海 | revenue | 1.000000 | 0.146763 |
unit_cost | 0.146763 | 1.000000 | |
北京 | revenue | 1.000000 | 0.183747 |
unit_cost | 0.183747 | 1.000000 | |
南京 | revenue | 1.000000 | 0.112418 |
unit_cost | 0.112418 | 1.000000 | |
广州 | revenue | 1.000000 | 0.205299 |
unit_cost | 0.205299 | 1.000000 | |
成都 | revenue | 1.000000 | 0.152857 |
unit_cost | 0.152857 | 1.000000 | |
杭州 | revenue | 1.000000 | 0.157356 |
unit_cost | 0.157356 | 1.000000 | |
武汉 | revenue | 1.000000 | 0.164363 |
unit_cost | 0.164363 | 1.000000 | |
深圳 | revenue | 1.000000 | 0.133183 |
unit_cost | 0.133183 | 1.000000 | |
西安 | revenue | 1.000000 | 0.277920 |
unit_cost | 0.277920 | 1.000000 | |
重庆 | revenue | 1.000000 | 0.138661 |
unit_cost | 0.138661 | 1.000000 |