小费数据集来源python第三方库seaborn中自带数据;
一:数据导入
import numpy as np
from pandas import Series,DataFrame
import pandas as pd
import seaborn as sns #导入seaborn库中自带数据
In [6]:
tips = sns.load_dataset('tips')
tips.head()
Out[6]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
In [7]:
二:定义问题:
三:数据清洗:
tips.shape#查看数据集的大小
Out[7]:
(244, 7)
In [8]:
tips.describe()#查看数据集中各属性
Out[8]:
total_bill | tip | size | |
---|---|---|---|
count |
244.000000 |
244.000000 | 244.000000 |
mean | 19.785943 | 2.998279 | 2.569672 |
std | 8.902412 | 1.383638 | 0.951100 |
min | 3.070000 | 1.000000 | 1.000000 |
25% | 13.347500 | 2.000000 | 2.000000 |
50% | 17.795000 | 2.900000 | 2.000000 |
75% | 24.127500 | 3.562500 | 3.000000 |
max | 50.810000 | 10.000000 | 6.000000 |
In [9]:
tips.info()#查看数据集是否有缺失值
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill 244 non-null float64
tip 244 non-null float64
sex 244 non-null category
smoker 244 non-null category
day 244 non-null category
time 244 non-null category
size 244 non-null int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.2 KB
In [10]:
四:数据探索:
#散点图用来表示数据之间的规律 通过plot函数的kind = 'scatter'可进行绘制;
tips.plot(kind = 'scatter',x = 'total_bill',y = 'tip')
#小费金额与消费总额进行分析,看看之间的关联(绘制散点图)
Out[10]:
In [42]:
#线性图
#线性图用于绘制两组数据之间的趋势;plot()方法
tips.plot(x = 'total_bill',y = 'tip')
Out[42]:
In [11]:
#以下代码为性别与小费关系(使用柱状图)
#首先通过sex属性名计算不通属性值的平均值
male_tip = tips[tips['sex'] == 'Male']['tip'].mean()
male_tip
Out[11]:
3.0896178343949052
In [14]:
female_tip = tips[tips['sex'] == 'Female']['tip'].mean()
female_tip
Out[14]:
2.833448275862069
In [15]:
#Series 是一个一维数组对象 ,它包含一组索引和一组数据,可以把它理解为一组带索引的数组。
s = Series([male_tip,female_tip],index=['male','female'])
s
Out[15]:
male 3.089618
female 2.833448
dtype: float64
In [16]:
#柱状图
#通过plot函数的kind = 'bar'可进行绘制
s.plot(kind='bar')
Out[16]:
In [40]:
#水平柱状图(类别较多情况)
#通过plot函数的kind = 'barh'可进行绘制
s.plot(kind='barh')
Out[40]:
In [41]:
#堆积柱状图(类别较多情况)
#通过plot函数的kind = 'barh'可进行绘制再加stacked参数设置
s.plot(kind='barh',stacked=True,alpha=0.5)
Out[41]:
In [39]:
#通过unique函数查看属性列下的唯一值
#查看日期的唯一值
tips['day'].unique()
Out[39]:
[Sun, Sat, Thur, Fri]
Categories (4, object): [Sun, Sat, Thur, Fri]
In [24]:
#以下为日期平均小费柱状图构成
Sun_tip = tips[tips['day'] == 'Sun']['tip'].mean()
Sun_tip
Out[24]:
3.255131578947369
In [27]:
Sat_tip = tips[tips['day'] == 'Sat']['tip'].mean()
Sat_tip
Out[27]:
2.993103448275862
In [26]:
Thur_tip = tips[tips['day'] == 'Thur']['tip'].mean()
Thur_tip
Out[26]:
2.771451612903226
In [25]:
Fri_tip = tips[tips['day'] == 'Fri']['tip'].mean()
Fri_tip
Out[25]:
2.734736842105263
In [30]:
day_tip = Series([Sun,Sat,Thur,Fri],index=['Sun_tip','Sat_tip','Thur_tip','Fri_tip'])
day_tip
Out[30]:
Sun_tip 3.255132
Sat_tip 2.993103
Thur_tip 2.771452
Fri_tip 2.734737
dtype: float64
In [31]:
day_tip.plot(kind='bar')
Out[31]:
In [32]:
#小费百分比6
tips['percent_tip'] = tips['tip']/(tips['total_bill']+tips['tip'])
tips.head(8)
Out[32]:
total_bill | tip | sex | smoker | day | time | size | percent_tip | |
---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 0.056111 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 0.138333 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 0.142799 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 0.122638 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 0.128014 |
5 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 | 0.157000 |
6 | 8.77 | 2.00 | Male | No | Sun | Dinner | 2 | 0.185701 |
7 | 26.88 | 3.12 | Male | No | Sun | Dinner | 4 | 0.104000 |
In [33]:
#密度图
#核密度估计(将数据的分布近似为一组核)(正态分布)
tips['percent_tip'].plot(kind='kde')
#通过plot函数的kind='kde'可进行绘制;
Out[33]:
In [38]:
#直方图
#直方图可用于频率分布,y轴可为数值或者比率(可以看出大概分布规律)
#通过hist方法绘制直方图(bin参数将值分为多少段默认为10,grid参数可图表中添加网格)
tips['tip'].hist(bins=10,grid=False)
Out[38]: