https://www.kesci.com/apps/home/project/5a8afe517f2d695222327e14
练习1-开始了解你的数据
步骤6 数据集中有多少个列(columns):chipo.shape[1]
步骤9 被下单数最多商品(item)是什么:chipo.item_name.value_counts().head(1) value_counts 默认从大到小排序
步骤10 在item_name这一列中,一共有多少商品被下单:chipo.item_name.nunique() nunique()???
步骤13 将item_price转换为浮点数:dollarizer = lambda x: float(x[1:-1]) ???
chipo.item_price = chipo.item_price.apply(dollarizer)
练习2-数据过滤与排序
步骤5 有多少球队参与了2012欧洲杯:euro12.shape[0] 与练习题1步骤6的区别
步骤6 该数据集中一共有多少列(columns):euro12.info() 与练习题1步骤6的区别
步骤8 对数据框discipline按照先Red Cards再Yellow Cards进行排序:discipline.sort_values(['Red Cards', 'Yellow Cards'], ascending = False)
步骤9 计算每个球队拿到的黄牌数的平均值:round(discipline['Yellow Cards'].mean())
步骤11 选取以字母G开头的球队数据:euro12[euro12.Team.str.startswith('G')]
步骤14 找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy):euro12.loc[euro12.Team.isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]
练习3-数据分组
步骤8 打印出每个大陆对spirit饮品消耗的平均值,最大值和最小值:drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])
练习4-Apply函数
步骤4 每一列(column)的数据类型是什么样的:crime.info()
步骤5 将Year的数据类型转换为 datetime64:crime.Year = pd.to_datetime(crime.Year, format='%Y')
crime.info()
步骤6 将列Year设置为数据框的索引:crime = crime.set_index('Year', drop = True)
步骤7 删除名为Total的列:del crime['Total']
步骤8 按照Year对数据框进行分组并求和:*注意Population这一列,若直接对其求和,是不正确的**
crimes
=
crime
.
resample
(
'10AS'
)
.
sum
() #先将可以加总的部分,每十年一次加总
population
=
crime
[
'Population'
]
.
resample
(
'10AS'
)
.
max
() #每十年的加总人口为每十年中的最大数
crimes
[
'Population'
]
=
population #将原本的Population替换成population
crimes
步骤9 何时是美国历史上生存最危险的年代:crime.idxmax(0) idxmax():最大值的索引值
练习5-合并
步骤3 将上述的数据框分别命名为data1, data2, data3:
data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
data3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_id'])
步骤4 将data1和data2两个数据框按照行的维度进行合并,命名为all_data:all_data = pd.concat([data1, data2])
步骤9 找到 data1 和 data2 合并之后的所有匹配结果:pd.merge(data1, data2, on='subject_id', how='outer') how='outer'
练习6-统计
步骤3 将数据作存储并且设置前三列为合适的索引:data = pd.read_table(path6, sep = "\s+", parse_dates = [[0,1,2]]) parse_dates???
步骤4 2061年?我们真的有这一年的数据?创建一个函数并用它去修复这个bug:
def fix_century(x):
year = x.year - 100 if x.year > 1989 else x.year
return datetime.date(year, x.month, x.day)
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century)
步骤5 将日期设为索引,注意数据类型,应该是datetime64[ns]:
data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"]) # transform Yr_Mo_Dy it to date type datetime64
data = data.set_index('Yr_Mo_Dy') # set 'Yr_Mo_Dy' as the index
步骤6 对应每一个location,一共有多少数据值缺失:
data
.
isnull
()
.
sum
()
isnull()应用
步骤7 对应每一个location,一共有多少完整的数据值:data.shape[1] - data.isnull().sum()????
步骤9 创建一个名为loc_stats的数据框去计算并存储每个location的风速最小值,最大值,平均值和标准差:
loc_stats
=
pd
.
DataFrame
()
loc_stats
[
'min'
]
=
data
.
min
()
# min
loc_stats
[
'max'
]
=
data
.
max
()
# max
loc_stats
[
'mean'
]
=
data
.
mean
()
# mean
loc_stats
[
'std'
]
=
data
.
std
()
# standard deviations
步骤10 创建一个名为day_stats的数据框去计算并存储所有location的风速最小值,最大值,平均值和标准差:
day_stats
=
pd
.
DataFrame
() # create the dataframe
day_stats
[
'min'
]
=
data
.
min
(
axis
=
1
)
# min
day_stats
[
'max'
]
=
data
.
max
(
axis
=
1
)
# max
day_stats
[
'mean'
]
=
data
.
mean
(
axis
=
1
)
# mean
day_stats
[
'std'
]
=
data
.
std
(
axis
=
1
)
# standard deviations
步骤11 对于每一个location,计算一月份的平均风速:
data
[
'date'
]
=
data
.
index # creates a new column 'date' and gets the values from the index
data
[
'month'
]
=
data
[
'date'
]
.
apply
(
lambda
date
:
date
.
month
)
data
[
'year'
]
=
data
[
'date'
]
.
apply
(
lambda
date
:
date
.
year
)
data
[
'day'
]
=
data
[
'date'
]
.
apply
(
lambda
date
:
date
.
day
)
january_winds
=
data
.
query
(
'month == 1'
)
january_winds.loc[:,'RPT':"MAL"].mean()
步骤12 对于数据记录按照年为频率取样:data.query('month == 1 and day == 1')
步骤13 对于数据记录按照月为频率取样:data.query('day == 1')
练习7-可视化
步骤5 绘制一个展示男女乘客比例的扇形图:
# sum the instances of males and females
males
=
(
titanic
[
'Sex'
]
==
'male'
)
.
sum
()
females
=
(
titanic
[
'Sex'
]
==
'female'
)
.
sum
()
# put them into a list called proportions
proportions
=
[
males
,
females
]
# Create a pie chartplt.pie(
# using proportions
proportions,
# with the labels being officer names
labels
=
[
'Males'
,
'Females'
],
# with no shadows
shadow
=
False
,
# with colors
colors
=
[
'blue'
,
'red'
],
# with one slide exploded out
explode
=
(
0.15
,
0
),
# with the start angle at 90%
startangle
=
90
,
# with the percent listed as a fraction
autopct
=
'
%1.1f%%
'
)
# View the plot drop aboveplt.axis('equal')
# Set labelsplt.title("Sex Proportion")
# View the plotplt.tight_layout()
plt.show()
步骤6 绘制一个展示船票Fare, 与乘客年龄和性别的散点图:
lm
=
sns
.
lmplot
(
x
=
'Age'
,
y
=
'Fare'
,
data
=
titanic
,
hue
=
'Sex'
,
fit_reg
=
False
) # creates the plot using
lm
.
set
(
title
=
'Fare x Age'
) # set title
axes
=
lm
.
axes
axes
[
0
,
0
]
.
set_ylim
(
-
5
,)
axes
[
0
,
0
]
.
set_xlim
(
-
5
,
85
) # get the axes object and tweak it
步骤8 绘制一个展示船票价格的直方图:
df
=
titanic
.
Fare
.
sort_values
(
ascending
=
False
)
df
binsVal
=
np
.
arange
(
0
,
600
,
10
)
binsVal # create bins interval using numpy
plt
.
hist
(
df
,
bins
=
binsVal
) # create the plot
plt.xlabel('Fare')plt.ylabel('Frequency')plt.title('Fare Payed Histrogram') # Set the title and labels
plt.show() # show the plot
练习8-创建数据框
步骤4 数据框的列排序是字母顺序,请重新修改为name, type, hp, evolution, pokedex这个顺序:
pokemon
=
pokemon
[[
'name'
,
'type'
,
'hp'
,
'evolution'
,
'pokedex'
]]
步骤6 查看每个列的数据类型:pokemon.dtypes
练习9-时间序列
步骤5 将Date这个列转换为datetime类型:apple.Date = pd.to_datetime(apple.Date)
步骤7 有重复的日期吗:apple.index.is_unique is_unique???
步骤8 将index设置为升序:apple.sort_index(ascending = True).head()
步骤9 找到每个月的最后一个交易日(business day):apple_month = apple.resample('BM').mean()
步骤10 数据集中最早的日期和最晚的日期相差多少天:(apple.index.max() - apple.index.min()).days
步骤12 按照时间顺序可视化Adj Close值:
appl_open = apple['Adj Close'].plot(title = "Apple Stock") # makes the plot and assign it to a variable
fig = appl_open.get_figure() # changes the size of the graph
fig.set_size_inches(13.5, 9)
练习10-删除数据
步骤4 创建数据框的列名称:iris.columns = ['sepal_length','sepal_width', 'petal_length', 'petal_width', 'class']
步骤5 数据框中有缺失值吗:pd.isnull(iris).sum()
步骤6 将列petal_length的第10到19行设置为缺失值:iris.iloc[10:20,2:3] = np.nan
步骤7 将缺失值全部替换为1.0:iris.petal_length.fillna(1, inplace = True)
步骤8 删除列class:del iris['class']
步骤10 删除有缺失值的行:iris = iris.dropna(how='any') dropna()