数据集:Kaggle上的1908年收集的公开数据集
项目任务:
# -*-coding: utf-8 -*-
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from bokeh.io import output_notebook, output_file, show
from bokeh.charts import Bar,TimeSeries
from bokeh.layouts import column
from math import pi
- 查看数据信息
data_path = './dataset/Airplane_Crashes_and_Fatalities_Since_1908.csv'
df_data = pd.read_csv(data_path)
print u'数据集基本信息:'
print df_data.info()
数据集基本信息: RangeIndex: 5268 entries, 0 to 5267 Data columns (total 13 columns): Date 5268 non-null object Time 3049 non-null object Location 5248 non-null object Operator 5250 non-null object Flight # 1069 non-null object Route 3562 non-null object Type 5241 non-null object Registration 4933 non-null object cn/In 4040 non-null object Aboard 5246 non-null float64 Fatalities 5256 non-null float64 Ground 5246 non-null float64 Summary 4878 non-null object dtypes: float64(3), object(10) memory usage: 535.1+ KB None
print u'数据集有%i行,%i列' %(df_data.shape[0], df_data.shape[1])
数据集有5268行,13列
print u'数据预览:'
df_data.head()
数据预览:
Date | Time | Location | Operator | Flight # | Route | Type | Registration | cn/In | Aboard | Fatalities | Ground | Summary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 09/17/1908 | 17:18 | Fort Myer, Virginia | Military - U.S. Army | NaN | Demonstration | Wright Flyer III | NaN | 1 | 2.0 | 1.0 | 0.0 | During a demonstration flight, a U.S. Army fly… |
1 | 07/12/1912 | 06:30 | AtlantiCity, New Jersey | Military - U.S. Navy | NaN | Test flight | Dirigible | NaN | NaN | 5.0 | 5.0 | 0.0 | First U.S. dirigible Akron exploded just offsh… |
2 | 08/06/1913 | NaN | Victoria, British Columbia, Canada | Private | - | NaN | Curtiss seaplane | NaN | NaN | 1.0 | 1.0 | 0.0 | The first fatal airplane accident in Canada oc… |
3 | 09/09/1913 | 18:30 | Over the North Sea | Military - German Navy | NaN | NaN | Zeppelin L-1 (airship) | NaN | NaN | 20.0 | 14.0 | 0.0 | The airship flew into a thunderstorm and encou… |
4 | 10/17/1913 | 10:30 | Near Johannisthal, Germany | Military - German Navy | NaN | NaN | Zeppelin L-2 (airship) | NaN | NaN | 30.0 | 30.0 | 0.0 | Hydrogen gas which was being vented was sucked… |
# def process_missing_data(df_data):
# """
# 处理缺失数据
# """
# if df_data.isnull().values.any():
# # 存在缺失数据
# print '存在缺失数据!'
# df_data = df_data.fillna(0.) # 填充nan
# # df_data = df_data.dropna() # 过滤nan
# return df_data.reset_index()
- 数据转换
df_data['Date'] = pd.to_datetime(df_data['Date'])
# df_data['Date']
df_data['Year'] = df_data['Date'].map(lambda x: x.year)
df_data.head()
Date | Time | Location | Operator | Flight # | Route | Type | Registration | cn/In | Aboard | Fatalities | Ground | Summary | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1908-09-17 | 17:18 | Fort Myer, Virginia | Military - U.S. Army | NaN | Demonstration | Wright Flyer III | NaN | 1 | 2.0 | 1.0 | 0.0 | During a demonstration flight, a U.S. Army fly… | 1908 |
1 | 1912-07-12 | 06:30 | AtlantiCity, New Jersey | Military - U.S. Navy | NaN | Test flight | Dirigible | NaN | NaN | 5.0 | 5.0 | 0.0 | First U.S. dirigible Akron exploded just offsh… | 1912 |
2 | 1913-08-06 | NaN | Victoria, British Columbia, Canada | Private | - | NaN | Curtiss seaplane | NaN | NaN | 1.0 | 1.0 | 0.0 | The first fatal airplane accident in Canada oc… | 1913 |
3 | 1913-09-09 | 18:30 | Over the North Sea | Military - German Navy | NaN | NaN | Zeppelin L-1 (airship) | NaN | NaN | 20.0 | 14.0 | 0.0 | The airship flew into a thunderstorm and encou… | 1913 |
4 | 1913-10-17 | 10:30 | Near Johannisthal, Germany | Military - German Navy | NaN | NaN | Zeppelin L-2 (airship) | NaN | NaN | 30.0 | 30.0 | 0.0 | Hydrogen gas which was being vented was sucked… | 1913 |
a) seaborn
plt.figure(figsize=(15.0,10.0))
sns.countplot(x='Year', data=df_data)
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
plt.title(u'空难次数VS年份')
plt.xlabel(u'年份')
plt.ylabel(u'空难次数')
plt.xticks(rotation=90)
plt.show()
b) bokeh
p = Bar(df_data,'Year',title=u'空难次数 VS 年份',plot_width=1000,legend=False,xlabel=u'年份',ylabel=u'空难次数')
p.xaxis.major_label_orientation = pi/2
output_notebook()
show(p)
grouped_year_sum_data = df_data.groupby('Year',as_index=False).sum()
grouped_year_sum_data.head()
Year | Aboard | Fatalities | Ground | |
---|---|---|---|---|
0 | 1908 | 2.0 | 1.0 | 0.0 |
1 | 1912 | 5.0 | 5.0 | 0.0 |
2 | 1913 | 51.0 | 45.0 | 0.0 |
3 | 1915 | 60.0 | 40.0 | 0.0 |
4 | 1916 | 109.0 | 108.0 | 0.0 |
a) seaborn
grouped_year_sum_data = df_data.groupby('Year',as_index=False).sum()
grouped_year_sum_data.head()
plt.title(u'乘客数量vs遇难数vs年份')
plt.xlabel(u'年份')
plt.ylabel(u'乘客数量vs遇难数')
plt.xticks(rotation=90)
plt.show()
b) bokeh
tsline = TimeSeries(data=grouped_year_sum_data, x='Year',y=['Aboard','Fatalities'],color=['Aboard', 'Fatalities'],
dash=['Aboard', 'Fatalities'],title=u'乘客数量vs遇难数vs年份',xlabel=u'年份',ylabel=u'乘客数vs遇难数',
legend=True)
tspoint = TimeSeries(data=grouped_year_sum_data,x='Year',y=['Aboard','Fatalities'],color=['Aboard', 'Fatalities'],
dash=['Aboard', 'Fatalities'],builder_type='point',title=u'乘客数量vs遇难数vs年份',xlabel=u'年份',
ylabel=u'乘客数vs遇难数',legend=True)
output_notebook()
show(column(tsline,tspoint))
grouped_data = df_data.groupby(by='Type',as_index=False)['Date'].count()
grouped_data.rename(columns={'Date':'Count'},inplace=True)
top_n = 10
top_n_grouped_data = grouped_data.sort_values('Count',ascending=False).iloc[:top_n, :]
top_n_grouped_data
Type | Count | |
---|---|---|
1178 | Douglas DC-3 | 334 |
2388 | de Havilland Canada DHC-6 Twin Otter 300 | 81 |
1097 | Douglas C-47A | 74 |
1089 | Douglas C-47 | 62 |
1230 | Douglas DC-4 | 40 |
2340 | Yakovlev YAK-40 | 37 |
125 | Antonov AN-26 | 36 |
1598 | Junkers JU-52/3m | 32 |
1119 | Douglas C-47B | 29 |
1045 | De Havilland DH-4 | 28 |
- 可视化结果
plt.figure(figsize=(15.0,10.0))
sns.barplot(x='Count',y='Type',data=top_n_grouped_data)
plt.title('Count vs Type',fontsize=20)
plt.xlabel('Type')
plt.ylabel('Count')
plt.show()