空难数据分析例子

数据集:Kaggle上的1908年收集的公开数据集

项目任务:

  • 每年空难数分析
  • 机上乘客数量
  • 生还数、遇难数
    • 哪些航空公司空难数最多?
    • 哪些机型空难数最多?
# -*-coding: utf-8 -*-
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from bokeh.io import output_notebook, output_file, show
from bokeh.charts import Bar,TimeSeries
from bokeh.layouts import column
from math import pi
- 查看数据信息
data_path = './dataset/Airplane_Crashes_and_Fatalities_Since_1908.csv'
df_data = pd.read_csv(data_path)
print u'数据集基本信息:'
print df_data.info()
数据集基本信息: RangeIndex: 5268 entries, 0 to 5267 Data columns (total 13 columns): Date 5268 non-null object Time 3049 non-null object Location 5248 non-null object Operator 5250 non-null object Flight # 1069 non-null object Route 3562 non-null object Type 5241 non-null object Registration 4933 non-null object cn/In 4040 non-null object Aboard 5246 non-null float64 Fatalities 5256 non-null float64 Ground 5246 non-null float64 Summary 4878 non-null object dtypes: float64(3), object(10) memory usage: 535.1+ KB None
print u'数据集有%i行,%i列' %(df_data.shape[0], df_data.shape[1])
数据集有5268行,13列
print u'数据预览:'
df_data.head()
数据预览:
Date Time Location Operator Flight # Route Type Registration cn/In Aboard Fatalities Ground Summary
0 09/17/1908 17:18 Fort Myer, Virginia Military - U.S. Army NaN Demonstration Wright Flyer III NaN 1 2.0 1.0 0.0 During a demonstration flight, a U.S. Army fly…
1 07/12/1912 06:30 AtlantiCity, New Jersey Military - U.S. Navy NaN Test flight Dirigible NaN NaN 5.0 5.0 0.0 First U.S. dirigible Akron exploded just offsh…
2 08/06/1913 NaN Victoria, British Columbia, Canada Private - NaN Curtiss seaplane NaN NaN 1.0 1.0 0.0 The first fatal airplane accident in Canada oc…
3 09/09/1913 18:30 Over the North Sea Military - German Navy NaN NaN Zeppelin L-1 (airship) NaN NaN 20.0 14.0 0.0 The airship flew into a thunderstorm and encou…
4 10/17/1913 10:30 Near Johannisthal, Germany Military - German Navy NaN NaN Zeppelin L-2 (airship) NaN NaN 30.0 30.0 0.0 Hydrogen gas which was being vented was sucked…
  • 处理缺失数据
# def process_missing_data(df_data):
#     """
#             处理缺失数据
#     """
#     if df_data.isnull().values.any():
#         # 存在缺失数据
#         print '存在缺失数据!'
#         df_data = df_data.fillna(0.)    # 填充nan
#         # df_data = df_data.dropna()    # 过滤nan
#     return df_data.reset_index()
- 数据转换
df_data['Date'] = pd.to_datetime(df_data['Date'])
# df_data['Date']
df_data['Year'] = df_data['Date'].map(lambda x: x.year)
df_data.head()
Date Time Location Operator Flight # Route Type Registration cn/In Aboard Fatalities Ground Summary Year
0 1908-09-17 17:18 Fort Myer, Virginia Military - U.S. Army NaN Demonstration Wright Flyer III NaN 1 2.0 1.0 0.0 During a demonstration flight, a U.S. Army fly… 1908
1 1912-07-12 06:30 AtlantiCity, New Jersey Military - U.S. Navy NaN Test flight Dirigible NaN NaN 5.0 5.0 0.0 First U.S. dirigible Akron exploded just offsh… 1912
2 1913-08-06 NaN Victoria, British Columbia, Canada Private - NaN Curtiss seaplane NaN NaN 1.0 1.0 0.0 The first fatal airplane accident in Canada oc… 1913
3 1913-09-09 18:30 Over the North Sea Military - German Navy NaN NaN Zeppelin L-1 (airship) NaN NaN 20.0 14.0 0.0 The airship flew into a thunderstorm and encou… 1913
4 1913-10-17 10:30 Near Johannisthal, Germany Military - German Navy NaN NaN Zeppelin L-2 (airship) NaN NaN 30.0 30.0 0.0 Hydrogen gas which was being vented was sucked… 1913
  • 数据分析与可视化——空难数vs年份

a) seaborn

plt.figure(figsize=(15.0,10.0))
sns.countplot(x='Year', data=df_data)

plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
plt.title(u'空难次数VS年份')
plt.xlabel(u'年份')
plt.ylabel(u'空难次数')
plt.xticks(rotation=90)
plt.show()

空难数据分析例子_第1张图片

b) bokeh

p = Bar(df_data,'Year',title=u'空难次数 VS 年份',plot_width=1000,legend=False,xlabel=u'年份',ylabel=u'空难次数')
p.xaxis.major_label_orientation = pi/2
output_notebook()
show(p)

空难数据分析例子_第2张图片

  • 数据分析与可视化——乘客数量vs遇难数vs年份
grouped_year_sum_data = df_data.groupby('Year',as_index=False).sum()
grouped_year_sum_data.head()
Year Aboard Fatalities Ground
0 1908 2.0 1.0 0.0
1 1912 5.0 5.0 0.0
2 1913 51.0 45.0 0.0
3 1915 60.0 40.0 0.0
4 1916 109.0 108.0 0.0

a) seaborn

grouped_year_sum_data = df_data.groupby('Year',as_index=False).sum()
grouped_year_sum_data.head()
plt.title(u'乘客数量vs遇难数vs年份')
plt.xlabel(u'年份')
plt.ylabel(u'乘客数量vs遇难数')
plt.xticks(rotation=90)
plt.show()

空难数据分析例子_第3张图片

b) bokeh

tsline = TimeSeries(data=grouped_year_sum_data, x='Year',y=['Aboard','Fatalities'],color=['Aboard', 'Fatalities'],
                    dash=['Aboard', 'Fatalities'],title=u'乘客数量vs遇难数vs年份',xlabel=u'年份',ylabel=u'乘客数vs遇难数',
                    legend=True)
tspoint = TimeSeries(data=grouped_year_sum_data,x='Year',y=['Aboard','Fatalities'],color=['Aboard', 'Fatalities'],
                    dash=['Aboard', 'Fatalities'],builder_type='point',title=u'乘客数量vs遇难数vs年份',xlabel=u'年份',
                    ylabel=u'乘客数vs遇难数',legend=True)
output_notebook()
show(column(tsline,tspoint))

空难数据分析例子_第4张图片
空难数据分析例子_第5张图片

  • top n 分析
grouped_data = df_data.groupby(by='Type',as_index=False)['Date'].count()
grouped_data.rename(columns={'Date':'Count'},inplace=True)
top_n = 10
top_n_grouped_data = grouped_data.sort_values('Count',ascending=False).iloc[:top_n, :]
top_n_grouped_data
Type Count
1178 Douglas DC-3 334
2388 de Havilland Canada DHC-6 Twin Otter 300 81
1097 Douglas C-47A 74
1089 Douglas C-47 62
1230 Douglas DC-4 40
2340 Yakovlev YAK-40 37
125 Antonov AN-26 36
1598 Junkers JU-52/3m 32
1119 Douglas C-47B 29
1045 De Havilland DH-4 28

- 可视化结果

plt.figure(figsize=(15.0,10.0))
sns.barplot(x='Count',y='Type',data=top_n_grouped_data)
plt.title('Count vs Type',fontsize=20)
plt.xlabel('Type')
plt.ylabel('Count')
plt.show()

空难数据分析例子_第6张图片

你可能感兴趣的:(sns,bokeh,数据,可视化,可视化,python)