权当学习Python练手用的.
数据来data.gov.uk,大小有58.4MB
文件都是些什么内容?
LowMemory 方式读取文件
#read the file
filedir='/home/derek/Desktop/python-data-analyis/large-excel-files/Accidents_2013.csv'
data = pd.read_csv(filedir,low_memory=False)
print data.ix[:10]['Day_of_Week']
print 'Accidents'
print '----------'
#选择星期日发生的事故
accidents_sunday = data[data.Day_of_Week==1]
print 'Accidents which happended on a Sunday: ',len(accidents_sunday)
#选择星期日发生的且涉事人数在十人以上的事故
accidents_sunday_twenty_cars = data[(data.Day_of_Week==1) & (data.Number_of_Vehicles>10)]
print'Accidents which happened on a Sunday involving > 10 cars: ' , len(accidents_sunday_twenty_cars)
#选择星期日发生的且涉事人数在十人以上且天气情况是下雨的事故(2对应的是无风下雨)
accidents_sunday_twenty_cars_rain = data[(data.Day_of_Week==1) & (data.Number_of_Vehicles>10) & (data.Weather_Conditions==2)]
print'Accidents which happened on a Sunday involving > 10 cars with rainning: ' , len(accidents_sunday_twenty_cars_rain)
#选择在伦敦的星期日发生的事故
london_data = data[(data['Police_Force'] == 1) & (data.Day_of_Week==1)]
print 'Accidents in London on a Sunday',len(london_data)
#选择在2000年的伦敦的星期日发生的事故
london_data_2000 = london_data[((pd.to_datetime('2000-1-1', errors='coerce')) > (pd.to_datetime(london_data['Date'],errors='coerce'))) & (pd.to_datetime(london_data['Date'],errors='coerce') < (pd.to_datetime('2000-12-31', errors='coerce')))]
print 'Accidents in London on a Sunday in 2000:',len(london_data_2000)
给人的感觉是特别像SQL语句,DataFrame的这种切片,方式特别好用,对不对?
pd.to_datetime(london_data['Date'],errors='coerce')
这里是日期转换函数.
输出:
Accidents
----------
Accidents which happended on a Sunday: 14854
Accidents which happened on a Sunday involving > 10 cars: 1
Accidents which happened on a Sunday involving > 10 cars with rainning: 1
Accidents in London on a Sunday 2374
Accidents in London on a Sunday in 2000: 0
sudo pip install XlsxWriter
writer = pd.ExcelWriter('london_data.xlsx', engine='xlsxwriter')
london_data.to_excel(writer, 'sheet1')
writer.save()
writer.close()
import pandas as pd
from pandas import Series
import matplotlib.pyplot as plt
#read the file
dir='/home/derek/Desktop/python-data-analyis/large-excel-files/'
filedir=['Accidents_2013.csv','Accidents_2014.csv', 'Accidents_2015.csv']
tot = Series([])
for i in range(3):
#块读取文件, 每次读1000条记录
data = pd.read_csv(dir + filedir[i],chunksize=1000)
for piece in data:
tot = tot.add(piece['Day_of_Week'].value_counts(), fill_value=0)
day_index = ['Sun', 'Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat']
print 'data like:'
#tot = tot.sort_values(ascending=False)
print tot
#重新构造一个Series,是为了给索引命名
new_Series = Series(tot.values, index=day_index)
new_Series.plot()
plt.show()
plt.close()
控制台输出:
data like:
1 46052
2 60956
3 65006
4 64039
5 64445
6 69378
7 55162
dtype: float64
结论: 看来,英国人在工作日出行要比在休息日造成更多的事故.星期五的出行造成的事故最多,或许,星期五急着回家,哈哈.相比起来,星期五不适合外出.
参考文章来源
文件没有提供,是因为:读者可以自己去下载,可能找到更想更好用Python分析的数据.