根据提供的csv文件,计算北京空气质量数据:
1.汇总计算PM指数年平均值的变化情况
2.汇总计算每年中1-12月的PM指数数据变化情况
csv文件详见附件
问题1
实际是求PM指数的年平均值,即按年分组计算平均值;
问题2
实际是求PM指数的月平均值,即按年和月分组计算平均值。
import numpy as np
import pandas as pd
# 打开文件
FileNameStr = 'BeijingPM20100101_20151231.csv'
df = pd.read_csv(FileNameStr, encoding='utf-8')
# 禁止省略列信息
pd.set_option('display.max_columns', None)
# 打印信息
print("--------------head--------------")
print(df.head())
print("------------describe------------")
print(df.describe())
print("--------------info--------------")
print(df.info())
print("================================")
执行后得到以下信息:
--------------head--------------
No year month day hour season PM_Dongsi PM_Dongsihuan \
0 1 2010 1 1 0 4 NaN NaN
1 2 2010 1 1 1 4 NaN NaN
2 3 2010 1 1 2 4 NaN NaN
3 4 2010 1 1 3 4 NaN NaN
4 5 2010 1 1 4 4 NaN NaN
PM_Nongzhanguan PM_US Post DEWP HUMI PRES TEMP cbwd Iws \
0 NaN NaN -21.0 43.0 1021.0 -11.0 NW 1.79
1 NaN NaN -21.0 47.0 1020.0 -12.0 NW 4.92
2 NaN NaN -21.0 43.0 1019.0 -11.0 NW 6.71
3 NaN NaN -21.0 55.0 1019.0 -14.0 NW 9.84
4 NaN NaN -20.0 51.0 1018.0 -12.0 NW 12.97
precipitation Iprec
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
------------describe------------
No year month day hour \
count 52584.000000 52584.000000 52584.000000 52584.000000 52584.000000
mean 26292.500000 2012.499772 6.523962 15.726609 11.500000
std 15179.837614 1.707485 3.448452 8.798896 6.922252
min 1.000000 2010.000000 1.000000 1.000000 0.000000
25% 13146.750000 2011.000000 4.000000 8.000000 5.750000
50% 26292.500000 2012.000000 7.000000 16.000000 11.500000
75% 39438.250000 2014.000000 10.000000 23.000000 17.250000
max 52584.000000 2015.000000 12.000000 31.000000 23.000000
season PM_Dongsi PM_Dongsihuan PM_Nongzhanguan \
count 52584.000000 25052.000000 20508.000000 24931.000000
mean 2.491100 89.154439 92.560806 88.643737
std 1.116988 87.239267 88.027434 88.041166
min 1.000000 3.000000 3.000000 3.000000
25% 1.000000 24.000000 28.000000 24.000000
50% 2.000000 64.000000 68.000000 62.000000
75% 3.000000 124.000000 127.000000 122.000000
max 4.000000 737.000000 672.000000 844.000000
PM_US Post DEWP HUMI PRES TEMP \
count 50387.000000 52579.000000 52245.000000 52245.000000 52579.000000
mean 95.904241 2.074554 54.602421 1016.465442 12.587040
std 91.643772 14.222059 25.991338 10.295070 12.098527
min 1.000000 -40.000000 2.000000 991.000000 -19.000000
25% 27.000000 -10.000000 31.000000 1008.000000 2.000000
50% 69.000000 2.000000 55.000000 1016.000000 14.000000
75% 132.000000 15.000000 78.000000 1025.000000 23.000000
max 994.000000 28.000000 100.000000 1046.000000 42.000000
Iws precipitation Iprec
count 52579.000000 52100.000000 52100.000000
mean 23.261829 19.258683 19.519008
std 49.281706 4381.035532 4381.036040
min 0.450000 0.000000 0.000000
25% 1.790000 0.000000 0.000000
50% 4.920000 0.000000 0.000000
75% 21.020000 0.000000 0.000000
max 585.600000 999990.000000 999990.000000
--------------info--------------
RangeIndex: 52584 entries, 0 to 52583
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 No 52584 non-null int64
1 year 52584 non-null int64
2 month 52584 non-null int64
3 day 52584 non-null int64
4 hour 52584 non-null int64
5 season 52584 non-null int64
6 PM_Dongsi 25052 non-null float64
7 PM_Dongsihuan 20508 non-null float64
8 PM_Nongzhanguan 24931 non-null float64
9 PM_US Post 50387 non-null float64
10 DEWP 52579 non-null float64
11 HUMI 52245 non-null float64
12 PRES 52245 non-null float64
13 TEMP 52579 non-null float64
14 cbwd 52579 non-null object
15 Iws 52579 non-null float64
16 precipitation 52100 non-null float64
17 Iprec 52100 non-null float64
dtypes: float64(11), int64(6), object(1)
memory usage: 7.2+ MB
None
================================
经查阅资料,可知各变量含义:
No
: 行号
year
: 年份
month
: 月份
day
: 日期
hour
: 小时
season
: 季节
PM_Dongsi
: 来自Dongsi(东四?)
的PM2.5浓度 (ug/m^3)
PM_Dongsihuan
: 来自Dongsihuan(东四环?)
的PM2.5浓度 (ug/m^3)
PM_Nongzhanguan
: 来自Nongzhanguan(农展馆?)
的PM2.5浓度 (ug/m^3)
PM_US Post
: 来自US Post
的PM2.5浓度 (ug/m^3)
DEWP
: 露点 (摄氏温度) 指在固定气压之下,空气中所含的气态水达到饱和而凝结成液态水所需要降至的温度。
TEMP
: Temperature (摄氏温度)
HUMI
: 湿度 (%)
PRES
: 气压 (hPa)
cbwd
: 组合风向
Iws
: 累计风速 (m/s)
precipitation
: 降水量/时 (mm)
Iprec
: 累计降水量 (mm)
不难发现,仅有第2, 3, 7, 8, 9, 10列的数据与题目相关。但四列PM数据都存在不同程度的缺失。
import numpy as np
import pandas as pd
# 打开文件,仅读取第7至第10列
FileNameStr = 'BeijingPM20100101_20151231.csv'
df = pd.read_csv(FileNameStr, encoding='utf-8', usecols=[6, 7, 8, 9])
# 打印信息
# 总行数
print("Total of rows: {}".format(len(df.index)))
# 显示某列缺失值个数
# print("The number of missing data in PM_Dongsi: {}".format(len(df.index) - len(df['PM_Dongsi'].dropna())))
# 为了书写简便,使用循环,遍历所有列,获得每列缺失值个数
for col in df:
#dropna()会去除缺省值的数据,遂得到每列的缺失数据数
print("The number of missing data in {}: {}".format(col, len(df.index) - len(df[col].dropna())))
# 获取所有列均为缺失值的行的个数,how='all'是限定所有列均为缺省值
print("The number of missing data in BOTH: ", len(df.index) - len(df.dropna(how='all')))
得到信息:1
Total of rows: 52584
The number of missing data in PM_Dongsi: 27532
The number of missing data in PM_Dongsihuan: 32076
The number of missing data in PM_Nongzhanguan: 27653
The number of missing data in PM_US_Post: 2197
The number of missing data in BOTH: 1894
可见PM_US_Post的数据最为完整,但是仍存在部分四个值均缺失的情况。
为了数据的可靠性,博主采用四项求平均的方式来确定每日的PM数据。
原文件中变量PM_US Post
中包含空格,可能会对命令操作造成干扰,建议将' '
替换为'_'
。
问题1:
import numpy as np
import pandas as pd
# 打开文件,仅读取第7至第10列
FileNameStr = 'BeijingPM20100101_20151231.csv'
df = pd.read_csv(FileNameStr, encoding='utf-8', usecols=[1, 6, 7, 8, 9])
# 新建平均值列,并将平均值写入
# 其中,iloc[:, 1:5]指第2到第5列,mean(axis=1)为求行平均值
df['PM_ave'] = df.iloc[:, 1:5].mean(axis=1)
# 保存到文件,其中以'year'分组,计算'PM_ave'列的平均值。
df.groupby('year')['PM_ave'].mean().to_csv("solution1.csv")
# 为方便,再打印一份
print(df.groupby('year')['PM_ave'].mean())
得到:
year PM_ave
2010 104.0457298
2011 99.09324038
2012 90.53876764
2013 98.40266354
2014 93.9177037
2015 85.85894216
问题2:
import numpy as np
import pandas as pd
# 打开文件,仅读取第7至第10列
FileNameStr = 'BeijingPM20100101_20151231.csv'
df = pd.read_csv(FileNameStr, encoding='utf-8', usecols=[1, 2, 6, 7, 8, 9])
# 新建平均值列,并将平均值写入
# 其中,iloc[:, 2:6]指第3到第6列,mean(axis=1)为求行平均值
df['PM_ave'] = df.iloc[:, 2:6].mean(axis=1)
# 保存到文件,其中以'year'和'month'分组,计算'PM_ave'列的平均值。
df.groupby(['year', 'month'])['PM_ave'].mean().to_csv("solution2.csv")
# 为方便,再打印一份
print(df.groupby(['year', 'month'])['PM_ave'].mean())
得到:
year month PM_ave
2010 1 90.40366972
2010 2 97.23994039
2010 3 94.04654443
2010 4 80.0724234
2010 5 87.07191316
2010 6 109.0389381
2010 7 123.4260753
2010 8 97.68343195
2010 9 122.792735
2010 10 118.7843666
2010 11 138.3840361
2010 12 97.11574697
2011 1 44.87369985
2011 2 150.2901786
2011 3 57.99198718
2011 4 91.72067039
2011 5 65.10814607
2011 6 108.7946554
2011 7 107.3864865
2011 8 103.7338004
2011 9 94.96940195
2011 10 145.5568182
2011 11 109.434965
2011 12 108.7213997
2012 1 118.9223881
2012 2 84.44202899
2012 3 96.47432432
2012 4 87.83588317
2012 5 90.96671491
2012 6 96.63418079
2012 7 80.6497093
2012 8 81.16532905
2012 9 59.95224719
2012 10 94.95135135
2012 11 87.43696275
2012 12 109.1872964
2013 1 183.1952703
2013 2 113.5664683
2013 3 114.5726927
2013 4 63.04780093
2013 5 89.14852151
2013 6 111.3548611
2013 7 74.93283917
2013 8 67.92361111
2013 9 85.71782407
2013 10 102.2087814
2013 11 85.1462963
2013 12 90.31776434
2014 1 107.9117384
2014 2 160.5138889
2014 3 103.1832437
2014 4 92.16064815
2014 5 64.95855735
2014 6 59.15462963
2014 7 91.7999552
2014 8 65.66823688
2014 9 68.23263889
2014 10 135.2697133
2014 11 106.3375
2014 12 76.62253584
2015 1 110.0227375
2015 2 103.4455605
2015 3 94.48342294
2015 4 79.39699074
2015 5 61.16756272
2015 6 60.33240741
2015 7 60.22950269
2015 8 45.89605735
2015 9 50.92476852
2015 10 77.2578405
2015 11 125.803125
2015 12 162.1789875
点击下载
原文地址:https://www.boyinthesun.cn/post/python-pd/
上文所指本地下载来自我的个人云盘