项目源码请见:https://github.com/dennis0818/weather_data_analysis
通过北京市历史24小时气象和大气污染物数据,采用普通线性回归、岭回归和Lasso回归进行建模分析,对比模型预测效果选择最优模型,预测之后12小时天气数据中PM2.5的值
在北京,冬天最令人头疼的就是雾霾问题,每当雾霾天气来临,那种灰蒙蒙的空气和将口鼻掩盖在厚厚的口罩下呼吸困难的感觉,让人情绪低落。而雾霾的罪魁祸首就是PM2.5。
本次分析主要是想要使用线性回归模型对PM2.5值进行预测
8月8日已经对(北京)历史24小时即08/07日22时-08/0821时共24小时数据进行了采集,今天先来采集新生成的数据做测试数据集,由于已经过去了12个小时,所以共有12组数据可用,即08/08日22时-08/09日09时
本次分析数据来自心知天气网,该网站可以通过Restful风格URL直接获取Json格式气象和大气数据,获取方式较简单。
import pandas as pd
import numpy as np
from io import StringIO
from urllib import request
import json
from dateutil.parser import parse
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
通过请求request从网络获取数据并进行整理
url_beijing_w = 'https://api.seniverse.com/v3/weather/hourly_history.json?key=Sz6GmmiQ6SAjYTKbc&location=beijing&language=zh-Hans&unit=c'
url_beijing_p = 'https://api.seniverse.com/v3/air/hourly_history.json?key=Sz6GmmiQ6SAjYTKbc&location=beijing&language=zh-Hans&scope=city'
s_p = request.urlopen(url_beijing_p).read().decode('utf8')
s_w = request.urlopen(url_beijing_w).read().decode('utf8')
data_dict_p = json.loads(s_p)
data_dict_w = json.loads(s_w)
获取的数据是Json格式的,先将其解析为字典,在根据字典键值存入DataFrame中
def gen_table_p(list2):
data_dict = {}
for i, value in enumerate(list2):
data_dict[i] = value['city']
return data_dict
def gen_table_w(list2):
data_dict = {}
for i, value in enumerate(list2):
data_dict[i] = value
return data_dict
将气象和大气污染物数据转换成DataFrame表格
data_list_p = data_dict_p['results'][0]['hourly_history']
data_list_p = gen_table_p(data_list_p)
data_p = pd.DataFrame(data_list_p)
data_list_w = data_dict_w['results'][0]['hourly_history']
data_list_w = gen_table_w(data_list_w)
data_w = pd.DataFrame(data_list_w)
data_p = data_p.T
data_w = data_w.T
调整时间格式,用datetime格式将日期转换成想要的格式,再存回表格中。删除几个不重要的特征变量
def adjust_time(data):
time = data['last_update'].astype(str)
time = time.str[:19]
time = time.str.replace('T', ' ')
time = time.map(lambda x : parse(x))
time = time.dt.strftime('%H-%m/%d')
data['last_update'] = time
return data
data_p = adjust_time(data_p)
data_w = adjust_time(data_w)
data = pd.merge(data_p, data_w, on = 'last_update')
data_test = data[:12]
data_test.head()
aqi | co | last_update | no2 | o3 | pm10 | pm25 | quality | so2 | clouds | ... | feels_like | humidity | pressure | temperature | text | visibility | wind_direction | wind_direction_degree | wind_scale | wind_speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 105 | 0.817 | 14-08/19 | 18 | 209 | 81 | 48 | 轻度污染 | 5 | 50 | ... | 29 | 46 | 1004 | 30 | 多云 | 19.4 | 南 | 188 | 3 | 19.44 |
1 | 97 | 0.908 | 13-08/19 | 19 | 197 | 84 | 49 | 良 | 6 | 50 | ... | 30 | 48 | 1004 | 30 | 多云 | 18.5 | 西 | 248 | 3 | 18.0 |
2 | 69 | 0.875 | 12-08/19 | 26 | 162 | 87 | 49 | 良 | 4 | 50 | ... | 28 | 53 | 1005 | 29 | 多云 | 13.7 | 西 | 250 | 3 | 15.84 |
3 | 74 | 0.808 | 11-08/19 | 41 | 114 | 98 | 51 | 良 | 3 | 90 | ... | 28 | 53 | 1005 | 28 | 阴 | 11.0 | 西南 | 228 | 2 | 11.16 |
4 | 74 | 0.800 | 10-08/19 | 51 | 77 | 98 | 48 | 良 | 2 | 90 | ... | 27 | 56 | 1005 | 28 | 阴 | 9.9 | 南 | 188 | 3 | 14.04 |
5 rows × 22 columns
data_test = data_test.drop(['dew_point', 'wind_direction', 'wind_direction_degree', 'text', 'code', 'wind_scale','last_update', 'quality'], axis = 1)
data_test
aqi | co | no2 | o3 | pm10 | pm25 | so2 | clouds | feels_like | humidity | pressure | temperature | visibility | wind_speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 105 | 0.817 | 18 | 209 | 81 | 48 | 5 | 50 | 29 | 46 | 1004 | 30 | 19.4 | 19.44 |
1 | 97 | 0.908 | 19 | 197 | 84 | 49 | 6 | 50 | 30 | 48 | 1004 | 30 | 18.5 | 18.0 |
2 | 69 | 0.875 | 26 | 162 | 87 | 49 | 4 | 50 | 28 | 53 | 1005 | 29 | 13.7 | 15.84 |
3 | 74 | 0.808 | 41 | 114 | 98 | 51 | 3 | 90 | 28 | 53 | 1005 | 28 | 11.0 | 11.16 |
4 | 74 | 0.800 | 51 | 77 | 98 | 48 | 2 | 90 | 27 | 56 | 1005 | 28 | 9.9 | 14.04 |
5 | 73 | 0.775 | 56 | 50 | 95 | 46 | 1 | 90 | 26 | 60 | 1005 | 26 | 9.6 | 12.24 |
6 | 67 | 0.750 | 56 | 28 | 83 | 40 | 1 | 90 | 25 | 64 | 1005 | 25 | 8.9 | 6.12 |
7 | 61 | 0.675 | 58 | 17 | 72 | 34 | 1 | 0 | 23 | 71 | 1005 | 23 | 9.0 | 6.12 |
8 | 59 | 0.650 | 54 | 20 | 68 | 33 | 2 | 0 | 21 | 75 | 1004 | 22 | 10.1 | 4.68 |
9 | 59 | 0.608 | 48 | 26 | 67 | 33 | 2 | 0 | 21 | 78 | 1004 | 21 | 11.6 | 6.48 |
10 | 58 | 0.592 | 48 | 32 | 66 | 31 | 2 | 0 | 21 | 79 | 1004 | 21 | 12.4 | 7.56 |
11 | 59 | 0.617 | 43 | 44 | 68 | 32 | 3 | 0 | 22 | 76 | 1004 | 22 | 12.5 | 6.84 |
数据备份,生成训练数据集和测试数据集
#data_test.to_excel('D:/python/practise/sample/weather/data_test(22-09).xlsx')
data_train = pd.read_excel('D:/Python/exercise/samples/WeatherData/backup/data_all(22-21).xlsx')
data_test = pd.read_excel('D:/Python/exercise/samples/WeatherData/backup/data_test(22-09).xlsx')
data_train.head()
Unnamed: 0 | aqi | co | last_update | no2 | o3 | pm10 | pm25 | so2 | station | clouds | feels_like | humidity | pressure | temperature | visibility | wind_speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 49 | 0.7 | 21-08/08 | 29 | 111 | 49 | 31 | 3 | 万寿西宫 | 90 | 29 | 71 | 1002 | 29 | 12.1 | 9.0 |
1 | 1 | 51 | 0.7 | 21-08/08 | 11 | 108 | 52 | 33 | 3 | 定陵 | 90 | 29 | 71 | 1002 | 29 | 12.1 | 9.0 |
2 | 2 | 52 | 0.8 | 21-08/08 | 19 | 124 | 53 | 30 | 3 | 东四 | 90 | 29 | 71 | 1002 | 29 | 12.1 | 9.0 |
3 | 3 | 48 | 0.6 | 21-08/08 | 20 | 88 | 46 | 33 | 3 | 天坛 | 90 | 29 | 71 | 1002 | 29 | 12.1 | 9.0 |
4 | 4 | 55 | 0.7 | 21-08/08 | 23 | 118 | 59 | 32 | 3 | 农展馆 | 90 | 29 | 71 | 1002 | 29 | 12.1 | 9.0 |
data_train.shape
(288, 17)
提取出各个站的aqi数据
aqi_all_station = data_train[['last_update','aqi', 'station']].set_index(['station', 'last_update']).unstack(0)
aqi_all_station = aqi_all_station.replace({0: np.nan}).fillna(aqi_all_station.median())
aqi_all_station
aqi | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
station | 万寿西宫 | 东四 | 农展馆 | 古城 | 天坛 | 奥体中心 | 官园 | 定陵 | 怀柔镇 | 昌平镇 | 海淀区万柳 | 顺义新城 |
last_update | ||||||||||||
00-08/08 | 55.0 | 62.0 | 30.0 | 36.0 | 28.0 | 52.5 | 30.0 | 35.0 | 22.0 | 31.0 | 28.0 | 60.0 |
01-08/08 | 45.0 | 51.0 | 39.0 | 31.0 | 40.0 | 45.0 | 40.0 | 22.0 | 24.0 | 22.0 | 26.0 | 50.0 |
02-08/08 | 60.0 | 53.0 | 39.0 | 31.0 | 52.0 | 51.0 | 43.0 | 26.0 | 24.0 | 23.0 | 35.0 | 60.0 |
03-08/08 | 65.0 | 54.0 | 42.0 | 29.0 | 57.0 | 45.0 | 32.0 | 23.0 | 26.0 | 23.0 | 44.0 | 60.0 |
04-08/08 | 64.0 | 58.0 | 38.0 | 26.0 | 49.0 | 45.0 | 36.0 | 25.0 | 24.0 | 22.0 | 49.0 | 58.0 |
05-08/08 | 50.0 | 48.0 | 49.0 | 26.0 | 42.0 | 43.0 | 39.0 | 25.0 | 26.0 | 24.0 | 54.0 | 49.0 |
06-08/08 | 39.0 | 50.0 | 45.0 | 22.0 | 54.0 | 46.0 | 51.0 | 29.0 | 31.0 | 30.0 | 57.0 | 36.0 |
07-08/08 | 53.0 | 62.0 | 42.0 | 54.0 | 58.0 | 56.0 | 63.0 | 33.0 | 29.0 | 35.0 | 45.0 | 43.0 |
08-08/08 | 64.0 | 62.0 | 69.0 | 63.0 | 69.0 | 60.0 | 65.0 | 39.0 | 35.0 | 38.0 | 52.0 | 51.0 |
09-08/08 | 70.0 | 77.0 | 77.0 | 53.0 | 79.0 | 52.0 | 60.0 | 31.0 | 24.0 | 39.0 | 54.0 | 63.0 |
10-08/08 | 73.0 | 82.0 | 83.0 | 52.0 | 85.0 | 67.0 | 62.0 | 32.0 | 29.0 | 27.0 | 60.0 | 65.0 |
11-08/08 | 64.0 | 74.0 | 74.0 | 44.0 | 88.0 | 53.0 | 65.0 | 37.0 | 40.0 | 36.0 | 58.0 | 63.0 |
12-08/08 | 64.0 | 73.0 | 67.0 | 50.0 | 60.0 | 57.0 | 68.0 | 38.0 | 40.0 | 33.0 | 52.0 | 58.0 |
13-08/08 | 64.0 | 94.0 | 75.0 | 65.0 | 65.0 | 78.0 | 99.0 | 48.0 | 45.0 | 44.0 | 83.0 | 79.0 |
14-08/08 | 84.0 | 111.0 | 104.0 | 88.0 | 100.0 | 88.0 | 111.0 | 53.0 | 50.0 | 77.0 | 94.0 | 89.0 |
15-08/08 | 77.0 | 122.0 | 114.0 | 98.0 | 106.0 | 107.0 | 118.0 | 74.0 | 85.0 | 100.0 | 97.0 | 92.0 |
16-08/08 | 57.0 | 112.0 | 108.0 | 84.0 | 89.0 | 101.0 | 109.0 | 102.0 | 111.0 | 113.0 | 101.0 | 109.0 |
17-08/08 | 48.0 | 88.0 | 88.0 | 62.0 | 60.0 | 72.0 | 75.0 | 120.0 | 112.0 | 130.0 | 63.0 | 94.0 |
18-08/08 | 50.0 | 85.0 | 80.0 | 50.0 | 65.0 | 53.0 | 63.0 | 128.0 | 108.0 | 127.0 | 47.0 | 83.0 |
19-08/08 | 48.0 | 74.0 | 70.0 | 52.0 | 55.0 | 60.0 | 54.0 | 122.0 | 84.0 | 82.0 | 48.0 | 65.0 |
20-08/08 | 46.0 | 52.0 | 46.0 | 59.0 | 51.0 | 52.0 | 47.0 | 57.0 | 57.0 | 48.0 | 53.0 | 54.0 |
21-08/08 | 49.0 | 52.0 | 55.0 | 72.0 | 48.0 | 49.0 | 51.0 | 51.0 | 60.0 | 53.0 | 52.0 | 62.0 |
22-08/07 | 40.0 | 59.0 | 59.0 | 41.0 | 43.0 | 51.0 | 40.0 | 24.0 | 27.0 | 41.0 | 61.0 | 54.0 |
23-08/07 | 55.0 | 62.0 | 37.0 | 51.0 | 36.0 | 52.5 | 39.0 | 35.0 | 24.0 | 37.0 | 39.0 | 60.0 |
plt.rcParams['font.sans-serif']=['SimHei'] #正常显示中文
aqi_all_station.plot(figsize = (20, 5))
plt.xticks(np.arange(24), aqi_all_station.index.values, fontsize = 15, rotation = 30)
plt.grid()
plt.show()
"aqi"表示空气质量指数(数值越低空气质量越好),从上图可以看出,各个城区全天24小时的aqi指数变化趋势,总体的最值大约出现在下午15时至18时,18时之后快速下降。其中,“万寿西宫,怀柔和昌平”aqi最高值出现在下午17时-18时时段,除此之外的其他城区aqi最高值均出现在下午15时。
计算各城区一天24小时aqi指数平均值
plt.figure(figsize = (12,6))
aqi_all_station.mean().reset_index(level = 0, drop = True).sort_values().plot.barh()
plt.xlabel('aqi', fontsize = 20)
Text(0.5, 0, 'aqi')
从图上可以看出,空气质量指数值最高的前五名分别为:东四、顺义新城、农展馆、天坛、官园,除顺义新城外其余四处均在三环以内,这其中天坛、官园和东四又均地处二环以内,这说明越靠近市中心,空气质量越差
data_test.head()
Unnamed: 0 | last_update | aqi | co | no2 | o3 | pm10 | pm25 | so2 | clouds | feels_like | humidity | pressure | temperature | visibility | wind_speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 09-08/08 | 78 | 0.817 | 29 | 61 | 61 | 57 | 1 | 50 | 28 | 77 | 1001 | 28 | 3.9 | 11.16 |
1 | 1 | 08-08/08 | 80 | 0.850 | 29 | 49 | 59 | 59 | 1 | 50 | 27 | 79 | 1001 | 27 | 3.7 | 8.64 |
2 | 2 | 07-08/08 | 83 | 0.842 | 30 | 39 | 62 | 61 | 2 | 50 | 26 | 83 | 1001 | 27 | 3.1 | 8.64 |
3 | 3 | 06-08/08 | 84 | 0.817 | 28 | 44 | 61 | 62 | 2 | 50 | 25 | 85 | 1001 | 26 | 3.1 | 6.12 |
4 | 4 | 05-08/08 | 80 | 0.792 | 29 | 49 | 62 | 59 | 2 | 50 | 25 | 86 | 1001 | 25 | 3.1 | 9.00 |
data_train = data_train.drop(['Unnamed: 0', ], axis = 1)
data_test = data_test.drop('Unnamed: 0', axis = 1)
def every_station_plot(station):
data_wanshou = data_train[data_train['station'] == station]
fig, ax = plt.subplots(2,1, figsize = (20,12))
data_wanshou = data_wanshou.set_index('last_update')
data_wanshou = data_wanshou.replace({0:np.nan}).fillna(data_wanshou.median())
data_wanshou.drop(['co', 'pressure', 'so2', 'visibility', 'wind_speed', 'clouds'], axis = 1).sort_values(by = 'last_update').plot(ax = ax[0], marker = 'o')
data_wanshou[['co', 'so2', 'visibility', 'wind_speed']].sort_values(by = 'last_update').plot(ax = ax[1], marker = 'o')
ax[0].set_title('aqi, no2, o3, pm10, pm25, feels_like, humidity, temperature')
ax[1].set_title('co, so2, visibility, wind_speed')
plt.suptitle(station, fontsize = 20)
根据不同的城区,分别绘制各城区全天24小时各项污染物随时间变化的趋势曲线图
由于co、so2、能见度、风速四项指标的数量级与其他指标相差较多,为使图形能够明显反映出各污染物变化趋势,故将这四项指标分开作图
for st in data_train['station'].unique():
every_station_plot(st)
各特征变量间相关性热力图
plt.figure(figsize = (15, 15))
sns.heatmap(data_train.corr(), annot = True, cmap='RdYlGn',linewidths=0.2)
data_train = data_train.drop(['last_update', 'station'], axis = 1)
data_train = data_train.replace({0:np.nan}).fillna(data_train.median())
data_train.head()
aqi | co | no2 | o3 | pm10 | pm25 | so2 | clouds | feels_like | humidity | pressure | temperature | visibility | wind_speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49.0 | 0.7 | 29.0 | 111.0 | 49.0 | 31.0 | 3.0 | 90.0 | 29.0 | 71.0 | 1002.0 | 29.0 | 12.1 | 9.0 |
1 | 51.0 | 0.7 | 11.0 | 108.0 | 52.0 | 33.0 | 3.0 | 90.0 | 29.0 | 71.0 | 1002.0 | 29.0 | 12.1 | 9.0 |
2 | 52.0 | 0.8 | 19.0 | 124.0 | 53.0 | 30.0 | 3.0 | 90.0 | 29.0 | 71.0 | 1002.0 | 29.0 | 12.1 | 9.0 |
3 | 48.0 | 0.6 | 20.0 | 88.0 | 46.0 | 33.0 | 3.0 | 90.0 | 29.0 | 71.0 | 1002.0 | 29.0 | 12.1 | 9.0 |
4 | 55.0 | 0.7 | 23.0 | 118.0 | 59.0 | 32.0 | 3.0 | 90.0 | 29.0 | 71.0 | 1002.0 | 29.0 | 12.1 | 9.0 |
data_test = data_test.drop(['last_update'], axis = 1)
data_test
aqi | co | no2 | o3 | pm10 | pm25 | so2 | clouds | feels_like | humidity | pressure | temperature | visibility | wind_speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 78 | 0.817 | 29 | 61 | 61 | 57 | 1 | 50 | 28 | 77 | 1001 | 28 | 3.9 | 11.16 |
1 | 80 | 0.850 | 29 | 49 | 59 | 59 | 1 | 50 | 27 | 79 | 1001 | 27 | 3.7 | 8.64 |
2 | 83 | 0.842 | 30 | 39 | 62 | 61 | 2 | 50 | 26 | 83 | 1001 | 27 | 3.1 | 8.64 |
3 | 84 | 0.817 | 28 | 44 | 61 | 62 | 2 | 50 | 25 | 85 | 1001 | 26 | 3.1 | 6.12 |
4 | 80 | 0.792 | 29 | 49 | 62 | 59 | 2 | 50 | 25 | 86 | 1001 | 25 | 3.1 | 9.00 |
5 | 80 | 0.775 | 28 | 54 | 65 | 59 | 2 | 50 | 26 | 82 | 1001 | 26 | 3.5 | 9.00 |
6 | 75 | 0.767 | 27 | 61 | 64 | 55 | 2 | 50 | 26 | 79 | 1001 | 27 | 4.6 | 9.72 |
7 | 69 | 0.808 | 25 | 73 | 65 | 50 | 2 | 50 | 27 | 79 | 1002 | 27 | 4.3 | 9.36 |
8 | 67 | 0.817 | 25 | 81 | 69 | 48 | 2 | 50 | 27 | 79 | 1002 | 27 | 4.1 | 9.00 |
9 | 62 | 0.767 | 26 | 88 | 67 | 44 | 2 | 50 | 27 | 79 | 1002 | 27 | 5.1 | 7.56 |
10 | 59 | 0.758 | 25 | 103 | 60 | 42 | 2 | 90 | 28 | 78 | 1002 | 28 | 6.2 | 6.48 |
11 | 56 | 0.733 | 24 | 114 | 61 | 39 | 2 | 90 | 28 | 74 | 1002 | 28 | 8.3 | 10.08 |
y_train = data_train['pm25'].values
x_train = data_train.drop(['pm25'], axis = 1).values
把PM2.5提取出来做真实值
y_true = data_test['pm25'].values
x_test = data_test.drop(['pm25'], axis = 1).values
先用普通线性回归模型预测
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
y_esti = model.predict(x_test)
这是预测结果值
y_esti
array([41.62610649, 43.71780508, 45.51334405, 44.44962464, 39.64758352,
40.84696283, 39.40450324, 41.28101583, 41.70510816, 38.61434584,
38.2644185 , 33.7721591 ])
这是真实值
y_true
array([57, 59, 61, 62, 59, 59, 55, 50, 48, 44, 42, 39], dtype=int64)
用残差平方和查看预测效果,公式如下: s c o r e = 1 n ∑ 1 n ( y i − y ∗ ) 2 score = \frac{1}{n}\sum_1^n(y_i-y^*)^2 score=n11∑n(yi−y∗)2
((y_esti - y_true)**2).mean()
179.24659005909845
model.score(x_test, y_true)
-2.051366469855796
def gen_polynomial(x, power):
p = x[:,0].reshape((len(x), 1))
a = x[:,1].reshape((len(x), 1))
s = x[:,2].reshape((len(x), 1))
for i in range(2, power + 1):
for j in range(i + 1):
for k in range(i - j + 1):
#print(f'{j}{k}{i-j-k}')
add = (p**j) * (a**k) * (s**(i-j-k))
x = np.concatenate([x, add], axis = 1)
return x
x_train_poly = gen_polynomial(x_train, 3)
x_test_poly = gen_polynomial(x_test, 3)
model2 = LinearRegression()
model2.fit(x_train_poly, y_train)
y_esti_poly = model2.predict(x_test_poly)
((y_esti_poly - y_true)**2).mean()
35.94599749989958
model2.score(x_test_poly, y_true)
0.38808090318175426
再用Ridge岭回归模型建模
Ridge岭回归模型适用于解决两类问题:一是样本少于变量个数,二是变量间存在多重共线性
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
reg = Ridge(alpha = 0.000001)
reg.fit(x_train_poly, y_train)
D:\ProgramData\lib\site-packages\sklearn\linear_model\ridge.py:125: LinAlgWarning: Ill-conditioned matrix (rcond=4.03488e-18): result may not be accurate.
overwrite_a=True).T
Ridge(alpha=1e-06, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=None, solver='auto', tol=0.001)
y_esti_reg_poly = reg.predict(x_test_poly)
((y_esti_reg_poly - y_true)**2).mean()
35.9459549360529
用RidgeCV模型
多个alpha值输入,得出多个对应最佳的 θ \theta θ,然后计算最佳的 θ \theta θ及对应的alpha
reg_cv = RidgeCV(alphas=[0.001, 10, 10000])
reg_cv.fit(x_train_poly, y_train)
RidgeCV(alphas=array([1.e-03, 1.e+01, 1.e+04]), cv=None, fit_intercept=True,
gcv_mode=None, normalize=False, scoring=None, store_cv_values=False)
y_esti_regcv_poly = reg_cv.predict(x_test_poly)
((y_esti_regcv_poly - y_true)**2).mean()
42.06173546500281
再用Lasso回归建模,用于估计稀疏系数的线性模型
Lasso模型适用于参数少的情况,产生稀疏矩阵
from sklearn.linear_model import Lasso
lasso = Lasso(alpha = 0.00001, max_iter = 1000000)
lasso.fit(x_train_poly, y_train)
Lasso(alpha=1e-05, copy_X=True, fit_intercept=True, max_iter=1000000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)
y_esti_lasso_poly = lasso.predict(x_test_poly)
((y_esti_lasso_poly - y_true)**2).mean()
35.944247424908795
lasso.score(x_test_poly, y_true)
0.3881106952137526
plt.figure(figsize = (14,8))
plt.plot(y_true, '--', label = 'real')
plt.plot(y_esti, label = 'normal')
plt.plot(y_esti_poly, label = 'normal_poly', marker = 'o')
plt.plot(y_esti_reg_poly, label = 'ridge_poly')
plt.plot(y_esti_regcv_poly, label = 'ridgeCV_poly')
plt.plot(y_esti_lasso_poly, label = 'lasso_poly')
plt.legend()
uu = pd.read_excel('D:/Python/exercise/samples/WeatherData/backup/data_test(22-09).xlsx')
plt.xticks(np.arange(12), uu.last_update.values, rotation = 30, fontsize = 12)
plt.grid()
plt.title('预测值对比')
plt.show()
从上图可以看出各种模型建预测值的对比,误差最大的是用原数据进行普通线型回归的结果,其他四种方式的训练集加入了多项式,准确度大幅提高且结果相近,其中普通线性回归和ridge回归结果几乎重合
各模型残差平方和评分如下:
compare_table = pd.DataFrame(np.zeros((5,2)), index = ['normal', 'normal_poly', 'ridge_poly', 'ridgeCV_poly', 'lasso'], columns = ['MSE', 'score'])
mse = np.array([((y_esti - y_true)**2).mean(),
((y_esti_poly - y_true)**2).mean(),
((y_esti_reg_poly - y_true)**2).mean(),
((y_esti_regcv_poly - y_true)**2).mean(),
((y_esti_lasso_poly - y_true)**2).mean()])
score = np.array([model.score(x_test, y_true),
model2.score(x_test_poly, y_true),
reg.score(x_test_poly, y_true),
reg_cv.score(x_test_poly, y_true),
lasso.score(x_test_poly, y_true)])
compare_table['MSE'] = mse
compare_table['score'] = score
compare_table
MSE | score | |
---|---|---|
normal | 179.246590 | -2.051366 |
normal_poly | 35.945997 | 0.388081 |
ridge_poly | 35.945955 | 0.388082 |
ridgeCV_poly | 42.061735 | 0.283971 |
lasso | 45.371600 | 0.388111 |
预测值有一定偏差,但基本反映了变化趋势。经过几个模型的筛选,最终还是普通线性回归模型效果稍好。