今日to do list:
我在预测一个名字叫做elborn基站的下行链路流量,用过去29天的数据预测未来10天的数据
一般必须都要导入的库有
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np # linear algebra
import warnings
warnings.filterwarnings('ignore') # 忽略警告信息
import matplotlib.pyplot as plt
对csv数据使用pandas.read_csv函数读取
一些参数:
elborn_df = pd.read_csv('dataset/ElBorn.csv')
elborn_test_df = pd.read_csv('dataset/ElBorn_test.csv')
basic_eda
def basic_eda(df):
print("-------------------------------TOP 5 RECORDS-----------------------------")
print(df.head(5))
print("-------------------------------INFO--------------------------------------")
print(df.info())
print("-------------------------------Describe----------------------------------")
print(df.describe())
print("-------------------------------Columns-----------------------------------")
print(df.columns)
print("-------------------------------Data Types--------------------------------")
print(df.dtypes)
print("----------------------------Missing Values-------------------------------")
print(df.isnull().sum())
print("----------------------------NULL values----------------------------------")
print(df.isna().sum())
print("--------------------------Shape Of Data---------------------------------")
print(df.shape)
print("============================================================================ \n")
basic_eda(elborn_df)
basic_eda(elborn_test_df)
然后画图看一下
# 我现在想把elborn_df画出来,横坐标是时间,纵坐标是down,并且横坐标的标签要旋转45度书写
plt.plot(elborn_df.index, elborn_df.down)
plt.xlabel('Time')
plt.ylabel('Down')
plt.title('Down')
# 我想把横坐标的日期标签旋转45
plt.xticks(rotation=45)
elborn_df.set_index(pd.DatetimeIndex(elborn_df["time"]), inplace=True)
elborn_df.drop(["time"], axis=1, inplace=True)
不填充的话后续fit模型的时候会出现loss全部为NAN的情况
elborn_df.down.fillna(elborn_df.down.mean(), inplace=True)
print(elborn_df.isna().sum())
在训练监督学习(深度学习)模型前,要把time series数据转化成samples的形式
那什么是sample?有一个输入组件 X X X和一个输出组件 y y y
深度学习模型就是一个映射函数: y = f ( X ) y=f(X) y=f(X)
对于一个单变量的one-step预测:输入组件就是前一个时间步的滞后数据,输出组件就是当前时间步的数据,如下:
X, y
[1, 2, 3], [4]
[2, 3, 4], [5]
[3, 4, 5], [6]
…
这里就是手动转换啦,之前写过使用TimeseriesGenerator自动转换的方法,看看对比
def series_to_supervised(data, window=3, lag=1, dropnan=True):
cols, names = list(), list()
# Input sequence (t-n, ... t-1)
for i in range(window, 0, -1):
cols.append(data.shift(i))
names += [('%s(t-%d)' % (col, i)) for col in data.columns]
# Current timestep (t=0)
cols.append(data)
names += [('%s(t)' % (col)) for col in data.columns]
# Target timestep (t=lag)
cols.append(data.shift(-lag))
names += [('%s(t+%d)' % (col, lag)) for col in data.columns]
# Put it all together
agg = pd.concat(cols, axis=1)
agg.columns = names
return agg
window =29
lag = 10
elborn_df_supervised = series_to_supervised(elborn_df, window, lag)
训练集和测试集的区别
训练集在建模过程中会被大量经常使用,验证集用于对模型少量偶尔的调整,而测试集只作为最终模型的评价出现,因此训练集,验证集和测试集所需的数据量也是不一致的,在数据量不是特别大的情况下一般遵循6:2:2的划分比例
为了使模型“训练”效果能合理泛化至“测试”效果,从而推广应用至现实世界中,因此一般要求训练集,验证集和测试集数据分布近似。但需要注意,三个数据集所用数据是不同的。
from sklearn.model_selection import train_test_split
label_name = 'down(t+%d)' % (lag)
label = elborn_df_supervised[label_name]
elborn_df_supervised = elborn_df_supervised.drop(label_name, axis=1)
X_train, X_valid, Y_train, Y_valid = train_test_split(elborn_df_supervised, label, test_size=0.4, random_state=0)
print('Train set shape', X_train.shape)
print('Validation set shape', X_valid.shape)
epochs = 40
batch = 256
lr = 0.0003
adam = optimizers.Adam(lr)
model_mlp = Sequential()
model_mlp.add(Dense(100, activation='relu', input_dim=X_train.shape[1]))
model_mlp.add(Dense(1))
model_mlp.compile(loss='mse', optimizer=adam)
model_mlp.summary()
mlp_hitstory = model_mlp.fit(X_train.values, Y_train, epochs=epochs, batch_size=batch, validation_data=(X_valid.values, Y_valid), verbose=2)
# 画图,横坐标是epochs,纵坐标是loss,分别画出train loss和validation loss
import matplotlib.pyplot as plt
plt.plot(mlp_hitstory.history['loss'])
plt.plot(mlp_hitstory.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
from sklearn.metrics import mean_squared_error
mlp_train_pred = model_mlp.predict(X_train.values)
mlp_valid_pred = model_mlp.predict(X_valid.values)
print('Train rmse:', np.sqrt(mean_squared_error(Y_train, mlp_train_pred)))
print('Validation rmse:', np.sqrt(mean_squared_error(Y_valid, mlp_valid_pred)))