In this section, we will fit an LSTM to the problem.
本章,我们将一个LSTM拟合到这个问题
The first step is to prepare the pollution dataset for the LSTM.
第一步是为LSTM准备污染数据集。
This involves framing the dataset as a supervised learning problem and normalizing the input variables.
这包括构造数据集为监督学习问题,和归一化输入变量(归一化?)
We will frame the supervised learning problem as predicting the pollution at the current hour (t) given the pollution measurement and weather conditions at the prior time step.
我们将把监督学习问题构建为给出前一个时间步的污染测量和天气条件,来预测当前时间的污染(这里的污染完全可以用PM2.5数据代替,就不会懵逼了).
This formulation is straightforward and just for this demonstration. Some alternate formulations you could explore include:
这个公式很简单,只为这个演示,你可以探索一些其他的公式:
We can transform the dataset using the series_to_supervised() function developed in the blog post:
我们可以使用博客中开发的series_to_supervised() function来转换数据集
First, the “pollution.csv” dataset is loaded. The wind speed feature is label encoded (integer encoded). This could further be one-hot encoded in the future if you are interested in exploring it.
首先,‘pollution.csv’数据集被加载,风速特征是标签编码(整数编码)。 如果您有兴趣探索它,这可能会在未来进一步被热编码(你这样写,我怎么可能会懂呀,看到后面再回头看大概懂了,label encoded和one-hot encoded 是两种编码处理,就不该翻译成中文)。
Next, all features are normalized, then the dataset is transformed into a supervised learning problem. The weather variables for the hour to be predicted (t) are then removed.
接下来,所有特征被归一化,然后数据集被转化为监督学习问题,被预测的小时天气变量被去除。
The complete code listing is provided below.
完整代码清单,提供如下:
from pandas import DataFrame
from pandas import concat
from pandas import read_csv
from pandas import set_option
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
# convert series to supervised learning
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence(t-n,... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d' % (j + 1, i)) for j in range(n_vars)]
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j + 1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j + 1, i)) for j in range(n_vars)]
# put it all together
agg = concat(cols, axis=1)
agg.columns = names
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg
# load dataset
dataset = read_csv('pollution.csv', header=0, index_col=0)
values = dataset.values
# integer encode direction
encoder = LabelEncoder()
values[:, 4] = encoder.fit_transform(values[:, 4])
# ensure all data is float
values = values.astype('float32')
# normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
# frame as supervised learning
reframed = series_to_supervised(scaled, 1, 1)
# drop columns we don't want to predict
reframed.drop(reframed.columns[[9, 10, 11, 12, 13, 14, 15]], axis=1, inplace=True)
set_option('display.max_columns', None)
print(reframed.head())
Running the example prints the first 5 rows of the transformed dataset. We can see the 8 input variables (input series) and the 1 output variable (pollution level at the current hour).
运行例子,打印出转换后数据集的前五行,我们看到有八个输入变量(输入序列),和一个输出变量(当前小时的污染水平)
var1(t-1 var2(t-1 var3(t-1 var4(t-1 var5(t-1 var6(t-1 var7(t-1 \
1 0.129779 0.352941 0.245902 0.527273 0.666667 0.002290 0.000000
2 0.148893 0.367647 0.245902 0.527273 0.666667 0.003811 0.000000
3 0.159960 0.426471 0.229508 0.545454 0.666667 0.005332 0.000000
4 0.182093 0.485294 0.229508 0.563637 0.666667 0.008391 0.037037
5 0.138833 0.485294 0.229508 0.563637 0.666667 0.009912 0.074074
var8(t-1 var1(t)
1 0.0 0.148893
2 0.0 0.159960
3 0.0 0.182093
4 0.0 0.138833
5 0.0 0.109658
This data preparation is simple and there is more we could explore. Some ideas you could look at include:
这个数据准备工作很简单,我们可以探索更多。 您可以查看的一些想法包括:
This last point is perhaps the most important given the use of Backpropagation through time by LSTMs when learning sequence prediction problems.
在学习序列预测问题时,最后一点可能是最重要的,因为LSTM使用反向传播时间。
用到的知识点:
DataFrame.
astype
(
dtype,
copy=True,
errors='raise',
**kwargs
)
[source]
Cast a pandas object to a specified dtype dtype
.
Parameters: | dtype : data type, or dict of column name -> data type
copy : bool, default True.
errors : {‘raise’, ‘ignore’}, default ‘raise’.
raise_on_error : raise on invalid input
|
---|---|
Returns: |
|
See also
pandas.to_datetime
pandas.to_timedelta
pandas.to_numeric
numpy.ndarray.astype
Examples
>>> ser = pd.Series([1, 2], dtype='int32')
>>> ser
0 1
1 2
dtype: int32
>>> ser.astype('int64')
0 1
1 2
dtype: int64
Convert to categorical type:
>>> ser.astype('category')
0 1
1 2
dtype: category
Categories (2, int64): [1, 2]
Convert to ordered categorical type with custom ordering:
>>> ser.astype('category', ordered=True, categories=[2, 1])
0 1
1 2
dtype: category
Categories (2, int64): [2 < 1]
Note that using copy=False
and changing data on a new pandas object may propagate changes:
>>> s1 = pd.Series([1,2])
>>> s2 = s1.astype('int64', copy=False)
>>> s2[0] = 10
>>> s1 # note that s1[0] has changed too
0 10
1 2
dtype: int64
pandas.
concat
(
objs,
axis=0,
join='outer',
join_axes=None,
ignore_index=False,
keys=None,
levels=None,
names=None,
verify_integrity=False,
sort=None,
copy=True
)
[source]
Concatenate pandas objects along a particular axis with optional set logic along the other axes.
Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.
Parameters: | objs : a sequence or mapping of Series, DataFrame, or Panel objects
axis : {0/’index’, 1/’columns’}, default 0
join : {‘inner’, ‘outer’}, default ‘outer’
join_axes : list of Index objects
ignore_index : boolean, default False
keys : sequence, default None
levels : list of sequences, default None
names : list, default None
verify_integrity : boolean, default False
sort : boolean, default None
copy : boolean, default True
|
---|---|
Returns: | concatenated : object, type of objs
|
See also
Series.append
, DataFrame.append
, DataFrame.join
, DataFrame.merge
Notes
The keys, levels, and names arguments are all optional.
A walkthrough of how this method fits in with other tools for combining pandas objects can be found here.
Examples
Combine two Series
.
>>> s1 = pd.Series(['a', 'b'])
>>> s2 = pd.Series(['c', 'd'])
>>> pd.concat([s1, s2])
0 a
1 b
0 c
1 d
dtype: object
Clear the existing index and reset it in the result by setting the ignore_index
option to True
.
>>> pd.concat([s1, s2], ignore_index=True)
0 a
1 b
2 c
3 d
dtype: object
Add a hierarchical index at the outermost level of the data with the keys
option.
>>> pd.concat([s1, s2], keys=['s1', 's2',])
s1 0 a
1 b
s2 0 c
1 d
dtype: object
Label the index keys you create with the names
option.
>>> pd.concat([s1, s2], keys=['s1', 's2'],
... names=['Series name', 'Row ID'])
Series name Row ID
s1 0 a
1 b
s2 0 c
1 d
dtype: object
Combine two DataFrame
objects with identical columns.
>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
... columns=['letter', 'number'])
>>> df1
letter number
0 a 1
1 b 2
>>> df2 = pd.DataFrame([['c', 3], ['d', 4]],
... columns=['letter', 'number'])
>>> df2
letter number
0 c 3
1 d 4
>>> pd.concat([df1, df2])
letter number
0 a 1
1 b 2
0 c 3
1 d 4
Combine DataFrame
objects with overlapping columns and return everything. Columns outside the intersection will be filled with NaN
values.
>>> df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
... columns=['letter', 'number', 'animal'])
>>> df3
letter number animal
0 c 3 cat
1 d 4 dog
>>> pd.concat([df1, df3])
animal letter number
0 NaN a 1
1 NaN b 2
0 cat c 3
1 dog d 4
Combine DataFrame
objects with overlapping columns and return only those that are shared by passing inner
to the join
keyword argument.
>>> pd.concat([df1, df3], join="inner")
letter number
0 a 1
1 b 2
0 c 3
1 d 4
Combine DataFrame
objects horizontally along the x axis by passing in axis=1
.
>>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
... columns=['animal', 'name'])
>>> pd.concat([df1, df4], axis=1)
letter number animal name
0 a 1 bird polly
1 b 2 monkey george
Prevent the result from including duplicate index values with the verify_integrity
option.
>>> df5 = pd.DataFrame([1], index=['a'])
>>> df5
0
a 1
>>> df6 = pd.DataFrame([2], index=['a'])
>>> df6
0
a 2
>>> pd.concat([df5, df6], verify_integrity=True)
Traceback (most recent call last):
...
ValueError: Indexes have overlapping values: ['a']
DataFrame.
dropna
(
axis=0,
how='any',
thresh=None,
subset=None,
inplace=False
)
[source]
Remove missing values.
See the User Guide for more on which values are considered missing, and how to work with missing data.
Parameters: | axis : {0 or ‘index’, 1 or ‘columns’}, default 0
how : {‘any’, ‘all’}, default ‘any’
thresh : int, optional
subset : array-like, optional
inplace : bool, default False
|
---|---|
Returns: | DataFrame
|
See also
DataFrame.isna
DataFrame.notna
DataFrame.fillna
Series.dropna
Index.dropna
Examples
>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
... "toy": [np.nan, 'Batmobile', 'Bullwhip'],
... "born": [pd.NaT, pd.Timestamp("1940-04-25"),
... pd.NaT]})
>>> df
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Drop the rows where at least one element is missing.
>>> df.dropna()
name toy born
1 Batman Batmobile 1940-04-25
Drop the columns where at least one element is missing.
>>> df.dropna(axis='columns')
name
0 Alfred
1 Batman
2 Catwoman
Drop the rows where all elements are missing.
>>> df.dropna(how='all')
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Keep only the rows with at least 2 non-NA values.
>>> df.dropna(thresh=2)
name toy born
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Define in which columns to look for missing values.
>>> df.dropna(subset=['name', 'born'])
name toy born
1 Batman Batmobile 1940-04-25
Keep the DataFrame with valid entries in the same variable.
>>> df.dropna(inplace=True)
>>> df
name toy born
1 Batman Batmobile 1940-04-25
突然想到Series 和 DataFrame有什么区别呢?
Series和DataFrame
range(10,0,-1)意思是从列表的下标为10的元素开始,倒序取到下标为0的元素(但是不包括下标为0元素),也就是说list[10]-list[1],转化成range就是相当于range(1,11)的倒序,最后得到的结果是[10,9,8,7,6,5,4,3,2,1]
header : int or list of ints, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to
header=0
and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical toheader=None
. Explicitly passheader=0
to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines ifskip_blank_lines=True
, so header=0 denotes the first line of data rather than the first line of the file.
index_col : int or sequence or False, default None
Column to use as the row labels of the DataFrame. If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters at the end of each line, you might consider index_col=False to force pandas to _not_ use the first column as the index (row names)
如果显示有省略号,根据情况加入如下代码
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)