sktime 是一个新的处理序列数据的库,可以进行多种任务的处理,分类、回归、聚类、预测和注释,是一个值得考察的库。
sktime使用的是一种叫做nested数据形式, 为了将其它类型的数据转化成nested数据,sktime提共了多种构建输入的工具,适用于不同的数据形式:
1.from_2d_array_to_nested:从2D数据,也叫tabular型数据, 是一种仅能容纳单变量的数据形式,这种数据形式中, 0轴表示instance,比如患者,1轴表示该变量多个时间点的测量数据,比如 多个时间点药物浓度(变量)的测量数据;
库的引用:from sktime.datatypes._panel._convert import ( from_2d_array_to_nes, from_nested_to_2d_array,is_nested_dataframe,)
2.from_long_to_nested:从纵向数据生成nested数据形式,纵向数据将多次测量的数据沿着0轴进行排列,而1轴需要罗列的几个特征:case_id(instance编号,比如患者),dim_id(维度编号,说明测量了几个变量,比如心率、血压等),reading_id(另一个名字可能更好理解一些,序列长度series_len,比如24小时)
3.from_multi_index_to_nested:从多索引pandas df 生成nested数据形式,具体说是两个索引,case_id和reading_id, 而维度沿着1轴展开。
4.from_3d_numpy_to_nested:从numpy 3D数据生成nested数据形式, numpy 3D数据和多索引数据相似, 只是索引是默认索引。
掌握这几个该是够用了, sktime还是提供了其它的许多读取数据的形式。
注:对于时间序列来说,一般会有年-月-日 小时-分钟-秒 来表示, 如果测量数据是每小时测量一次,那么reading_id 就是0-23, case_id就是天数;如果以上数据处理为每月一个周期, 那reading_id 就是24*30,而case_id 就成为月的数量。
以下用sktime提供的代码进行一定的演示:
#1.单变量2d数据
import sktime
import pandas as pd
from numpy.random import default_rng
from sktime.datatypes._panel._convert import (
from_2d_array_to_nested,
from_nested_to_2d_array,
is_nested_dataframe,
)
rng = default_rng()
X_2d = rng.standard_normal((50, 20))
print(f"The tabular data has the shape {X_2d.shape}")
print(pd.DataFrame(X_2d).head(3))
print(pd.DataFrame(X_2d).tail(3))
output:
The tabular data has the shape (50, 20)
0 1 2 3 4 5 6
0 0.951345 -1.315426 0.214762 -0.136433 0.247084 -1.932268 0.785887
1 -0.010882 1.698942 0.781392 0.007080 -0.734776 0.233316 -1.657900
2 1.018911 -0.964553 -0.682563 0.702798 -0.032204 -1.105005 0.095655
7 8 9 10 11 12 13 \
0 -0.042764 0.569522 1.017830 0.186783 0.249294 0.244153 -0.985478
1 -0.279308 0.831627 0.765397 1.470033 -0.484943 -0.084951 -1.571112
2 -0.861139 0.122383 -0.389639 1.494302 0.343313 0.297859 0.142106
14 15 16 17 18 19
0 1.862123 -0.193330 -0.433783 0.750041 -0.500827 -2.268721
1 0.047955 0.664481 -1.374092 -0.187755 1.350386 -0.360479
2 -0.572279 -0.419691 -0.341976 0.008623 0.901157 0.709582
0 1 2 3 4 5 6
47 -2.027696 0.868153 0.083681 0.045453 -0.199436 -0.960754 0.611705
48 1.278521 0.740739 -0.581048 -0.274030 0.559486 -1.960081 -0.527335
49 0.256974 -1.053948 -0.180352 0.495492 -0.229110 3.771682 0.350383
7 8 9 10 11 12 13 \
47 -0.909587 0.887509 0.960593 -0.712712 0.668460 0.539432 -1.276410
48 -1.160588 0.541308 -1.091403 2.419113 -0.700643 1.003165 -1.082824
49 0.555191 -1.780015 -0.549731 1.503096 -1.293898 0.111052 0.422830
14 15 16 17 18 19
47 2.669976 0.632130 -0.482352 -0.309763 1.390071 -0.773413
48 -0.796634 0.635805 0.935867 0.563172 -1.042389 -0.277242
49 1.761040 -1.392075 0.873080 -1.138395 -0.788783 1.283540
X_nested = from_2d_array_to_nested(X_2d)
print(f"X_nested is a nested DataFrame: {is_nested_dataframe(X_nested)}")
print(f"The cell contains a {type(X_nested.iloc[0,0])}.")
print(f"The nested DataFrame has shape {X_nested.shape}")
print(X_nested.head(1))
print(X_nested.tail(1))
#这种数据形式只有1个维度,也就是1个变量(所以是1列),20个时间点(series_len=20),50个case的数据表
output:
X_nested is a nested DataFrame: True
The cell contains a
The nested DataFrame has shape (50, 1)
0
0 0 0.951345
1 -1.315426
2 0.214762
3…
0
49 0 0.256974
1 -1.053948
2 -0.180352
3…
from sktime.datasets import generate_example_long_table
X = generate_example_long_table(num_cases=50, series_len=20, num_dims=5)
print(X.head())
print(X.tail())
#纵向数据
case_id dim_id reading_id value
0 0 0 0 0.562555
1 0 0 1 0.280241
2 0 0 2 0.738769
3 0 0 3 0.258843
4 0 0 4 0.354176
case_id dim_id reading_id value
4995 49 4 15 0.734340
4996 49 4 16 0.991652
4997 49 4 17 0.665473
4998 49 4 18 0.272355
4999 49 4 19 0.695632
from sktime.datatypes._panel._convert import from_long_to_nested, from_nested_to_long
X_nested = from_long_to_nested(X)
X_nested.head()
print(f"X_nested is a nested DataFrame: {is_nested_dataframe(X_nested)}")
print(f"The cell contains a {type(X_nested.iloc[0,0])}.")
print(f"The nested DataFrame has shape {X_nested.shape}")
X_nested.iloc[0, 0]
X_nested is a nested DataFrame: True
The cell contains a
The nested DataFrame has shape (50, 5)
0 0.722682
1 0.710791
2 0.829011
3 0.594688
4 0.138771
5 0.974366
6 0.691353
7 0.645689
8 0.924583
9 0.076050
10 0.933042
11 0.904305
12 0.193533
13 0.371561
14 0.470711
15 0.199187
16 0.222599
17 0.593297
18 0.486712
19 0.805983
Name: 0, dtype: float64
#多索引数据
from sktime.datasets import make_multi_index_dataframe
from sktime.datatypes._panel._convert import (
from_multi_index_to_nested,
from_nested_to_multi_index,
)
X_mi = make_multi_index_dataframe(n_instances=50, n_columns=5, n_timepoints=20)
print(f"The multi-indexed DataFrame has shape {X_mi.shape}")
print(f"The multi-index names are {X_mi.index.names}")
X_mi.head()
output:
The multi-indexed DataFrame has shape (1000, 5)
The multi-index names are [‘case_id’, ‘reading_id’]
var_0 var_1 var_2 var_3 var_4
case_id reading_id
0 0 0.995135 0.038909 0.181639 0.104997 0.122326
1 0.355448 0.991423 0.807475 0.062603 0.175412
2 0.935102 0.403729 0.031880 0.664581 0.438267
3 0.724594 0.291621 0.833333 0.780480 0.201459
4 0.019985 0.743191 0.485572 0.444321 0.146891
X_nested = from_multi_index_to_nested(X_mi, instance_index="case_id")
print(f"X_nested is a nested DataFrame: {is_nested_dataframe(X_nested)}")
print(f"The cell contains a {type(X_nested.iloc[0,0])}.")
print(f"The nested DataFrame has shape {X_nested.shape}")
X_nested.head()
output:
X_nested is a nested DataFrame: True
The cell contains a
The nested DataFrame has shape (50, 5)
var_0 var_1 var_2 var_3 var_4
0 0 0.995135 1 0.355448 2 0.935102 3… 0 0.038909 1 0.991423 2 0.403729 3… 0 0.181639 1 0.807475 2 0.031880 3… 0 0.104997 1 0.062603 2 0.664581 3… 0 0.122326 1 0.175412 2 0.438267 3…
1 0 0.100698 1 0.147056 2 0.475139 3… 0 0.947874 1 0.008540 2 0.004873 3… 0 0.233057 1 0.089218 2 0.214873 3… 0 0.905125 1 0.200611 2 0.534587 3… 0 0.215696 1 0.870398 2 0.663107 3…
2 0 0.013395 1 0.643044 2 0.534492 3… 0 0.967722 1 0.393499 2 0.000600 3… 0 0.044322 1 0.706455 2 0.015284 3… 0 0.321464 1 0.262389 2 0.287553 3… 0 0.031653 1 0.506912 2 0.440595 3…
3 0 0.633218 1 0.858456 2 0.573660 3… 0 0.176980 1 0.217540 2 0.741566 3… 0 0.378840 1 0.260752 2 0.158162 3… 0 0.394054 1 0.326364 2 0.767337 3… 0 0.441756 1 0.962046 2 0.337529 3…
4 0 0.530016 1 0.641393 2 0.580072 3… 0 0.550959 1 0.762723 2 0.807829 3… 0 0.777979 1 0.423855 2 0.034303 3… 0 0.937339 1 0.197130 2 0.916071 3… 0 0.802547 1 0.274180 2 0.198586 3…
from sktime.datasets import load_airline
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.forecasting.theta import ThetaForecaster
from sktime.performance_metrics.forecasting import mean_absolute_percentage_error
import matplotlib.pyplot as plt
y = load_airline()#时间序列数据
print(y)
y_train, y_test = temporal_train_test_split(y)
fh = ForecastingHorizon(y_test.index, is_relative=False)
print(fh)
forecaster = ThetaForecaster(sp=12) # monthly seasonal periodicity
forecaster.fit(y_train)
y_pred = forecaster.predict(fh)
mean_absolute_percentage_error(y_test, y_pred)
y_train.plot()
y_test.plot()
y_pred.plot()
plt.show()
output:
Period
1949-01 112.0
1949-02 118.0
1949-03 132.0
1949-04 129.0
1949-05 121.0
…
1960-08 606.0
1960-09 508.0
1960-10 461.0
1960-11 390.0
1960-12 432.0
Freq: M, Name: Number of airline passengers, Length: 144, dtype: float64
ForecastingHorizon([‘1958-01’, ‘1958-02’, ‘1958-03’, ‘1958-04’, ‘1958-05’, ‘1958-06’,
‘1958-07’, ‘1958-08’, ‘1958-09’, ‘1958-10’, ‘1958-11’, ‘1958-12’,
‘1959-01’, ‘1959-02’, ‘1959-03’, ‘1959-04’, ‘1959-05’, ‘1959-06’,
‘1959-07’, ‘1959-08’, ‘1959-09’, ‘1959-10’, ‘1959-11’, ‘1959-12’,
‘1960-01’, ‘1960-02’, ‘1960-03’, ‘1960-04’, ‘1960-05’, ‘1960-06’,
‘1960-07’, ‘1960-08’, ‘1960-09’, ‘1960-10’, ‘1960-11’, ‘1960-12’],
dtype=‘period[M]’, name=‘Period’, is_relative=False)
from sktime.classification.interval_based import TimeSeriesForestClassifier
from sktime.datasets import load_arrow_head
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_arrow_head()#这个数据集有1个变量,在不同的250个时间点测量, 有211个case
print(X.shape)
print(y)#y是一个三分类的数据,没有时间是属性
X_train, X_test, y_train, y_test = train_test_split(X, y)
classifier = TimeSeriesForestClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy_score(y_test, y_pred)
0.8679245283018868
X.iloc[0,0]
output:
(211, 1)
[‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’
‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’
‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’
‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’
‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’
‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘1’ ‘1’ ‘1’
‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’
‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’
‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘2’ ‘2’ ‘2’ ‘2’
‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’
‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’
‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’]
0 -1.963009
1 -1.957825
2 -1.956145
3 -1.938289
4 -1.896657
…
246 -1.841345
247 -1.884289
248 -1.905393
249 -1.923905
250 -1.909153
Length: 251, dtype: float64
from sktime.annotation.adapters import PyODAnnotator
from pyod.models.iforest import IForest
from sktime.datasets import load_airline
y = load_airline()
print(y.shape)
pyod_model = IForest()
pyod_sktime_annotator = PyODAnnotator(pyod_model)
pyod_sktime_annotator.fit(y)
annotated_series = pyod_sktime_annotator.predict(y)
annotated_series
output:
(144,)
1949-01 1
1949-02 0
1949-03 0
1949-04 0
1949-05 0
…
1960-08 1
1960-09 1
1960-10 0
1960-11 0
1960-12 1
Freq: M, Length: 144, dtype: int32