(一)sktime
(二)pyts
https://zhuanlan.zhihu.com/p/272691705
今天搜索shapelets方面代码的时候看到了这样一个库:pyts,他的GitHub仓库地址是:https://github.com/johannfaouzi/pyts。在他的仓库readme下还放了介绍这个库的论文: 《pyts: A Python Package for Time Series Classification》,于是我来阅读一下。 挑主要的说:
首先这个库依赖:numpy,scipy,scikit-learn,joblib还有numba库。 其中,joblib的作用是running Python functions as pipeline jobs。 numba的作用是:Numbacompiled numerical algorithms in Python can reach the speeds of C or FORTRAN.
为了计算的效率,pyts库应用的绝大部分算法仅仅针对等长的时间序列。需要注意的是,库里的dtw算法及其变种是支持变长的时间序列的。 For computational efficiency, most algorithms implemented in pyts can only deal with data sets of equal-length time series. pyts支持单变量和多变量时间序列数据集。
上图是pyts支持的各种算法(后续在GitHub里还有更新)。
论文很简短,下面看一下怎么使用这个库。 首先,中文博客里面对pyts库的应用我所见到的仅有一个——把时间序列转化为图片【1】。那这里我来写一个如何调用它的shapelets吧。 阅读官方关于shapelets文档【2】,直接看一个示例:
from pyts.classification import LearningShapelets
from pyts.datasets import load_gunpoint
X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
clf = LearningShapelets(random_state=42, tol=0.01)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
这里可以通过clf.shapelets_来获取学习到的shapelets。在文档里,提到shapelets的shape为: array shape = (n_tasks, n_shapelets) n_shapelets是个啥呢?应该是指shapelets的个数。如果想要获取某个shapelets。语法是这样的:
shapelets = np.asarray([clf.shapelets_[0, -9], clf.shapelets_[0, -12]])
这个意思是说,我从训练好的shapelets中获取倒数第九个和倒数第12个。 还有一份代码:
import matplotlib.pyplot as plt
import numpy as np
from pyts.classification import LearningShapelets
from pyts.datasets import load_gunpoint
from pyts.utils import windowed_view
# Load the data set and fit the classifier
X, _, y, _ = load_gunpoint(return_X_y=True)
clf = LearningShapelets(random_state=42, tol=0.01)
clf.fit(X, y)
# Select two shapelets
shapelets = np.asarray([clf.shapelets_[0, -9], clf.shapelets_[0, -12]])
# Derive the distances between the time series and the shapelets
shapelet_size = shapelets.shape[1]
X_window = windowed_view(X, window_size=shapelet_size, window_step=1)
X_dist = np.mean(
(X_window[:, :, None] - shapelets[None, :]) ** 2, axis=3).min(axis=1)
plt.figure(figsize=(14, 4))
# Plot the two shapelets
plt.subplot(1, 2, 1)
plt.plot(shapelets[0])
plt.plot(shapelets[1])
plt.title('Two learned shapelets', fontsize=14)
# Plot the distances
plt.subplot(1, 2, 2)
for color, label in zip('br', (1, 2)):
plt.scatter(X_dist[y == label, 0], X_dist[y == label, 1],
c=color, label='Class {}'.format(label))
plt.title('Distances between the time series and both shapelets',
fontsize=14)
plt.legend()
plt.show()
首先,通过 clf.fit(x,y)
,一共获得了90条shapelets,其中,有30条长度为15,有30条长度为30,有30条长度为45。为什么会有90条,又为什么长度分别是15,30,45呢。 这个需要看【3】,首先看一下Learning Shapelets实例化对象时的参数:
LearningShapelets(n_shapelets_per_size=0.2, min_shapelet_length=0.1,
shapelet_scale=3, penalty='l2',
tol=0.001, C=1000, learning_rate=1.0,
max_iter=1000, multi_class='multinomial',
alpha=-100, fit_intercept=True,
intercept_scaling=1.0,
class_weight=None, verbose=0, random_state=None, n_jobs=None)
看一下对参数 n_shapelets_per_size
的解释:
int or float (default = 0.2) Number of shapelets per size. If float, it represents a fraction of the number of timestamps and the number of shapelets per size is equal to ceil(n_shapelets_per_size * n_timestamps)
.
我们的一条时间序列长度是150,150*0.2=30条,也就是说每个size下的shapelets要有30条。 再看参数 shapelet_scale
:
The different scales for the lengths of the shapelets. The lengths of the shapelets are equal to min_shapelet_length * np.arange(1, shapelet_scale + 1). The total number of shapelets (and features) is equal to n_shapelets_per_size * shapelet_scale.
因为shapelet_scale默认为3,并且min_shapelet_length=0.1, 单条时间序列的长度为150,1500.1=15,153=45,所以shapelets的长度分为3类——15,30,45。而每一类下面有30条,所以一共有90条。
我对它的X_window还有X_dist很感兴趣。X_window就是根据windowSize把每一条时间序列都进行切分,并进行一个欧式距离的计算,然后取当中最小的一个。 那我就想,前期获取shapelets这一步用它的,后面利用滑动窗口进行切分并计算距离就我自己来实现。
from pyts.classification import LearningShapelets
from pyts.datasets import load_gunpoint
import numpy as np
import math
def getDataFromSlidingWindow(windowSize,data):
# 目前只实现了step为1的情况
dataList=[]
for i in range(data.shape[0]-windowSize+1):
temp=[]
for j in range(windowSize):
temp.append(data[i+j])
dataList.append(temp)
return dataList
# 计算平均距离,采用欧式距离计算
def meanDistance(windowData,shapelets):
sum=0
for i in range(len(windowData)):
sum=sum+(windowData[i]-shapelets[i])*(windowData[i]-shapelets[i])
# result=math.sqrt(float(sum))/len(windowData)
result=float(sum)/len(windowData)
return result
X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
clf = LearningShapelets(random_state=42, tol=0.01)
clf.fit(X_train, y_train)
shapelets = np.asarray([clf.shapelets_[0, 0], clf.shapelets_[0, -1]])
print(shapelets.shape)
print(clf.shapelets_[0,0].shape)
print(clf.shapelets_[0,-1].shape)
print(X_train.shape)
for i in range(X_train.shape[0]):
print(X_train[i])
tempcalculateDisList=[]
windowSize=clf.shapelets_[0,0].shape[0] # windowSize与shapelets的长度保持一致
step=1 # 窗口每次向前滑动的步长
# print(len(getDataFromSlidingWindow(windowSize,X_train[0,:])))
# print(len(getDataFromSlidingWindow(windowSize,X_train[0,:])[0]))
windowData=getDataFromSlidingWindow(windowSize, X_train[0,:])
allDistanceList=[]
for item in windowData:
allDistanceList.append(meanDistance(item,clf.shapelets_[0,0]))
print(min(allDistanceList))
Ok,就写到这里。后面的就是与论文相关了。
【1】将一维时间序列转化成二维图片 【2】3. Extracting features from time series 【3】pyts官方文档——pyts.classification.LearningShapelets