python 使用tsfresh进行时间序列特征提取

tsfresh文档地址:https://tsfresh.readthedocs.io/en/latest/
tsfresh github地址:https://github.com/blue-yonder/tsfresh
tsfresh 安装方法:pip install tfresh

使用示例

import pandas as pd
import numpy as np
from tsfresh import extract_features

def convert_to_extract_df(dataframe:pd.DataFrame):
    """把dataframe格式转变为extract_features需要的格式"""
    covert_df = pd.DataFrame(columns=['value', 'id'])
    for _col, col_series in dataframe.iteritems():
        _col_df = pd.DataFrame(data=[col_series.values]).T
        _col_df.columns = ['value']
        _col_df['id'] = _col
        covert_df = pd.concat([covert_df, _col_df], axis=0, ignore_index=True)
    covert_df['value'] = covert_df['value'].astype("float")
    return covert_df

def get_line_features(dataframe: pd.DataFrame):
    """得到曲线的特征"""
    from tsfresh import extract_features  # todo 费时间
    ext_feature = extract_features(dataframe, column_id="id")
    return ext_feature

构造一个时间序列,注意案例中一列是一个时间序列,每一行代表着一天

time_df = pd.DataFrame(np.arange(400).reshape((100, 4)), index=pd.date_range(start="20190101", periods=100, freq="10D"),columns=["col1","col2","col3","col4"])

构造完成时间序列后,需要将序列统一转换为tsfresh提取特征时接受的输入格式:

ext_df = convert_to_extract_df(time_df)
ext_df.head()

现在的格式是:

		value	id
	0	0.0		col1
	1	4.0		col1
	2	8.0		col1
	3	12.0	col1
	4	16.0	col1
	...	...	...
	395	383.0	col4
	396	387.0	col4
	397	391.0	col4
	398	395.0	col4
	399	399.0	col4

可以看到一个id代表一条序列,每一个相同的id都是相同的序列值,不同的id是不同的时间序列

然后开始提取特征:

ext_feature = get_line_features(ext_df)

就可以得到结果了

ext_feature.shape # (4, 787)

这里的4与序列个数一样,一个序列(表示为不一样的id)会有1行,787是不同的特征

		value__variance_larger_than_standard_deviation	value__has_duplicate_max	value__has_duplicate_min	value__has_duplicate	value__sum_values	value__abs_energy	value__mean_abs_change	value__mean_change	value__mean_second_derivative_central	value__median	...	value__permutation_entropy__dimension_5__tau_1	value__permutation_entropy__dimension_6__tau_1	value__permutation_entropy__dimension_7__tau_1	value__query_similarity_count__query_None__threshold_0.0	value__matrix_profile__feature_"min"__threshold_0.98	value__matrix_profile__feature_"max"__threshold_0.98	value__matrix_profile__feature_"mean"__threshold_0.98	value__matrix_profile__feature_"median"__threshold_0.98	value__matrix_profile__feature_"25"__threshold_0.98	value__matrix_profile__feature_"75"__threshold_0.98
col1	1.0	0.0	0.0	0.0	19800.0	5253600.0	4.0	4.0	0.0	198.0	...	-0.0	-0.0	-0.0	NaN	1.685874e-07	1.685874e-07	1.685874e-07	1.685874e-07	1.685874e-07	1.685874e-07
col2	1.0	0.0	0.0	0.0	19900.0	5293300.0	4.0	4.0	0.0	199.0	...	-0.0	-0.0	-0.0	NaN	1.685874e-07	1.685874e-07	1.685874e-07	1.685874e-07	1.685874e-07	1.685874e-07
col3	1.0	0.0	0.0	0.0	20000.0	5333200.0	4.0	4.0	0.0	200.0	...	-0.0	-0.0	-0.0	NaN	1.685874e-07	1.685874e-07	1.685874e-07	1.685874e-07	1.685874e-07	1.685874e-07
col4	1.0	0.0	0.0	0.0	20100.0	5373300.0	4.0	4.0	0.0	201.0	...	-0.0	-0.0	-0.0	NaN	1.685874e-07	1.685874e-07	1.685874e-07	1.685874e-07	1.685874e-07	1.685874e-07

你可能感兴趣的:(python,机器学习,python,机器学习,时间序列)