tsfresh官方文档
阅读之后可以快速上手提取时序特征
features_filtered_direct = extract_relevant_features(df_timeseries, y,
fdr_level = 0.002, column_id='stay_id', column_sort='time')
df_timeseries是我的时序数据
df[‘stay_id’, ‘y’]是另一个表中,我的id和预测值,我确定表里是有这两个索引的
features_filtered_direct = extract_relevant_features(df_timeseries, df['stay_id', 'y'],
column_id='stay_id', column_sort='time')
后来我发现,这是自己的习惯性小错误,它把’stay_id’, 'y’作为一个词,而这显然无意义,而不是分开来的两个词
正确的访问是 df [['stay_id', 'y']]
y应该是series,而不是dataframe
查看series教程
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
在往上截一点图可以看到,这需要多进程
所以应该把代码放在main里面
if __name__ == '__main__':
...
注意提取过程耗时较长
就我的数据而言,即使使用extract_relevant_features,它仍然产生了900个特征
我在调用这里看到了一个参数fdr_level
,源码中它的定义为
在所有创建的特征中不相关的比例,也就是说这个比例越小,不相关的越少,过滤越严格,它的默认值为0.05
我设置了0.002,然后,我的内存炸了…