Large-Scale Unusual Time Series Detection 2015

Exploring the feature space of large collections of time series

Video Hyndman.pdf
时间序列异常检测代码工具
Exploring the feature space of large collections of time series
Workshop on Frontiers in Functional Data Analysis
Banff, Canada.

It is becoming increasingly common for organizations to collect very large amounts of data over time. Data visualization is essential for exploring and understanding structures and patterns, and to identify unusual observations. However, the sheer quantity of data available challenges current time series visualisation methods.

For example, Yahoo has banks of mail servers that are monitored over time. Many measurements on server performance are collected every hour for each of thousands of servers. We wish to identify servers that are behaving unusually.

Alternatively, we may have thousands of time series we wish to forecast, and we want to be able to identify the types of time series that are easy to forecast and those that are inherently challenging.

I will demonstrate a functional data approach to this problem using a vector of features on each time series, measuring characteristics of the series. For example, the features may include lag correlation, strength of seasonality, spectral entropy, etc. Then we use a principal component decomposition on the features, and plot the first few principal components. This enables us to explore a lower dimensional space and discover interesting structure and unusual observations.

Large-scale unusual time series detection

Rob J Hyndman1, Earo Wang1 and Nikolay Laptev2

Monash Business School, Monash University, Clayton, Victoria, Australia.
Yahoo Labs, Sunnyvale, California, USA

Abstract It is becoming increasingly common for organizations to collect very large amounts of data over time, and to need to detect unusual or anomalous time series. For example, Yahoo has banks of mail servers that are monitored over time. Many measurements on server performance are collected every hour for each of thousands of servers. We wish to identify servers that are behaving unusually.
We compute a vector of features on each time series, measuring characteristics of the series. The features may include lag correlation, strength of seasonality, spectral entropy, etc. Then we use a principal component decomposition on the features, and use various bivariate outlier detection methods applied to the first two principal components. This enables the most unusual series, based on their feature vectors, to be identified. The bivariate outlier detection methods used are based on highest density regions and α-hulls.
Download working paper
Associated R package

A new R package for detecting unusual time series

The anomalous package provides some tools to detect unusual time series in a large collection of time series. This is joint work with Earo Wang (an honours student at Monash) and Nikolay Laptev (from Yahoo Labs). Yahoo is interested in detecting unusual patterns in server metrics.
The package is based on this paper with Earo and Nikolay.
The basic idea is to measure a range of features of the time series (such as strength of seasonality, an index of spikiness, first order autocorrelation, etc.) Then a principal component decomposition of the feature matrix is calculated, and outliers are identified in 2-dimensional space of the first two principal component scores.

We use two methods to identify outliers.

A bivariate kernel density estimate of the first two PC scores is computed, and the points are ordered based on the value of the density at each observation. This gives us a ranking of most outlying (least density) to least outlying (highest density).
A series of α–convex hulls are computed on the first two PC scores with decreasing α, and points are classified as outliers when they become singletons separated from the main hull. This gives us an alternative ranking with the most outlying having separated at the highest value of α, and the remaining outliers with decreasing values of α.

I explained the ideas in a talk last Tuesday given at a joint meeting of the Statistical Society of Australia and the Melbourne Data Science Meetup Group. Slides are available here. A link to a video of the talk will also be added there when it is ready.
The density-ranking of PC scores was also used in my work on detecting outliers in functional data. See my 2010 JCGS paper and the associated rainbow package for R.
There are two versions of the package: one under an ACM licence, and a limited version under a GPL licence. Eventually we hope to make the GPL version contain everything, but we are currently dependent on the alphahull package which has an ACM licence.

Related Posts:
A new open source data set for detecting time series outliers
My Yahoo talk is now online
A time series classification contest
North American seminars: June 2015
Estimating a nonlinear time series model in R

A new open source data set for detecting time series outliers

Yahoo Labs has just released an interesting new data set useful for research on detecting anomalies (or outliers) in time series data. There are many contexts in which anomaly detection is important. For Yahoo, the main use case is in detecting unusual traffic on Yahoo servers.

The data set comprises real traffic to Yahoo services, along with some synthetic data. There are 367 time series in the data set, each of which contains between 741 and 1680 observations recorded at regular intervals. Each series is accompanied by an indicator series with a 1 if the observation was an anomaly, and 0 otherwise. The anomalies in the real data were determined by human judgement, while those in the synthetic data were generated algorithmically. For the synthetic data, some information about the components used to construct the data is also provided.

Although the Yahoo announcement claims that the data are publicly available, in fact they are only available to people with an edu address. Further, you have to apply to use them, and it takes about 24 hours before approval is granted. I have suggested that they remove these restrictions, and make the data available without restriction to anyone who wants to use them.

Research on anomaly detection in time series seems to be growing in popularity. Twitter has also released their own Anomaly Detection R package. Their approach has some similarities with my own tsoutliers function in the forecast package. The tso function in the ts outliers package is another approach to the same problem.
Hopefully having a large public data set available will lead to improvements in time series outlier detection methods, at least for detecting outliers in internet traffic data.

Related Posts:
A new R package for detecting unusual time series
New in forecast 5.0
My Yahoo talk is now online
North American seminars: June 2015
More time series data online

2015/6/28 11:31:21
本文问题的不同之处 1页
We are interested in the time series that are anomalous relative to the other time series in the same cluster, or more generally, in the same set. This type of anomaly detection is diﬀerent from univariate anomaly detection or even from a multivariate point anomaly detection [6] because we are interested in identifying entire time series that are behaving unusually in the context of other metrics.

工具包已有 R ，2页
作者贡献
First, we introduce a novel and accurate method of using PCA with α-convex hulls for ﬁnding anomalous time series. Second we perform a study of possible features that are useful for the types of time series dynamics seen in web-traﬃc time series.

为何PCA有效，2页
Therefore,loosely speaking the ﬁrst k principal components capture the
k most prevalent patterns in the data

本文用的方法
To find anomalies in the first two PCs we use a multi-dimensional outlier detection algorithm. We have implemented a density-based and an α-hull based multidimensional outlier detection algorithms.
The density based multi-dimensional anomaly detection algorithm [7] Computing and Graphing Highest Density Regions finds points in the first two principal components with lowest density.The α-hull method [15]Generalizing the Convex Hull of a Sample: The R Package ...is a generalization of the convex hull [6]A Survey of Outlier Detection Methodologies. which is a bounding region of a point set. The α parameter in the α-hull method defines a generalized disk of radius α. When α is sufficiently large, the α-hull method is equivalent to the convex hull. Given α, an
edge of the α-shape is drawn between two members of the finite point set if there exists a generalized disk of radius α containing the entire point set and the two points lie on its boundary.

2015/6/28 15:20:03
the variance of the variances across blocks measures the “lumpiness” of the series.
方差的跨越块的方差测量序列的“凹凸不平”。
Some of our features rely on a robust STL decomposition。

2015/6/28 15:44:43
“Flat spots” are computed by dividing the sample space of a time series into ten equal-sized intervals, and computing the maximum run length within any single interval.
“平点”是通过将一个时间序列的样本空间分成十个大小相等的间隔，并计算任何单一间隔内的最大游程长度进行计算。
Finally, “crossing points”are defined as the number of times a time series crosses the mean line.
最后，“交叉点”被定义为一个时间序列穿过平均线的次数。

2015/6/28 15:50:11
我们的方法效果
our approach first extracts the two most significant principal components (PC)s from all time series and then determines the outliers
in the new 2D “feature space”. For multidimensional outlier detection on the PC space we show results for the density-based method (HDR) and for the α-hull method.
对于多维异常检测在PC领域，我们显示结果基于密度的方法（HDR）和α-船体的方法。

参考

6 R: A Language and Environment for Statistical Computing
R: The R Project for Statistical Computing
R: a language and environment for statistical computing ...
21 A PCA-based Similarity Measure for Multivariate Time Series

Nonparametric and semiparametric response surface methodology: a review of designs, models and optimization techniques

Recent publications

Do human rhinovirus infections and food allergy modify grass pollen–induced asthma hospital admissions in children?
Jun 2015, Journal article

STR: A Seasonal-Trend Decomposition Procedure Based on Regression
Jun 2015, Working paper

Probabilistic time series forecasting with boosted additive models: an application to smart meter data
Jun 2015, Working paper

Large-scale unusual time series detection
Jun 2015, Working paper

A note on the validity of cross-validation for evaluating time series predictionApr 2015, Working paper

Discussion of “High-dimensional autocovariance matrices and optimal linear prediction”Apr 2015, Journal article

Bivariate data with ridges: two-dimensional smoothing of mortality rates
Dec 2014, Working paper

Optimally reconciling forecasts in a hierarchy
Oct 2014, Journal article

Outdoor fungal spores are associated with child asthma hospitalisations - a case-crossover study
Sep 2014, Journal article

Efficient identification of the Pareto optimal set
Aug 2014, Conference

Working papers

2015 (4) STR: A Seasonal-Trend Decomposition Procedure Based on Regression

Probabilistic time series forecasting with boosted additive models: an application to smart meter data

Large-scale unusual time series detection

A note on the validity of cross-validation for evaluating time series prediction

2014 (6) Bivariate data with ridges: two-dimensional smoothing of mortality rates

Low-dimensional decomposition, smoothing and forecasting of sparse functional data

Fast computation of reconciled forecasts for hierarchical and grouped time series

Monash Electricity Forecasting Model

“Facts” may still be artefacts, since models can make unrealistic assumptions: statistical methods for the estimation of invasion lag-phases from herbarium data

Bagging exponential smoothing methods using STL decomposition and Box-Cox transformation

2013 (2) Nonparametric and semiparametric response surface methodology: a review of designs, models and optimization techniques

hts: An R package for forecasting hierarchical or grouped time series

2012 (1) Recursive and direct multi-step forecasting: the best of both worlds

2008 (1) Forecasting without significance tests?

2007 (1) A state space model for exponential smoothing with group seasonality

2006 (1) Local linear multivariate regression with variable bandwidth in the presence of heteroscedasticity

2005 (1) Time series forecasting: the case for the single source of error state space approach

2000 (1) Seasonal adjustment methods for the analysis of respiratory disease in environmental epidemiology

1996 (1) A unified view of linear AR(1) models

1995 (1) The problem with Sturges’ rule for constructing histograms

Papers in conference proceedings

2014 (3) Efficient identification of the Pareto optimal set

Common functional principal component models for mortality forecasting

Boosting multi-step autoregressive forecasts

2010 (3) Exploratory graphics for functional data

Short-term load forecasting based on a semi-parametric additive model

Functionalization of microarray devices: process optimization using a multiobjective PSO and multiresponse MARS modeling

2009 (1) Nonparametric time series forecasting with dynamic updating

2005 (2) Dimension reduction for clustering time series using global characteristics

Robust forecasting of mortality and fertility rates: a functional data approach

2001 (1) Statistical methodological issues in studies of air pollution and respiratory disease

1999 (1) Nonparametric additive regression models for binary time series

1987 (1) Calculating the odds

Large-Scale Unusual Time Series Detection 2015

Exploring the feature space of large collections of time series

Large-scale unusual time series detection

A new R package for detecting unusual time series

A new open source data set for detecting time series outliers

参考

Recent publications

Working papers

Papers in conference proceedings

你可能感兴趣的:(Large-Scale Unusual Time Series Detection 2015)