Large-Scale Unusual Time Series Detection 2015

Exploring the feature space of large collections of time series

Video Hyndman.pdf
时间序列异常检测 代码工具
Exploring the feature space of large collections of time series
Work­shop on Fron­tiers in Func­tional Data Analy­sis
Banff, Canada.

It is becoming increasingly common for organizations to collect very large amounts of data over time. Data visualization is essential for exploring and understanding structures and patterns, and to identify unusual observations. However, the sheer quantity of data available challenges current time series visualisation methods.

For example, Yahoo has banks of mail servers that are monitored over time. Many measurements on server performance are collected every hour for each of thousands of servers. We wish to identify servers that are behaving unusually.

Alternatively, we may have thousands of time series we wish to forecast, and we want to be able to identify the types of time series that are easy to forecast and those that are inherently challenging.

I will demonstrate a functional data approach to this problem using a vector of features on each time series, measuring characteristics of the series. For example, the features may include lag correlation, strength of seasonality, spectral entropy, etc. Then we use a principal component decomposition on the features, and plot the first few principal components. This enables us to explore a lower dimensional space and discover interesting structure and unusual observations.

Large-scale unusual time series detection

Rob J Hyndman1, Earo Wang1 and Nikolay Laptev2

Monash Business School, Monash University, Clayton, Victoria, Australia.
Yahoo Labs, Sunnyvale, California, USA

Abstract It is becoming increasingly common for organizations to collect very large amounts of data over time, and to need to detect unusual or anomalous time series. For example, Yahoo has banks of mail servers that are monitored over time. Many measurements on server performance are collected every hour for each of thousands of servers. We wish to identify servers that are behaving unusually.
We compute a vector of features on each time series, measuring characteristics of the series. The features may include lag correlation, strength of seasonality, spectral entropy, etc. Then we use a principal component decomposition on the features, and use various bivariate outlier detection methods applied to the first two principal components. This enables the most unusual series, based on their feature vectors, to be identified. The bivariate outlier detection methods used are based on highest density regions and α-hulls.
Download working paper
Associated R package

A new R package for detecting unusual time series

The anom­alous pack­age pro­vides some tools to detect unusual time series in a large col­lec­tion of time series. This is joint work with Earo Wang (an hon­ours stu­dent at Monash) and Niko­lay Laptev (from Yahoo Labs). Yahoo is inter­ested in detect­ing unusual pat­terns in server met­rics.
The pack­age is based on this paper with Earo and Niko­lay.
The basic idea is to mea­sure a range of fea­tures of the time series (such as strength of sea­son­al­ity, an index of spik­i­ness, first order auto­cor­re­la­tion, etc.) Then a prin­ci­pal com­po­nent decom­po­si­tion of the fea­ture matrix is cal­cu­lated, and out­liers are iden­ti­fied in 2-​​dimensional space of the first two prin­ci­pal com­po­nent scores.

We use two meth­ods to iden­tify outliers.

A bivari­ate ker­nel den­sity esti­mate of the first two PC scores is com­puted, and the points are ordered based on the value of the den­sity at each obser­va­tion. This gives us a rank­ing of most out­ly­ing (least den­sity) to least out­ly­ing (high­est density).
A series of α–con­vex hulls are com­puted on the first two PC scores with decreas­ing α, and points are clas­si­fied as out­liers when they become sin­gle­tons sep­a­rated from the main hull. This gives us an alter­na­tive rank­ing with the most out­ly­ing hav­ing sep­a­rated at the high­est value of α, and the remain­ing out­liers with decreas­ing val­ues of α.

I explained the ideas in a talk last Tues­day given at a joint meet­ing of the Sta­tis­ti­cal Soci­ety of Aus­tralia and the Mel­bourne Data Sci­ence Meetup Group. Slides are avail­able here. A link to a video of the talk will also be added there when it is ready.
The density-​​ranking of PC scores was also used in my work on detect­ing out­liers in func­tional data. See my 2010 JCGS paper and the asso­ci­ated rain­bow pack­age for R.
There are two ver­sions of the pack­age: one under an ACM licence, and a lim­ited ver­sion under a GPL licence. Even­tu­ally we hope to make the GPL ver­sion con­tain every­thing, but we are cur­rently depen­dent on the alphahull pack­age which has an ACM licence.

The anom­alous pack­age pro­vides some tools to detect unusual time series in a large col­lec­tion of time series. This is joint work with Earo Wang (an hon­ours stu­dent at Monash) and Niko­lay Laptev (from Yahoo Labs). Yahoo is inter­ested in detect­ing unusual pat­terns in server met­rics.
The pack­age is based on this paper with Earo and Niko­lay.

Related Posts:
A new open source data set for detect­ing time series outliers
My Yahoo talk is now online
A time series clas­si­fi­ca­tion contest
North Amer­i­can sem­i­nars: June 2015
Esti­mat­ing a non­lin­ear time series model in R

A new open source data set for detecting time series outliers

Yahoo Labs has just released an inter­est­ing new data set use­ful for research on detect­ing anom­alies (or out­liers) in time series data. There are many con­texts in which anom­aly detec­tion is impor­tant. For Yahoo, the main use case is in detect­ing unusual traf­fic on Yahoo servers.

The data set com­prises real traf­fic to Yahoo ser­vices, along with some syn­thetic data. There are 367 time series in the data set, each of which con­tains between 741 and 1680 obser­va­tions recorded at reg­u­lar inter­vals. Each series is accom­pa­nied by an indi­ca­tor series with a 1 if the obser­va­tion was an anom­aly, and 0 oth­er­wise. The anom­alies in the real data were deter­mined by human judge­ment, while those in the syn­thetic data were gen­er­ated algo­rith­mi­cally. For the syn­thetic data, some infor­ma­tion about the com­po­nents used to con­struct the data is also provided.

Although the Yahoo announce­ment claims that the data are pub­licly avail­able, in fact they are only avail­able to peo­ple with an edu address. Fur­ther, you have to apply to use them, and it takes about 24 hours before approval is granted. I have sug­gested that they remove these restric­tions, and make the data avail­able with­out restric­tion to any­one who wants to use them.

Research on anom­aly detec­tion in time series seems to be grow­ing in pop­u­lar­ity. Twit­ter has also released their own Anom­aly Detec­tion R pack­age. Their approach has some sim­i­lar­i­ties with my own tsoutliers func­tion in the forecast pack­age. The tso func­tion in the ts outliers pack­age is another approach to the same problem.
Hope­fully hav­ing a large pub­lic data set avail­able will lead to improve­ments in time series out­lier detec­tion meth­ods, at least for detect­ing out­liers in inter­net traf­fic data.

Related Posts:
A new R pack­age for detect­ing unusual time series
New in fore­cast 5.0
My Yahoo talk is now online
North Amer­i­can sem­i­nars: June 2015
More time series data online

2015/6/28 11:31:21
本文问题的不同之处 1页
We are interested in the time series that are anomalous relative to the other time series in the same cluster, or more generally, in the same set. This type of anomaly detection is different from univariate anomaly detection or even from a multivariate point anomaly detection [6] because we are interested in identifying entire time series that are behaving unusually in the context of other metrics.

工具包已有 R ,2页
作者贡献
First, we introduce a novel and accurate method of using PCA with α-convex hulls for finding anomalous time series. Second we perform a study of possible features that are useful for the types of time series dynamics seen in web-traffic time series.

为何PCA有效 ,2页
Therefore,loosely speaking the first k principal components capture the
k most prevalent patterns in the data

本文用的方法
To find anomalies in the first two PCs we use a multi-dimensional outlier detection algorithm. We have implemented a density-based and an α-hull based multidimensional outlier detection algorithms.
The density based multi-dimensional anomaly detection algorithm [7] Computing and Graphing Highest Density Regions finds points in the first two principal components with lowest density.The α-hull method [15]Generalizing the Convex Hull of a Sample: The R Package ...is a generalization of the convex hull [6]A Survey of Outlier Detection Methodologies. which is a bounding region of a point set. The α parameter in the α-hull method defines a generalized disk of radius α. When α is sufficiently large, the α-hull method is equivalent to the convex hull. Given α, an
edge of the α-shape is drawn between two members of the finite point set if there exists a generalized disk of radius α containing the entire point set and the two points lie on its boundary.

2015/6/28 15:20:03
the variance of the variances across blocks measures the “lumpiness” of the series.
方差的跨越块的方差测量序列的“凹凸不平”。
Some of our features rely on a robust STL decomposition。

2015/6/28 15:44:43
“Flat spots” are computed by dividing the sample space of a time series into ten equal-sized intervals, and computing the maximum run length within any single interval.
“平点”是通过将一个时间序列的样本空间分成十个大小相等的间隔,并计算任何单一间隔内的最大游程长度进行计算。
Finally, “crossing points”are defined as the number of times a time series crosses the mean line.
最后,“交叉点”被定义为一个时间序列穿过平均线的次数。

2015/6/28 15:50:11
我们的方法 效果
our approach first extracts the two most significant principal components (PC)s from all time series and then determines the outliers
in the new 2D “feature space”. For multidimensional outlier detection on the PC space we show results for the density-based method (HDR) and for the α-hull method.
对于多维异常检测在PC领域,我们显示结果基于密度的方法(HDR)和α-船体的方法。

参考

6 R: A Language and Environment for Statistical Computing
R: The R Project for Statistical Computing
R: a language and environment for statistical computing ...
21 A PCA-based Similarity Measure for Multivariate Time Series

Nonparametric and semiparametric response surface methodology: a review of designs, models and optimization techniques

Recent publications

Do human rhinovirus infections and food allergy modify grass pollen–induced asthma hospital admissions in children?
Jun 2015, Journal article

STR: A Seasonal-Trend Decomposition Procedure Based on Regression
Jun 2015, Working paper

Probabilistic time series forecasting with boosted additive models: an application to smart meter data
Jun 2015, Working paper

Large-scale unusual time series detection
Jun 2015, Working paper

A note on the validity of cross-validation for evaluating time series predictionApr 2015, Working paper

Discussion of “High-dimensional autocovariance matrices and optimal linear prediction”Apr 2015, Journal article

Bivariate data with ridges: two-dimensional smoothing of mortality rates
Dec 2014, Working paper

Optimally reconciling forecasts in a hierarchy
Oct 2014, Journal article

Outdoor fungal spores are associated with child asthma hospitalisations - a case-crossover study
Sep 2014, Journal article

Efficient identification of the Pareto optimal set
Aug 2014, Conference

Working papers

2015 (4) STR: A Seasonal-Trend Decomposition Procedure Based on Regression

Probabilistic time series forecasting with boosted additive models: an application to smart meter data

Large-scale unusual time series detection

A note on the validity of cross-validation for evaluating time series prediction

2014 (6) Bivariate data with ridges: two-dimensional smoothing of mortality rates

Low-dimensional decomposition, smoothing and forecasting of sparse functional data

Fast computation of reconciled forecasts for hierarchical and grouped time series

Monash Electricity Forecasting Model

“Facts” may still be artefacts, since models can make unrealistic assumptions: statistical methods for the estimation of invasion lag-phases from herbarium data

Bagging exponential smoothing methods using STL decomposition and Box-Cox transformation

2013 (2) Nonparametric and semiparametric response surface methodology: a review of designs, models and optimization techniques

hts: An R package for forecasting hierarchical or grouped time series

2012 (1) Recursive and direct multi-step forecasting: the best of both worlds

2008 (1) Forecasting without significance tests?

2007 (1) A state space model for exponential smoothing with group seasonality

2006 (1) Local linear multivariate regression with variable bandwidth in the presence of heteroscedasticity

2005 (1) Time series forecasting: the case for the single source of error state space approach

2000 (1) Seasonal adjustment methods for the analysis of respiratory disease in environmental epidemiology

1996 (1) A unified view of linear AR(1) models

1995 (1) The problem with Sturges’ rule for constructing histograms

Papers in conference proceedings

2014 (3) Efficient identification of the Pareto optimal set

Common functional principal component models for mortality forecasting

Boosting multi-step autoregressive forecasts

2010 (3) Exploratory graphics for functional data

Short-term load forecasting based on a semi-parametric additive model

Functionalization of microarray devices: process optimization using a multiobjective PSO and multiresponse MARS modeling

2009 (1) Nonparametric time series forecasting with dynamic updating

2005 (2) Dimension reduction for clustering time series using global characteristics

Robust forecasting of mortality and fertility rates: a functional data approach

2001 (1) Statistical methodological issues in studies of air pollution and respiratory disease

1999 (1) Nonparametric additive regression models for binary time series

1987 (1) Calculating the odds

你可能感兴趣的:(Large-Scale Unusual Time Series Detection 2015)