Anomaly detection in time series



The use case :

Everyday, we have metrics that are established daily to check that our systems are doing fine. From time to times, bugs occur in the workflow building these metrics, and I have to build an algorithm that will alert us when it seems there is a problem, so that we can check where it is coming from.

What I'm trying :

I don't know much about time series. The data I have displays a weekly seasonality, and the volume isn't constant (it's growing). See image : (I don't have enough rep to post multiple images) Therefore, from what I read, ARIMA seems to be a relevant algorithm.

I stumbled upon the tsoutlier package in R, and its functions locate.outliers.oloop() and discard.outliers(), so I had the following idea: instead of trying to train something I don't master, let's have the automatic parts of the algorithm train on their own, and use that: if it detects outliers, then throw an alert.

  • First question: does it seem to be a good or a bad idea?

If it's a good approach :

I'm having trouble getting something that works well. I've explored two ways of doing things :

1) train the whole thing, and run it on itself.

cval <- 3
mo2 <- locate.outliers.oloop(TS_train, 
                             maxit.iloop = 100, 
                             maxit.oloop = 100, 
                             cval = cval)
d2<-discard.outliers(mo2,TS_train, method = "bottom-up",
            = arima_train$call,
                     cval = cval)

OTL <-  TS_train[sort(d2$outliers$ind)]

plot(1:length(TS_train), TS_train, type = 'p')
lines(1:length(TS_train), TS_train, col = "grey")
points(sort(d2$outliers$ind), OTL, col = "red")

This hasn't given very good results. See image 1.

2) have a sliding window: train the algorithm with day 1 to 60 as training data, evaluate day 61, slide one day, train the algorithm with day 2 to 61 as training data, evaluate day 62, slide one day, and so on.

plage <- 60
cval <- 3
ylim <-  range(DT$col1) + c(-1,1) * .05 * diff(range(DT$col1))

for( i in (plage + 1):65 ) { 
  TS_tmp <- ts(DT[(i-plage):i]$col1, frequency = 7)
  aa <- auto.arima(TS_tmp)
  mo_tmp <- locate.outliers.oloop(TS_tmp, aa, 
                                  maxit.iloop = 100, maxit.oloop = 100, 
                                  cval = cval)
  d_tmp <- discard.outliers(mo_tmp, TS_tmp, method = "bottom-up",
                   = aa$call, cval = cval)

  Outl <- TS_tmp[sort(d_tmp$outliers$ind)]
  print(plot(DT[(i-plage):i]$col1, type = "b", xlim = c(1, nrow(DT)), ylim = ylim))
  print(points(sort(d_tmp$outliers$ind), Outl , col = "red"))

It seems to perform worse... (See image 2)

Then again, I don't know much about time series and don't really know how to evaluate how well a model performs.

My questions :

  • Are any of the two approaches a good idea ? Which one is best ? Any better idea ?

  • I'm using the 9 months of data I have in approach n1. In approach n2 (if it's relevant), how much data should be included in my training set ? (currently : 60 days)

  • Any advice on how to best calibrate the model(s) ?

enter image description here


Posted 2017-08-30T10:03:49.240

Reputation: 61



I suggest you try to remove the trend (growing) from your data then try to model it. Then difference with model could be your outlier and alarm

see this:


Posted 2017-08-30T10:03:49.240

Reputation: 695


You could possibly try the MDI algorithm based on this paper.

The authors have achieved great results in detecting anomalies for spatio-temporal time series data. It is based on comparing the probability distributions on specific intervals of the time series as compared to the rest of the time series. The paper describes how they approach this seemingly complicated combinatorial optimization problem.

Also, they have released there code. It even comes with a GUI for non-spatial but temporal data.


Posted 2017-08-30T10:03:49.240

Reputation: 88