6

2

# The use case :

Everyday, we have metrics that are established daily to check that our systems are doing fine. From time to times, bugs occur in the workflow building these metrics, and I have to build an algorithm that will alert us when it seems there is a problem, so that we can check where it is coming from.

# What I'm trying :

I don't know much about time series. The data I have displays a weekly seasonality, and the volume isn't constant (it's growing). See image : http://www.blabala.com (I don't have enough rep to post multiple images) Therefore, from what I read, ARIMA seems to be a relevant algorithm.

I stumbled upon the `tsoutlier`

package in `R`

, and its functions `locate.outliers.oloop()`

and `discard.outliers()`

, so I had the following idea: instead of trying to train something I don't master, let's have the automatic parts of the algorithm train on their own, and use that: if it detects outliers, then throw an alert.

- First question: does it seem to be a good or a bad idea?

# If it's a good approach :

I'm having trouble getting something that works well. I've explored two ways of doing things :

**1) train the whole thing, and run it on itself.**

```
cval <- 3
mo2 <- locate.outliers.oloop(TS_train,
arima_train,
maxit.iloop = 100,
maxit.oloop = 100,
cval = cval)
d2<-discard.outliers(mo2,TS_train, method = "bottom-up",
tsmethod.call = arima_train$call,
cval = cval)
OTL <- TS_train[sort(d2$outliers$ind)]
plot(1:length(TS_train), TS_train, type = 'p')
lines(1:length(TS_train), TS_train, col = "grey")
points(sort(d2$outliers$ind), OTL, col = "red")
```

This hasn't given very good results. See image 1.

**2) have a sliding window:** train the algorithm with day 1 to 60 as training data, evaluate day 61, slide one day, train the algorithm with day 2 to 61 as training data, evaluate day 62, slide one day, and so on.

```
plage <- 60
cval <- 3
ylim <- range(DT$col1) + c(-1,1) * .05 * diff(range(DT$col1))
for( i in (plage + 1):65 ) {
TS_tmp <- ts(DT[(i-plage):i]$col1, frequency = 7)
aa <- auto.arima(TS_tmp)
mo_tmp <- locate.outliers.oloop(TS_tmp, aa,
maxit.iloop = 100, maxit.oloop = 100,
cval = cval)
d_tmp <- discard.outliers(mo_tmp, TS_tmp, method = "bottom-up",
tsmethod.call = aa$call, cval = cval)
Outl <- TS_tmp[sort(d_tmp$outliers$ind)]
print(plot(DT[(i-plage):i]$col1, type = "b", xlim = c(1, nrow(DT)), ylim = ylim))
print(points(sort(d_tmp$outliers$ind), Outl , col = "red"))
}
```

It seems to perform worse... (See image 2)

Then again, I don't know much about time series and don't really know how to evaluate how well a model performs.

My questions :

Are any of the two approaches a good idea ? Which one is best ? Any better idea ?

I'm using the 9 months of data I have in approach n1. In approach n2 (if it's relevant), how much data should be included in my training set ? (currently : 60 days)

Any advice on how to best calibrate the model(s) ?