6

2

# The use case :

Everyday, we have metrics that are established daily to check that our systems are doing fine. From time to times, bugs occur in the workflow building these metrics, and I have to build an algorithm that will alert us when it seems there is a problem, so that we can check where it is coming from.

# What I'm trying :

I don't know much about time series. The data I have displays a weekly seasonality, and the volume isn't constant (it's growing). See image : http://www.blabala.com (I don't have enough rep to post multiple images) Therefore, from what I read, ARIMA seems to be a relevant algorithm.

I stumbled upon the tsoutlier package in R, and its functions locate.outliers.oloop() and discard.outliers(), so I had the following idea: instead of trying to train something I don't master, let's have the automatic parts of the algorithm train on their own, and use that: if it detects outliers, then throw an alert.

• First question: does it seem to be a good or a bad idea?

# If it's a good approach :

I'm having trouble getting something that works well. I've explored two ways of doing things :

1) train the whole thing, and run it on itself.

cval <- 3
mo2 <- locate.outliers.oloop(TS_train,
arima_train,
maxit.iloop = 100,
maxit.oloop = 100,
cval = cval)
tsmethod.call = arima_train$call, cval = cval) OTL <- TS_train[sort(d2$outliers$ind)] plot(1:length(TS_train), TS_train, type = 'p') lines(1:length(TS_train), TS_train, col = "grey") points(sort(d2$outliers$ind), OTL, col = "red")  This hasn't given very good results. See image 1. 2) have a sliding window: train the algorithm with day 1 to 60 as training data, evaluate day 61, slide one day, train the algorithm with day 2 to 61 as training data, evaluate day 62, slide one day, and so on. plage <- 60 cval <- 3 ylim <- range(DT$col1) + c(-1,1) * .05 * diff(range(DT$col1)) for( i in (plage + 1):65 ) { TS_tmp <- ts(DT[(i-plage):i]$col1, frequency = 7)
aa <- auto.arima(TS_tmp)
mo_tmp <- locate.outliers.oloop(TS_tmp, aa,
maxit.iloop = 100, maxit.oloop = 100,
cval = cval)
d_tmp <- discard.outliers(mo_tmp, TS_tmp, method = "bottom-up",
tsmethod.call = aa$call, cval = cval) Outl <- TS_tmp[sort(d_tmp$outliers$ind)] print(plot(DT[(i-plage):i]$col1, type = "b", xlim = c(1, nrow(DT)), ylim = ylim))
print(points(sort(d_tmp$outliers$ind), Outl , col = "red"))
}


It seems to perform worse... (See image 2)

Then again, I don't know much about time series and don't really know how to evaluate how well a model performs.

My questions :

• Are any of the two approaches a good idea ? Which one is best ? Any better idea ?

• I'm using the 9 months of data I have in approach n1. In approach n2 (if it's relevant), how much data should be included in my training set ? (currently : 60 days)

• Any advice on how to best calibrate the model(s) ?