*Disclaimer, I have not tried any of these ideas.*

# Predict the CI directly

*Note: This method will require very large batch sizes, which may not be possible due to memory constraints.*

Set the batch size to something large. You'll also need to decide what you want your confidence interval to be, 95% matches a lot of expectations but may be less numerically stable than something like 90% of 80%. We'll call this value $0 < c < 1$.

In addition to outputting a prediction $\hat{y_i}$ of the true value $y$, have your model also output a lower bound $l_i(c)$ and an upper bound $u_i(c)$ for each individual estimate.

Now, define
$$
z =
\begin{align}
& \left(\frac{1-c}{2} - \frac{\sum_{i=1}^n (1\ \text{ if }\ (y_i < l_i(c))\ \text{ else }\ 0)}{n} \right)^2 \\
+ & \left(\frac{1-c}{2} - \frac{\sum_{i=1}^n (1\ \text{ if }\ (y_i > u_i(c))\ \text{ else }\ 0)}{n} \right)^2
\end{align}
$$

In essence, $z$ is the (squared) measure of how many times the predicted lower/upper bounds are exceeded vs. how many times we expect them to be exceeded. In practice, some tweaks to exactly how $z$ is computed may be necessary, but the main idea of penalizing the model for having too lax or too restricted in its estimations of the lower/upper bounds should be preserved.

Add $z$ to your MSE loss function, you'll likely need to balance these two losses, i.e. add an additional hyperparameter $\alpha$ to your model such that $\text{TOTAL LOSS} = \text{MSE} + \alpha z$.

As an example, if you set your batch size/time-step length so that there are 500
distinct estimates (e.g. $n=500$), and set $c$ to 0.9, in theory this training procedure should encourage the model to estimate $l_i(c)$ and $u_i(c)$ such that exactly 25 actual $y_i$ values are below $l_i(c)$ and exactly 25 $y_i$ are above $u_i(c)$.

# Predict the error directly

In addition to getting a prediction $\hat{y}$ of the true value $y$ and training with the MSE loss function, add another output of your model which attempts to predict the squared error $\epsilon = (\hat{y} - y)^2$ directly.

You can create the MSE of the estimate $\hat{\epsilon}$ of $\epsilon$ and combine that with your existing MSE to get the total loss for training.

You can then use this value to normalize your errors to try and get an anomaly score that is usable across different sub-populations:

$$
\text{anomaly score} = \frac{(y - \hat{y})^2}{\hat{\epsilon}}
$$
where higher scores indicate that the true error is larger than expected.

Do you need the predicted CI? Have you tried using |actual - precicted| as the measure of anomaly-ness? – kbrose – 2018-04-25T00:56:13.760

There is no static threshold of |actual - predicted| being anomalous. F.e. at night there is less noise then overday. – Mike Evers – 2018-04-25T07:28:11.027

Please can someone help? – Mike Evers – 2018-04-25T13:55:14.807

tried to give a couple ideas in an answer. – kbrose – 2018-04-25T20:11:15.127