## Detecting components in timeseries

26

20

I'm looking at this sequence of values:

I'd like to detect the points where the center of the time-series shifts (around x=1000 and x=2000). Many of the transforms and smoothing methods I have tried destroy this information.

We can visually see that there are 3 "major" subsequences centered around y=400, 200, and then 400 again, but I'm not sure how programmatically locate these major components as continuous subsequences.

How can I automatically detect what the underlying linear step function is here?

To get going, you can grab the data like this:

data = Uncompress[FromCharacterCode[
Flatten[ImageData[Import["http://i.stack.imgur.com/1D962.png"],"Byte"]]]]


I find this an interesting question. You are looking for cluster points in the series. Are there more than two? Do you know them in advance? (In this case it would be not so hard.) Maybe it's not the right place here to ask - hence no answers. Asking in a math forum? – Darwin1871 – 2016-01-28T00:55:15.327

Maybe you could do a windowed mean and standard deviation and make a hypothesis test whether the next (or the +k-th) data point lies within a certain confidence interval. Doing this with different window sizes. – Darwin1871 – 2016-01-28T01:12:11.510

Why not a Bai-Perron test for structural breaks? Perron has Matlab code on his website which is very easy to use. – Titus – 2019-02-11T20:55:11.803

34

I had a go with HiddenMarkovProcess[], based on the assumption that the data is normally distributed around two different means (it looks like it!). This approach should be fine for cases where the number of "states" is small, e.g. 2 in this case. Otherwise you're looking at Infinite Hidden Markov Models, or see the bottom of this answer.

To remove some spurious detections, I first applied a median filter to smooth the data. You could also chop off the first and last 50 points (that drop to zero) to improve some of the estimates:

(* data is the provided tabulated list *)
ydata = MedianFilter[data[[All, 2]], 40];

hmm = EstimatedProcess[ydata, HiddenMarkovProcess[2, "Gaussian"]];
foundStates =
FindHiddenMarkovStates[ydata, hmm, "PosteriorDecoding"];

(* Extract the mean positions from the Markov model *)
hmmMeans = First@(# /. NormalDistribution -> List) & /@ Last@hmm;
(* {184.383, 391.369} *)

(* Now generate the piecewise data *)
meanFoundStates = foundStates /. Table[i -> hmmMeans[[i]], {i, 2}];

(* Now plot for comparison *)
ListLinePlot[{data[[All, 2]], meanFoundStates}]


Finally, you can detect the positions of the shift (e.g. at x=1000) by using:

FoldList[Plus, 1, Length@# & /@ Split[foundStates, #2 - #1 == 0 &]]
(* {1, 253, 907, 2044, 2946, 3143} *)


You can see that the "big" changes you refer to at x=1000 and x=2000 are actually picked up at 907 and 2044.

Another good alternative would be to make the most of RLink and use some of the packages for changepoint detection available in R. There are a few examples in this blog post, and I would recommend looking at:

• bcp - Bayesian Change Point detection
• ecp - Nonparametric Multiple Change Point Analysis

2That is cool... +1 – garej – 2016-01-27T10:42:02.080

Nice work, but I'm still wondering how to find the x points 1k and 2k where the boundaries are? – M.R. – 2016-01-27T16:08:59.520

@M.R. see my update – dr.blochwave – 2016-01-27T18:39:32.570

Is there a way to flatten out the discontinuities at {517, 539, 596, 632} and at {2218, 2237, 2368, 2392, 2708, 2735} so that the graph looks more like @dr.belisarius's graph? And in general there many be many more plateaus not just the two means, but n means, will your approach work for that too? – M.R. – 2016-01-27T21:29:23.517

@M.R. It will work if n is finite (otherwise you need to look into infinite Markov models). Have you looked into the R examples I posted above? They are very well-explained in the documentation and I've had great success with both. – dr.blochwave – 2016-01-27T21:32:11.647

I'm reading the pdf on the Bayesian Change Point detection in R, this is a great resource, thanks! – M.R. – 2016-01-27T21:36:44.027

@M.R. great, also, see the update - a small filtering prior to the detection will help remove the discontinuities. – dr.blochwave – 2016-01-27T21:39:19.843

Thanks for the update this looks great now. – M.R. – 2016-01-27T21:39:29.330

@M.R. try the R code in the blogpost I linked to: http://www.r-bloggers.com/change-point-detection-in-time-series-with-r-and-tableau/

– dr.blochwave – 2016-01-27T21:41:48.450

Could I know why in First@(# /. NormalDistribution -> List),we use that First but not Last? – yode – 2017-04-14T16:12:06.347

18

ListPlot@{l1, msf = MeanShiftFilter[l1, IntegerPart[Length@l1/10], MedianDeviation@l1,
MaxIterations -> 10]}


And here are the detected means (assuming there are three):

fc = FindClusters[msf];
Mean /@ fc
( *{3.77282, 220.788, 387.444} *)


This is a beautiful answer! How to I get the x values where it jumps? – M.R. – 2016-01-27T18:19:57.460

How and how did you know to divide by 10? – M.R. – 2016-01-27T18:20:50.880

It's perhaps worth pointing out that this doesn't actually give discrete values for each "state" - just a smoother version of the data. Using my method for detecting the jumps (FoldList[Plus, 1, Length@# & /@ Split[msf, #2 - #1 == 0 &]]) reveals this. – dr.blochwave – 2016-01-27T18:42:30.390

Although subsequent clustering around the means might work! – dr.blochwave – 2016-01-27T18:44:42.477

12

Another approach is to use compound median filtering which returns a blocky function. Then threshold the jumps between blocks. No assumptions about the number or size of blocks is made.

Function to plot the input series as discrete jumps.

BlockPlot[s_] :=
Partition[
Flatten[{s[[1]],
Table[{{s[[i, 1]], s[[i - 1, 2]]}, s[[i]]}, {i, 2, Length[s]}]}],
2]


Median filter until the signal does not change, then repeat for successively wider window radii r.

MedianFilterRoot[x_, r_] := FixedPoint[Round[MedianFilter[#, r]]&, x]

CompoundMedianFilter[x_?VectorQ, r_] := Fold[MedianFilterRoot[#1,#2]&,x,Range[r]]

CompoundMedianFilter[x_?MatrixQ, r_] :=
Transpose[{x[[All,1]], Fold[MedianFilterRoot[#1, #2]&,x[[All,2]],Range[r]]}]


Find locations where the signal jumps more than the threshold t.

DifferenceThreshold[y_List, t_] :=
Pick[Most[y], UnitStep[Abs[Differences[y[[All, 2]]]] - t], 1]


Subsample the imported data v down to w, for faster execution, and find the maximum. Both are global variables.

w = v[[Range[1, 3142, 10]]];
max = Max[w[[All, 2]]];


Adjust the maximum filter window radius r, and the jump threshold t.

Manipulate[
Module[{y = CompoundMedianFilter[w, r]},
ListLinePlot[BlockPlot[y], PlotStyle -> Directive[Thick, Blue],
Prolog -> {Red, Point[w]},
Epilog -> {Darker[Green],
Map[Line[{{#[[1]], 0}, {#[[1]], max}}]&, DifferenceThreshold[y, t]]},
Frame -> True, PlotRange -> {0, max}, ImageSize -> 600]],
{{r, 10, "Max CMF Radius"}, 1, 30, 1, Appearance -> "Labeled"},
{{t, 30, "Jump Threshold"}, 10, 200, 10, Appearance -> "Labeled"}
]