Linearly increasing data with manual reset


I have a linearly increasing time series dataset of a sensor, with value ranges between 50 and 150. I've implemented a Simple Linear Regression algorithm to fit a regression line on such data, and I'm predicting the date when the series would reach 120.

All works fine when the series move upwards. But, there are cases in which the sensor reaches around 110 or 115, and it is reset; in such cases the values would start over again at, say, 50 or 60.

This is where I start facing issues with the regression line, as it starts moving downwards, and it starts predicting old date. I think I should be considering only the subset of data from where it was previously reset. However, I'm trying to understand if there are any algorithms available that consider this case.

I'm new to data science, would appreciate any pointers to move further.

Edit: nfmcclure's suggestions applied

Before applying the suggestions

enter image description here

Below is the snapshot of what I've got after splitting the dataset where the reset occurs, and the slope of two set.

enter image description here

finding the mean of the two slopes and drawing the line from the mean.

enter image description here

Is this OK?


Posted 2014-07-04T05:12:44.707

Reputation: 183

1You have the right idea, except when plotting it you should start where the series starts every reset. For estimating where it will hit, say 120, see my first edit in my answer. – nfmcclure – 2014-08-27T16:13:41.443



I thought this was an interesting problem, so I wrote a sample data set and a linear slope estimator in R. I hope it helps you with your problem. I'm going to make some assumptions, the biggest is that you want to estimate a constant slope, given by some segments in your data. Another assumption to separate the blocks of linear data is that the natural 'reset' will be found by comparing consecutive differences and finding ones that are X-standard deviations below the mean. (I chose 4 sd's, but this can be changed)

Here is a plot of the data, and the code to generating it is at the bottom. Sample Data

For starters, we find the breaks and fit each set of y-values and record the slopes.

# Find the differences between adjacent points
diffs = y_data[-1] - y_data[-length(y_data)]
# Find the break points (here I use 4 s.d.'s)
break_points = c(0,which(diffs < (mean(diffs) - 4*sd(diffs))),length(y_data))
# Create the lists of y-values
y_lists = sapply(1:(length(break_points)-1),function(x){
# Create the lists of x-values
x_lists = lapply(y_lists,function(x) 1:length(x))
#Find all the slopes for the lists of points
slopes = unlist(lapply(1:length(y_lists), function(x) lm(y_lists[[x]] ~ x_lists[[x]])$coefficients[2]))

Here are the slopes: (3.309110, 4.419178, 3.292029, 4.531126, 3.675178, 4.294389)

And we can just take the mean to find the expected slope (3.920168).

Edit: Predicting when series reaches 120

I realized I didn't finish predicted when series reaches 120. If we estimate the slope to be m and we see a reset at time t to a value x (x<120), we can predict how much longer it would take to reach 120 by some simple algebra.

enter image description here

Here, t is the time it would take to reach 120 after a reset, x is what it resets to, and m is the estimated slope. I'm not going to even touch the subject of units here, but it's good practice to work them out and make sure everything makes sense.

Edit: Creating The Sample Data

The sample data will consist of 100 points, random noise with a slope of 4 (Hopefully we will estimate this). When the y-values reach a cutoff, they reset to 50. The cutoff is randomly chosen between 115 and 120 for each reset. Here is the R code to create the data set.

# Create Sample Data
x_data = 1:100 # x-data
y_data = rep(0,length(x_data)) # Initialize y-data
y_data[1] = 50 
reset_level = sample(115:120,1) # Select initial cutoff
for (i in x_data[-1]){ # Loop through rest of x-data
  if(y_data[i-1]>reset_level){ # check if y-value is above cutoff
    y_data[i] = 50             # Reset if it is and
    reset_level = sample(115:120,1) # rechoose cutoff
  }else {
    y_data[i] = y_data[i-1] + 4 + (10*runif(1)-5) # Or just increment y with random noise
plot(x_data,y_data) # Plot data


Posted 2014-07-04T05:12:44.707

Reputation: 485

I think your answer is useful to the problem. Just some suggestions: I would move the data generation code to the bottom, or even to an external Gist, since it is not really part of the proposed solution. And I would elaborate a bit more on the fact that you are using 4 standard deviations to detect resets: right now, it is just a comment lost in the code, and it is the core of your solution. – logc – 2014-07-14T13:26:23.830

Good ideas. Will do. – nfmcclure – 2014-07-14T15:18:33.427

Hi nfmcclure, I've applied your suggestion and updated the post. Please provide your comments. – ArunDhaJ – 2014-08-27T07:16:43.810


Your problem is that the resets aren't part of your linear model. You either have to cut your data into different fragments at the resets, so that no reset occurs within each fragment, and you can fit a linear model to each fragment. Or you can build a more complicated model that allows for resets. In this case, either the time of occurrence of the resets has to be put into the model manually, or the time of resets has to be a free parameter in the model that is determined by fitting the model to the data.


Posted 2014-07-04T05:12:44.707

Reputation: 805