11

8

I often am building a model (classification or regression) where I have some predictor variables that are sequences and I have been trying to find technique recommendations for summarizing them in the best way possible for inclusion as predictors in the model.

As a concrete example, say a model is being built to predict if a customer will leave the company in the next 90 days (anytime between t and t+90; thus a binary outcome). One of the predictors available is the level of the customers financial balance for periods t_0 to t-1. Maybe this represents monthly observations for the prior 12 months (i.e. 12 measurements).

I am looking for ways to construct features from this series. I use descriptives of each customers series such as the mean, high, low, std dev., fit a OLS regression to get the trend. Are their other methods of calculating features? Other measures of change or volatility?

ADD:

As mentioned in a response below, I also considered (but forgot to add here) using Dynamic Time Warping (DTW) and then hierarchical clustering on the resulting distance matrix - creating some number of clusters and then using the cluster membership as a feature. Scoring test data would likely have to follow a process where the DTW was done on new cases and the cluster centroids - matching the new data series to their closest centroids...

If the t and t+1 dependency is a trend or seasonality - consider extracting it and dealing with the rest as with independent variables. – Diego – 2016-03-25T17:20:21.107

My concern with this answer is that PCA doesn't recognize the clear dependency between the series t and t+1. – B_Miner – 2014-06-24T01:50:57.443