Sequence pattern mining on continuous dataset


I have a dataset on broadband usage and bills for a set of customers for an year. For every metric, like Upload usage, I have 12 values corresponding to each month. Below are the columns and the format.

CustomerId,Upload_Jan,Upload_Feb...Upload_Dec, Download_Jan,Download_Feb....Download_Dec, bill_Jan....bill_Dec

The values in all columns are continuous. I want to see if there are some sequences appearing in the data frequently.

  1. Can we do sequential mining on continuous data set?
  2. Is it mandatory to bin each column to separate bins? How to decide the number of bins?
  3. What is the best way of pre processing the data set for tools like SPMF which requires transactionized data?

I looked at TraMineR package in R and wrote a few scripts, but I could not make out much from the sequences generated.


Posted 2016-05-02T05:46:43.717

Reputation: 325

What's the data set size in number of instances? There might also not be any interesting sequence patterns to discover. – K3---rnc – 2016-05-04T12:09:28.707

Looks more like 37 columns? – K3---rnc – 2016-05-04T21:23:20.027

Yes, 3*12 columns + customer_id column=37 columns and 1 lakh rows. – pnv – 2016-05-05T04:01:24.420



The established procedure is called symbolic aggregate approximation or SAX in which you first first do piecewise aggregate approximation (PAA), namely send non-overlapping mean (boxcar) windows of width $w$ over the data, and then transform the obtained values into symbols by equal-frequency-splitting the resulting data distribution into $k$ letters of the alphabet.

How to determine appropriate $w$ and $k$? They are a tradeoff between dimensionality reduction and information loss. One way to adapt $w$ is to look at window' data points variance; if the variance is too large, reduce the window size, and vice versa. Another information-theoretic is MDL as explained in this question.

Since you only have 12 values in a series, the windowing step may not be necessary.

Alternatively, you could try reshaping the data into three series:




where D, U, and B are string literals (used as start markers) and $d_m^{(c_i)}$, $u_m^{(c_i)}$, and $b_m^{(c_i)}$ are the values of download, upload and billed, for customer $c_i$ in month $m$. Then you could use one of the frequent subsequence mining algorithms on those three sequences (separately) to find any prevaling patterns.


Posted 2016-05-02T05:46:43.717

Reputation: 2 982

Thanks alot for the answer. I am reading about SAX, but isn't this time series? Is it possible for me to apply time series on this dataset, because it has just 12 values. – pnv – 2016-05-04T04:55:21.647

Anything that is a homogenous variable and has a distinct time dimension, like your monthly bandwidth upload, download, and billed amount, is a time series (three series, likely correlated, in your case). Since you only have 12 values in each series, I don't think a length-wise reduction is necessary, so I guess I would skip the windowing step and just do MDL or equal-frequency discretization. You can try both kinds in Discretize widget in Orange Data Mining suite.

– K3---rnc – 2016-05-04T11:51:12.447

I updated the answer a bit. – K3---rnc – 2016-05-04T12:08:48.280