13

2

I need help on what should be my next step in an algorithm I am designing.

Due to NDAs, I can't disclose much, but I'll try to be generic and understandable.

Basically, after several steps in the algorithms, I have this:

For each customer that I have, and events that they do during a month, during first steps I have clustered the events into several categories (each customer will have the events separated into categories that go from 1 to x being x between 1 to 25, generally the first categories have more density of events than the others).

For each category and customer I have created a time series aggregating the events of the month per hour (getting patterns of when these events are being done). Also I am using a couple of normalizing variables based on the number of days over a month (30 days) that the guy performs at least one event, and the number of days with at least one event over the total of days with at least one event (aggregating all clusters). The first one gives me a ratio of how active the customer is during the month, and the second one weights the category against the others.

The final table looks like this

```
|*Identifier*| *firstCat* | *feature1* | *feature2* | { *(TIME SERIES)* }
CustomerID | ClusterID | DaysOver30 | DaysOverTotal | Events9AM Events10AM ...
xx | 1 | 0,69 | 0,72 | 0,2 0,13 ...
xx | 2 | 0,11 | 0,28 | 0,1 0,45 ...
xy | 1 | 0,23 | 0,88 | 0,00 0,60 ...
xy | 2 | 0,11 | 0,08 | 1,00 0,00 ...
xy | 3 | 0,10 | 0,04 | 0,40 0,60 ...
```

The time series variables are the percentage over the total of events per day on each specific category (this means that per each row adding up all variables should be 1). The reason of doing it like that is because for example a time series with events `0 0 0 1 0`

and `1 1 1 2 1`

are completely different, and standardizing to normal would give similar results. And due to high skew between different categories, I check the values on the time series independently with the others.

What I need to do now is to identify these categories (remember, they can be from 1 to x being x any number from 1 to 25) into 3 tags: tag A, tag B and None of Them. Looking at these variables I can manually identify which tag they belong to, and the idea is to identify manually as much as I can and use any classifier algorithm to learn from that and identify all of them.

My idea was to use multiple logistic regressions on the table, but all the variables of the time series are correlated (since they are a linear combination of each other), so I thought I better use a clustering algorithm only over the time series using euclidean distance to categorize the different patterns and use the result and the other two normalizing variables in the logistic regression.

The other concern that I have is that this approach takes each row independently from the others, and in theory, for each customer there should be only 0 or 1 tag A, 0 or 1 tag B and the remaining of them should be None (another tip is that normally Tag A and B are between first categories, because is highly dependent on the normalizing features (if days over total is High, there is a high possibility that the row is either A or B, depending on the Time Series Pattern).

Edit: This is no longer a concern, I will just perform two different logistic regressions, one for Tag A or Other and another for Tag B or another, with the result probabilities I can select only the best of each.

The dataset is enormous and the final algorithm needs to be applied using SQL (on Teradata), but for getting the coefficients of the logistic regression, or the centers of the clustering I get a sample and use R.

What is your exact question? I would simply compute features of the time-series and then add these features to the customers features. Then you will just have basic clustering. For choice of features of the time-series, domain knowledge is required. – Nikolas Rieble – 2016-12-02T11:48:24.240

Just a suggestion :) ... I'm not sure if you get any proper answer as long as the question is that long. For example your tags are exactly my research direction but I really have no time & energy to read it all! If you can update a shorter version, it would be better for you Q and also for yourself as in scientific reporting you need to talk things out briefly. – Kasra Manshaei – 2016-01-19T07:03:26.493

I will try to reduce the question. Is just that usually if I don't explain myself people confuse what I intended. Anyway, as soon as I get time on my job Ill try to reduce the size of the question, thank you for the recommendation – JusefPol – 2016-01-19T07:30:42.690