2

## The problem

I want to figure out how routers correlate between each other. Like, if a specific error occurred in router A, and almost at the same time the error occurs in router B, they probably have some connection with each other (are at one line).

## The Data

Suppose I have a dataframe that looks like this:

|Router|Error|Duration|Timestamp          |
|DB-XX |GSM  |26.5374 |2019-05-01 00:20:14|
|DT-XY |AUC  |15.5400 |2019-05-01 01:15:01|
|DR-YY |AUC  |02.0333 |2019-05-01 01:17:13|
|DP-YX |LOC  |45.2609 |2019-05-01 00:01:10|


## The question

What is the best way to deal with it? Regression (one vs the rest) for each router? The problem is, that there are hundreds of models and I also want to reduce computational costs...

3A simple method would be to represent your router state as a time series (1-error, 0-no error) and compute the correlation matrix. If errors are small fraction of time, correlation is approximately equal to (duration A, B have error together)/sqrt((duration A error) x (duration B error)). – Valentas – 2019-07-05T10:36:33.067

Is your question language specific? (you put the python tag, in which case you should put your data into a pandas.DataFrame and use the corr() function. Maybe the get_dummie can be useful to transform categorical features into numeric ones)

– Manu H – 2019-07-05T11:47:14.950

0

Blindly Dummy Coding errors in Pandas will introduce irrational numerical relationships between different types of errors and this will not help you in finding true similarity.

First and foremost you would like to convert your data into time series data of each router with sampling at equal time steps for each error. 1s for the time steps when the error occurs, 0s for the time step when it doesn't. By this, you convert each router's data into a binary vector for each specific error.

Next thing, calculating a Pearson or Spearman correlation between binary vectors is not a good idea. As explained brilliantly here,

Correlations arise naturally for some problems involving 0s and 1s, e.g. in the study of binary processes in time or space. On the whole, however, there will be better ways of thinking about such data, depending largely on the main motive for such a study. For example, the fact that correlations make much sense does not mean that linear regression is a good way to model a binary response.

You would like to use a similarity metric designed specifically for binary vectors. For example, Jaccard Similarity which computes intersection over union (number of times when both of the vectors were one divided by number of times when either one was) is a good choice. A great summary of such similarity vectors can be found in this article.

Calculations involving these similarity calculations won't be computationally intensive.

Depending on the sparsity of data, it might be better to do Frequent Itemset Mining so that you know when an error occurs, which routers go down together.

0

Correlation is a metric that can be used when two features have the same number of total observations. Given your scenario, each router can fail for different number of times. So, we can't estimate correlation between two routers since they will be having different number of observations.

One thing you can do would be to convert every router into a time series signal, using the provided duration and start time. You can they compute the correlation then. But, it might not be really informative of the actual situation, as the signal will be predominantly sparse