## Incorrect correlation results

3

I am working on a predictive system of a finite outcome with output -1 or 1, depending on the values detected by 10 sensors.

       class_label     sensor0     sensor1     sensor2     sensor3     sensor4     sensor5     sensor6     sensor7     sensor8     sensor9
count   400.000000  400.000000  400.000000  400.000000  400.000000  400.000000  400.000000  400.000000  400.000000  400.000000  400.000000
mean      0.000000    0.523661    0.509223    0.481238    0.509752    0.497875    0.501065    0.490480    0.482372    0.482822    0.541933
std       1.001252    0.268194    0.276878    0.287584    0.297712    0.288208    0.287634    0.289954    0.282714    0.296180    0.272490
min      -1.000000    0.007775    0.003865    0.004473    0.001466    0.000250    0.000425    0.000173    0.003322    0.003165    0.000452
25%      -1.000000    0.299792    0.283004    0.235544    0.262697    0.249369    0.269430    0.226687    0.242848    0.213626    0.321264
50%       0.000000    0.534906    0.507583    0.460241    0.510066    0.497842    0.497108    0.477341    0.463438    0.462251    0.578389
75%       1.000000    0.751887    0.727843    0.734937    0.768975    0.743401    0.738854    0.735304    0.732483    0.740542    0.768990
max       1.000000    0.999476    0.998680    0.992963    0.995119    0.999412    0.997367    0.997141    0.998230    0.996098    0.999465


As part of my work, I try to rank the importance of these sensors using the correlation matrix (with the help of pandas.DataFrame.corr):

The problem in this matrix is that it shows no correlation between the outcome and the sensor6, which is not true, as you can see in the following scatter plots that using sensor6, we can easily predict the outcome (this is also backed using decision tree, knn, etc.)

Questions:

• Why is the correlation matrix wrong? is it still reliable for other sensors?
• What alternatives to use to rank the importance of the sensors?

## Edits:

For data types reference pandas.DataFrame.info():

RangeIndex: 400 entries, 0 to 399
Data columns (total 11 columns):
class_label    400 non-null float64
sensor0        400 non-null float64
sensor1        400 non-null float64
sensor2        400 non-null float64
sensor3        400 non-null float64
sensor4        400 non-null float64
sensor5        400 non-null float64
sensor6        400 non-null float64
sensor7        400 non-null float64
sensor8        400 non-null float64
sensor9        400 non-null float64
dtypes: float64(11)
memory usage: 34.5 KB


For data content, pandas.DataFrame.head() and pandas.DataFrame.tail() return:

data.head()
Out[5]:
class_label   sensor0   sensor1   sensor2   sensor3   sensor4   sensor5   sensor6   sensor7   sensor8   sensor9
0          1.0  0.834251  0.726081  0.535904  0.214896  0.873788  0.767605  0.111308  0.557526  0.599650  0.665569
1          1.0  0.804059  0.253135  0.869867  0.334285  0.604075  0.494045  0.833575  0.194190  0.014966  0.802918
2          1.0  0.694404  0.595777  0.581294  0.799003  0.762857  0.651393  0.075905  0.007186  0.659633  0.831009
3          1.0  0.783690  0.038780  0.285043  0.627305  0.800620  0.486340  0.827723  0.339807  0.731343  0.892359
4          1.0  0.788835  0.174433  0.348770  0.938244  0.692065  0.377620  0.183760  0.616805  0.492899  0.930969

data.tail()
Out[6]:
class_label   sensor0   sensor1   sensor2   sensor3   sensor4   sensor5   sensor6   sensor7   sensor8   sensor9
395         -1.0  0.433150  0.816109  0.452945  0.065469  0.237093  0.719321  0.577969  0.085598  0.357115  0.070060
396         -1.0  0.339346  0.914610  0.097827  0.077522  0.484140  0.690568  0.420054  0.482845  0.395148  0.438641
397         -1.0  0.320118  0.444951  0.401896  0.970993  0.960264  0.138345  0.354927  0.230749  0.204612  0.558889
398         -1.0  0.059132  0.337426  0.772847  0.099038  0.966042  0.975086  0.532891  0.035839  0.258723  0.709958
399         -1.0  0.379778  0.460256  0.229257  0.768975  0.321882  0.118572  0.448964  0.546324  0.363127  0.176632


2

Pearson Correlation is a technique for investigating the relationship between two quantitative, continuous variables, for example, age and blood pressure. Pearson's correlation coefficient (r) is a measure of the strength of the association between the two variables, a number between -1 and 1. [reference]

This following scatter plots from SPSS tutorial explains different pearson correlations visually:

From your scatter plots, however, it looks like there is no linrar correlation between between sensor 6 vs. class label. I do not know why you expect a correlation??!! That is why using pandas.DataFrame.corr, which is by default based on Pearson correlation, should in principle should lead to a near-zero correlation. What strikes me though is that how this gives you high correlations for some sensory data namely sensor0, sensor4 and sensor8!! Maybe post part of your actual dataset rather just the statistics. I am curious why we see those correlations.

Two questions:

• What are the data types of the class_label and sensory data columns? Having a wrong data types could lead to wrong correlation values in pandas. See this post
• Shouldn't you be looking at the correlation between a continuous value vs. categorical varaibles? See this post as a good reference.

This is a great answer. You raise some interesting points. I still need to go through the references and the posts though. As for your questions, I edited my question accordingly and included some parts of the data and their types. – SuperKogito – 2019-07-22T13:53:11.207

You are welcome. Oh your target class_label is float64!! Shouldn't this be a categorical data type of two classes of 1 and -1? This could may be the root of cause of those corrleations. Please change this column's data type to categorical or object of 1 and -1 lables and get the correlation again, I expect near-zero for all sensory data vs. class_label. And as I suspected you have a continuous variable correlation against a categorical variable, Perason is not the choice anyway, see table above. Honestly though continuous vs. categorical corrleations are not easy too!!! – TwinPenguins – 2019-07-22T14:20:11.917

your features:sensor0,1,3,4,8 show good amount of correlation why don't you try and use them and have you tried RFECV with any gradient boosted tree regressor it might give you a good starting point of how many features do you actually need you can also try ensemble of feature selection using the feature_importance function that every model provides you could then use any kind of voting mechanism to get your optimal subset of features. Have you looked at the p-values of your correlations it will at least give you an idea of how significant is your correlation even if the correlation value~0. – khwaja wisal – 2019-08-19T20:32:46.087