## Standard correlation coefficient of various datasets

3

2

I understood the correlation coefficient of the first line. But the correlation coefficient of second and third line differ with the first line in figure 1. Why is it so ?. Even the shapes differ for second and third line ?. Can someone help me to gain an understanding of below picture

The second question is related to the correlation between attributes. The intention behind asking this question it to gain more understanding from StackOverflow data science users. The very one thing I learnt from the below picture is

• if you plot attribute against the same attribute, you will end up with lines, which is of no use. For instance, if you notice median_house_value in y-axis against median_house_value in the x-axis, they are basically histograms.

Is there any valuable feedback that you can comment on the below picture ?. If so, please let me know.

Can you be more specific about your question? And btw, what book is this? – David Masip – 2018-06-04T09:52:45.633

@DavidMasip For example, in the first line, for the coefficient 0.8, the plot is slanted slightly towards the right. But in the second line, even though the coefficient remains the same (i.e 1 ), the plot is slanted towards the right – James K J – 2018-06-04T10:00:42.303

2Correlation coefficient doesn't really care about if it is slightly slanted, it cares about the linear relation between two variables. – David Masip – 2018-06-04T10:03:41.003

@DavidMasip What can I infer from the second and third line of the picture? – James K J – 2018-06-04T10:05:22.823

1

## First part:

The question you have to ask yourself is "Given I know the x-value, what knowledge do I gain about the y-value". And a small side note: you may only consider linear relationships.

Consider the first row. In the first image, if you know the x-value, then the y-value is y = x, i.e. you know the y-value exactly. In the following two images, if the x-value is high, then the y-value is high too, but there is a random component too. Thus, the correlation coefficient is positive (i.e. knowing x helps in estimating y), but not 1 (i.e. there is no exact equation y = a*x. In the middle image, there is no relationship between the x- and y-coordinate - they are purely random, so no correlation. In the right three images, it is the same story as in the left three images, but the sign is flipped.

Now consider the second row: in the first three cases, given you know the x-value, you can always infer the y-value, e.g. y = x, y = 0.5 x, and y = 0.1 x. There is no random component in y, so if you know x, you also know the exact value of y, thus the correlation coefficient is 1. It doesn't matter if the formula is y = 1 x or y = 0.1 x (i.e. the slope of the line doesn't matter!) all that matters for the correlation coefficient is that there is such a linear coefficient which leads to an exact match.

In the third row, knowing x does give some knowledge on y. For example in the middle plot, you have a relationship of y = x^2 + random, so there is indeed some relationship between x and y, but it is non-linear. Thus the correlation coefficient is zero. There is no way for you to say "a high x-value also leads to a high y-value" or "a high x-value leads to a low y-value".

## Second part:

I assume this plot is generated e.g. with Seaborn's pairplot() function, though of course different functions exist for creating this kind of plot. It is important to note that the plots on the diagonal are fundamentally different plots than the other elements.

The off-diagonal plots are two-dimensional scatter plots, i.e. you draw a point for each data sample. The diagonal plots are histograms of each feature, and not scatter plots. This is because a scatter plot of twice the same variable would always give you a straight line y = x. The histograms are actually quite useful as you can get a feeling for the data and make a guess about the underlying distribution.

For example, your total_rooms variable is heavily skewed, most houses have very few rooms, but there are a few outliers with huge numbers of rooms. I thus wouldn't trust the mean value of total_rooms much and rather use the median - of course this is highly dependent on what kind of analyse you are doing.