Directly proportional Trend in training and cross validation curves

2

In continuation with a question already asked, I tried the same curve on a different dataset I found. My model is a simple Logistic regression curve with OnevsRest Classifier. But the graph I obtained this time was different. enter image description here

What could be a possible reason for this? As explained in the other answer, if it takes less data to fit better, shouldn't the accuracy decrease here as well?

Hima Varsha

Posted 2016-09-26T10:10:34.397

Reputation: 2 146

I think this is unusual. I have a suspection though - could you try to randomly shuffle your data (X and y) and check if you're still getting the same learning curve when you use the shuffled data? – stmax – 2016-09-27T08:41:44.440

I used sklearn's learning_curve to do this. sklearn does actually randomly shuffle internally while plotting this curve. – Hima Varsha – 2016-09-27T08:51:52.290

PS.: with shuffling X and y I don't mean to use KFold(..., shuffle=True), but something like permutation = np.random.permutation(len(X)), X = X[permutation], y = y[permutation]. Let me know if that changes your learning curve. – stmax – 2016-09-27T08:52:02.160

1I don't think it does - I get very different results when I shuffle or not. Please try shuffling it yourself with np.random.permutation as explained above. – stmax – 2016-09-27T08:53:08.487

Answers

1

These learning curves seem unusual - the training accuracy should start high and get lower as more samples are added.

I suspect that your data is sequential (it has some kind of time dependency). sklearn's learning_curve function does not seem to shuffle the data (should it?), so the training accuracy can change/increase once new structures appear in the data over time.

Here's a notebook trying to reproduce the effect: https://gist.github.com/stmax82/79b744877b0a482f8739d372c4777e0d

Two images from the notebook:

The learning curves on the shuffled data look like that (as expected):

learning curves on shuffled data

While the learning curves on the original / sequential data look like that (unusual because training accuracy is rising over time):

learning curves on original data

That's just my try at explaining your learning curves. It might be something completely different... try shuffling your data to find out.

stmax

Posted 2016-09-26T10:10:34.397

Reputation: 1 536

I shuffled the dataset and tried it again as you suggested. It seems to follow the same trend. – Hima Varsha – 2016-09-29T12:59:13.077

Alright. I guess we need a new hypothesis then :p – stmax – 2016-09-29T15:05:33.577

It may be useful to have a pointer to the data – Pablo Suau – 2016-09-30T12:38:57.330