What's cooking Kaggle - Improve model

2

I'm participating in a Kaggle contest "What's cooking". The goal is to know wich kind of cuisine we have, depending on some ingredients. So it's a multiclass classification problem. I have an existing model, and I have been trying to improve it for 2 weeks without result. I am using scikit learn, and my existing model is

pipelineLR = Pipeline([
('vect', CountVectorizer(max_df=0.75)),
('lr', LogisticRegression(C=0.4,solver='lbfgs',multi_class='ovr',fit_intercept=True,warm_start=True)),
])

It's really basic. But the problem of this dataset is that we have unbalanced data. Also we have a a huge sparse matrix because of the CountVectorizer. I tried a lot of different model, with and without TF-IDF and this model is the best I can get. I have also confusion matrix, classification report and learning curve for it.

What I can notice is that is have high-variance. I think I can do something by preprocessing the data but I am really stuck. I don't know if my model is good enough to be improved, or if I have to find a totally new approach to it. Also what I can get from the classification report is that the model is working well for classes where I have a lot of data, but it's making a lot of mistakes for classes with less data.

I am very new to data science and I am trying to learn by myself. But I think I really need to have another point of view with this problem, because I can't manage to improve my score.

Classification report Learning curves

Baptiste Em

Posted 2015-11-23T09:40:01.580

Reputation: 21

Answers

2

But the problem of this dataset is that we have unbalanced data

I think that the way to fix your problem is to use something like SMOTE or one of its variants. (My favorite is SMOTE-ENN, but go with what works the best). Here is an implementation that you can use to fix the class imbalances. I don't know if this will solve your specific problem, but it is worth a shot.

Ryan

Posted 2015-11-23T09:40:01.580

Reputation: 674