I'm participating in a Kaggle contest "What's cooking". The goal is to know wich kind of cuisine we have, depending on some ingredients. So it's a multiclass classification problem. I have an existing model, and I have been trying to improve it for 2 weeks without result. I am using scikit learn, and my existing model is
pipelineLR = Pipeline([ ('vect', CountVectorizer(max_df=0.75)), ('lr', LogisticRegression(C=0.4,solver='lbfgs',multi_class='ovr',fit_intercept=True,warm_start=True)), ])
It's really basic. But the problem of this dataset is that we have unbalanced data. Also we have a a huge sparse matrix because of the CountVectorizer. I tried a lot of different model, with and without TF-IDF and this model is the best I can get. I have also confusion matrix, classification report and learning curve for it.
What I can notice is that is have high-variance. I think I can do something by preprocessing the data but I am really stuck. I don't know if my model is good enough to be improved, or if I have to find a totally new approach to it. Also what I can get from the classification report is that the model is working well for classes where I have a lot of data, but it's making a lot of mistakes for classes with less data.
I am very new to data science and I am trying to learn by myself. But I think I really need to have another point of view with this problem, because I can't manage to improve my score.