3

1

I am working on a Kaggle dataset and I am trying to build a predictive model for the "Chance of Admit" (dependent variable) of students to the university of their interest.

Below you can find the correlation among all (independent and dependent) variables. We can quickly observe that only "GRE Score", "TOEFL Score" and "CGPA" considerably affect the "Chance of Admit" variable. So, it makes sense to eliminate all other variables from the predictive model.

Now among the "GRE Score", "TOEFL Score" and "CGPA" variables, we can see that they are all highly corellated (this also makes sense in real life as you always expect a good student to get good grades in these tests). I cannot decide which variables to keep for my final model. Could I keep all of them ? or how do I decide which one to exclude ?

Any help is appreciated.

3What kind of predictive model are you building? Some models are sensitive to "double counting" highly correlated variables, but others are implicitly feature-selective and will only keep variables that add new information, and some are latent variable models that can summarize correlated variables into one new feature. Correlated input variables may or may not be a problem for your downstream task, depending on what you plan to do. – Nuclear Hoagie – 2020-01-03T19:30:31.637

So far I have tried Recurrent neural networks, multilinear regression and k-NN regressor.. I will also try SVR and decision trees.. However, I am still new to machine learning and I do not know much about how each specific model treats highly correlated variables.. Could you give me more info ? – batman – 2020-01-03T23:11:11.513