Finding parameters with extreme values (classification with scikit-learn)



I am currently working with the forest cover type prediction from Kaggle, using classification models with scikit-learn. My main purpose is learning about the different models, so I don't pretend to discuss about which one is better.

When working with logistic regression, I wonder if I need the 'penalty' parameter (where I can choose L1 or L2 regularization). Based on what I found, these regularization terms are useful to avoid over-fitting, specially when the parameter values are extreme (by extreme I understand the range of some parameter values are very large compared to other parameters, Correct me if I am wrong. In this case, wouldn't it be enough to apply log-scale or normalization to these values?).

The main questions are: as the number of parameters is large, are there visualization techniques and tools in scikit-learn which can help me to find parameters with extreme values? is there any statistical function/tool which returns how extreme the values of parameters are?


Posted 2015-04-21T09:47:05.917

Reputation: 757



If by "parameters" you mean features (called "Data Fields" at Kaggle), then, yes, you can log-scale those. To visualize them you can just use histograms. To do it for all features in python, for example, you can put your data in pandas DataFrame (let us call it "data") and then use data.hist() This has nothing to do with the regularization in any model.

If by "parameters" you mean the coefficients obtained after fitting the logistic regression, then one uses regularization. This has, however, is not directly related to log-transform. How you list/visualize your coefficients depends on the programming tool you use for logistic regression (or other model)


Posted 2015-04-21T09:47:05.917

Reputation: 1 251