How to understand features impact in a non linear case?

3

I give a simple example: I have a set of houses with different features (# rooms, perimeter, # neighbours, etc...), almost 15, and a price value for each house. The features are also quite correlated (i.e. perimeter is often correlated with #rooms). I want to establish what are the main features (or non-linear combination of them) that determine the price.

In a linear case, for instance, I can compute a Lasso regression and see the importance of each feature through the coefficients. In my case, every feature (or maybe combination of them) has a non linear impact. For example, the # of neighbours can have a quadratic impact (increase the price if #neighbours < 10, and decrease the price if > 10).

I want to identify the main important relationship among the features and the prices. I don't need a predictor. For example, at the end I will discover that the price depends principally by #rooms/perimeter and #neighbours^2.

I was thinking to apply Kernel methods, in combination with regression or PCA. But I don't know a lot about kernel methods.

Thank you in advance.

A M

Posted 2019-01-08T21:51:06.690

Reputation: 61

Answers

1

As far as I know, kernel methods cannot deal with categorical variables (don't now whether it is the case). In addition, you will have to use indirect methods to evaluate the variable importance. This could work, although I havent tested it yet:

Giam, X., Olden, J.D., 2015. A new R2-based metric to shed greater insight on variable importance in artificial neural networks. Ecol. Modell. 313, 307–313. http://dx.doi.org/10.1016/j.ecolmodel.2015.06.034

I would definitively go for a tree-based approach. Since you already know that there are correlated variables, I would advocate for conditional Random Forests (which solve many drawbacks of the standard random forests implementation). Check:

Strobl, C., Hothorn, Zeileis, A., 2009. Party on! R J. 1 (2), 14–17.

And references therein. At least in R there are complementary packages (https://cran.r-project.org/web/packages/pdp/index.html) that allow plotting the impact of each predictor variable over the target variable (house prices). That complements the variable importance ranking pretty well.

Good luck.

Rafael Muñoz-Mas

Posted 2019-01-08T21:51:06.690

Reputation: 81

a tree-based approach is able to detect the most important features but not the relationship with the output. I think I will first create non linear features by myself and then apply Lasso regression. – A M – 2019-01-15T08:55:34.723

A tree-based approach is able to predict the output (in case it works properly) so is it not finding the relationship between the input/s (X) and the output (Y)? In fact, a single decision-tree can be read as a series of rules describing the conditions that 'trigger' a given response (¿relationship?). Could you (@Alessio) extend your comment because maybe I got lost. – Rafael Muñoz-Mas – 2019-01-16T08:14:49.613

If for instance I use Linear regression, I know from the coefficient the linear relationship between a feature x and the output y. if I use a random forest (not a single decision tree), I use an ensamble of classifiers and it is very hard to get rules from x to y. It's more as a blackbox. – A M – 2019-01-16T10:41:26.593

@Alessio Indeed, for forests an indirect method is needed. That's why I indicated that packages such as pdp complement the variable importance ranking. On the other hand the relationships do not needs to be linear, which can be quite an advantage. – Rafael Muñoz-Mas – 2019-01-16T11:57:43.907

1

I want to identify the main important relationship among the features and the prices. I don't need a predictor. For example, at the end I will discover that the price depends principally by #rooms/perimeter and #neighbours^2.

If it depends principally by #neighbours^2, it depends by #neighbours to the same extent. The same hold for the other combinations.

But if you wish to clearly identify the linear dependency on the #neighbours^2 rather than on #neighbours, or #rooms/perimeter (rather than simple #rooms) this is no different from a predictor.

Weka has a rich toolkit for ranking and selecting features by their importance, see this blogpost for a tutorial.

Dmytro Prylipko

Posted 2019-01-08T21:51:06.690

Reputation: 676

if the price first increase with #neighbours and then decrease, this is a quadratic relationship very different from a linear one. I will first build non-linear features by hand combining the others and then I will apply Lasso regression or one method you suggested. – A M – 2019-01-15T08:52:29.900

1I am not talking about the exact nature of dependency, I am talking about the importance. The method you wanna use is probably a good one. Just keep in mind you have need to build a huge feature vector covering all possible combinations of features with no guarantee you have covered all of them. – Dmytro Prylipko – 2019-01-15T09:11:33.283

0

Im not familiar with many methods for feature importance but you could try random forest. Explained in:

Breiman L (2001). "Random Forests".Machine Learning. 45 (1): 5–32.doi:10.1023/A:1010933404324

Nemanja Boskovic

Posted 2019-01-08T21:51:06.690

Reputation: 111