3

I give a simple example: I have a set of houses with different features (# rooms, perimeter, # neighbours, etc...), almost 15, and a price value for each house. The features are also quite correlated (i.e. perimeter is often correlated with #rooms). I want to establish what are the main features (or non-linear combination of them) that determine the price.

In a linear case, for instance, I can compute a Lasso regression and see the importance of each feature through the coefficients. In my case, every feature (or maybe combination of them) has a non linear impact. For example, the # of neighbours can have a quadratic impact (increase the price if #neighbours < 10, and decrease the price if > 10).

I want to identify the main important relationship among the features and the prices. I don't need a predictor. For example, at the end I will discover that the price depends principally by #rooms/perimeter and #neighbours^2.

I was thinking to apply Kernel methods, in combination with regression or PCA. But I don't know a lot about kernel methods.

Thank you in advance.

a tree-based approach is able to detect the most important features but not the relationship with the output. I think I will first create non linear features by myself and then apply Lasso regression. – A M – 2019-01-15T08:55:34.723

A tree-based approach is able to predict the output (in case it works properly) so is it not finding the relationship between the input/s (X) and the output (Y)? In fact, a single decision-tree can be read as a series of rules describing the conditions that 'trigger' a given response (¿relationship?). Could you (@Alessio) extend your comment because maybe I got lost. – Rafael Muñoz-Mas – 2019-01-16T08:14:49.613

If for instance I use Linear regression, I know from the coefficient the linear relationship between a feature x and the output y. if I use a random forest (not a single decision tree), I use an ensamble of classifiers and it is very hard to get rules from x to y. It's more as a blackbox. – A M – 2019-01-16T10:41:26.593

@Alessio Indeed, for forests an indirect method is needed. That's why I indicated that packages such as pdp complement the variable importance ranking. On the other hand the relationships do not needs to be linear, which can be quite an advantage. – Rafael Muñoz-Mas – 2019-01-16T11:57:43.907