## How to use Random Forest to reduce dimensions

I am working on the Boston competition on Kaggle and at the moment I am trying to use Random Forest to find the columns with the highest correlation with the target variable SalePrice. However, the implementation returned almost every single variable in the dataset:

       0   1      2      3     4     5    6    ... 252 253 254 255 256 257 258
0        1  RL   65.0   8450  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
1        2  RL   80.0   9600  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
2        3  RL   68.0  11250  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
3        4  RL   60.0   9550  Pave   NaN  IR1  ...   0   0   0   0   1   0   1
4        5  RL   84.0  14260  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
5        6  RL   85.0  14115  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
6        7  RL   75.0  10084  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
7        8  RL    NaN  10382  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
8        9  RM   51.0   6120  Pave   NaN  Reg  ...   0   0   0   0   1   0   1
9       10  RL   50.0   7420  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
10      11  RL   70.0  11200  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
11      12  RL   85.0  11924  Pave   NaN  IR1  ...   0   0   1   0   1   0   1
12      13  RL    NaN  12968  Pave   NaN  IR2  ...   0   1   0   0   1   0   1
13      14  RL   91.0  10652  Pave   NaN  IR1  ...   0   0   1   0   1   0   1
14      15  RL    NaN  10920  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
15      16  RM   51.0   6120  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
16      17  RL    NaN  11241  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
17      18  RL   72.0  10791  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
18      19  RL   66.0  13695  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
19      20  RL   70.0   7560  Pave   NaN  Reg  ...   0   0   0   0   1   0   1
20      21  RL  101.0  14215  Pave   NaN  IR1  ...   0   0   1   0   1   0   1
21      22  RM   57.0   7449  Pave  Grvl  Reg  ...   0   1   0   0   1   0   1
22      23  RL   75.0   9742  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
23      24  RM   44.0   4224  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
24      25  RL    NaN   8246  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
25      26  RL  110.0  14230  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
26      27  RL   60.0   7200  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
27      28  RL   98.0  11478  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
28      29  RL   47.0  16321  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
29      30  RM   60.0   6324  Pave   NaN  IR1  ...   0   1   0   0   1   1   0
...    ...  ..    ...    ...   ...   ...  ...  ...  ..  ..  ..  ..  ..  ..  ..
1430  1431  RL   60.0  21930  Pave   NaN  IR3  ...   0   1   0   0   1   0   1
1431  1432  RL    NaN   4928  Pave   NaN  IR1  ...   0   1   0   0   1   0   1


Not only that but some of these columns are also returning NaN values. I already took care of NaN values before returning anything.

Caveat: I am using Random Forest right after one-hot encoding my categorical variables so that is part of the reason why the return has such a high dimension.

Here is my implementation so far:

I have gathered the name of my categorical, continuous and binary variables in separate lists:

categorical_columns = ['MSSubClass', 'MSZoning', 'LotShape', 'LandContour', 'LotConfig', 'Neighborhood', 'Condition1',
'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
'Foundation', 'Heating', 'Electrical', 'Functional', 'GarageType', 'PavedDrive', 'Fence',
'MiscFeature', 'SaleType', 'SaleCondition', 'Street', 'CentralAir']

ranked_columns = ['Utilities', 'LandSlope', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond',
'PoolQC', 'OverallQual', 'OverallCond']

numerical_columns = ['LotArea', 'LotFrontage', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
'BsmtUnfSF','TotalBsmtSF', '1stFlrSF', '2ndFlrSf', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
'BsmtHalfBath', 'FullBath', 'HalfBath', 'Bedroom', 'Kitchen', 'TotRmsAbvGrd', 'Fireplaces',
'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
'3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']


I've created a function definition named def feature_encoding(df, categorical_list): and the following code is from this function definition:

Here, I am going through every categorical variable in categorical_columns in a loop to one-hot encode each of them. At the end, I am inserting them back into the data frame:

for col in categorical_list:

# take one-hot encoding
OHE_sdf = pd.get_dummies(df[categorical_list])

# drop the old categorical column from original df
df.drop(col, axis = 1, inplace = True)

# attach one-hot encoded columns to original dataframe
df = pd.concat([df, OHE_sdf], axis = 1, ignore_index = True)

return df


Here, I am encoding my ranked values (for example: Excellent, Good, Average) with integers:

df['Utilities'] = df['Utilities'].replace(['AllPub', 'NoSeWa'], [2, 1])  # Utilities
df['ExterQual'] = df['ExterQual'].replace(['Ex', 'Gd', 'TA', 'Fa'], [4, 3, 2, 1])  # Exterior Quality
df['LandSlope'] = df['LandSlope'].replace(['Gtl', 'Mod', 'Sev'], [3, 2, 1])  # Land Slope
df['ExterCond'] = df['ExterCond'].replace(['Ex', 'Gd', 'TA', 'Fa', 'Po'], [4, 3, 2, 1, 0])  # Exterior Condition
df['HeatingQC'] = df['HeatingQC'].replace(['Ex', 'Gd', 'TA', 'Fa', 'Po'], [4, 3, 2, 1, 0])  # Heating Quality and Condition
df['KitchenQual'] = df['KitchenQual'].replace(['Ex', 'Gd', 'TA', 'Fa'], [3, 2, 1, 0])  # Kitchen Quality


Some of the columns had values abbreviated as NA, which meant something like "No pavement" but pandas interpreted it as NaN. To avoid this, I replaced each of these abbreviations with something like XX:

# Replacing the NA values of each column with XX to avoid pandas from listing them as NaN
na_data = ['Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']

for i in na_data:
df[i] = df[i].fillna('XX')

# Replaced the NaN values of LotFrontage and MasVnrArea with the mean of their column
df['LotFrontage'] = df['LotFrontage'].fillna(df['LotFrontage'].mean())
df['MasVnrArea'] = df['MasVnrArea'].fillna(df['MasVnrArea'].mean())


And finally, this is my Random Forest implementation to find correlated variables:

x_train, x_test, y_train, y_test = train_test_split(df, df['SalePrice'], test_size=0.3, random_state=42)

sel = SelectFromModel(RandomForestClassifier(n_estimators=100))
sel.fit(x_train, y_train)
sel.get_support()

selected_feat = x_train.columns[sel.get_support()]


I apologize for such a wordy post. I wanted to be as clear in my question as possible. If you'd like to see the entire .py file, it is in the same repository as the hyperlinked dataset.

– Peter – 2019-07-20T22:11:23.697

Have you tried playing with the 'threshold' param for the SelectFromModel. Set it very high(I.E 30*mean or something) and see if number of returned features is lower. This will help understanding if its data(the features<>target correlation) or something created in the code. – yoav_aaa – 2019-07-21T10:47:10.090

@yoav_aaa Could you expand on that a little more? – Andros Adrianopolos – 2019-07-22T20:21:57.943

Expanding on my comment,

The SelectFromModel selects the best features based on some information criteria. When the estimator (random forest in your case) is fitted, the SelectFromModel calulcates feature importance for each of the features the estimator is fitted with.

Then the SelectFromModel 'filters' out those features which don't meet a specific criteria, for example a feature_importance value criteria. Setting this criteria (named threshold in Sklearn) can have big affect on number of features being filtered out.

Based on your question its hard telling if indeed all features are valuable to estimators fit quality. One way examining if it's a code related issue is trying different values for the threshold param.

One would expect number of selected features (features with support) to decrease when threshold increases. If this works as expected you can think of how you determine the threshold value that will best serve your needs.

And how do I alter the threshold param? – Andros Adrianopolos – 2019-07-23T15:24:56.980

There is a threhold param you can pass to the SelectFromModel init. See the docs https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html

– yoav_aaa – 2019-07-24T08:10:01.513

I see. But what is the appropriate unit/value representation for threshold? For example, what would threshold=5 do? – Andros Adrianopolos – 2019-07-24T11:41:38.127

I'm trying threshold=0.5, for example, but it doesn't do a difference. – Andros Adrianopolos – 2019-07-24T11:47:19.300

read the docs and try "300*mean" – yoav_aaa – 2019-07-24T12:51:08.343

I did threshold = 300 * "mean" and it still returns the same number of columns. – Andros Adrianopolos – 2019-07-25T10:55:10.643