6

1

I am working on Azure ML Studio and try to create a regression model to predict a numerical value. I will try to describe my features and what I have done until now.

My data with about 3 million rows :

Features:

- 8 integer features from 1 to 25
- 2 boolean features with 0 and 1
- 3 integer features from 1 to 10
- 2 integer feature from 0 to 500.000 (and 1.000.000 respectively) with about 4.500 unique values
- 1 integer feature from 20 to 50
- 1 integer feature from 1 to 15
- 1 integer feature from 0 to 100

Label:

- Integer from 10.000 to 100.000.000 with about 5.000 unique values

What I have done:

- Split the dataset to 80% (train) and 20% (test). Then I split the training dataset again to 60% (actual train) and 40% (validation).
- Normalize the features with many unique values (4th bullet in the above list)
- Train a model of Boosted Decision Tree Regression.
- Use the Sweep Parameters module to find the best combination

I tried also Neural Networks, Bayesian Linear Regression, but BDTR gave the best score.

I tried to exclude columns and start with only a few (based on what I think it will affect the model) and then add more columns one by one.

However, the least MSE I could achieved was 1.500.000 (plus I had many negative scored values)

So, I was thinking what other techniques I could use to improve the model.

If 10 is the min value the response can take on, I would say any prediction below 10 should be set to equal 10. You can also look into random forest/bagging, or possibly taking the average prediction of many Boosted trees/Neural net models to see if it helps your results a little. Also, another loss function might be nice (estimate 11 and truth 10 is a 10% error, but only loss 1, where estimate 1,100,000 and truth 1,000,000 is still 10% error, but the loss is 100,000 so it is prioritizing those higher values. Just something to consider, a lot depends on the context of the variables. – TBSRounder – 2015-12-24T17:44:30.203