1

I am trying to make a linear regression model for the sale price of a house based on many variables (based on the data from this Kaggle challenge https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

The distribution above is the 2nd-floor size in square feet and the y-axis is the sale price. It shows a clear linearity, except for the fact that homes without a 2nd floor clearly are not for sale for 0 dollars.

I have many variables like this that have some threshold either upper or lower that has a large distribution of response for a single value. If I simply exclude these values then the intercept for this curve will be through the origin. Should I let that be the case and assume that the $0 price tag will be corrected by the other variables in my regression?

What is the best way to treat/fit data such as this?

Thanks!