How best to use the resale transaction year in predicting housing prices?


I'm looking into the classic problem of predicting apartment prices (resale market) based on the their type, size, location, etc. Pretty straightforward and Linear Regression or Regression Trees give some first decent result -- I'm still more in the exploratory phase.

However, I'm not sure how to best incorporate the year of the resale transaction, since there are clear long-term trends over the years. Right now, I just keep it as a feature, which seems to be a valid approach. I just wonder if there might be alternative approaches. For example, I also have to overall price movement on a quarterly basis. So I assume I could adjust the each resale price based on those trends and ignore the year as feature. Would this make sense?

What are other approaches? (Again, I'm not even sure if this an issue at all.)


Posted 2020-12-29T00:34:46.480

Reputation: 131



I think you have identified the two main options:

  1. Model the price trend, i.e. make your model fit to capture the price trend over time
  2. Clean your [price] data so that prices are expressed in "real terms" (not including the price trend).

Option 2 would imply that you "deflate" a standardised price (e.g. price per square meter). Thus the price per sqm in 2018, 2019, and 2020 would be "adjusted" to the price level of (for instance) 2017 so that all prices are "prices of 2017". You need a standardised price (e.g. per sqm) because you need to control for possible unobserved effects in the composition of your data, e.g. when the average house sold in 2020 is "larger" than the average house in 2017. In essence you need make sure that the "deflated" prices are comparable. This can be a problem, for instance when there are changes in the market over time. You could imagine that the willingness to pay for "large" houses change over time, so that one sqm of a "large" house becomes more expensive over time. It can be hard to capture such effects by simply "deflating" (average) prices per sqm.

Option 1 can partly capture the effect(s) described above. Consider the case of a linear regression. Say you have two years (2019,2020) and you want to "control" inflation over time. Your (simplyfied) base model, with price $p$ and $sqm$ as independent variable would look like:

$$ p = \beta_0 + \beta_1 sqm + u. $$

Now you can add a "year dummy" (=1 if year==2020):

$$ p = \beta_0 + \beta_1 sqm + \beta_2 t_{2020} + u. $$

Coefficient $\beta_2$ captures the average effect on $p$ in 2020 compared to 2019. This is sometimes called a "fixed effect" since the variable simply is a "shift" in prices in 2020 compared to 2019 for all levels of $sqm$.

If you think $sqm$ and "time" are somehow related, you can also add interaction terms, e.g.:

$$ p = \beta_0 + \beta_1 sqm + \beta_2 t_{2020} + \beta_3 sqm * t_{2020} + u. $$

In this model you allow for a different intercept (in 2019 and 2020) and for a different slope of $sqm$ in both years. Instead of an interaction of "time" and $sqm$ you could also add and interaction with "size dummies" (e.g. "small" vs. "large" houses).

In essence, option 1 gives you more flexibility since linear regression allows you to "deflate" prices inside the model. Note that linear regression is a parametric approach, so you need to find a proper parameterisation of the model (just as you would need to find the right strategy to deflate prices when you do this outside the model).

When you use regression trees, you don't need to worry about the functional representation of the model. The advantage of linear regression is that the "time dummy" is forced to be fitted on all data. In regression trees the effect of dummies is less prevalent. So in this case "deflating" data outside the model could be worth a try.

However, when you are up to predictions with low variance, you ultimately need to check what approach works best based on test results.

Edit (20-12-30): Dummies

Suppose you have a vector of IDs:

1  1
2  1
3  2
4  2
5  3
6  3

Dummy encoding will look like:

  id1 id2 id3
1   1   0   0
2   1   0   0
3   0   1   0
4   0   1   0
5   0   0   1
6   0   0   1

In a linear regression, dummies generally work as "contrasts", e.g. the effect of id2 vs. id1 and id3 vs. id1, so that you include n-1 of the dummies.


Posted 2020-12-29T00:34:46.480

Reputation: 4 724

thanks for the very detailed reply! For option 1, does that boil down to treating year as a categorical feature: having data for n years (say, n=30), I will have n or n-1 dummy features? I haven't fully grasped the difference between dummy and one-hot encoding :). For Option 2, what I have is for every quarter, the change compared to the previous one. For example for Q2/2020 it is +1.2% meaning the resale price index (not sure how it gets calculated) increased by that amount compared to Q1/2020. Does that mean I could increase the Q2/2020 prices by 1.2%? – Christian – 2020-12-30T00:22:26.753


Regarding dummies/one-hot you would include $n-1$ dummies (for each time period). See my edit and this post: Regarding the other option. Q1/20 * 1.012 = Q2/20 on average (!). It appears to be a chain index, so you have a Q to Q change which you need to "map" on the entire period of time (with a fixed base) If you are not familiar with this, I suggest not using the method.

– Peter – 2020-12-30T10:52:17.193