I think you have identified the two main options:

- Model the price trend, i.e. make your model fit to capture the price trend over time
- Clean your [price] data so that prices are expressed in "real terms" (not including the price trend).

**Option 2** would imply that you "deflate" a standardised price (e.g. price per square meter). Thus the price per sqm in 2018, 2019, and 2020 would be "adjusted" to the price level of (for instance) 2017 so that all prices are "prices of 2017". You need a standardised price (e.g. per sqm) because you need to control for possible unobserved effects in the composition of your data, e.g. when the average house sold in 2020 is "larger" than the average house in 2017. In essence you need make sure that the "deflated" prices are comparable. This can be a problem, for instance when there are changes in the market over time. You could imagine that the willingness to pay for "large" houses change over time, so that one sqm of a "large" house becomes more expensive over time. It can be hard to capture such effects by simply "deflating" (average) prices per sqm.

**Option 1** can partly capture the effect(s) described above. Consider the case of a linear regression. Say you have two years (2019,2020) and you want to "control" inflation over time. Your (simplyfied) base model, with price $p$ and $sqm$ as independent variable would look like:

$$ p = \beta_0 + \beta_1 sqm + u. $$

Now you can add a "year dummy" (`=1 if year==2020`

):

$$ p = \beta_0 + \beta_1 sqm + \beta_2 t_{2020} + u. $$

Coefficient $\beta_2$ captures the average effect on $p$ in 2020 compared to 2019. This is sometimes called a "fixed effect" since the variable simply is a "shift" in prices in 2020 compared to 2019 for *all* levels of $sqm$.

If you think $sqm$ and "time" are somehow related, you can also add interaction terms, e.g.:

$$ p = \beta_0 + \beta_1 sqm + \beta_2 t_{2020} + \beta_3 sqm * t_{2020} + u. $$

In this model you allow for a different intercept (in 2019 and 2020) *and* for a different slope of $sqm$ in both years. Instead of an interaction of "time" and $sqm$ you could also add and interaction with "size dummies" (e.g. "small" vs. "large" houses).

In essence, option 1 gives you more flexibility since linear regression allows you to "deflate" prices inside the model. Note that linear regression is a parametric approach, so you need to find a proper parameterisation of the model (just as you would need to find the right strategy to deflate prices when you do this outside the model).

When you use regression trees, you don't need to worry about the functional representation of the model. The advantage of linear regression is that the "time dummy" is forced to be fitted on all data. In regression trees the effect of dummies is less prevalent. So in this case "deflating" data outside the model could be worth a try.

However, when you are up to predictions with low variance, you ultimately need to check what approach works best based on test results.

**Edit (20-12-30): Dummies**

Suppose you have a vector of IDs:

```
id
1 1
2 1
3 2
4 2
5 3
6 3
```

Dummy encoding will look like:

```
id1 id2 id3
1 1 0 0
2 1 0 0
3 0 1 0
4 0 1 0
5 0 0 1
6 0 0 1
```

In a linear regression, dummies generally work as "contrasts", e.g. the effect of `id2`

vs. `id1`

and `id3`

vs. `id1`

, so that you include `n-1`

of the dummies.

thanks for the very detailed reply! For option 1, does that boil down to treating

`year`

as a categorical feature: having data for`n`

years (say,`n=30`

), I will have`n`

or`n-1`

dummy features? I haven't fully grasped the difference between dummy and one-hot encoding :). For Option 2, what I have is for every quarter, the change compared to the previous one. For example for Q2/2020 it is +1.2% meaning the resale price index (not sure how it gets calculated) increased by that amount compared to Q1/2020. Does that mean I could increase the Q2/2020 prices by 1.2%? – Christian – 2020-12-30T00:22:26.7531

Regarding dummies/one-hot you would include $n-1$ dummies (for each time period). See my edit and this post: https://datascience.stackexchange.com/a/84061/71442. Regarding the other option. Q1/20 * 1.012 = Q2/20 on average (!). It appears to be a chain index, so you have a Q to Q change which you need to "map" on the entire period of time (with a fixed base) https://stats.oecd.org/glossary/detail.asp?ID=3742. If you are not familiar with this, I suggest not using the method.

– Peter – 2020-12-30T10:52:17.193