1

I have data on a unit $i$ which enters an entry state $S_0$. This unit has some covariates $x_i$ I would like to predict the probability the unit will reach the terminal state $S_{pos}$ or $S_{neg}$. The unit can spend time on each state, which affects the probabilities as well. In some time the unit can jump to an intermediate state $S_j$ which is not terminal which also changes the probability to get to one of the terminal states.

I have "dimension" data of the unit $x_i$ in one table and the state transition and time in a "fact" table.

I would like to build a model $M(x, s, t)$ which will predict probability to get to $S_{pos}$ given the covariate $x$ the state $s$ and the time spent in that state $t$. My idea was to replicate the data so that each unit $i$ I will create $T$ rows, one for each of the timestamps it lived until it was terminated and feed this to a general-purpose classifier.

The question is, is this redundant? Is there some representation of the data where I will not have to duplicate rows? Maybe feed this to an RNN? but there I will still need to feed a "nothing happened" time stamp every day until a state has changed.

How would you model this?

Survival analysis seems less relevant as I do not want to predict the duration, just the terminal state.