In general time series are not really different from other machine learning problems - you want your test set to 'look like' your training set, because you want the model you learned on your training set to still be appropriate for your test set. That's the important underlying concept regarding stationarity. Time series have the additional complexity that there may be *long term* structure in your data that your model may not be sophisticated enough to learn. For example, when using an autoregressive lag of N, we can't learn dependencies over intervals longer than N. Hence, when using simple models like ARIMA, we want data to also be *locally* stationary.

As you said, stationary just means the model's statistics don't change over time ('locally' stationary). ARIMA models are essentially regression models where you use the past N values as input to linear regression to prediction the N+1st value. (At least, that's what the AR part does). When you learn the model you're learning the regression coefficients. If you have a time series where you learn the relationship between the past N points and the next point, and then you apply that to a different set of N points to predict the next value, you are implicitly assuming that the same relationship holds between the N predictor points and the following N+1st point you're trying to predict. That's stationarity. If you separated your training set into two intervals and trained on them separately, and got two very different models - what would you conclude from that? Do you think you would feel confident applying those models to predict *new* data? Which one would you use? These issues arise if the data is 'non-stationary'.

My take on RNNs is this - you are still learning a pattern from one segment of a time series, and you still want to apply it to another part of the time series to get predictions. The model learns a simplified representation of the time series - and if that representation applies on the training set but not in the test set, it won't perform well. However, unlike ARIMA, RNNs are capable of learning nonlinearities, and specialized nodes like LSTM nodes are even better at this. In particular, LSTMs and GRUs are very good at learning long-term dependencies. See for example this blog post. Effectively this means that what is meant by 'stationarity' is less brittle with RNNs, so it's somewhat less of a concern. To be able to learn long term dependencies, however, you need LOTS of data to train on.

Ultimately the proof is in the pudding. That is, do model validation like you would with any other machine learning project. If your model predicts well for hold-out data, you can feel somewhat confident in using it. But like any other ML project - if your test data is ever significantly different than your training data, your model will not perform well.

2This answer is excellent. Well-considered and thorough. – StatsSorceress – 2018-03-02T14:18:23.457

1It's been awhile. Has anyone tested this assumption? – compguy24 – 2019-03-19T20:40:43.300