## Predicting next number in a sequence - data analysis

2

2

I am a machine learning newbie and I am working on a project where I'm given a sequence of integers all of which are in the range 0 to 70. My goal is to predict the next integer in the sequence given the previous 5 integers in the same sequence. There isn't much more information on the sequence of integers itself (for example, how was the sequence obtained, etc).

The following are the things I tried.

1. The first thing that came to mind was to use a LSTM regression model with 5 input time steps and one output (corresponding to the next integer in the sequence - in Keras this would be return_sequences=False). I passed 5 previous integers themselves as the input. This resulted in the model predicting pretty much the average (~30) all the time.
2. I tried the model in (1) above but with more input time steps (say 100), but still no improvement.
3. I then tried (1) and (2), but this time using the difference between consecutive integers as the input and trying to predict the difference to the next integer in the sequence. The results with this are still bad.
4. I then tried a LSTM classification model by one-hot encoding the input and output since I know that all integers in the sequence are in the range 0 to 70. Again, no improvement.
5. I then tried a seq2seq (encoder-decoder) LSTM model with 5 inputs in the encoder and 5 outputs in the decoder with the correct outputs also being fed into the decoder (teacher forcing). Still the results are bad.

At this point I started doubting whether I can train a model on the given data and whether the given data is just a bunch of random integers.

I looked for statistical tests to determine whether the data is random or not and found out about pandas autocorrelation plot. This is what the plot looks like when plotted on the difference between consecutive integers (looks similar when plotted using the actual integers themselves). As I understand, since the values are very close to zero, it means that the data is random. Is that right ?

I also used statsmodels "plot_acf" and the following is the plot I got for the difference between consecutive integers. I see that there is some negative correlation when the lag is 1. Why doesn't this show up in the plot using pandas' autocorrelation_plot()?

I tried building a AR (auto regression) model as well, but still the results are bad.

The histogram of the integers in the sequence also seems to suggest that the integers are random (all values have about the same count except some higher integers). Am I wasting my time trying to build a machine learning model to predict the next integer in the sequence ?

0

Interesting problem. The pandas autocorrelation plot suggests that the data is random.

How much do you know about the source? Is it believable that the sequence is, in fact, random?

Have you gone through the data and plotted a histogram of the integer appearance counts? Do they appear uniformly or are some more frequent than others?

One thing I think you should try is to reformulate your LSTM model. I don't think this is a regression problem, even though the target is a single integer. It's a classification problem with 70 classes. Try treating it that way and using a categorical crossentropy loss function. In such a scenario, ~1.5% accuracy would represent random guessing. Can the model do better than that?

I don't know much about the source apart from that the integers are the absolute positions of a ball on a number line (no idea whether random or follows a pattern).

I just added the histogram to my question. All values appear uniformly except a few at the edges.

I did try a LSTM classification model (many-to-one model) by one-hot encoding the inputs and outputs. I trained it for 100 epochs with 'categorical_crossentropy' loss and 'accuracy' metric (in Keras) and this is what I get.

Epoch 100/100

• 1s - loss: 0.1049 - acc: 0.9994 - val_loss: 8.3997 - val_acc: 0.0243
• < – varun – 2018-12-14T14:43:45.570

The first 'accuracy' value (0.9994) in my first comment is possibly due to overfitting because the lstm dimension I used was 256. The model was probably able to remember the data completely. After I added a bit of L2 regularization, I get these values.

Epoch 100/100

• 1s - loss: 4.0315 - acc: 0.0489 - val_loss: 4.2761 - val_acc: 0.0216

The accuracy on the validation set did not improve. – varun – 2018-12-14T14:56:47.697

Great. Well it seems that the model wasn't able to recognize a generalizable pattern. I'm not sure how much time you want to spend on this, as it's getting pretty likely the data is random (or dependent on something that's not in the data you have), but if you did want to keep going you could try 1) not using one-hot encoding of the inputs because there is actual ordinality in the case of number line position (meaning 5 and 6 are close to each other, and 5 and 60 are not - when you use one-hot encoding, 5 and 6 are totally different and 5 and 60 are equally totally different), and – Matthew – 2018-12-14T15:20:36.203

start="2">

• try binning. If you treat 0-9, 10-19, 20-29, etc, as bins, are you able to successfully get a rough estimate of where the sequence is going?
• < – Matthew – 2018-12-14T15:21:39.657

I get your point that we lose the ordering information when using one-hot encoding for numerical data like this one. Are you suggesting "binning" as one way to convert continuous data into categorical data while still preserving some partial ordering (within the bin)? The other end would be to treat it as a regression problem (which I already have tried), isn't it ? – varun – 2018-12-14T15:50:29.613

My thought was to leave the original data as-is to preserve the ordinal relationship, but to try to predict whether the next number would be 0-9, 10-19, 20-29, etc - basically, trying to relax the problem a little bit to see if the relaxed problem was solvable. If a model trying to predict the bin couldn't do better than ~20%, it'd be really hard to believe that there was much information in the data. – Matthew – 2018-12-14T15:58:11.280

I just tried framing the problem as a classification problem with real valued (continuous) input and 10 bins on the output. The best I could get was a validation set accuracy of 23%.

Epoch 100/100

• 2s - loss: 1.7020 - acc: 0.2597 - val_loss: 1.7610 - val_acc: 0.2297
• < – varun – 2018-12-14T18:03:04.743