Recurrent neural network in 11.1 explicit examples?



I heard that RNN was implemented in Mathematica as of 11.1. Trying to search online, I find some general information about neural networks in Mathematica, or a list of related functions. My trouble is that this list of functions lumps purely statistical machine learning functions like Classify and Predict together with neural network functions, as well as (I presume) recurrent neural network functions, so that it is really hard to tell what it is I actually need to do RNN.

Perhaps there is a resource I missed that shows an explicit example of how to tackle a time series forecasting problem with several inputs, making use of Mathematicas RNN functions?

If none are known, perhaps someone knowledgeable could write a short example, e.g. using example data from here?

Thanks for any suggestion!


Posted 2017-08-29T13:19:50.683

Reputation: 10 970


Perhaps this YouTube video will help.

– m_goldberg – 2017-08-30T22:55:09.463

1@m_goldberg Thank you, that was very helpful indeed! Even though the video did not address any cases in which the sequence has continuity properties in the mathematical sense, I am starting to suspect that this property is implied to be captured by these techniques without doing anything extra. – M.Z. – 2017-08-30T23:48:10.707



Here is a simple example that may help you get started. In this example, we are going to a predict a simple time series of a sinusoid wave.

data = Table[Sin[x], {x, 0, 100, 0.04}];

enter image description here

We will cut the data into windows of 51 data points. The first 50 points as a whole is our X, and the last data point is our Y.

training = 
   List /@ Most[#] -> List@Last[#] & /@ (Partition[data, 51, 1])];

We use a single gated recurrent layer in our neural network

net = NetChain[{
   LinearLayer[1]}, "Input" -> {50, 1}, "Output" -> 1

and train with the training data

trained = NetTrain[net, training]

After training, we can use it to predict the time series. We first feed the neural network with 50 data points and then repeatedly use the data it generates to feed back into the neural network to generate the next data point. Here is a comparison between the ground truth and our predictions, which shows very good agreements.

   NestList[Append[Rest[#], trained[#]] &, 
     List /@ Sin[Range[-49*0.04, 0, 0.04]], 500][[All, -1]], 
  Table[Sin[x], {x, 0, 500*0.04, 0.04}]}, Joined -> True, 
 PlotLegends -> {"predicted", "ground truth"}]

enter image description here


Posted 2017-08-29T13:19:50.683

Reputation: 25 464

This looks interesting! What should be changed if we have more features than just the observable itself? Do I understand correctly, that we would need to predict each feature as well as the observable (plus 49 past steps) in order to make a second prediction step, and so forth, in this case? --- (for the sake of example, maybe you could just duplicate the data and feed it as several features in parallel, to illustrate the syntax?) – M.Z. – 2017-08-30T20:31:19.430

Also, which criterium tells you that you should use only one layer, and with exactly 10 nodes? – M.Z. – 2017-08-30T20:34:10.317

I decided to accept this answer because it made the syntax most clear and gave me the ability to write my own code. – M.Z. – 2017-08-31T12:58:52.387

3@Kagaratsch For multiple inputs, the first thing you can try is to use CatenateLayer to join the three input into a long sequence before feeding it into the recurrent layers. For example, something like:NetGraph[{CatenateLayer[], GatedRecurrentLayer[10], LinearLayer[1]}, {{NetPort["Input1"], NetPort["Input2"], NetPort["Input3"]} -> 1, 1 -> 2 -> 3}, "Input1" -> {50, 1}, "Input2" -> {50, 1}, "Input3" -> {50, 1}, "Output" -> 1], where input1,2,3,4 can be your par1,2,3 and obs. – xslittlegrass – 2017-08-31T14:18:23.687

1I set the number of layers and number of neurons by trail and error. But in general, increase the number of neurons will make the network more powerful to capture the features in the data, but it is also prone to overfitting and requires more data to train. – xslittlegrass – 2017-08-31T14:26:41.477

@Kagaratsch Thanks for accepting my answer! – xslittlegrass – 2017-08-31T14:26:52.167

Hi,sorry to comment here.Could I have a chat if your are available? – yode – 2017-09-14T04:04:19.080

@xslittlegrass I tried using same framework to teach LSTM a sine function. My training train data is sin[A*t] with A being {0.5,0.7,0.1,0.2...2}... I wanted LSTM so learn to generate sine function but it failed. Any idea on how to approach this problem? – psimeson – 2019-12-27T17:04:56.863


Taking inspiration from the answer by xslittlegrass, I came up with the following solution.

Recall the sample data from this question. We have an observable obs we are interested to predict:

enter image description here

and three parameters par1, par2, par3 that are correlated with the observable to some extent:

enter image description here

We only use the data for the first 700 time steps to train the model, and will try to predict the next 300 time steps.

We create a training set of tlen consecutive data points of length featn each, which have the respectively following data point of length featn as output. Then we train a model that returns featn outputs:

dat = Transpose[{par1/Max[par1], par2/Max[par2], par3/Max[par3], obs/Max[obs]}];
tlen = 300; featn = Length[dat[[1]]];
training = Table[dat[[i ;; i + tlen - 1]] -> dat[[i + tlen]], {i, 1, Length[dat] - tlen}];
net = NetChain[{GatedRecurrentLayer[tlen, "Dropout" -> {"VariationalInput" -> 0.1 , "VariationalState" -> 0.5}], LinearLayer[featn]}, "Input" -> {tlen, featn}, "Output" -> {featn}]
trained = NetTrain[net, training, Method -> {"ADAM", "InitialLearningRate" -> 0.0001}]

The training takes about two minutes. Finally, we can iteratively predict the future 300 steps

datt = dat;
 start = datt[[-tlen ;;]];
 AppendTo[datt, trained[start]];
 , {i, 1, 300}]

Amazingly, the prediction is qualitatively correct, with amplitude deviation growing to about 15% over the course of 300 time steps!

ListPlot[{datt[[;; , 4]], tab}]

enter image description here

Any suggestions for how to improve upon the above?


Posted 2017-08-29T13:19:50.683

Reputation: 10 970

2Add dropout ("VariationalInput"-> 0.1 and "VariationalState"-> 0.5) ,(try different values) showed that also during training tweek with learningRate (try from 0.1 to 0.0001 and see you get an improvement. – user34018 – 2017-08-31T10:46:08.700

@user34018 Thanks! Those options helped improve the prediction quite a bit! – M.Z. – 2017-08-31T12:18:23.903

tab is not defined in the ListPlot code. You can include it for the sake of completeness. – PlatoManiac – 2017-08-31T13:50:58.427


@PlatoManiac I added tab to the data at

– M.Z. – 2017-08-31T17:29:53.013


For instance, let's assume you have a sequence of 3 and you have 8 input variables (features) X

Let Y be the output with values "yes" or "no" for each sequence of X

Let's X have a dimension of 195

You create a sequence of 3 for X

Xpartition = Partition[X, 3]

Now, you create your trainingData:

trainingData = MapThread[Rule, {Xpartition, Y}]

You build your model:

net = NetChain[
    { LongShortTermMemoryLayer[32]
    , SequenceLastLayer[]
    , LinearLayer[2]
    , SoftmaxLayer[]
  , "Input" -> {3, 8}
  , "Output" -> NetDecoder[{"Class", {"no", "yes"}}]

where 3 is the number of vectors in the sequence, and 8 is the length of vectors.


Posted 2017-08-29T13:19:50.683

Reputation: 809

Thank you for your hints! May I ask a few questions? By sequence of 3 you mean, your time series has 3 consecutive time steps? What is the dimension 195 of X, I do not see it enter the code anywhere? You do not define any Y, what is it? What are all the SequenceLastLayer, LinearLayer and SoftmaxLayer for, why do they appear in this order in the syntax? A yes or no output is not really what I'd like to learn how to do, rather I'd like to predict the next value in a sequence of values of an observable, that also enters as one of the features. Could you use the data from my question? – M.Z. – 2017-08-30T11:41:37.120

Basically, I still do not see any consecutive time steps in the answer and how to predict the next one. What I am looking for is a worked example like the following, except for Mathematica: In the end, if no humanly comprehensible examples exist for how to use this functionality in Mathematica, I'll just end up using tensorflow, which would be sad...

– M.Z. – 2017-08-30T11:45:02.670

1In my problem, I have to predict device failure. I have a device is monitored by 8 variables. All these variables are connected to time. To use RNN, I have to decide my sequence that will predict failure (as I have more than one variables so instead of having xt-1, xt-2, xt-3 where x is just one data point, I have each xt with 8 data points, but I still want to predict failure using all these variables. Therefore, I have to provide RNN with my sequence that has all (xt-1, xt-2,xt-3). By the way, this gave me great results. – user34018 – 2017-08-31T10:37:27.703

1195 is my sample size – user34018 – 2017-08-31T10:47:06.403


Key applied RNN examples from developers are located in documentation at:

See "Sequence Learning and NLP" section.

Vitaliy Kaurov

Posted 2017-08-29T13:19:50.683

Reputation: 66 672

I have seen these examples before. integer addition, sorting sequences, question answering, language modeling, sentiment analysis - none of these address my questions, sadly. – M.Z. – 2017-08-30T21:09:06.587

As a note, this page does not exist anymore. – dearN – 2019-05-17T14:30:15.880

@dearN URL corrected, thanks. – Vitaliy Kaurov – 2019-05-17T16:24:21.427


Here RNN in Mathematica? they are talking about RNN

In help section, if you for LongShortTermMemoryLayer you will also find other RNN models that have been implemented.


Posted 2017-08-29T13:19:50.683

Reputation: 809

As far as I understand, a LongShortTermMemoryLayer is just a basic building block that can be part of a model. However, there does not seem to be a comprehensive example on how to use it in a concrete model e.g. to generate a forecast of several time steps of an observable based on several input parameters. – M.Z. – 2017-08-30T02:05:34.853

The example they present using LongShortTermMemoryLayer[20], SequenceLastLayer[], LinearLayer[1] – user34018 – 2017-08-30T02:31:05.707

There they are doing addition with strings, which has no time series properties. It is still unclear how to take a time flow into account and generate a forecast. – M.Z. – 2017-08-30T04:16:45.173

You are supposed to have a sequence of vectors of size k, where each vector contains your several inputs (predictors, also called independent variables or features). So your input will be defined as "input"-> {k,length_of_vector} – user34018 – 2017-08-30T09:58:22.963