Hyperparameter search for LSTM-RNN using Keras (Python)

19

8

From Keras RNN Tutorial: "RNNs are tricky. Choice of batch size is important, choice of loss and optimizer is critical, etc. Some configurations won't converge."

So this is more a general question about tuning the hyperparameters of a LSTM-RNN on Keras. I would like to know about an approach to finding the best parameters for your RNN.

I began with the IMDB example on Keras' Github.

the main model looks like this:

(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features,
                                                      test_split=0.2)

max_features = 20000
maxlen = 100  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(LSTM(128))  
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
          optimizer='adam',
          class_mode="binary")

print("Train...")
model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=3,
      validation_data=(X_test, y_test), show_accuracy=True)
score, acc = model.evaluate(X_test, y_test,
                        batch_size=batch_size,
                        show_accuracy=True)

print('Test accuracy:', acc)
Test accuracy:81.54321846

81.5 is a fair score and more importantly it means that the model, even though not fully optimized, it works.

My data is Time Series and the task is binary prediction, the same as the example. And now my problem looks like this:

#Training Data
train = genfromtxt(os.getcwd() + "/Data/trainMatrix.csv", delimiter=',', skip_header=1)
validation = genfromtxt(os.getcwd() + "/Data/validationMatrix.csv", delimiter=',', skip_header=1)

#Targets
miniTrainTargets = [int(x) for x in genfromtxt(os.getcwd() + "/Data/trainTarget.csv", delimiter=',', skip_header=1)]
validationTargets = [int(x) for x in genfromtxt(os.getcwd() + "/Data/validationTarget.csv", delimiter=',', skip_header=1)]

#LSTM
model = Sequential()
model.add(Embedding(train.shape[0], 64, input_length=train.shape[1]))
model.add(LSTM(64)) 
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
          optimizer='adam',
          class_mode="binary")

model.fit(train, miniTrainTargets, batch_size=batch_size, nb_epoch=5,
      validation_data=(validation, validationTargets), show_accuracy=True)
valid_preds = model.predict_proba(validation, verbose=0)
roc = metrics.roc_auc_score(validationTargets, valid_preds)
print("ROC:", roc)
ROC:0.5006526

The model is basically the same as the IMDB one. Though the result means it's not learning anything. However, when I use a vanilla MLP-NN I don't have the same problem, the model learns and the score increases. I tried increasing the number of epochs and increasing-decreasing the number of LTSM units but the score won't increase.

So I would like to know a standard approach to tuning the network because in theory the algorithm should perform better than a multilayer perceptron network specially for this time series data.

wacax

Posted 2016-01-17T18:26:54.320

Reputation: 3 000

1How much data do you have? What is length of your sequences? LSTM's are only really useful for problems with lots of data and long term dependencies. – pir – 2016-01-18T12:52:00.547

Random search or Bayesian optimization are standard ways of finding hyperparameters :) – pir – 2016-01-18T12:52:37.343

1Are you sure you need the embedding layer? Many time series datasets would not need it. – pir – 2016-01-18T12:54:15.233

I have nearly 100k data points and twice as many features as the IMDB example so I don't think that's the problem. As for the embedding layer, how exactly would you connect the LSTM layer to the input? According to the documentation http://keras.io/layers/recurrent/#lstm Keras' LSTM only takes initializations, activations and output_dim as arguments. If that is the source of the error, code describing how to eliminate the embedding layer will be greatly appreciated.

– wacax – 2016-01-18T16:58:32.903

Please see my answer. It seems you don't need the embeding layer. – pir – 2016-01-18T23:08:20.360

Answers

5

An embedding layer turns positive integers (indexes) into dense vectors of fixed size. For instance, [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]. This representation conversion is learned automatically with the embedding layer in Keras (see the documentation).

However, it seems that your data does not need any such embedding layer to perform a conversion. Having an unnecessary embedding layer is likely why you cannot get your LSTM to work properly. If that is the case then you should simply remove the embedding layer.

The first layer in your network should then have the input_shape argument added with information on the dimensions of your data (see examples). Note that you can add this argument to any layer - it will not be present in the documentation for any specific layer.


By the way, hyperparameters are often tuned using random search or Bayesian optimization. I would use RMSProp and focus on tuning batch size (sizes like 32, 64, 128, 256 and 512), gradient clipping (on the interval 0.1-10) and dropout (on the interval of 0.1-0.6). The specifics of course depend on your data and model architecture.

pir

Posted 2016-01-17T18:26:54.320

Reputation: 780

What do you propose to replace the embedding layer with? I tried simply removing the embedding layer but that doesn't work. – wacax – 2016-03-29T18:00:07.760

1Look at the other examples - start e.g. directly with the Dense layer. Remember to set the input_shape parameter. – pir – 2016-03-29T18:27:32.067

5

I would recommend Bayesian Optimization for hyper parameter search and had good results with Spearmint. You might have to use an older version for commercial use.

Mutian Zhai

Posted 2016-01-17T18:26:54.320

Reputation: 71

3

I would suggest using hyperopt , which uses a kind of Bayesian Optimization for search optimal values of hyperparameters given the objective function. It is more intuitive to use than Spearmint.

PS : There is a wrapper of hyperopt speifically for keras, hyperas. You can also use it.

SHASHANK GUPTA

Posted 2016-01-17T18:26:54.320

Reputation: 2 845

2

Talos is exactly what you're looking for; an automated solution for searching hyperparameter combinations for Keras models. I might not be objective as I'm the author, but the intention have been to provide an alternative with the lowest possible learning curve while exposing Keras functionality entirely.

Alternatively, as it had already been mentioned, you can look into Hyperas, or then SKlearn or AutoKeras. To my knowledge, at the time of writing, these 4 are the options for Keras users specifically.

mikkokotila

Posted 2016-01-17T18:26:54.320

Reputation: 181