## Can the learning rate be considered both a parameter AND a hyper-parameter?

0

Here is my understanding of those 2 terms:

Hyper-parameter: A variable that is set by a human before the training process starts. Examples are the number of hidden-layers in a Neural Network, the number of neurons in each layer, etc. Some models don't have any hyper-parameters, like the linear model.

Parameter: A variable in which the training process will update. For instance, the weights of a Neural Network are parameters as they are updated as we train the network and there is no human intervention on the process. Another example would be the slope and the y-intercept in a simple linear model.

Having said that, what would the learning rate parameter ($\eta$) be?

$$\Theta_{i+1} = \Theta_{i} + \eta \nabla J(\Theta_{i} )$$

My understanding is that $\eta$ is set before the training starts to a large value but then, as the training progresses and the function gets closer and closer to a local minimum, the learning rate is decreased. In that case, doesn't the learning parameter satisfy both the definitions of a parameter and of a hyper-parameter?

4Your definitions are wrong. If you are setting it, it is a hyperparameter. If you are estimating it, then it is a parameter. This can in fact be done, but not as you indicated. In your model, your learning rate has a schedule but it is still a hyperparameter. – Emre – 2017-07-31T04:39:28.543

" then, as the training progresses and the function gets closer and closer to a local minimum, the learning rate is decreased" that only depends on your learning strategy (as in, the optimizer). There is usually no "learning" of the learning rate unless in a few cases in literature (meta-learning). – E_net4 wants more flags – 2017-07-31T13:49:30.747

1

My understanding is that $\eta$ is set before the training starts to a large value but then, as the training progresses and the function gets closer and closer to a local minimum, the learning rate is decreased. In that case, doesn't the learning parameter satisfy both the definitions of a parameter and of a hyper-parameter?

No it does not, because you are at all times controlling the learning rate without reference to the data. Typically both the learning rate and the learning rate schedule are considered hyper-parameters, so by adding a learning rate schedule you have not transformed learning rate into a parameter, but instead added a new hyper-parameter that may need tuning!

The learning rate is also is not part of the model as it would be used in production, so it is a training hyper-parameter. Although that is not a strict separation, this also applies to things that are neither hyper-parameters nor parameters, such as the current momentum values, which are used to improve training rate, and are usually stored in a some data structures alongside the model being trained. Likewise most adaptive learning rate structures (used in e.g. RMSProp, Adagrad etc) are in this category - neither model parameter nor hyper-parameter.

It might be possible to argue that for a continuous online model, the learning rate could be a model parameter, as it is part of the production model being used to make predictions - provided it was still being used, but adaptive, depending on data seen to date, and/or current learning progress. Then if you had to feed in an initial learning rate you could maybe say that the learning rate was both a parameter and hyper-parameter of the ANN model. Although you would likely have to explain the scenario in full to be understood, it would be more usual to just say you were using an adaptive learning rate.

"both the learning rate and the learning rate schedule are considered hyper-parameters" - Once you define a learning rate schedule, doesn't it make the original hyperparameter (ie, the original learning rate) obsolete in the sense that you won't have to tweak it anymore? It is not like now you have 2 independent hyperparameters to tweak, but only 1 (the scheduler). Am I missing something? – Tiago – 2018-07-19T21:33:36.510

1@Tiago: Yes, although it depends how it is defined. You may have a starting rate, and a decay factor. That would make 2 scalar hyper-parameters. If you have a set of multiple (rate, num_epochs) that could be 6 or 8 hyper-parameters. In general you could say that a learning rate scheduler is one complex hyper-parameter, and it may have zero or more scalar factors that can be adjusted (zero is possible for some adaptive schemes, then the hyper-parameter is just the choice to use that scheme). – Neil Slater – 2018-07-20T06:43:39.790