4

After reading quite a lot of papers (20-30 or so), I feel that I am quite not understanding things.

Let us focus on the supervised learnings (for example). Given a set of data $\mathcal{D}_{train}=\{(x_i^{train},y_i^{train})\}$ and $\mathcal{D}_{test}=\{(x_i^{test},y_i^{test})\}$ where we assume $y_i^{test}$ are unknown, the goal is to find a function $$ f_\theta(x), \qquad \text{such that} \quad f_\theta(x_i^{test}) \approx y_i^{test}. $$ To do this, we need a model for $f$. Typically, neural networks are frequently employed. Thus we have $$ f_\theta(x) = W^{(L+1)}\sigma(W^{(L)}\sigma(\cdots \sigma(W^{(1)}\sigma(W^{(0)}x+b^{(0)})+b^{(1)})\cdots )+b^{(L)})+b^{(L+1)} $$ where $\theta = \{W^{(i)},b^{(i)}\}_{i=0}^{L+1}$. Then $f_\theta$ is a neural network of $L$ hidden layers. In order to find $\theta$, typically, one define a loss function $\mathcal{L}$. One popular choice is $$ \mathcal{L}(\mathcal{D}_{train}):= \sum_{(x_i^{train},y_i^{train})\in \mathcal{D}_{train}} \left(f_\theta(x_i^{train}) - y_i^{train} \right)^2. $$ In order to find $\theta^*$ which minimizes the loss function $\mathcal{L}$, a typical (or it seems the only approach) is to apply the gradient method.

As far as I know, the gradient method does not guarantee the convergence to the minimizer.

However, it seems that a lot of research papers simply mention something like

We apply the standard gradient method (e.g., Adam, Adadelta, Adagrad, etc.) to find the parameters.

It seems that we don't know those methods can return the minimizer. This makes me think that it could be possible that all the papers rely on this argument (utilizing the parameters found by gradient methods) might be wrong. Typically, their justifications are heavily on their examples saying it works well.

In addition to that, sometimes, they mentioned that they tuned some parameters to run gradient methods. What does that mean ``tune"? The gradient method high depends on the initialization of the parameter $\theta^{(0)}$.
If the initial choice were already close enought to the minimizer, i.e., $\theta^{(0)} \approx \theta^*$, it is not surprising that it works well.
But it seems that a lot of trials and errors are necessary to find a proper (good and working well) initialization. **It sounds to me that they already found the good solution via trials and errors, not gradient methods.** Thus tuning sounds to me that they already found a parameter which already closes to $\theta^*$.

I start to think that there may be something I am not aware of as the volume of such researches is huge. Did I miss something? Or can we just do research in this manner? I am so confused... I am not trying to attack or critize a specific paper or research. I am trying to understand.

Any comments/answers will be very appreciated.

What's wrong with the justification that it works well? Also, trying different initial weights is standard procedure in most optimization methods and doesn't mean that all the weights in the neural network were simply guessed. – oW_ – 2018-10-05T22:06:06.340

@oW_ Thanks for your comment. I am not saying that is wrong or not. But it doesn't sound rigorous to me. Even though trying different initial weights is standard procedure, it does not mean that it is right things to do, unless it is rigorously proved. What I mean is that certain conditions on the initial error $|\theta^{(0)}-\theta^*|$ which guarantees the performance of the optimization procedure. – induction601 – 2018-10-06T04:15:37.493

Furthermore, showing examples do not replace a proof. It seems to me that it is just showing a specific example which they manage to produce some good results on. Such good results could be possibly obtained if one brutal forcely find some specific value $\theta_{brutal}$ which works well. – induction601 – 2018-10-06T04:20:22.577

a proof for what exactly? not sure I understand your concern – oW_ – 2018-10-09T15:17:29.450