14

4

From what I understand, the *Gumbel-Softmax* trick is a technique that enables us to sample discrete random variables, in a way that is differentiable (and therefore suited for end-to-end deep learning).

Many papers and articles describe it as a way of selecting instances in the input (i.e. 'pointers') without using the non-differentiable argmax-function. The thing that confuses me is that this effect can be achieved without randomness by just using *Softmax with temperature*:

**Softmax with temperature**
$$y_=\frac{exp(\frac{_}{\tau})}{\sum_{}exp(\frac{_}{\tau})}$$

**Gumbel-Softmax**
$$y_=\frac{exp(\frac{log(\pi_)+g_i}{\tau})}{\sum_{}exp(\frac{log(\pi_j)+g_j}{\tau})}$$

**My question**

*From a practical and theoretical perspective, when is it beneficial to incorporate Gumbel noise into a neural network, as opposed to just using Softmax with temperature?*

A couple of observations:

- When the temperature is low, both Softmax with temperature and the Gumbel-Softmax functions will approximate a one-hot vector. However, before convergence, the Gumbel-Softmax may more suddenly 'change' its decision because of the noise.
- When the temperature is higher, the Gumbel noise will get a larger significance and the distribution will become more uniform. Why is this desired?

My best guess is that the introduction of the Gumbel noise enforces stronger exploration before convergence, but I can't recall reading any papers that use this as a motivation to bring in the extra randomness.

Does anyone have any experience or insights on this? Maybe I've completely missed the key point of Gumbel-Softmax :)

The hard sampling gradient is a smart trick! – Shaohua Li – 2020-07-21T12:23:32.977