Understanding notation of Goodfellow's GAN objective function



enter image description here

What is the meaning of $V(D,G)$? How do we get these expectation parts? I was trying to understand it following this article: Understanding Generative Adversarial Networks (D.Seita), but, after many tries, I still can't understand how he got from $\sum_{n=1}^{N} \log D(x)$ to $\mathbb{E}(\log(D(x))$.


Posted 2019-04-12T20:53:48.807

Reputation: 195

Welcome to SE:AI! – DukeZhou – 2019-04-12T21:11:21.013

I only glanced at the article, but it sounds like 1/2 of the data is coming from each source, so y_i was replaced with 1/2. To get expectation of a random variable you sum over all values of the variable and multiply each value by the probability of it occurring. Since p(.) = 0.5, then by summing over the value of each variable D and multiplying by the probability, it yields an expectation. – Hanzy – 2019-04-13T01:44:45.240

I also noticed this is similarly asked in the comment section below the article. I think it’s the first question, which the author (sort of) explains, although briefly. – Hanzy – 2019-04-13T01:45:52.200

@Hanzy but i still dont understand why he could replace sum of values with expectation? If there is really half 0s and half 1s wouldt then the summation be equal to N/2 rather than 1/2? – i_rezic – 2019-04-13T08:09:19.900



To understand this equation first you need to understand the context in which it is first introduced. We have two neural networks (i.e. $D$ and $G$) that are playing a minimax game. This means that they have competing goals. Let's look at each one separately:


Before we start, you should note that throughout the whole paper the notion of the data-generating distribution is used; in short the authors will refer to the samples through their underlying distributions, i.e. if a sample $a$ is drawn from a distribution $p_a$, we'll denote this as $a \sim p_a$. Another way to look at this is that $a$ follows distribution $p_a$.

The generator ($G$) is a neural network that produces samples from a distribution $p_g$. It is trained so that it can bring $p_g$ as close to $p_{data}$ as possible so that samples from $p_g$ become indistinguishable to samples from $p_{data}$. The catch is that it never gets to actually see $p_{data}$. Its inputs are samples $z$ from a noise distribution $p_z$.


The discriminator ($D$) is a simple binary classifier that tries to identify which class a sample $x$ belongs to. There are two possible classes, which we'll refer to as the fake and the real. Their respective distributions are $p_{data}$ for the real samples and $p_g$ for the fake ones (note that $p_g$ is actually the distribution of the outputs of the generator, but we'll get back to this later).

Since it is a simple binary classification task, the discriminator is trained on a binary cross-entropy error:

$$ J^{(D)} = H(y, \hat y) = H(y, D(x)) $$

where $H$ is the cross-entropy $x$ is sampled either from $p_{data}$ or from $p_g$ with a probability of $50\%$. More formally:

$$ x \sim \begin{cases} p_{data} \rightarrow & y = 1, & \text{with prob 0.5}\\ p_g \;\;\;\,\rightarrow & y = 0, & \text{otherwise} \end{cases} $$

We consider $y$ to be $1$ if $x$ is sampled from the real distribution and $0$ if it is sampled from the fake one. Finally, $D(x)$ represents the probability with which $D$ thinks that $x$ belongs to $p_{data}$. By writing the cross-entropy formula we get:

$$ H(y, D(x)) = \mathbb{E}_y[-log \; D(x)] = \frac{1}{N} \sum_{i=1}^{N}{ \; y_i \; log(D(x_i))} $$

where $N$ is the size of the dataset. Since each class has $N/2$ samples we can split this sum into two parts: $$ = - \left[ \frac{1}{N} \sum_{i=1}^{N/2}{ \; y_i \; log(D(x_i))} + \frac{1}{N} \sum_{i=N/2}^{N} \; (1 - y_i) \; log((1 - D(x_i))) \right] $$

The first of the two terms represents the the samples from the $p_{data}$ distribution, while the second one the samples from the $p_g$ distribution. Since all $y_i$ are equally likely to occur, we can convert the sums into expectations:

$$ = - \left[ \frac{1}{2} \; \mathbb{E}_{x \sim p_{data}}[log \; D(x)] + \frac{1}{2} \; \mathbb{E}_{x \sim p_{g}}[log \; (1 - D(x))] \right] $$

At this point, we'll ignore $2$ from the equations since it's constant and thus irrelevant when optimizing this equation. Now, remember that samples that were drawn from $p_g$ were actually outputs from the generator (obviously this affects only the second term). If we substitute $D(x), x \sim p_g$ with $D(G(z)), z \sim p_z$ we'll get:

$$ L_D = - \left[\; \mathbb{E}_{x \sim p_{data}}[log \; D(x)] + \; \mathbb{E}_{z \sim p_{z}}[log \; (1 - D(G(z)))] \right] $$

This is the final form of the discriminator loss.

Zero-sum game setting

The discriminator's goal, through training, is to minimize its loss $L_D$. Equivalently, we can think of it as trying to maximize the opposite of the loss:

$$ \max_D{[-J^{(D)}]} = \max_D \left[\; \mathbb{E}_{x \sim p_{data}}[log \; D(x)] + \; \mathbb{E}_{z \sim p_{z}}[log \; (1 - D(G(z)))] \right] $$

The generator however, wants to maximize the discriminator's uncertainty (i.e. $J^{(D)}$), or equivalently minimize $-J^{(D)}$.

$$ J^{(G)} = - J^{(D)} $$

Because the two are tied, we can summarize the whole game through a value function $V(D, G) = -J^{(D)}$. At this point I like to think of it like we are seeing the whole game through the eyes of the generator. Knowing that $D$ tries to maximize the aforementioned quantity, the goal of $G$ is:

$$ \min_G\max_D{V(D, G)} = \min_G\max_D \left[\; \mathbb{E}_{x \sim p_{data}}[log \; D(x)] + \; \mathbb{E}_{z \sim p_{z}}[log \; (1 - D(G(z)))] \right] $$


This whole endeavor (on both my part and the authors' part) was to provide a mathematical formulation to training GANs. In practice there are many tricks that are invoked to effectively train a GAN, that are not depicted in the above equations.


Posted 2019-04-12T20:53:48.807

Reputation: 2 624