The question is about a mismatch between the loss function in two papers on GANs. The first paper is *Generative Adversarial Nets* Ian J. Goodfellow et. al., 2014, and the excerpt image in the question is this.

The adversarial modeling framework is most straightforward to apply when the models are both multilayer perceptrons. To learn the generator’s distribution $p_g$ over data $x$, we define a prior on input noise variables $p_z (z)$, then represent a mapping to data space as $G (z; \theta_g)$, where $G$ is a differentiable function represented by a multilayer perceptron with parameters $\theta_g$. We also define a second multilayer perceptron $D (x; \theta_d)$ that outputs a single scalar. $D (x)$ represents the probability that $x$ came from the data rather than pg. We train $D$ to maximize the probability of assigning the correct label to both training examples and samples from $G$. We simultaneously train $G$ to minimize $\log (1 − D(G(z)))$:

In other words, $D$ and $G$ play the following two-player minimax game with value function $V (G, D)$:

$$ \min_G \, \max_D V (D, G) = \mathbb{E}_{x∼p_{data}(x)} \, [\log \, D(x)] \\
\quad\quad\quad\quad\quad\quad\quad + \, \mathbb{E}_{z∼p_z(z)} \, [\log \, (1 − D(G(z)))] \, \text{.} \quad \text{(1)} $$

The second paper is *Image-to-Image Translation with Conditional Adversarial Networks*, Phillip Isola Jun-Yan Zhu Tinghui Zhou Alexei A. Efros, 2018, and the excerpt image in the question is this.

The objective of a conditional GAN can be expressed as

$$ \mathcal{L}_{cGAN} (G, D) = \mathbb{E}_{x, y} \, [\log D(x, y)] \\ \quad\quad\quad\quad\quad\quad\quad + \mathbb{E}_{x, z} \, [\log \, (1 − D(x, G(x, z))], \quad \text{(1)} $$

where $G$ tries to minimize this objective against an adversarial $D$ that tries to maximize it, i.e.

$$ G^{∗} = \arg \, \min_G \, \max_D \mathcal{L}_{cGAN} (G, D) \, \text{.} $$

To test the importance of conditioning the discriminator, we also compare to an unconditional variant in which the discriminator does not observe $x$:

$$ \mathcal{L}_{GAN} (G, D) = \mathbb{E}_y \, [\log \, D(y)] \\
\quad\quad\quad\quad\quad\quad\quad + \mathbb{E}_{x, z} \, [\log \, (1 − D(G(x, z))] \, \text{.} \quad \text{(2)} $$

In the above $G$ refers to the generative network, $D$ refers to the discriminative network, and $G^{*}$ refers to the minimum with respect to $G$ of the maximum with respect to $D$. As the question author tentatively put forward, $\mathbb{E}$ is the expectation with respect to its subscripts.

The question of discrepancy is that the right hand sides do not match between the first paper's equation (1) and the second paper's equation (2) which is absent of the condition involving $y$.

First paper:

$$ \mathbb{E}_{x∼p_{data}(x)} \, [\log \, D(x)] \\
\quad\quad\quad\quad\quad\quad\quad + \, \mathbb{E}_{z∼p_z(z)} \, [\log \, (1 − D(G(z)))] \, \text{.} \quad \text{(1)} $$

Second paper:

$$ \mathbb{E}_y \, [\log \, D(y)] \\
\quad\quad\quad\quad\quad\quad\quad + \mathbb{E}_{x, z} \, [\log \, (1 − D(G(x, z))] \, \text{.} \quad \text{(2)} $$

The second later paper further states this.

GANs are generative models that learn a mapping from random noise vector $z$ to output image $y, G : z \rightarrow y$. In contrast, conditional GANs learn a mapping from observed image $x$ and random noise vector $z$, to $y, G : {x, z} \rightarrow y$.

Notice that there is no $y$ in the first paper and the removal of the condition in the second paper corresponds to the removal of $x$ as the first parameter of $D$. This is one of the causes of confusion when comparing the right hand sides. The others are use of variables and degree of explicitness in notation.

The tilda $~$ means *drawn according to*. The right hand side in the first paper indicates that the expectation involving $x$ is based on a drawing according to the probability distribution of the data with respect to $x$ and the expectation involving $z$ is based on a drawing according to the probability distribution of $z$ with respect to $z$.

The removal of the observation of $x$ from the second right hand term of the second paper's equation (2), which is the first parameter of $G$, the replacement of that equation's $y$ variable with the now freed up $x$ variable, and the acceptance of the abbreviation of the tilda notation used in the first paper then brings both papers into exact agreement.

Thank you for your reply and sorry for the late response. Regarding this paragraph: "The question of discrepancy is that the right hand sides do not match between the first paper's equation (1) and the second paper's equation (2) which is absent of the condition involving ." No, it is not what confuses me. The question of discrepancy is the mismatch of the left hand sides of the 2 papers: the first paper defines the Expectations to be the results of the min-max operation, where as the second paper suggests to take the min-max of the Expections. – AugLe – 2019-02-25T10:39:17.463