Goodfellow, Ian and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron and Bengio, Yoshua
NIPS 2014
Summary:
This paper is the original GAN paper which first introduced a practical neural network approach into generation tasks. The motivation of this paper is to train a neural network G that takes noise z as input and learn to generate G(z) that is close to the training dataset x, in other words, make G(z) distribution close to the original data distribution x.
To achieve that goal in a neural network approach, we need to define a loss function that could provide loss to G to backpropagate and update its performance. If we decide to use MSE loss between the generated images and the sampled original images, I assume the G(z) will be still able to predict the similar images in some degree as x based on z. However, since the MSE loss is sensitive to the local information rather than the global similarity, this approach might work for some simple image datasets but hard to generalize to complex real-world images. The bottleneck now is we don’t have such loss function that could evaluate the generated image quality as whole.
A quite straightforward way is to train another neural network to evaluate the generation quality as whole since NN are good at capture global information. The authors also adopted this idea and trained another NN named ‘Discriminator’ D to return a probability of the input image coming from the same distribution of training dataset x. However, forming a high-quality training dataset for such task is not that easy. We definitely need the original dataset to intrinsically carry the target distribution information. Besides the positive samples, we also need negative samples to distinguish the distributions of positive samples. Random noise could be helpful but it might fail the discriminator when the input image is relatively close to the positive samples but not exactly belongs to that distribution x. In other words, the performance of the D relies on well-designed dataset that carefully defines the cases at boundary conditions, which is also impractical for high dimensional data and human labor.
Another more practical way to achieve this goal is to let the D network learned from the images generated from an unconverged G network so that with the converging of the G network, the D could learn more and more difficult samples to distinguish, which is same as what we argued ‘well-designed dataset at boundary conditions’. In the meantime, the G network could also learn to generate more and more realistic data to fail the discriminator based on the loss backpropagation The whole optimization process can be modeled as a two-player minmax game with value function V(G,D), where the loss of the V is a negative binary cross entropy score of the classification performance on the original image (log part) and generated image (1-log) part. The D network should maximize the (log part) and also try its best to reject the G generation to raise up the value function. Which shows its ability to distinguish the fake data well. However, the goal of the G is to minimize the value function to make sure it could fool the D network.
In the training stage, the D and G networks are trained in an interactive way. The D network first updates k steps and then the G network updates once. The motivation behind this is two NNs need to be trained in a same pace to avoid the accuracy saturation. Suppose if a D is too strong compared with the G, the output of G will tend to always get rejected and D provides almost no improvement guidance to G. Similarly, if D is too weak to distinguish the results, the G network will tend to be misled by the false positive. The k here is usually larger than 1 since the D requires more training to be close to the optimal solution to provide a relative correct guidance.
Strength:
Its design of making the automatically adversarial process differentiable is amazing. Even though there is still some hyperparameter space to explore to make the training converged, but it is already a huge leap compared with the non-interactive training approaches.
Critique:
There are several possible improvements in my mind after reading this paper:
-
Add class guidance by put condition in D and G training.
-
Could the output of the D be multi-dimensional so that it passes more information to the G training?
-
Could k here also be adaptive to adjust the D and G training pace?
Code:
A tutorial-level implementation
https://interactiveaudiolab.github.io/course-deep-learning/notebooks/notebook_6_gan.ipynb