StyleGAN

20 Nov 2022

StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks

Karras, Tero and Laine, Samuli and Aila, Timo

CVPR 2019

Summary:

StyleGAN demonstrated astonishing controllable face generation results in 2019 which somehow broke the stereotype that GAN was unfit for generating photorealistic results. The revision of the StyleGAN on the traditional GAN structure mainly happened in the generator G part, where it adds up an extra mapping network before the synthesis network so that the latent space Z is no longer directly mapping to the image feature space. In the paper they argued and experimented on many insightful hypothesis they proposed, which helped readers understand the motivation behind the design and the interpretation of the GAN.

The style-based generator took the latent variable z as input and first mapped it to the W space, where w later provided the scaling and offset control on the adaptive instance normalization (AdaIN) operations. For the synthesis network, the backbone network is the progressive GAN, which upscales the feature map size progressively. However, several differences from the building blocks of ProGAN are: the input is no longer the latent z but a constant layer; noise will be added on the convolved feature maps at different scales; and AdaIN operatons are applied on all noised convolved feature maps. Under this setting, the noise is added into different scale of the generation process which could better model the stochastic variation features and the w will control the generation in different scales in a spatially invariant way, which provides disentanglement in different levels. Below, I plan to answer several core questions.

Why mapping to w space first rather than directly using z space?

The training set feature space could lack some style combination (long hair + man), which inherently doesn’t support a well-trained generation for such combination. However, the z space has a distribution to sample (normal distribution), it has to map the z space to the image space which needs some non-linearity to expand that original image space. Non-linearity results in entanglement. The w space here is not sampled with any fixed distribution since the the sampling intensity is induced by the learned piecewise continuous mapping, which can adapt to the image space.

Why would AdaIN work?

The AdaIN is to normalize the channel first and then applies scale and bias based on the style from w. The new per-channel statistics, as dictated by the style, modify the relative importance of features for the subsequent convolution, but don’t depend on the original statistics due to the normalization. Thus each style only control one convolution before being overridden by the next AdaIN. Thus, the style is localized so that modifying a specific subset of the styles can be expected to affect only certain aspects of the image. This help the network disentangle the different features.

Why adding explicit noise before every AdaIN?

There is pressure for NN to introduce new content in all scales as soon as possible. The easiest way for the networks is taking a fresh set of noise for all the layers as input feature so that there is no incentive to generate the stochastic effects from earlier activations.

Why and how does the style mixing work?

The mapping process can be thought as a way to draw samples for each style from a learned distribution and the synthesis network as a way to generate a novel image based on a collection of styles. The earlier the style affects larger scale of images due to the ProGAN upsampling. Thus, coarse resolution style maps to global features (pose, hair style, face shape etc.) and high resolution style maps to finer details such as color scheme and microstructure. Due to the disentanglement property we introduced before, they can just linearly select different level of styles from different latent vectors and combine them to different level to achieve the effect they want.

Why and how does the network automatically separate the global effects from stochasticity?

Suppose if network relies on the noise to generate the high level information (pose, face, etc.), the spatially inconsistent decision from noise would be penalized by the discriminator. Thus, the network learns to use global and local channels appropriately without explicit guidance.

Strength:

The written of this paper is very intuitive, insightful, and thorough. The interpretation of the design motivation of such NN architecture clearly showed the deep thinking of the authors.

It provides nice disentanglement property of the style and highly photorealistic results. Besides, it also provides two metrics for measuring the perceptual smoothness and linear separability. Which is also very helpful for future works.

Critique:

The noise adding near the final layer might cause the ‘white dots noise’ effects on the generated images. Is it reasonable to avoid noise adding to the final part? Besides, is it reasonable to keep using uncontrolled AdaIN layer even for the later layers? The goal of the later layers is to generate local features of the images, which is inconsistent with the AdaIN’s property to override the previous layer. The high level feature should maintain and it is more reasonable to explicitly decrease the weight w to smaller and smaller when it comes down to the final layers.

Resources:

I actualy strongly recommend to read the original paper several times rather than looking for other online explanation. It is actually not a good manner to put so much deep discussion distributed to all the sections considering the readability of writing, but the authors of this paper think things so deeply and it contains very thorough design motivation. Reading such paper is like listening a master teaching.

Video:

A good entry-level introduction. Lack of depth.

https://www.youtube.com/watch?v=4LNO8nLxF4Y&ab_channel=AIBites

Code:

https://github.com/NVlabs/stylegan