Efficient Geometry-aware 3D Generative Adversarial Networks

Eric R. Chan, Gordon Wetzstein, et al

CVPR 2022

Summary:

I review this paper mainly for curiosity of the term ‘geometry-aware’ and ‘3D’. We always see works from 2D face generation and the physical correctness the geometry is implicitly encoded in the training dataset. However, how to add some explicit physics correctness to the generative network and how to apply the data to 3D example. Those motivated me to have a look on this work from a top research group in my computational imaging research field.

This paper demonstrates a new 3D GAN framework to generate photorealistic 3D human face geometry based on a set of 2D images in different perspectives, which requires cheaper instruments to generate a 3D geometry information.

The information flow of the whole frameworks is following:

Both the camera pose information and raw images are encoded to feed a general GAN generator network (The paper uses StyleGAN2 as backbone).
Rather than generate a RGB image as most backbone do, the paper proposed a tri-planes representation for encoding geometric and color information. This is a hybrid design which could make the feature explicitly maintains geometric information (by explicit 3D sampling) but also lessen the computational cost brought by pure voxel-based representation. It projects the target points to 3 orthogonal feature maps
The neural render part to decode the feature maps to get the color and density information as several feature maps is pretty light weighted thanks to the new presentation. Notice that, the camera pose information is also involved in decoding since there are some pose related feature for face (people tends to smile when directly facing the camera).
Super resolution part. To reduce the training time for tri-plane representation, there is a super resolution module to upsample both the rendering features and raw image and concatenates them together. This could further save the computation cost.
Finally taking the high resolution feature maps provided by the super resolution module into the backbone GAN discriminator, it will progressively learn to generate more and more realistic image (both geometrically and visually)

Why Good:

This work provides new 3D-geometry aware data representation, which could mitigates the inconsistence and large training data size requirement previous SOTA suffers (voxel-based, mesh-based, low-resolution 3D, block-based, and NN based representation).
This works is especially useful in 3D VR/AR applications since it could generate a photorealistic 3D face model based only on one single image. Which is an important technology for 3D avatar generation.
This paper pick a core tradeoff problem for 3DGAN, ‘how to make the synthesis volume view consistent but still in relative low computational cost’ and provides a new solution.

Possible Follow-On:

The model still contain artifacts for finer details due to its not explicit enough representation. There is a tradeoff for efficiency and geometric expressiveness.
Pose information is good for the framework to learn about the pose-related artifact, but it will add up the instrument setup cost to promote. How about use a loose estimate?
Generation from a single image would be the holy grail for such task.