Abstract


Enabling photorealistic avatar animations in virtual and augmented reality (VR/AR) has been challenging because of the difficulty of obtaining ground truth state of faces. It is physically impossible to obtain synchronized images from head-mounted cameras (HMC) sensing input, which has partial observations in infrared (IR), and an array of outside-in dome cameras, which have full observations that match avatars' appearance. Prior works relying on analysis-by-synthesis methods could generate accurate ground truth, but suffer from imperfect disentanglement between expression and style in their personalized training. The reliance of extensive paired captures (HMC and dome) for the same subject makes it operationally expensive to collect large-scale datasets, which cannot be reused for different HMC viewpoints and lighting. In this work, we propose a novel generative approach, Generative HMC (GenHMC), that leverages large unpaired HMC captures, which are much easier to collect, to directly generate high-quality synthetic HMC images given any conditioning avatar state from dome captures. We show that our method is able to properly disentangle the input conditioning signal that specifies facial expression and viewpoint, from facial appearance, leading to more accurate ground truth. Furthermore, our method can generalize to unseen identities, removing the reliance on the paired captures. We demonstrate these breakthroughs by both evaluating synthetic HMC images and universal face encoders trained from these new HMC-avatar correspondences, which achieve better data efficiency and state-of-the-art accuracy.

GenHMC


GenHMC facilitates the straightforward establishment of correspondences between HMC and avatars using dome capture images. Moreover, it enables the generation of synthetic images with diverse natural augmentations for a single expression, enhancing both flexibility and usability.

During training, we load a real, monochrome HMC image \( \mathbf{x} \in \mathbb{R}^{256 \times 256} \), and apply it through a pre-trained face keypoint detection model \( \phi_\text{kpt} \) and a face segmentation model \( \phi_\text{seg} \). We then overlay the detected results of the two models on the same image, which we then use as the conditional signal for the generative model.

At inference time, no real HMC image is available (after all, we are to generate them). We instead take the avatar renderings from simulated HMCviewpoints, detect/project the keypoint and segmentation maps on them, and use the outcome as conditional \( \mathbf{c} \).

Encoder Training


We present a novel encoder training system that minimizes the need for real HMC captures.

The training system comprises two major components: GenHMC inference and the encoder training with therefore synchronized dome assets. For the first time, dome assets, such as ground truth expression codes, or even dome images can be directly utilized to train face encoders for VR headsets. This approach offers several key benefits: (1) more accurate ground truth supervision from multi-view dome captures, comparing against pseudo correspondences established by [Wei et al. 2019], (2) reduced need for paired HMC and dome captures, and (3) increased diversity in HMC inputs resulted from GenHMC.

Experiments

Scalability of GenHMC


We observe that reducing the number of training subjects leads to a decrease in the quality and diversity of the generated images Specifically, the inference results become increasingly degenerate, and the pixel-wise alignment between the generated outcome and the keypoint/segmentation maps degrades. In the extreme case where we train with only one subject, the model tends to generate images that closely resemble the attributes of that specific subject, indicating a significant loss of diversity.

Universal Facial Encoders with GenHMC Data

Scaling Law of Number of GenHMC Subjects


Increasing the number of GenHMC training subjects generally leads to decreased photometric \( L_1 \) error in the UE training. While performance improves with more diverse GenHMC training subjects when training solely on synthetic data, combining real and GenHMC images yields robust UE performance even with GenHMC models trained on as few as 60 subjects.

Scaling Law of Number of Subjects with UE


Incorporating synthetic data generated by GenHMC consistently improves UE performance, even with a limited number of real training subjects. This highlights the ability of GenHMC to augment limited real datasets, enhancing the performance and robustness of downstream UE training.

Comparisons with Different Training Configurations


Additional Qualitative Results