Abstract

overview

Neural Radiance Field (NeRF) has broken new ground in the novel view synthesis due to its simple concept and state-of-the-art quality. However, it suffers from severe performance degradation unless trained with a dense set of images with different camera poses, which hinders its practical applications. Although previous methods addressing this problem achieved promising results, they relied heavily on the additional training resources, which goes against the philosophy of sparse-input novel-view synthesis pursuing the training efficiency. In this work, we propose MixNeRF, an effective training strategy for novel view synthesis from sparse inputs by modeling a ray with a mixture density model. Our MixNeRF estimates the joint distribution of RGB colors along the ray samples by modeling it with mixture of distributions. We also propose a new task of ray depth estimation as a useful training objective, which is highly correlated with 3D scene geometry. Moreover, we remodel the colors with regenerated blending weights based on the estimated ray depth and further improves the robustness for colors and viewpoints. Our MixNeRF outperforms other state-of-the-art methods in various standard benchmarks with superior efficiency of training and inference.

Video

Modeling a Ray with Mixture Density Model

First, our MixNeRF estimates the distribution of the RGB color values along the samples of the ray on a pixel with a mixture model, which is derived from a weighted combination of component distributions. The conventional outputs of NeRF for each sampled point are used as a location parameter and to compute a mixing coefficient π, respectively. In addition to these, a scale parameter β is also estimated in our model:


The pdf of our mixture model formed by the component distributions above is defined as:


The mixture coefficient πij is derived from the density output σij as follows:

Depth Estimation by Mixture Density Model

We propose a ray’s depth estimation as an effective auxiliary task for training our MixNeRF with sparse inputs. Our MixNeRF estimates d, the ray’s depth, which is defined as the length of the unnormalized ray direction vector along the ray samples. The pdf of our mixture model for the depth of the i-th ray is as follows:


Since the mixing coefficient π and parameter β are optimized through the supervision of the depth as well as the color values, it improves the robustness of our MixNeRF for slight changes of geometry. In addition, we exploit the estimated depth to regenerate the blending weights along the samples and model the RGB color values by a mixture of distributions once again. Since the estimated depth of each sample is trained to be nearly identical to the ground truth depth, but not exactly the same, it can play a role of pseudo geometry for adjacent points of the sample without any additional pre-generation process of extra training data. The new blending weights along a ray based on the estimated ray depths are defined as follow:


Finally, we model the color values along a ray based on the new mixing coefficients and the corresponding pdf is as follows:


Since the estimated ray depths are likely to be close enough to those of the ground truths, we use the same GT color values of input rays for modeling the mixture distribution based on the newly generated mixing coefficients. It further improves the robustness for shift of colors and ray viewpoints by simply modeling a ray once again with regenerated blending weights, eliminating pre-generation and extra inference of unseen views without much computational overhead.

Benefit of Mixture Density Model

For the unimodal distribution in blue, mip-NeRF does not estimate the mode well and achieves degenerate geometry. However, RegNeRF and our MixNeRF show the unimodal weight distributions leading to higher-quality novel views, and especially our MixNeRF achieves the distribution with sharper mode than RegNeRF, which is more similar to that of mip-NeRF (All-view). In case of the bimodal-shaped distribution in red, our MixNeRF estimates the weight distribution successfully while both mip-NeRF and RegNeRF fail to estimate the accurate modes. Since the predicted 3D geometry is directly correlated with how well the density is estimated, our MixNeRF is able to learn the geometry more efficiently with limited input views through mixture density modeling.

Depth Map Estimation

We observe that RegNeRF fails to learn the geometry with its smoothing strategy and achieves degenerate results due to the overly strong prior of depth smoothness. However, since our MixNeRF learns the depth of a ray by leveraging a mixture density model without smoothing from additional unseen rays, the depth maps are predicted much more efficiently and precisely.

Efficiency in Training and Inference

Although it takes a similar amount of time to train DietNeRF as MixNeRF, its performance is inferior significantly to ours in 3 and 6-view scenario. Compared to RegNeRF, ours outperforms it with about 42% shorter training time per scene under the same number of input view scenario, resulting from the elimination of extra inference for additional unseen rays.

Results

DTU 3-view


LLFF 3-view


Blender 4-view

Citation

Acknowledgements

This work was supported by NRF grant (2021R1A2C3006659) and IITP grants (2021-0-01343, 2022-0-00953) funded by Korean Government.