Abstract

overview

Recent advancements in the Neural Radiance Field (NeRF) have enhanced its capabilities for novel view synthesis, yet its reliance on dense multi-view training images poses a practical challenge, often leading to artifacts and a lack of fine object details. Addressing this, we propose ARC-NeRF, an effective regularization-based approach with a novel Area Ray Casting strategy. While the previous ray augmentation methods are limited to covering only a single unseen view per extra ray, our proposed Area Ray covers a broader range of unseen views with just a single ray and enables an adaptive high-frequency regularization based on target pixel photo-consistency. Moreover, we propose luminance consistency regularization, which enhances the consistency of relative luminance between the original and Area Ray, leading to more accurate object textures. The relative luminance, as a free lunch extra data easily derived from RGB images, can be effectively utilized in few-shot scenarios where available training data is limited. Our ARC-NeRF outperforms its baseline and achieves competitive results on multiple benchmarks with sharply rendered fine details.

Area Ray Casting

Compared to the existing ray augmentation schemes where a resulting augmented ray corresponds to an unseen view, our proposed Area Ray covers the area of continuous unseen views by Integrated Positional Encoding (IPE), providng more efficient extra training resources.


First, we reparameterize the metric distance \( t \in [t_{near}, t_{far}] \) as \( \tilde{t} \) to derive the variance \( \tilde{\sigma}^{2}_{\rho} \), which is perpendicular to the Area Ray (a). Next, as shown in (b), we derive a base radius of the Area Ray \( \tilde{\rho} \) from the angle \( \tilde{\theta} \) between \( r \) and \( \hat{n} \) using the trigonometric function as follows: $$\tilde{\rho} = \tilde{\delta}\tan\tilde{\theta},$$ where \( \tilde{\delta} = 1 - \tilde{t}_{s} \) so that \( \tilde{\rho} \) is obtained from the sample located on \( \tilde{t} = 1 \). However, directly employing the obtained \( \tilde{\rho} \) in IPE results in significantly large \( \tilde{\sigma}^2_\rho \), leading to over-regularization of high-frequency components for the samples along an Area Ray. Thus, we adjust the scale of \( \tilde{\rho} \) to \( [0, 1] \) to contract \( \tilde{\rho} \) with the large \( \tilde{\theta} \) value into a proper range, while leaving the one with small \( \tilde{\theta} \) affected little as follows: $$\tilde{\rho} = \exp{(-1 / (\tilde{\delta}\tan{\tilde{\theta}}))}.$$

And then, \( \tilde{\sigma}^2_\rho \) is derived from \( \tilde{t} \) and \( \tilde{\rho} \) to featurize the conical frustums of Area Ray as multivariate Gaussian by simply replacing the original metric distance \( t \), which is used in mip-NeRF, with \( \tilde{t} \) as follows: $$\tilde{\sigma}^2_\rho = \tilde{\rho}^2\left( \frac{\tilde{t}^2_\mu}{4} + \frac{5\tilde{t}^2_\delta}{12} - \frac{4\tilde{t}^4_\delta}{15(3\tilde{t}^2_\mu + \tilde{t}^2_\delta)} \right),$$ where \( \tilde{t}_\delta \) and \( \tilde{t}_\mu \) denote a half-width and mid-point of adjacent \( \tilde{t} \) values. Note that we use the same \( \mu_t \) and \( \sigma^2_t \) for the mean and variance along the Area Ray as mip-NeRF.

Finally, we generate an Area Ray \( \mathbf{\tilde{r}}(t) = \mathbf{\tilde{o}} + t\mathbf{\tilde{d}} \), where \( \mathbf{\tilde{d}} = -\mathbf{\hat{n}} \) and \( \mathbf{\tilde{o}} = \mathbf{p}_s - t_s\mathbf{\tilde{d}} \), so that the Area Ray is cast from the newly set camera origin \( \mathbf{\tilde{o}} \), which has the same distance from \( \mathbf{p}_s \) as the original ray, covering the unseen view area between the original ray and the corresponding reflection ray around the axis of \( \mathbf{\hat{n}} \).

Luminance Consistency Regularization


We propose the luminance map as an effective additional training resource for few-shot scenarios with limited data, providing 'free lunch' information easily derived from RGB images, and introduce luminance consistency regularization.
For simplicity, we use a relative luminance value, which is normalized as \( [0, 1] \), and derive the GT relative luminance \( y_\text{GT} \) of a target pixel as follows: $$y_\text{GT} = \sum_{\bar{c}}^{\{\bar{r}, \bar{g}, \bar{b}\}} \lambda_{\bar{c}} \bar{c},$$ where \( \bar{c} = c_\text{GT}^{2.2} \) indicates a linear rgb component converted from the gamma-compressed one by applying a simple power curve.

In addition to the existing outputs, our ARC-NeRF estimates the luminance \( y \) as additional outputs per sample along a ray and renders the final luminance \( \hat{y} \) by volume rendering as follows: $$\hat{y}(\mathbf{r}) = \sum_{i=1}^{N}w_i y_i,$$ where \( y_i \in [0, 1] \) is the estimated relative luminance of the \( i \)-th sample along a ray \( \mathbf{r} \).

Experiments

Frequency regularization effect of Area Ray


Compared to FreeNeRF which forcibly masks the high-frequency spectrum in the early training phase, ours adaptively regularizes the high-frequency components of additional ray samples based on the target pixel photo-consistency (i.e., the angle between the original ray and Area Ray) during the whole training process. As a result, our ARC-NeRF already achieves sharper fine details at 25K iteration than the fully trained FreeNeRF.

Effectiveness of Area Ray as a bundle of rays


Our ARC-NeRF outperforms FlipNeRF in all scenarios by a large margin. The training time per scene is measured using the same GPU, iterations, and batch size. The size of circles is proportional to \( \kappa \), i.e., the number of augmented rays per original ray.

Comparison with Other Baselines

Our ARC-NeRF achieves competitive rendering quality with better capturing fine details.

4-view

8-view

FreeNeRF is trained with the black and white prior assuming the estimated black and white color as the background and table, respectively, which is a highly strong assumption specific to the dataset, and achieves degenerate results without the prior. However, our ARC-NeRF achieves competitive performance without any heuristic prior by using Area Ray, which enables adaptive regularization of high-frequency.


3-view; Notable improvement in the detail of the tail.


6-view; Apple surface textures are more stably reconstructed across changing views.


9-view; Brick textures are also more consistently reproduced.

Citation

Acknowledgements

This work was supported by NRF grant (2021R1A2C3006659) and IITP grant (RS-2021-II211343), both funded by MSIT of the Korean Government. The work was also supported by Samsung Electronics (IO201223-08260-01).