I'm interested in computer vision, deep learning, and neural rendering.
Much of my interest is currently focused on the efficient training framework of NeRF/3D-GS and synthetic data training leveraging diffusion models.
On the job market now—looking for research scientist/engineer roles where I can contribute, learn, and collaborate with great folks. Always happy to chat!
We present GenHMC, a generative diffusion framework that synthesizes photorealistic head-mounted camera (HMC) images from avatar renderings. By enabling high-quality unpaired training data generation, GenHMC facilitates scalable training of facial encoders for Codec Avatars and generalizes well across diverse identities and expressions.
We propose ROODI, a method for extracting and reconstructing 3D objects in the presence of occlusions using Gaussian Splatting. It first removes irrelevant splats based on a KNN-based pruning strategy, then completes the occluded regions using a diffusion-based generative inpainting model, enabling high-quality geometry recovery even under heavy occlusion.
We propose DivCon-NeRF, a novel ray augmentation method designed for few-shot novel view synthesis. By introducing surface-sphere and inner-sphere augmentation techniques, our method effectively balances ray diversity and geometric consistency, which helps suppress floaters and appearance artifacts often seen in sparse-input settings.
We introduce ARC-NeRF, a few-shot rendering method that casts area rays to cover a broader set of unseen viewpoints, improving spatial generalization with minimal input. Alongside, we propose adaptive frequency regularization and luminance consistency loss to further refine textures and high-frequency details in rendered outputs.
We introduce HL-CLIP, a CLIP-based video highlight detection framework that leverages the strong semantic alignment of pre-trained vision-language models. By fine-tuning the visual encoder and applying a saliency-based temporal pooling technique, our method achieves state-of-the-art performance with minimal domain-specific supervision.
We present SR-TensoRF, a sun-aligned relighting approach for NeRF-style outdoor scenes that does not rely on environment maps. By aligning lighting with solar movement and using a cubemap-based TensoRF backbone, our method enables realistic and fast relighting for dynamic outdoor scenes with consistent directional light simulation.
We propose ConcatPlexer, a simple yet effective batching strategy that accelerates Vision Transformer (ViT) inference by concatenating visual tokens along an additional dimension. This approach preserves model accuracy while improving inference throughput, requiring no architectural changes and offering easy integration into existing ViT pipelines.
We present FlipNeRF, a framework that utilizes flipped reflection rays derived from input images to simulate novel training views. This approach enhances surface normal estimation and rendering fidelity, enabling better generalization in few-shot novel view synthesis without requiring additional images or supervision.
We introduce MDPose, a real-time multi-person pose estimation method based on mixture density modeling. By randomly grouping keypoints and modeling their joint distribution without relying on person-specific instance IDs, MDPose achieves high accuracy and real-time performance even in crowded scenes with complex pose variations.
We present D-RMM, an end-to-end multi-object detection framework that models object locations using a regularized mixture density model. The training objective includes a novel Maximum Component Maximization (MCM) loss that prevents duplicate detections, resulting in improved accuracy and stability in both dense and sparse detection scenarios.
We propose MixNeRF, which models each camera ray as a mixture of Laplacian densities to better capture multi-modal RGB distribution in sparsely sampled scenes. Our framework includes a depth prediction auxiliary task and mixture regularization loss, allowing for more accurate novel view synthesis in few-shot NeRF settings.
We introduce MUM, a semi-supervised object detection framework that applies strong spatial data augmentation by mixing image tiles and unmixing their corresponding features. This strategy allows the model to benefit from mixed inputs without corrupting label supervision, leading to improved performance in low-label regimes on COCO and VOC benchmarks.
Thanks for sharing the website template, Jon Barron. :)