[ICLR 2024] LRM: Large reconstruction model for single image to 3d

Paper Review/3D Reconstruction (3DGS, NERF, LRM)

[ICLR 2024] LRM: Large reconstruction model for single image to 3d

이성훈 Ethan 2024. 12. 21. 21:07

728x90

- Introduction

어떤 물체의 이미지 한 장으로 3D reconstruction 이 가능할까?

최근 2d image generation 분야를 보면, large 모델을 사용하면서 아주 좋은 성능을 보이고 있음

► 3D recon 분야에도 적용 (transformers)

1. 이미지를 받아서 triplane representation 형태로 NeRF 추정 (EG3D 에서 제안)

Volume 이나 Point cloud 에 비해 연산량이 적음

2. Encoder-Decoder architecture 제안 (DINO)

- Method

Image Encoder

DINO 학습된 ViT 사용

Input: 512 x 512 x 3

Conv Output, ViT Output: (32 x 32) x 768

일반적으로 CLS token 만 사용하는 것과 달리, 전체 feature sequence 를 모두 사용

Image-to-triplane Decoder

Camera Features

4x4 Camera extrinsic parameter matrix, focal length, principal point 까지 합쳐서 feature 생성

$\boldsymbol{c}=\left[\boldsymbol{E}_{1 \times 16}, f o c_x, f o c_y, p p_x, p p_y\right]$

LRM 은 object 의 canonical pose 에 영향을 받지 않고, 해당 parameter 는 오직 학습 때만 사용

Triplane Representation

$\left(\boldsymbol{T}_{x y}, \boldsymbol{T}_{y z}, \boldsymbol{T}_{x z}\right)$

3D point 를 plane 에 project 시키고, feature 를 interpolate 해서 MLP 통과

Modulation with Camera Features

Camera modulation layer norm

$\begin{aligned} \gamma, \beta & =\operatorname{MLP}^{\bmod }(\tilde{\boldsymbol{c}}) \\ \operatorname{ModLN}_{\mathrm{c}}\left(\boldsymbol{f}_j\right) & =\operatorname{LN}\left(\boldsymbol{f}_j\right) \cdot(1+\gamma)+\beta\end{aligned}$

Transformer Layers

$\begin{aligned} \boldsymbol{f}_j^{\text {cross }} & =\operatorname{CrossAttn}\left(\operatorname{ModLN}_c\left(\boldsymbol{f}_j^{\text {in }}\right) ;\left\{\boldsymbol{h}_i\right\}_{i=1}^n\right)+\boldsymbol{f}_j^{\text {in }} \\ \boldsymbol{f}_j^{\text {self }} & ={\operatorname{SelfAttn}\left(\operatorname{ModLN}_c\left(\boldsymbol{f}_j^{\text {cross }}\right) ;\left\{\operatorname{ModLN}_c\left(\boldsymbol{f}_j^{\text {cross }}\right)\right\}_j\right)+\boldsymbol{f}_j^{\text {cross }}}_{\boldsymbol{f}_j^{\text {out }}}=\operatorname{MLP}^{\text {tfm }}\left(\operatorname{ModLN}_c\left(\boldsymbol{f}_j^{\text {self }}\right)\right)+\boldsymbol{f}_j^{\text {self }}\end{aligned}$

Triplane NeRF

$\mathrm{MLP}^{\text {nerf }}$ 를 통해 RGB 와 density 예측

Dimension (4) = RGB (3) + Density (1)

Training Objectives

$\mathcal{L}_{\text {recon }}(\boldsymbol{x})=\frac{1}{V} \sum_{v=1}^V\left(\mathcal{L}_{\mathrm{MSE}}\left(\hat{\boldsymbol{x}}_v, \boldsymbol{x}_v^{G T}\right)+\lambda \mathcal{L}_{\mathrm{LPIPS}}\left(\hat{\boldsymbol{x}}_v, \boldsymbol{x}_v^{G T}\right)\right)$

- Experiment

Dataset

Objaverse
MVImgNet

Camera Nomalization

Normalized the input camera poses to [0, -2, 0] for Objaverse
Normalized the input camera poses to [0, -dis, 0] for video data (dis: distance between world origin and camera origin)

Network Architecture

ViT-B/16 (DINO)
Input: 512x512

Training

128 x A100 (40G)
Batch size 1024
Epochs 30
3 Days

Inference

5 second on A100

- Discussion

- Reference

[1] Hong, Yicong, et al. "Lrm: Large reconstruction model for single image to 3d." ICLR 2024 [Paper link]

728x90

저작자표시 (새창열림)

'Paper Review > 3D Reconstruction (3DGS, NERF, LRM)' 카테고리의 다른 글

[CVPR 2024] DUSt3R: Geometric 3D Vision Made Easy (0)	2025.01.25
[SIGGRAPH 2023] 3D Gaussian Splatting for Real-Time Radiance Field Rendering (0)	2024.01.08
[CVPR 2022 Oral] Point-NeRF: Point-based Neural Radiance Fields (0)	2023.08.08
[WACV 2023] Vision Transformer for NeRF-Based View Synthesis from a Single Input Image (0)	2023.04.15
[CVPR 2021] pixelNeRF: Neural Radiance Fields from One or Few Images (0)	2023.04.15

현재글[ICLR 2024] LRM: Large reconstruction model for single image to 3d

Ethan's Winery

이성훈 Ethan

250x250

GAN, image classification, dl, 용어, Continual Learning, 딥러닝, incremental learning, fewshot,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Ethan's Winery