[WACV 2023] Vision Transformer for NeRF-Based View Synthesis from a Single Input Image

Paper Review/3D Reconstruction (3DGS, NERF, LRM)

[WACV 2023] Vision Transformer for NeRF-Based View Synthesis from a Single Input Image

이성훈 Ethan 2023. 4. 15. 22:37

- Introduction

pixel-NeRF 와 같은 image-conditioned model 의 경우 target view 에 있는 pixel 이 input 에서 관찰이 불가능한 경우, significantly degrade 됨

따라서 이 논문에선 ViT 를 이용하여 global feature 를, CNN 을 이용하여 local feature 를 뽑고 incorporate 하여 더 나은 occluded region 에 대한 rendering quality 향상을 목표로 함

차에서 input view 에서 보이지 않는 wheel 을 rendering 하기 위해선, image-conditioned model 의 경우엔 ray 를 따라서 feature 를 query

이 논문의 method 는 long-range dependencies 학습을 위해 self-attention 을 사용하여 query pixel 과 가장 관련이 되어있는 usefel feature 를 찾음

- Method

Novel View Synthesis From a Single Image

목표: Single image 로 부터 3D representation 추론

Synthesizing Occluded Regions

1D latent code
- $(\sigma, \mathbf{c})=\mathcal{F}_{1 \mathrm{D}}(\mathbf{z} ; \mathbf{x} ; \mathbf{d})$
- $\mathbf{z}$: 1D global latent vector
- $\mathbf{x}$: Position
- $\mathbf{d}$: Viewing direction

2D spatially-variant image feature
- $(\sigma, \mathbf{c})=\mathcal{F}_{2 \mathrm{D}}\left(\mathbf{W}(\pi(\mathbf{x})) ; \mathbf{x}_{c} ; \mathbf{d}_{c}\right)$
- $\mathbf{W}(\pi(\mathbf{x})$: Feature map with spatial information
- $\mathbf{x}_c$: 3D position
- $\mathbf{d}_c$: Viewing direction

3D volume-based approaches
- $(\sigma, \mathbf{c})=\mathcal{F}_{3 \mathrm{D}}\left(\mathbf{W}\left(\pi\left(\mathbf{x}_{n}\right)\right) ; \mathbf{x}_{n}\right)$
- $\mathbf{W}(\pi(\mathbf{x}_n)$:
- $\mathbf{x}$: 3D Position
- $\mathbf{x}_n$: 3D Position of neighboring voxels of \mathbf{x}$

3D method 가 shape refine, better visual quality 등의 장점을 가지지만, computationally expensive

► Ours 는 global information 을 encode 하는 1D-based method 와 local information 을 leverage 하는 2D-based method 를 combine

ViT architecture - Global features

Input: $\mathbf{I}_{s} \in \mathbb{R}^{H \times W \times 3}$
Reshape: $\mathbf{P} \in \mathbb{R}^{N \times P^{2} \times 3},\ N=\frac{H W}{P^{2}}$ patches
Project: $\mathbf{P}_{l} \in \mathbb{R}^{N \times D}$, cls token as 'background' token
- Positional encodings: $\mathbf{P}_{e}^{i}=\mathbf{P}_{l}^{i}+\mathbf{e}^{i}$
Transformer encoder: $J$, outputs from the $j$-th layer: $f^j$
Convolutional decoder
- Drop class token: $\mathcal{O}: \mathbb{R}^{(N+1) \times D} \rightarrow \mathbb{R}^{N \times D}$
- Unflatten: $\mathcal{U}: \mathbb{R}^{N \times D} \rightarrow \mathbb{R}^{\frac{H}{P} \times \frac{W}{P} \times D}$
- Convolutional decoder: $\mathcal{U}: \mathbb{R}^{N \times D} \rightarrow \mathbb{R}^{\frac{H}{P} \times \frac{W}{P} \times D}$
- Feature maps: $\mathbf{W}_{G}^{j}=(\mathcal{D} \circ \mathcal{U} \circ \mathcal{O})\left(f^{j}\right), \text { where } j \in\{0,1, \ldots, J\}$

2D CNN - Local feature

Local features: $\mathbf{W}_{L}=\mathcal{G}_{L}\left(\mathbf{I}_{s}\right), \mathcal{G}_{L}: \mathbb{R}^{H \times W \times C} \rightarrow \mathbb{R}^{\frac{H}{2} \times \frac{W}{2} \times D_{L}}$

Hybrid feature map: $\mathbf{W}=\mathcal{G}\left(\mathbf{W}_{G}^{0}, \mathbf{w}_{G}^{1}, \ldots, \mathbf{W}_{G}^{J} ; \mathbf{w}_{L}\right)$

Volumetric Rendering with NeRF

Positional encoding : $\gamma(p)=\left(\sin \left(2^{0} \pi p\right), \cos \left(2^{0} \pi p\right), \ldots\right. \left.\quad \sin \left(2^{M-1} \pi p\right), \cos \left(2^{M-1} \pi p\right)\right)$

MLP outputs: $(\sigma, \mathbf{c})=\operatorname{MLP}\left(\gamma\left(\mathbf{x}_{c}\right) ; \mathbf{d}_{c} ; \mathbf{W}(\pi(\mathbf{x}))\right)$

Render target view: $\hat{\mathbf{C}}(\mathbf{r})=\int_{t_{n}}^{t_{f}} T(t) \sigma(t) \mathbf{c}(t)dt$, $T(t)=\exp \left(-\int_{t_{n}}^{t} \sigma(\mathbf{r}(s)) d s\right)$

Loss: $T(t)=\exp \left(-\int_{t_{n}}^{t} \sigma(\mathbf{r}(s)) d s\right)$

- Experiment

Datasets

Category-specific view synthesis
Category-agnostic view synthesis
- ShapeNet: 13 category
View synthesis on real images
- Trained on ShapeNet, tested on Stanford cars
- Removed background with segmentation model

- Reference

[1] Lin, Kai-En, et al. "Vision transformer for nerf-based view synthesis from a single input image." WACV 2023 [Paper link]

저작자표시 (새창열림)

'Paper Review > 3D Reconstruction (3DGS, NERF, LRM)' 카테고리의 다른 글

[ICLR 2024] LRM: Large reconstruction model for single image to 3d (0)	2024.12.21
[SIGGRAPH 2023] 3D Gaussian Splatting for Real-Time Radiance Field Rendering (0)	2024.01.08
[CVPR 2022 Oral] Point-NeRF: Point-based Neural Radiance Fields (0)	2023.08.08
[CVPR 2021] pixelNeRF: Neural Radiance Fields from One or Few Images (0)	2023.04.15
[ECCV 2020 oral] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis (0)	2023.03.12

현재글[WACV 2023] Vision Transformer for NeRF-Based View Synthesis from a Single Input Image

Ethan's Winery

이성훈 Ethan

250x250

image classification, dl, 딥러닝, 용어, fewshot, GAN, Continual Learning, incremental learning,

Today :
Yesterday :

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Ethan's Winery