Paper Review/3D Reconstruction (3DGS, NERF, LRM)

[WACV 2023] Vision Transformer for NeRF-Based View Synthesis from a Single Input Image

이성훈 Ethan 2023. 4. 15. 22:37

- Introduction

pixel-NeRF 와 같은 image-conditioned model 의 경우 target view 에 있는 pixel 이 input 에서 관찰이 불가능한 경우, significantly degrade 됨

 

따라서 이 논문에선 ViT 를 이용하여 global feature 를, CNN 을 이용하여 local feature 를 뽑고 incorporate 하여 더 나은 occluded region 에 대한 rendering quality 향상을 목표로 함 

 

 

차에서 input view 에서 보이지 않는 wheel 을 rendering 하기 위해선, image-conditioned model 의 경우엔 ray 를 따라서 feature 를 query

 

이 논문의 method 는 long-range dependencies 학습을 위해 self-attention 을 사용하여 query pixel 과 가장 관련이 되어있는 usefel feature 를 찾음


- Method

 

Novel View Synthesis From a Single Image

 

목표: Single image 로 부터 3D representation 추론

 

 

 

Synthesizing Occluded Regions

  • 1D latent code
    • $(\sigma, \mathbf{c})=\mathcal{F}_{1 \mathrm{D}}(\mathbf{z} ; \mathbf{x} ; \mathbf{d})$
    • $\mathbf{z}$: 1D global latent vector
    • $\mathbf{x}$: Position
    • $\mathbf{d}$: Viewing direction

 

  • 2D spatially-variant image feature
    • $(\sigma, \mathbf{c})=\mathcal{F}_{2 \mathrm{D}}\left(\mathbf{W}(\pi(\mathbf{x})) ; \mathbf{x}_{c} ; \mathbf{d}_{c}\right)$
    • $\mathbf{W}(\pi(\mathbf{x})$: Feature map with spatial information
    • $\mathbf{x}_c$: 3D position
    • $\mathbf{d}_c$: Viewing direction

 

  • 3D volume-based approaches
    • $(\sigma, \mathbf{c})=\mathcal{F}_{3 \mathrm{D}}\left(\mathbf{W}\left(\pi\left(\mathbf{x}_{n}\right)\right) ; \mathbf{x}_{n}\right)$
    • $\mathbf{W}(\pi(\mathbf{x}_n)$: 
    • $\mathbf{x}$: 3D Position
    • $\mathbf{x}_n$: 3D Position of neighboring voxels of \mathbf{x}$

 

3D method 가 shape refine, better visual quality 등의 장점을 가지지만, computationally expensive

 

► Ours 는 global information 을 encode 하는 1D-based method 와 local information 을 leverage 하는 2D-based method 를 combine

 

 

 

ViT architecture - Global features

  • Input: $\mathbf{I}_{s} \in \mathbb{R}^{H \times W \times 3}$
  • Reshape: $\mathbf{P} \in \mathbb{R}^{N \times P^{2} \times 3},\  N=\frac{H W}{P^{2}}$ patches
  • Project: $\mathbf{P}_{l} \in \mathbb{R}^{N \times D}$, cls token as 'background' token
    • Positional encodings: $\mathbf{P}_{e}^{i}=\mathbf{P}_{l}^{i}+\mathbf{e}^{i}$
  • Transformer encoder: $J$, outputs from the $j$-th layer: $f^j$
  • Convolutional decoder
    • Drop class token: $\mathcal{O}: \mathbb{R}^{(N+1) \times D} \rightarrow \mathbb{R}^{N \times D}$
    • Unflatten: $\mathcal{U}: \mathbb{R}^{N \times D} \rightarrow \mathbb{R}^{\frac{H}{P} \times \frac{W}{P} \times D}$
    • Convolutional decoder: $\mathcal{U}: \mathbb{R}^{N \times D} \rightarrow \mathbb{R}^{\frac{H}{P} \times \frac{W}{P} \times D}$
    • Feature maps: $\mathbf{W}_{G}^{j}=(\mathcal{D} \circ \mathcal{U} \circ \mathcal{O})\left(f^{j}\right), \text { where } j \in\{0,1, \ldots, J\}$

 

2D CNN - Local feature 

  • Local features: $\mathbf{W}_{L}=\mathcal{G}_{L}\left(\mathbf{I}_{s}\right), \mathcal{G}_{L}: \mathbb{R}^{H \times W \times C} \rightarrow \mathbb{R}^{\frac{H}{2} \times \frac{W}{2} \times D_{L}}$

 

Hybrid feature map: $\mathbf{W}=\mathcal{G}\left(\mathbf{W}_{G}^{0}, \mathbf{w}_{G}^{1}, \ldots, \mathbf{W}_{G}^{J} ; \mathbf{w}_{L}\right)$

 

 

Volumetric Rendering with NeRF

 

Positional encoding : $\gamma(p)=\left(\sin \left(2^{0} \pi p\right), \cos \left(2^{0} \pi p\right), \ldots\right. \left.\quad \sin \left(2^{M-1} \pi p\right), \cos \left(2^{M-1} \pi p\right)\right)$

 

MLP outputs: $(\sigma, \mathbf{c})=\operatorname{MLP}\left(\gamma\left(\mathbf{x}_{c}\right) ; \mathbf{d}_{c} ; \mathbf{W}(\pi(\mathbf{x}))\right)$

 

Render target view: $\hat{\mathbf{C}}(\mathbf{r})=\int_{t_{n}}^{t_{f}} T(t) \sigma(t) \mathbf{c}(t)dt$, $T(t)=\exp \left(-\int_{t_{n}}^{t} \sigma(\mathbf{r}(s)) d s\right)$

 

Loss: $T(t)=\exp \left(-\int_{t_{n}}^{t} \sigma(\mathbf{r}(s)) d s\right)$


- Experiment

 

Datasets

  • Category-specific view synthesis
    •  
  • Category-agnostic view synthesis
    • ShapeNet: 13 category
  • View synthesis on real images
    • Trained on ShapeNet, tested on Stanford cars
    • Removed background with segmentation model

 

 


- Discussion

 


- Reference

[1] Lin, Kai-En, et al. "Vision transformer for nerf-based view synthesis from a single input image." WACV 2023 [Paper link]