- Introduction
pixel-NeRF 와 같은 image-conditioned model 의 경우 target view 에 있는 pixel 이 input 에서 관찰이 불가능한 경우, significantly degrade 됨
따라서 이 논문에선 ViT 를 이용하여 global feature 를, CNN 을 이용하여 local feature 를 뽑고 incorporate 하여 더 나은 occluded region 에 대한 rendering quality 향상을 목표로 함
차에서 input view 에서 보이지 않는 wheel 을 rendering 하기 위해선, image-conditioned model 의 경우엔 ray 를 따라서 feature 를 query
이 논문의 method 는 long-range dependencies 학습을 위해 self-attention 을 사용하여 query pixel 과 가장 관련이 되어있는 usefel feature 를 찾음
- Method
Novel View Synthesis From a Single Image
목표: Single image 로 부터 3D representation 추론
Synthesizing Occluded Regions
- 1D latent code
- $(\sigma, \mathbf{c})=\mathcal{F}_{1 \mathrm{D}}(\mathbf{z} ; \mathbf{x} ; \mathbf{d})$
- $\mathbf{z}$: 1D global latent vector
- $\mathbf{x}$: Position
- $\mathbf{d}$: Viewing direction
- 2D spatially-variant image feature
- $(\sigma, \mathbf{c})=\mathcal{F}_{2 \mathrm{D}}\left(\mathbf{W}(\pi(\mathbf{x})) ; \mathbf{x}_{c} ; \mathbf{d}_{c}\right)$
- $\mathbf{W}(\pi(\mathbf{x})$: Feature map with spatial information
- $\mathbf{x}_c$: 3D position
- $\mathbf{d}_c$: Viewing direction
- 3D volume-based approaches
- $(\sigma, \mathbf{c})=\mathcal{F}_{3 \mathrm{D}}\left(\mathbf{W}\left(\pi\left(\mathbf{x}_{n}\right)\right) ; \mathbf{x}_{n}\right)$
- $\mathbf{W}(\pi(\mathbf{x}_n)$:
- $\mathbf{x}$: 3D Position
- $\mathbf{x}_n$: 3D Position of neighboring voxels of \mathbf{x}$
3D method 가 shape refine, better visual quality 등의 장점을 가지지만, computationally expensive
► Ours 는 global information 을 encode 하는 1D-based method 와 local information 을 leverage 하는 2D-based method 를 combine
ViT architecture - Global features
- Input: $\mathbf{I}_{s} \in \mathbb{R}^{H \times W \times 3}$
- Reshape: $\mathbf{P} \in \mathbb{R}^{N \times P^{2} \times 3},\ N=\frac{H W}{P^{2}}$ patches
- Project: $\mathbf{P}_{l} \in \mathbb{R}^{N \times D}$, cls token as 'background' token
- Positional encodings: $\mathbf{P}_{e}^{i}=\mathbf{P}_{l}^{i}+\mathbf{e}^{i}$
- Transformer encoder: $J$, outputs from the $j$-th layer: $f^j$
- Convolutional decoder
- Drop class token: $\mathcal{O}: \mathbb{R}^{(N+1) \times D} \rightarrow \mathbb{R}^{N \times D}$
- Unflatten: $\mathcal{U}: \mathbb{R}^{N \times D} \rightarrow \mathbb{R}^{\frac{H}{P} \times \frac{W}{P} \times D}$
- Convolutional decoder: $\mathcal{U}: \mathbb{R}^{N \times D} \rightarrow \mathbb{R}^{\frac{H}{P} \times \frac{W}{P} \times D}$
- Feature maps: $\mathbf{W}_{G}^{j}=(\mathcal{D} \circ \mathcal{U} \circ \mathcal{O})\left(f^{j}\right), \text { where } j \in\{0,1, \ldots, J\}$
2D CNN - Local feature
- Local features: $\mathbf{W}_{L}=\mathcal{G}_{L}\left(\mathbf{I}_{s}\right), \mathcal{G}_{L}: \mathbb{R}^{H \times W \times C} \rightarrow \mathbb{R}^{\frac{H}{2} \times \frac{W}{2} \times D_{L}}$
Hybrid feature map: $\mathbf{W}=\mathcal{G}\left(\mathbf{W}_{G}^{0}, \mathbf{w}_{G}^{1}, \ldots, \mathbf{W}_{G}^{J} ; \mathbf{w}_{L}\right)$
Volumetric Rendering with NeRF
Positional encoding : $\gamma(p)=\left(\sin \left(2^{0} \pi p\right), \cos \left(2^{0} \pi p\right), \ldots\right. \left.\quad \sin \left(2^{M-1} \pi p\right), \cos \left(2^{M-1} \pi p\right)\right)$
MLP outputs: $(\sigma, \mathbf{c})=\operatorname{MLP}\left(\gamma\left(\mathbf{x}_{c}\right) ; \mathbf{d}_{c} ; \mathbf{W}(\pi(\mathbf{x}))\right)$
Render target view: $\hat{\mathbf{C}}(\mathbf{r})=\int_{t_{n}}^{t_{f}} T(t) \sigma(t) \mathbf{c}(t)dt$, $T(t)=\exp \left(-\int_{t_{n}}^{t} \sigma(\mathbf{r}(s)) d s\right)$
Loss: $T(t)=\exp \left(-\int_{t_{n}}^{t} \sigma(\mathbf{r}(s)) d s\right)$
- Experiment
Datasets
- Category-specific view synthesis
- Category-agnostic view synthesis
- ShapeNet: 13 category
- View synthesis on real images
- Trained on ShapeNet, tested on Stanford cars
- Removed background with segmentation model
- Discussion
- Reference
[1] Lin, Kai-En, et al. "Vision transformer for nerf-based view synthesis from a single input image." WACV 2023 [Paper link]