- Introduction
Problem define: 기존 NeRF 는 너무 많은 수의 image 를 요구하며 너무 긴 optimization 시간으로 인해 impractical
► pixelNeRF 는 image feature 를 사용하지 않는 NeRF 와 달리, 각 pixel 에 aligned 된 spatial image feature 를 input 으로 사용
► pixelNeRF 는 NeRF 와 달리 few input image 로 잘 작동함
Framework
- Single Image
- Input image → Fully convolutional image feature grid
- Sample the corresponding image feature via projection and bilinear interpolation
- Query specification is sent along with the image features to NeRF network
- Multiple Image
- Input image → Latent representation in each camera's coordinate frame
- Pooled in an intermediate layer

View space: Viewer-centered coordinate system
Canonical space: Objected-centered coordinate system
- Method
Total architecture
- Fully-convolutional image encoder E: Input image 를 pixel-aligned feature grid 로 encode
- Pretrained ResNet-34
- NeRF network f: Outputs color and density

Single-Image pixelNeRF
Input image I
Feature volume W=E(I)
Camera ray x: Retrieve the corresponding image feature by projecting x to π(x)
Feature vector W(π(x)): Bilinearly interpolating between the pixelwise features
Image features, position, viewing direction is passed into the NeRF network
f(γ(x),d;W(π(x)))=(σ,c)
γ(⋅): Positional encoding
d: Viewing direction
x: Query point
Multiple Views
기존 연구에선 test time 에 single input view 만을 사용하던 것과는 다르게, pixelNeRF 에선 test time 에 arbitrary number 의 input view 를 사용하여 additional information 을 제공
ith input image I(i)
Camera transform from the world space to its view space with P(i)=[R(i)t(i)]
x(i)=P(i)x,d(i)=R(i)d

V(i)=f1(γ(x(i)),d(i);W(i)(π(x(i))))
V(1) 부터 V(i) 의 mean 을 구해 f2 로 이동
► (σ,c)=f2(ψ(V(1),…,V(n)))
만약 single input 인 경우엔 그냥 f=f2∘f1
- Experiment
Datasets
- ShapeNet
- Category-specific
- Category-agnostic
- ShapeNet scenes
- Unseen categories
- Multiple objects
- Domain transfer to real car photos
- DTU MVS dataset
- Real scenes
Baselines
- SRN
- DVR
- SoftRas for category-agnostic setting
Metrics
- PSNR
- SSIM
- LPIPS
Category-specific
Separate model for cars and chairs

Test time 에 2개의 input view 를 넣었을 때 reconstruction 이 더 잘 됨을 알 수 있음


Category-agnostic
Single model for 13 largest ShapeNet categories

Unseen-categories

Multiple-objects

- Discussion
- Reference
[1] Yu, Alex, et al. "pixelnerf: Neural radiance fields from one or few images." CVPR 2021 [Paper link]