Paper Review/Self-Supervised Learning (SSL)

[CVPR 2022] Masked Autoencoders Are Scalable Vision Learners (MAE)

이성훈 Ethan 2023. 7. 27. 16:39

- Introduction

 

In NLP, models such as GPT or BERT remove a portion of data and learn to predict the removed content

 

► What is the difference in masked autoencoding between CV and NLP?

 

  1. Architecture was different
    • CNN was traditionally used in the field of vision
    • ViT addressed this gap
  2. Information density is different
    • Language: Human-generated signals
    • Vision: Natural signals
    • Masking a very high portion of random patches is effective for image training
  3. Decoder's role is different
    • Vision: Decoder reconstructs pixels
    • Language: Predict missing words

 

 

By the token masking design, (1) Optimizes accuracy (2) Reduces overall pre-training time


- Method

 

Masking

 

Image → Non-overlapping patches (ViT) → Random Sampling

 

ImageNet validation set / 80% masked
COCO validation set / MAE trained on ImageNet

 

 

MAE encoder

 

ViT encoder only uses unmasked patches

 

 

MAE decoder

 

MAE decoder uses encoded visible tokens along with mask tokens

 

Positional embedding is added to all tokens

 

 

Reconstruction target

 

Decoder's output is reshaped to form a reconstructed image

 

Last layer of the decoder is a linear projection

 

Trained with MSE loss

 

 

Simple implementation

 

  1. Generate token for every input patch
  2. Token random shuffle and select $x$% portion 
  3. Encode
  4. Append mask tokens to encoded tokens and unshuffle
  5. MSE loss

- Experiment

 

Self-supervised pre-training on ImageNet-1k

 

Baseline: ViT-L 16

 

 

Masking ratio

 

 

Ablations (Decoder design, Mask token, Reconstruction target, Data augmentation, Mask sampling strategy)

 

 

 

Comparisons with self-supervised methods on IN1K dataset

 

 

MAE vs Supervised pre-training

 


- Discussion

 


- Reference

[1] He, Kaiming, et al. "Masked autoencoders are scalable vision learners." CVPR 2022 [Paper link]