- Introduction
In NLP, models such as GPT or BERT remove a portion of data and learn to predict the removed content
► What is the difference in masked autoencoding between CV and NLP?
- Architecture was different
- CNN was traditionally used in the field of vision
- ViT addressed this gap
- Information density is different
- Language: Human-generated signals
- Vision: Natural signals
- Masking a very high portion of random patches is effective for image training
- Decoder's role is different
- Vision: Decoder reconstructs pixels
- Language: Predict missing words
By the token masking design, (1) Optimizes accuracy (2) Reduces overall pre-training time
- Method
Masking
Image → Non-overlapping patches (ViT) → Random Sampling
MAE encoder
ViT encoder only uses unmasked patches
MAE decoder
MAE decoder uses encoded visible tokens along with mask tokens
Positional embedding is added to all tokens
Reconstruction target
Decoder's output is reshaped to form a reconstructed image
Last layer of the decoder is a linear projection
Trained with MSE loss
Simple implementation
- Generate token for every input patch
- Token random shuffle and select $x$% portion
- Encode
- Append mask tokens to encoded tokens and unshuffle
- MSE loss
- Experiment
Self-supervised pre-training on ImageNet-1k
Baseline: ViT-L 16
Masking ratio
Ablations (Decoder design, Mask token, Reconstruction target, Data augmentation, Mask sampling strategy)
Comparisons with self-supervised methods on IN1K dataset
MAE vs Supervised pre-training
- Discussion
- Reference
[1] He, Kaiming, et al. "Masked autoencoders are scalable vision learners." CVPR 2022 [Paper link]