[CVPR 2022] Masked Autoencoders Are Scalable Vision Learners (MAE)

Paper Review/Self-Supervised Learning (SSL)

[CVPR 2022] Masked Autoencoders Are Scalable Vision Learners (MAE)

이성훈 Ethan 2023. 7. 27. 16:39

- Introduction

In NLP, models such as GPT or BERT remove a portion of data and learn to predict the removed content

► What is the difference in masked autoencoding between CV and NLP?

Architecture was different
- CNN was traditionally used in the field of vision
- ViT addressed this gap
Information density is different
- Language: Human-generated signals
- Vision: Natural signals
- Masking a very high portion of random patches is effective for image training
Decoder's role is different
- Vision: Decoder reconstructs pixels
- Language: Predict missing words

By the token masking design, (1) Optimizes accuracy (2) Reduces overall pre-training time

- Method

Masking

Image → Non-overlapping patches (ViT) → Random Sampling

COCO validation set / MAE trained on ImageNet

MAE encoder

ViT encoder only uses unmasked patches

MAE decoder

MAE decoder uses encoded visible tokens along with mask tokens

Positional embedding is added to all tokens

Reconstruction target

Decoder's output is reshaped to form a reconstructed image

Last layer of the decoder is a linear projection

Trained with MSE loss

Simple implementation

Generate token for every input patch
Token random shuffle and select $x$% portion
Encode
Append mask tokens to encoded tokens and unshuffle
MSE loss

- Experiment

Self-supervised pre-training on ImageNet-1k

Baseline: ViT-L 16

Masking ratio

Ablations (Decoder design, Mask token, Reconstruction target, Data augmentation, Mask sampling strategy)

Comparisons with self-supervised methods on IN1K dataset

MAE vs Supervised pre-training

- Discussion

- Reference

[1] He, Kaiming, et al. "Masked autoencoders are scalable vision learners." CVPR 2022 [Paper link]

저작자표시 (새창열림)

현재글[CVPR 2022] Masked Autoencoders Are Scalable Vision Learners (MAE)

Ethan's Winery

이성훈 Ethan

250x250

fewshot, Continual Learning, GAN, 용어, incremental learning, dl, 딥러닝, image classification,

Today :
Yesterday :

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Ethan's Winery