Paper Review/Out-of-Distribution Detection (OOD)

[CVPR 2023] GEN: Pushing the Limits of Softmax-Based Out-of-Distribution Detection

이성훈 Ethan 2023. 7. 4. 02:57

- Introduction

 

OOD detection scenarios

  1. Covariate shift: Change in the input distribution
  2. Semantic shift: Change in the label distribution

 

Existing OOD detection works

  • Predictive distribution
  • Incorporate feature statistics for ID data
  • Requires a portion of training data
  • Internal feature activation

 

GOAL: Explore and push the limits of OOD detection when the output of a softmax layer is the only available source of information - Post-hoc method

 

 

Main Contribution

  1. Only uses predictive distribution
  2. Performs very well (성능에 자신이 있다면 이것도 contribution 으로 가져갈 수 있군..)

- Related Work

 

Score Design

  • Maximum predicted Softmaz Probability (MSP)
  • Minimum Mahalanobis distance between feature and class-wise centroids
  • Energy Score
  • Hard Maximum of logits

 

Previous Methods

  • GradNorm
  • pNML
  • ViM
  • ...

 


- Method

 

Generalized Entropy Score

 

Aim: Rely solely on the logits and predictive distribution

  1. Agnostic to any information on the classifier training, training set, explicit OOD samples
  2. Neural Collapse Hypothesis: Feature from the penulimate layer has very limited additional info compared to logits

 

Generalized Entropy $\mathit{G}$: Differentiable and concave function on the categorical distributions $\Delta^\mathit{C}$

 

Bregman Divergence: $D_G(\mathbf{p} \| \mathbf{q}):=G(\mathbf{q})-G(\mathbf{p})+(\mathbf{p}-\mathbf{q})^{\top} \nabla G(\mathbf{q})$

 

Assumption: $G$ is invariant under permutations of the elements in $p$

 

Bregman Divergence between $p$ and the uniform categorical distribution $u=1/C$ reduces to the negated generalized entropy

 

$\begin{aligned} D_G(\mathbf{p} \| \mathbf{u}) & =G(\mathbf{u})-G(\mathbf{p})+(\mathbf{p}-\mathbf{u})^{\top} \nabla G(\mathbf{u}) \\ & \doteq-G(\mathbf{p})+\underbrace{(\mathbf{p}-\mathbf{u})^{\top} \nabla G(\mathbf{u})}_{=0} .\end{aligned}$

 

$\nabla G(\mathbf{u})=\nabla G(\mathbf{1} / C)=\kappa \mathbf{1}$ 이기 때문에, 마지막 term 이 0 이 됨을 알 수 있음

결론적으로 negated entropy를 사용하는 것은 predictive distribution $p$와 uniform distribution $u$ 사이의 통계적 거리(statistical distance) 로 해석될 수 있다.

 

$G_\gamma(\mathbf{p})=\sum_j p_j^\gamma\left(1-p_j\right)^\gamma,\, \gamma \in (0,1)$

 

0과 1 사이에선 concave 함을 알 수 있음

 

Shannon Entropy 보다 Generalized Entropy 가 $\gamma$ 의 값이 작을수록 더 민감해짐

 

결과적으로 GEN 의 motivation 은 simple 하고 straightforward (라고 하는데... 정보이론에 익숙하지 않은 나에겐 잘 와닿지 않는다..)

 

아무튼 Generalized Entropy 를 사용하여 minor deviations 를 amplify 하는게 포인트 → Detail 을 좀 더 잘 잡아내지 않을까..


- Experiment

 

Datasets

  • ID
    • ImageNet-1K
  • OOD
    • OpenImage-O
    • Texture
    • iNaturalist
    • ImageNet-O

 

ImageNet-1K penultimate layer output dimension and top-1 accuracy

 

Baseline

  • Big Transfer (BiT)
  • ViT
  • RepVGG
  • ResNet-50-D
  • DeiT
  • Swin

 

Per-Dataset performance of OOD detection methods
Average performance of OOD detection methods


- Discussion

 

사실 연구에 사용하기 위해 ViT에서 성능이 제일 좋았으면 좋았을텐데... 싶지만...

 

그래도 굉장히 간단한 방법으로 여러가지 dataset 및 baseline 에서 sota 를 찍었기 때문에 꽤나 의미가 있는 연구라고 생각된다.


- Reference

[1] Liu, Xixi, et al. "GEN: Pushing the Limits of Softmax-Based Out-of-Distribution Detection." CVPR 2023 [Paper link]