Skip to content

0x451 Representations

1. Classical Features

1.1. SIFT

1.2. SURF

1.3. BRIEF

Model (BRIEF, Binary Robust Independent Elementary Features)

2. Self-Supervised Features

This section follows the Cookbook of SSL

2.1. Deep Metric Learning Model


The deep metric learning family encourages similarity between semantically transformed versions of an image.

Model (SimCLR) Check illustration of this blog

Learning steps are:

  • apply data augmentation (e.g. crop, distortion) to every data point \(x\) in the batch, resulting in positive pair \((\tilde{x}_i, \tilde{x}_j)\)
  • use encoder (e.g: ResNet) and pooling (e.g: mean pooling) to encode \(h_i = f(\tilde{x}_i)\)
  • project representations \(z_i = g(h_i) = W_2(\text{ReLU}(W_1h_i))\) to do the contrastive loss (\(z_i\) seems to be better than \(h_i\) for contrastive loss)
  • apply the contrastive loss

Suppose the minibatch contains \(N\) examples, the augmentation gives \(2N\) data points.the loss function of a positive pair \((i,j)\) is defined as

\[l_{i,j} = - \log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} 1_{k \neq i}\exp(\text{sim}(z_i, z_k)/\tau)}\]


2.2. Self-Distillation Model

This family feeds two different views to two encoders, and mapping one to the other by means of a predictor.

Model (DINO) self distillation


Model (MoCo v3)

Model (BYOL (bootstrap your own latent))

2.3. Canonical Correlation Analysis Model

2.4. Masked Image Models

2.4.1. Pixel Masked Model

Model (Masked Autoencoder, MAE) The core ideas are

  • encoder only operates on the visible subsets of patches
  • mask high proportion of images (e.g 75\%)

masked autoencoder

2.4.2. Token Masked Model

Model (BEiT, Masked Language Model) BERT Pre-Training of Image Transformers


2.4.3. High-level Feature Masked Model

Model (BEiT v2) The core ideas are

  • decoder attempts to reconstruct semantic features distilled from teacher models (as shown in the 1st figure)
  • a final layer CLS token is concat with intermediate layer's outputs to reconstruct the features (as shown in the 2nd figure). This is to aggregate information to CLS token



3. Visual Language Modeling

3.1. Multimodal Encoder

Model (VisualBERT) pretrain the model to align image regions and input texts.

Each image embedding is corresponding to a bounding region detected by an object detector. The embedding is a sum of CNN features embedding, segment embedding (indicating it is an image embedding) and position embedding

The pretraining tasks are

  • MLM with image: only text can be masked
  • sentence-image prediction: whether sentencne and image are aligned


Model (VL-BERT)


Model (SimVLM, prefix language model) Images are considered as prefix which has the bidirectional attention. The remaining tokens has the autoregressive structure

prefix lm

3.2. Unimodal Encoder

Model (CLIP, Contrastive Language Image Pretraining)

Jointly train a text encoder and image encoder such that the correctly-aligned (text, image) pair has better probability (use cross entropy loss)

This can do a zero-shot prediction using a label text

clip train