0x451 Representations
1. Classical Features
1.1. SIFT
1.2. SURF
1.3. BRIEF
Model (BRIEF, Binary Robust Independent Elementary Features)
2. Self-Supervised Features
This section follows the Cookbook of SSL
2.1. Deep Metric Learning Model
SimCLR/NNCLR/MeanSHIFT/SCL
The deep metric learning family encourages similarity between semantically transformed versions of an image.
Model (SimCLR) Check illustration of this blog
Learning steps are:
- apply data augmentation (e.g. crop, distortion) to every data point \(x\) in the batch, resulting in positive pair \((\tilde{x}_i, \tilde{x}_j)\)
- use encoder (e.g: ResNet) and pooling (e.g: mean pooling) to encode \(h_i = f(\tilde{x}_i)\)
- project representations \(z_i = g(h_i) = W_2(\text{ReLU}(W_1h_i))\) to do the contrastive loss (\(z_i\) seems to be better than \(h_i\) for contrastive loss)
- apply the contrastive loss
Suppose the minibatch contains \(N\) examples, the augmentation gives \(2N\) data points.the loss function of a positive pair \((i,j)\) is defined as
2.2. Self-Distillation Model
This family feeds two different views to two encoders, and mapping one to the other by means of a predictor.
Model (DINO) self distillation
Model (MoCo v3)
Model (BYOL (bootstrap your own latent))
2.3. Canonical Correlation Analysis Model
2.4. Masked Image Models
2.4.1. Pixel Masked Model
Model (Masked Autoencoder, MAE) The core ideas are
- encoder only operates on the visible subsets of patches
- mask high proportion of images (e.g 75\%)
2.4.2. Token Masked Model
Model (BEiT, Masked Language Model) BERT Pre-Training of Image Transformers
2.4.3. High-level Feature Masked Model
Model (BEiT v2) The core ideas are
- decoder attempts to reconstruct semantic features distilled from teacher models (as shown in the 1st figure)
- a final layer CLS token is concat with intermediate layer's outputs to reconstruct the features (as shown in the 2nd figure). This is to aggregate information to CLS token
3. Visual Language Modeling
3.1. Multimodal Encoder
Model (VisualBERT) pretrain the model to align image regions and input texts.
Each image embedding is corresponding to a bounding region detected by an object detector. The embedding is a sum of CNN features embedding, segment embedding (indicating it is an image embedding) and position embedding
The pretraining tasks are
- MLM with image: only text can be masked
- sentence-image prediction: whether sentencne and image are aligned
Model (VL-BERT)
Model (SimVLM, prefix language model) Images are considered as prefix which has the bidirectional attention. The remaining tokens has the autoregressive structure
3.2. Unimodal Encoder
Model (CLIP, Contrastive Language Image Pretraining)
Jointly train a text encoder and image encoder such that the correctly-aligned (text, image) pair has better probability (use cross entropy loss)
This can do a zero-shot prediction using a label text