# 0x451 Representations

## 1. Classical Features

### 1.3. BRIEF

Model (BRIEF, Binary Robust Independent Elementary Features)

## 2. Self-Supervised Features

This section follows the Cookbook of SSL

### 2.1. Deep Metric Learning Model

SimCLR/NNCLR/MeanSHIFT/SCL

The deep metric learning family encourages similarity between semantically transformed versions of an image.

Model (SimCLR) Check illustration of this blog

Learning steps are:

• apply data augmentation (e.g. crop, distortion) to every data point $$x$$ in the batch, resulting in positive pair $$(\tilde{x}_i, \tilde{x}_j)$$
• use encoder (e.g: ResNet) and pooling (e.g: mean pooling) to encode $$h_i = f(\tilde{x}_i)$$
• project representations $$z_i = g(h_i) = W_2(\text{ReLU}(W_1h_i))$$ to do the contrastive loss ($$z_i$$ seems to be better than $$h_i$$ for contrastive loss)
• apply the contrastive loss

Suppose the minibatch contains $$N$$ examples, the augmentation gives $$2N$$ data points.the loss function of a positive pair $$(i,j)$$ is defined as

$l_{i,j} = - \log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} 1_{k \neq i}\exp(\text{sim}(z_i, z_k)/\tau)}$

### 2.2. Self-Distillation Model

This family feeds two different views to two encoders, and mapping one to the other by means of a predictor.

Model (DINO) self distillation

Model (MoCo v3)

Model (BYOL (bootstrap your own latent))

### 2.3. Canonical Correlation Analysis Model

Model (Masked Autoencoder, MAE) The core ideas are

• encoder only operates on the visible subsets of patches
• mask high proportion of images (e.g 75\%)

Model (BEiT, Masked Language Model) BERT Pre-Training of Image Transformers

#### 2.4.3. High-level Feature Masked Model

Model (BEiT v2) The core ideas are

• decoder attempts to reconstruct semantic features distilled from teacher models (as shown in the 1st figure)
• a final layer CLS token is concat with intermediate layer's outputs to reconstruct the features (as shown in the 2nd figure). This is to aggregate information to CLS token

## 3. Visual Language Modeling

### 3.1. Multimodal Encoder

Model (VisualBERT) pretrain the model to align image regions and input texts.

Each image embedding is corresponding to a bounding region detected by an object detector. The embedding is a sum of CNN features embedding, segment embedding (indicating it is an image embedding) and position embedding