Skip to content

0x443 Neural Model

1. Model

1.1. CNN Model

1.1.1. R-CNN

Model (R-CNN, Region-Based CNN)

Step 1: find region proposals (maybe from some external algorithms)

Region proposal is to find a small set of boxes that are likely to cover all objects. For example, Selective Search gives 2000 regions

Step 2: wrap image regions into 224x224 and run convolutional net to predict class dsitributions and transformation of bounding box \((t_x, t_y, t_h, t_w)\)

\[b_x = p_x + p_w t_x\]
\[b_w = p_w \exp(t_w)\]

Note that weight and height are transformed in exp scale

One problem arises here is that object detctors might output many overlapping detections. The solution is to postprocess raw detections using Non-Max Suppression, it is to greedily pickup the highest scoring box and delete box whose overlap with it is high enough. This still has some open issues that it may eliminate good boxes.

Model (Fast R-CNN)

One issue with R-CNN is its very slow (running full CNN for 2000 proposals). Fast R-CNN split the full CNN into two CNN, where the first CNN (backbone network) is shared by all proposals. Most computation happens here so it is fast.

The proposed region is to take shared output from the first backbone CNN and forward through the second small CNN.

Model (Faster R-CNN)

With Fast R-CNN, most computational costs are from the region proposals, to further make it faster, Faster R-CNN uses the Region Proposal Network (RPN) to predict proposals from features

1.1.2. Modern CNN

Model (ConvNexT)

1.2. Vision Transformer

Model (vision transformer, vit) use transformer instead of cnn

  • images is splitted into patches, 224x224 images is splitted into 16x16 patches. each patch has 14x14 (196 dim), each patch is like a word-embedding, there are 16x16 words on total.
  • a learnable embedding (like the BERT's class token) is prepend before the patch sequence.
  • pos embedding (trainable 1d pos embedding) are added
  • can be used as a self-supervised training with masked patch prediction.


Model (DeiT, data-efficient image transformer) distill information from a teacher ViT model


1.2.1. Hierarchical Model

Model (swin transformer)

Swin Transformer block

  • attention is limited to a local window
  • those window will shifted across layers


those blocks are forming stages hierarchy in which a layer merging neighbor patches


Model (HIPT, Hierarchical Image Pyramid Transformer) High resolution tranformer model using hierarchical model


2. Tasks

2.1. Classification

2.2. Detection

Task (Object Detection) The task of object detection is as follows:

  • input: an RGB image
  • output: A set of detected objects, for each object we have an category label (from fixed, known set of categories) and Bounding box (x, y, width, height)

The challenges of the object detection task is

  • multiple outputs: need to output variable numbers of objects per image
  • multiple output types: need to predict what and where
  • large images: need a higher resolution for detection often ~800x600 (classification task: 224x224)

A simple approach is to use the sliding window: apply a CNN to many different crops of images to classify for each crop. However, this would generate too many boxes therefore not feasible.

Metric (IoU Intersection Over Union) compute the overlap between the ground truth box and the prediction box. IoU > 0.5 is "decent", 0.7 is "pretty good", 0.9 is "almost perfect"

\[\frac{\text{Area of Intersection}}{\text{Area of Union}}\]

Metric (mAP: Mean Average Precision) compute AP (area under precision recall area), for each category and take mean. In COCO MAP, repeat this for different IoU threshold

2.3. Segmentation

3. Generative Model

3.1. GAN

3.2. Autoregressive Model

Model (Image GPT) use GPT to model autoregressively over pixel sequence. The pretrained model can be evaluated using classification task by fitting a linear classifier over some hidden layer. (middle layer tends to perform better)


3.2.1. Quantized Model

Model (VQVAE) See the VQVAE section

Model (VQGAN)

  • encoder: use CNN to capture local context-rich vocab
  • sequential model: use transformer to model the long-range LM interaction.
  • decoder: CNN Decoder
  • discriminator: to promote high-quality reconstruction. It ensures the vocab capture important local structure to alleviate the need for modeling low-level feature with transfromer


Model (ViT-VQGAN) Improved VQGAN. Blog Article

Arch diff

  • encoder: CNN -> ViT
  • decoder: CNN -> ViT



Metric (FID, Fréchet inception distance) Assess the quality of generative datasets (thus its underlying generative model) using Wasserstein metric between

  • multi-dim Gaussian fitting to the internal activations of generative model
  • multi-dim Gaussian fitting to the activations of real images

Recall the Wasserstein of gaussian has closed form.

Note that FID is not fully aligned with perceptual quality. The low-level image processing (e.g. resizing, compression) might impact it according to this work

Here is an implementation on github

4. Multimodal Model

4.1. Text-to-Image Generation

Model (DALLE-2)

Model (Parti)


5. Reference

[0] original papers. All images are taken from the original papers or blogs

[1] EECS 498-007 Deep Learning for Computer Vision

[2] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.