Skip to content

0x560 Representation

1. Image Embedding

1.1. Vision Transformer

Model (vision transformer, vit) use transformer instead of cnn

  • images is splitted into patches, 224x224 images is splitted into 16x16 patches. each patch has 14x14 (196 dim), each patch is like a word-embedding, there are 16x16 words on total.
  • a learnable embedding (like the BERT's class token) is prepend before the patch sequence.
  • pos embedding (trainable 1d pos embedding) are added
  • can be used as a self-supervised training with masked patch prediction.

vit

Model (DeiT, data-efficient image transformer) distill information from a teacher ViT model

deit

1.1.1. Hierarchical Model

Model (swin transformer)

Swin Transformer block

  • attention is limited to a local window
  • those window will shifted across layers

swin_transformer

those blocks are forming stages hierarchy in which a layer merging neighbor patches

swin_transformer

Model (HIPT, Hierarchical Image Pyramid Transformer) High resolution tranformer model using hierarchical model

hipt

2. Image Tokenizer