Skip to content

Xinjian Li

0x560 Representation

Xinjian Li

Home
Research
Research
- Index
- Dissertation
- Publications
- Readings
  Readings
  - Index
  - 0x0 ML
    0x0 ML
    
    Bayesian
    
    Classical
    
    Game
    
    Generalization
  - 0x1 DL
    0x1 DL
    
    Efficiency
    
    Optimization
    
    Optimization
    
    Transformer
  - 0x2 CV
    0x2 CV
    
    Multimodal
    
    Tasks
  - 0x3 NLP
    0x3 NLP
    
    LLM
    
    Task
    
    Todo
  - 0x4 Speech
    0x4 Speech
    
    ASR
    
    Alignment
    
    Dataset
    
    Multimodal
    
    SLM
    
    TTS
Software
Software
- Software
Notes
Notes
- Notes
- 0x0 Mathematics
  0x0 Mathematics
  - 0x00 Foundation
    0x00 Foundation
    
    0x000 Set Theory
    
    0x001 Topology
    
    0x002 Number
    
    0x003 Sequence
    
    0x004 Matrix
  - 0x01 Algebra
    0x01 Algebra
    
    0x010 Abstract Algebra
    
    0x011 Linear Algebra
  - 0x02 Analysis
    0x02 Analysis
    
    0x020 Foundation
    
    0x021 Complex Analysis
    
    0x022 Real Analysis
    
    0x023 Functional Analysis
    
    0x024 Fourier Analysis
  - 0x03 Geometry
    0x03 Geometry
    
    0x030 Foundation
    
    0x031 Curve and Surface
    
    0x032 Differential Geometry
  - 0x04 Applied
    0x04 Applied
    
    0x046 Differential Equation
- 0x1 Science
  0x1 Science
  - 0x10 Subatom
    0x10 Subatom
    
    0x100 Standard Model
    
    0x101 Quantum Mechanics
  - 0x11 Classical Physics
    0x11 Classical Physics
    
    0x110 Classical Mechanics
    
    0x111 Electrodynamics
    
    0x112 Thermodynamics
    
    0x113 Statistical Mechanics
  - 0x12 Chemistry
    0x12 Chemistry
    
    0x120 Foundation
  - 0x13 Biology
    0x13 Biology
    
    0x130 Cell
  - 0x14 Economics
    0x14 Economics
    
    0x140 Microeconomics
    
    0x141 Macroeconomics
    
    0x142 Game-Theory
  - 0x15 Earth Science
    0x15 Earth Science
    
    0x150 Geology
  - 0x16 Astronomy
    0x16 Astronomy
    
    0x160 Foundation
    
    0x161 Astrophysics
    
    0x162 Observatory
    
    0x153 Planet
    
    0x154 Star
    
    0x155 Galaxy
- 0x2 Engineering
  0x2 Engineering
  - 0x20 Mechanical Engineering
    0x20 Mechanical Engineering
    
    0x200 Foundation
    
    0x201 Mechanism
    
    0x202 Optimal Control
  - 0x21 Electronic Engineering
    0x21 Electronic Engineering
    
    0x210 Foundation
    
    0x211 Semiconductor
    
    0x212 Analog Circuits
    
    0x213 Digital Circuits
    
    0x214 Integrated Circuit
    
    0x215 Telecommunication
    
    0x216 PLD
  - 0x22 Computer Engineering
    0x22 Computer Engineering
    
    0x220 ISA
    
    0x221 Computing
    
    0x222 Memory
    
    0x223 Communication
  - 0x23 Quantum Engineering
    0x23 Quantum Engineering
    
    0x230 Foundation
- 0x3 Computer Science
  0x3 Computer Science
  - 0x30 Theory
    0x30 Theory
    
    0x300 Formal Language
    
    0x301 Complexity
    
    0x302 Information Theory
    
    0x303 Cryptography
  - 0x31 Algorithm
    0x31 Algorithm
    
    0x310 Arithmetic
    
    0x311 Numerical Algorithm
    
    0x312 Sequence
    
    0x313 Sort & Search
    
    0x314 Combinatorial Optimization
  - 0x32 Operating System
    0x32 Operating System
    
    0x320 Foundation
    
    0x321 Concurrency
    
    0x322 Memory
    
    0x323 File System
    
    0x324 Linux Admin
    
    0x325 Windows Admin
  - 0x33 Execution
    0x33 Execution
    
    0x330 Assembler
    
    0x331 Linker
    
    0x332 Compiler
    
    0x333 Build
    
    0x334 Runtime
  - 0x34 Language
    0x34 Language
    
    0x340 Foundation
    
    0x341 C
    
    0x342 C++
    
    0x343 Java
    
    0x344 JavaScript
    
    0x345 Go
    
    0x346 Python
    
    0x347 SQL
    
    0x348 HTML/CSS
  - 0x35 Network
    0x35 Network
    
    0x350 Physical and Link
    
    0x351 Network and Transport
    
    0x352 Application
    
    0x353 Browser
    
    0x354 Server
    
    0x355 Security
  - 0x36 Local Systems
    0x36 Local Systems
    
    0x360 Interface
    
    0x361 Computing
    
    0x362 Memory
    
    0x363 Storage
    
    0x364 Virtualization
  - 0x37 Distributed Systems
    0x37 Distributed Systems
    
    0x370 Communication
    
    0x371 Computing
    
    0x372 Distribution
    
    0x373 Storage
    
    0x374 Virtualization
    
    0x375 Search Engine
- 0x4 Machine Learning
  0x4 Machine Learning
  - 0x40 Probability
    0x40 Probability
    
    0x400 Probability
    
    0x401 Distribution
    
    0x402 Stochastics
  - 0x41 Statistics
    0x41 Statistics
    
    0x410 Foundation
    
    0x411 Classical
    
    0x412 Bayesian
    
    0x413 Parametric
    
    0x414 Nonparametric
  - 0x42 Optimization
    0x42 Optimization
    
    0x420 Convex-Analysis
    
    0x421 Convex-Optimization
  - 0x43 Model
    0x43 Model
    
    0x430 Foundation
    
    0x431 Classical
    
    0x432 Bayesian
    
    0x433 Reinforcement-Learning
  - 0x44 Language
    0x44 Language
    
    0x440 Foundation
    
    0x441 Representations
    
    0x442 Model
  - 0x45 Vision
    0x45 Vision
    
    0x450 Foundation
    
    0x451 Representations
    
    0x452 Model
  - 0x46 Speech
    0x46 Speech
    
    0x460 Foundation
    
    0x461 Representations
    
    0x462 Model
- 0x5 Deep Learning
  0x5 Deep Learning
  - 0x50 Foundation
    0x50 Foundation
    
    0x500 Theory
    
    0x501 Optimization
    
    0x502 Efficiency
    
    0x503 Reinforcement Learning
    
    0x504 Multitask Learning
    
    0x505 Data
  - 0x51 System
    0x51 System
    
    0x500 Accelerator
    
    0x511 Computing
    
    0x512 Compiler
    
    0x513 Distribution
    
    0x514 Infrastructure
  - 0x52 Framework
    0x52 Framework
    
    0x520 Tensorflow
    
    0x521 Torch
    
    0x522 Jax
    
    0x523 Dataset
  - 0x53 Model
    0x53 Model
    
    0x530 FFN
    
    0x531 Sequence
    
    0x532 Convolution
    
    0x533 Graph
    
    0x534 Attention
  - 0x54 Generation
    0x54 Generation
    
    0x540 Adversarial
    
    0x541 VAE
    
    0x542 Flow
    
    0x543 Diffusion
    
    0x544 Autoregressive
    
    0x545 Energy
  - 0x55 Language
    0x55 Language
    
    0x550 Encoder
    
    0x551 Decoder
    
    0x552 Adaptation
    
    0x553 Task
    
    0x554 Scaling
  - 0x56 Vision
    0x56 Vision
    
    0x560 Representation 0x560 Representation
    Table of contents
    
    1. Image Embedding
    
    1.1. Vision Transformer
    
    2. Image Tokenizer
    
    0x561 Task
  - 0x57 Speech
    0x57 Speech
    
    0x570 Representation
    
    0x571 Model
    
    0x572 Task

0x560 Representation

1. Image Embedding
- 1.1. Vision Transformer
  - 1.1.1. Hierarchical Model
2. Image Tokenizer

1. Image Embedding

1.1. Vision Transformer

Model (vision transformer, vit) use transformer instead of cnn

images is splitted into patches, 224x224 images is splitted into 16x16 patches. each patch has 14x14 (196 dim), each patch is like a word-embedding, there are 16x16 words on total.
a learnable embedding (like the BERT's class token) is prepend before the patch sequence.
pos embedding (trainable 1d pos embedding) are added
can be used as a self-supervised training with masked patch prediction.

vit

Model (DeiT, data-efficient image transformer) distill information from a teacher ViT model

deit

1.1.1. Hierarchical Model

Model (swin transformer)

Swin Transformer block

attention is limited to a local window
those window will shifted across layers

swin_transformer

those blocks are forming stages hierarchy in which a layer merging neighbor patches

swin_transformer

Model (HIPT, Hierarchical Image Pyramid Transformer) High resolution tranformer model using hierarchical model

hipt

2. Image Tokenizer