Skip to content

Xinjian Li

0x554 Scaling

Xinjian Li

Home
Research
Research
- Index
- Dissertation
- Publications
- Readings
  Readings
  - Index
  - 0x0 ML
    0x0 ML
    
    Bayesian
    
    Classical
    
    Game
    
    Generalization
  - 0x1 DL
    0x1 DL
    
    Efficiency
    
    Optimization
    
    Optimization
    
    Transformer
  - 0x2 CV
    0x2 CV
    
    Multimodal
    
    Tasks
  - 0x3 NLP
    0x3 NLP
    
    LLM
    
    Task
    
    Todo
  - 0x4 Speech
    0x4 Speech
    
    ASR
    
    Alignment
    
    Dataset
    
    Multimodal
    
    SLM
    
    TTS
Software
Software
- Software
Notes
Notes
- Notes
- 0x0 Mathematics
  0x0 Mathematics
  - 0x00 Foundation
    0x00 Foundation
    
    0x000 Set Theory
    
    0x001 Topology
    
    0x002 Number
    
    0x003 Sequence
    
    0x004 Matrix
  - 0x01 Algebra
    0x01 Algebra
    
    0x010 Abstract Algebra
    
    0x011 Linear Algebra
  - 0x02 Analysis
    0x02 Analysis
    
    0x020 Foundation
    
    0x021 Complex Analysis
    
    0x022 Real Analysis
    
    0x023 Functional Analysis
    
    0x024 Fourier Analysis
  - 0x03 Geometry
    0x03 Geometry
    
    0x030 Foundation
    
    0x031 Curve and Surface
    
    0x032 Differential Geometry
  - 0x04 Applied
    0x04 Applied
    
    0x046 Differential Equation
- 0x1 Science
  0x1 Science
  - 0x10 Subatom
    0x10 Subatom
    
    0x100 Standard Model
    
    0x101 Quantum Mechanics
  - 0x11 Classical Physics
    0x11 Classical Physics
    
    0x110 Classical Mechanics
    
    0x111 Electrodynamics
    
    0x112 Thermodynamics
    
    0x113 Statistical Mechanics
  - 0x12 Chemistry
    0x12 Chemistry
    
    0x120 Foundation
  - 0x13 Biology
    0x13 Biology
    
    0x130 Cell
  - 0x14 Economics
    0x14 Economics
    
    0x140 Microeconomics
    
    0x141 Macroeconomics
    
    0x142 Game-Theory
  - 0x15 Earth Science
    0x15 Earth Science
    
    0x150 Geology
  - 0x16 Astronomy
    0x16 Astronomy
    
    0x160 Foundation
    
    0x161 Astrophysics
    
    0x162 Observatory
    
    0x153 Planet
    
    0x154 Star
    
    0x155 Galaxy
- 0x2 Engineering
  0x2 Engineering
  - 0x20 Mechanical Engineering
    0x20 Mechanical Engineering
    
    0x200 Foundation
    
    0x201 Mechanism
    
    0x202 Optimal Control
  - 0x21 Electronic Engineering
    0x21 Electronic Engineering
    
    0x210 Foundation
    
    0x211 Semiconductor
    
    0x212 Analog Circuits
    
    0x213 Digital Circuits
    
    0x214 Integrated Circuit
    
    0x215 Telecommunication
    
    0x216 PLD
  - 0x22 Computer Engineering
    0x22 Computer Engineering
    
    0x220 ISA
    
    0x221 Computing
    
    0x222 Memory
    
    0x223 Communication
  - 0x23 Quantum Engineering
    0x23 Quantum Engineering
    
    0x230 Foundation
- 0x3 Computer Science
  0x3 Computer Science
  - 0x30 Theory
    0x30 Theory
    
    0x300 Formal Language
    
    0x301 Complexity
    
    0x302 Information Theory
    
    0x303 Cryptography
  - 0x31 Algorithm
    0x31 Algorithm
    
    0x310 Arithmetic
    
    0x311 Numerical Algorithm
    
    0x312 Sequence
    
    0x313 Sort & Search
    
    0x314 Combinatorial Optimization
  - 0x32 Operating System
    0x32 Operating System
    
    0x320 Foundation
    
    0x321 Concurrency
    
    0x322 Memory
    
    0x323 File System
    
    0x324 Linux Admin
    
    0x325 Windows Admin
  - 0x33 Execution
    0x33 Execution
    
    0x330 Assembler
    
    0x331 Linker
    
    0x332 Compiler
    
    0x333 Build
    
    0x334 Runtime
  - 0x34 Language
    0x34 Language
    
    0x340 Foundation
    
    0x341 C
    
    0x342 C++
    
    0x343 Java
    
    0x344 JavaScript
    
    0x345 Go
    
    0x346 Python
    
    0x347 SQL
    
    0x348 HTML/CSS
  - 0x35 Network
    0x35 Network
    
    0x350 Physical and Link
    
    0x351 Network and Transport
    
    0x352 Application
    
    0x353 Browser
    
    0x354 Server
    
    0x355 Security
  - 0x36 Local Systems
    0x36 Local Systems
    
    0x360 Interface
    
    0x361 Computing
    
    0x362 Memory
    
    0x363 Storage
    
    0x364 Virtualization
  - 0x37 Distributed Systems
    0x37 Distributed Systems
    
    0x370 Communication
    
    0x371 Computing
    
    0x372 Distribution
    
    0x373 Storage
    
    0x374 Virtualization
    
    0x375 Search Engine
- 0x4 Machine Learning
  0x4 Machine Learning
  - 0x40 Probability
    0x40 Probability
    
    0x400 Probability
    
    0x401 Distribution
    
    0x402 Stochastics
  - 0x41 Statistics
    0x41 Statistics
    
    0x410 Foundation
    
    0x411 Classical
    
    0x412 Bayesian
    
    0x413 Parametric
    
    0x414 Nonparametric
  - 0x42 Optimization
    0x42 Optimization
    
    0x420 Convex-Analysis
    
    0x421 Convex-Optimization
  - 0x43 Model
    0x43 Model
    
    0x430 Foundation
    
    0x431 Classical
    
    0x432 Bayesian
    
    0x433 Reinforcement-Learning
  - 0x44 Language
    0x44 Language
    
    0x440 Foundation
    
    0x441 Representations
    
    0x442 Model
  - 0x45 Vision
    0x45 Vision
    
    0x450 Foundation
    
    0x451 Representations
    
    0x452 Model
  - 0x46 Speech
    0x46 Speech
    
    0x460 Foundation
    
    0x461 Representations
    
    0x462 Model
- 0x5 Deep Learning
  0x5 Deep Learning
  - 0x50 Foundation
    0x50 Foundation
    
    0x500 Theory
    
    0x501 Optimization
    
    0x502 Efficiency
    
    0x503 Reinforcement Learning
    
    0x504 Multitask Learning
    
    0x505 Data
  - 0x51 System
    0x51 System
    
    0x500 Accelerator
    
    0x511 Computing
    
    0x512 Compiler
    
    0x513 Distribution
    
    0x514 Infrastructure
  - 0x52 Framework
    0x52 Framework
    
    0x520 Tensorflow
    
    0x521 Torch
    
    0x522 Jax
    
    0x523 Dataset
  - 0x53 Model
    0x53 Model
    
    0x530 FFN
    
    0x531 Sequence
    
    0x532 Convolution
    
    0x533 Graph
    
    0x534 Attention
  - 0x54 Generation
    0x54 Generation
    
    0x540 Adversarial
    
    0x541 VAE
    
    0x542 Flow
    
    0x543 Diffusion
    
    0x544 Autoregressive
    
    0x545 Energy
  - 0x55 Language
    0x55 Language
    
    0x550 Encoder
    
    0x551 Decoder
    
    0x552 Adaptation
    
    0x553 Task
    
    0x554 Scaling
  - 0x56 Vision
    0x56 Vision
    
    0x560 Representation
    
    0x561 Task
  - 0x57 Speech
    0x57 Speech
    
    0x570 Representation
    
    0x571 Model
    
    0x572 Task

0x554 Scaling

According to this work: do you need billions of words of pretraining data LM requires only 10M or 100M words to learn syntactic/semantic features, a much larger database (1B, 30B) is required to acquire common sense knowledge

The scaling law paper shows cross-entropy loss scales as a power-law wrt model size, dataset size, computation size:

scale

Chichilla paper suggests using training token of 20 times parameters under budget constraint, however, llama 3 training their model with much larger and it continued to improve log-linearly. They train 8B and 70B up to 15T tokens (which should be ~200B token for 8B according to Chinchilla)

Another relevant work is the U-shape scaling, which show that there are a few tasks that has worse performance with larger models, those tasks, however, actually have the U-scaling curve, where the decreased performance with medium model might be explained by the "distractor task"

Check this lecture series