Skip to content

0x430 Foundation

1. Historical Lingustics

Historical linguistics studies how languages change in time and space

  • Diachrony: study development and evolution of languages through time
  • Synchrony: study language at a moment in time

1.1. Language Family

1.1.1. Indo-European

Derived from PIE

  • Germanic: English, German (Grimm's law)
  • Romance: Italic, French, Spanish(Derived from Vulgar Latin)
  • Hellenic: Greek
  • Balto-Slavic: Russian, Czech, Polish
  • Indo-Iranian: Hindi, Bengali
  • Celtic: Welsh (VSO)

1.1.2. Sino-Tibetan

  • Chinese, Tibetan languages

1.1.3. Niger-Congo

The largest language family by the number of languages (~1500), 3rd largest family by the number of speakers (~700m)

  • Bantu: swahili, zulu
  • Yoruba
  • Igbo

1.1.4. Afro-Asiatic

  • Arabic (VOS), Hebrew (VOS), Aramaic

1.1.5. Austronesian

  • Verb initial (VOS)
  • Malay, Indonesian, Walpiri, Malagacy

2. Writing Systems

A grapheme is the smallest functional unit in a writing system, broadly speaking, it can represent three different levels of linguistic information: individual sounds (alphabet), syllables (syllabary) or words (logographic systems).

Check the wikipedia's page for more information

2.1. Abjad

Abjad is a group of alphabet writing system in which there is only one symbol per consonant: vowels are not usually marked. Many Semitic languages are written with Abjads, including Arabic and Hebrew.

2.1.1. Arabic

This playlist is a good source to learn arabic script



2.1.2. Hebrew


2.1.3. Persian


Modern Persian, written in a version of the Arabic script (28 letters) except letters marked RED, which are belong to the Persian version (32 letters).

4 additional ones are exclusive to Persian: (P)"پ," (ZH)"ژ," (G)"گ," (CH)"چ".


2.2. Abugida

consonant-vowel sequences are written as units; each unit is based on a consonant letter, and vowel notation is secondary.

2.3. Alphabetic

An alphabet is a small set of letters (basic written symbols), each of which roughly represents or represented historically a segmental phoneme of a spoken language

2.3.1. Cyrillic

Check actual pronunciation at this website

It's easier to remember the pronunciation by associating them with corresponding Greek letter (e.g: p -> rho)



2.3.2. Georgian

Georgian script using three writing systems, Mkhedruli (third line) is the standard one


2.4. Logographic

2.5. Syllabic

3. Morphology

According to Wikipedia, morphology is the study of words, how they are formed, and their relationship to other words in the same language

3.1. Subword

3.2. Word

4. Syntax

4.1. Generative Grammar

Generative grammar is a small and finite set of rules that can produce a large and potentially infinite number of well-formed structures.

It looks that there are two schools of linguistics:

  • Transformational grammar
  • Monostratal (non-transformational) grammar

There are some interesting papers about how infants know about syntax, for example, What infants know about syntax but couldn’t have learned: experimental evidence for syntactic structure at 18 months

4.1.1. Transformational Grammar

Definition (deep structure vs surface structure) Surface structure is the syntactic forms of the outward form, deep structure is the abstract underlying syntactic form.

Definition (structural ambiguity) If there are two distinct underlying interpretations that have to be represented differently in deep structure, then it has structural ambiguity Standard Theory (1956 - 1965)

The original model proposed by Chomsky in 1965 Extended Standard Theory, X-bar theory (1965 - 1973)

Grammatical Category

  • tense: present, past
  • number: singular, plural
  • gender: masculine, feminine, neuter
  • voice: passive, applicative
  • valency: number of argument controlled by a verb
  • aspect: progressive, perfect
  • case: nominative, accusative, genitive, dative, ergative, instrumental, absolutive Minimalist Program (1990 - )

4.2. Dependency Grammar

4.3. Categorical Grammar

5. Semantics

5.1. Semantic Roles

Semantic roles are roles of entities with respect to the action described by the governing verb

  • Agent: entity that performs the action
  • Patient: entity that receives the action / affected by the action (state changing)
  • Theme: entity described but without state changing

6. Reference


[2] Foundations of Statistical Natural Language Processing