0x44 NLP

Computational Linguistic

POS-Tagging

Parsing

Machine Translation

Statistical MT

IBM1..5

Neural Network MT

Introduction

Attention

Architecture

  • Global Attention: original attention in Bahdanau’s paper
  • Input Feed Attention: context vector is feedback into input
  • Local Attention: position-aware attention

Reference: https://arxiv.org/pdf/1508.04025.pdf

IME

Information Retrieval

Recommender System

Collaborative Filtering

  • User-based CF: compute the similarity between users based on users history, then recommend using similar users’ history
  • Item-based CF: compute the similarity between items based on users history, then recommend using similar items

Authority Metrics

PageRank

page rank is like vote, popular pages have more votes.

Each page will distribute its votes equally to all of its outlink pages and same random pages, each page will receive from its votes from all of their inlinks pages and also some random pages.

$$!Rank(p_k) = \frac{1-d}{ |c| } + d \sum_{ p_j \in InLinks( p_k ) } \frac{Rank(p_j) }{| OutLinks(p_j) | }$$

Topic Sensitive PageRank

PageRank with topics.

  • during indexing, each page is assigned to a topic
  • teleportation is allowed to pages within a topic
  • each document has a topic distribution
  • teleportation is based on the topic distribution
  • TSPR is linearly weighted by topic specific TSPR

Hyperlink-Induced Topic Search (HITS)

HITS only computes scores for a local graph, not the entire web. It is useful to find communities. It defines root set and base set where root set is query specific document set, and base set is the root set expanded with its links-to and links-from documents. The algorithm is as follows:

  • each page in base set has a hub score and authority score
  • hub score is computed from authority score
  • authority is computed from hub score
  • repeat

Neural Network Models

Representation Based Models

Deep Structured Semantic Models (DSSM) [ref]

  1. hash both document and query into a fixed length hashing layer using trigrams histograms.
  2. apply MLP to both document vector and query vector
  3. cosine similarity and normalized with softmax

Interaction Based Models

Deep Relevance Matching Model [ref]

DRMM is a type of interaction-based neural IR model, which can identify local matches between two pieces of text. Key ideas are followings:

  • initialize terms with word2vec
  • measure interaction between each pair of terms $$(q_i, d_j)$$
  • bin interactions (hard bin for perfect match, and soft bins for others)
  • MLP and aggregate