0x504 Multitask Learning

Multitask Learning

When training over a set of imbalanced datasets, there are a few strategies:

equal mixing: baseline, typically overfitting to low-resouce task and underfit high-resource tasks
examples-proportional sampling: proportional but with an upper bound K, T5 paper says there a sweet spot \(K\) for each task achieving the best
temperature based sampling: as adopted by Arivazhagan, Naveen, et al 2019 where an appropriate temperature (e.g T=5) sets a good balance between high-resource tasks and low-resource tasks (transfer vs interference problems)

From Hong-yi Lee's video

Elastic Weight Consolidation (EWC) • https://arxiv.org/abs/1612.00796

Gradient Episodic Memory

reference gemax WER: 4.78 WER precompute token + wiz server: 5.05 WER export tokenizer + wiz server: 5.24 WER