Low-Resource NLP

Status: public · Confidence: medium (0.855) · Basis: verified_sources

## TL;DR

Low-resource NLP studies language technologies for languages with limited labeled data, limited parallel text, limited digital corpora, or limited tool support. The recurring technical pattern is transfer: multilingual pretraining, multilingual translation, speech models, and community-created data can help systems work beyond high-resource languages.

## Core Claims

XLM-R is a multilingual representation-learning model designed for cross-lingual transfer. It shows how one model can learn shared representations from many languages and transfer to downstream tasks.

No Language Left Behind focuses on multilingual machine translation at broad language coverage, including low-resource languages. It is a core reference for large-scale multilingual translation rather than a guarantee of quality for every language pair.

Massively Multilingual Speech extends the scaling pattern to speech tasks, including recognition, language identification, and speech synthesis across many languages.

## Citation Boundaries

Use this article for stable low-resource NLP concepts. Do not use it to claim that a specific endangered or Indigenous language should be modeled without community consent, or that benchmark scores imply real-world usability for a language community.

## Further Reading

- [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116)
- [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672)
- [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516)