---
id: nlp-advanced-techniques
title: "Advanced NLP: Tokenization, Embeddings, and Decoding"
schema_type: Article
category: ai
language: en
confidence: high
last_verified: "2026-05-24"
created_date: "2026-05-24"
generation_method: ai_assisted
ai_models:
  - claude-opus
derived_from_human_seed: true
conflict_of_interest: none_declared
is_live_document: false
data_period: static
atomic_facts:
  - id: f1
    statement: BERT (Devlin et al. 2019, Google) introduced bidirectional pretraining with masked language modeling, achieving SOTA on 11 NLP benchmarks and establishing the pretrain-finetune paradigm.
    source_title: "Devlin, Jacob, et al. BERT: Pre-training of Deep Bidirectional Transformers. NAACL 2019"
    source_url: https://arxiv.org/abs/1810.04805
    confidence: high
  - id: f2
    statement: >-
      GPT-3 (Brown et al. 2020, OpenAI) demonstrated that scaling language models to 175B parameters enables strong few-shot learning without gradient updates — simply providing a few examples in the
      prompt.
    source_title: Brown, Tom B., et al. Language Models are Few-Shot Learners. NeurIPS 2020
    source_url: https://arxiv.org/abs/2005.14165
    confidence: high
  - id: f3
    statement: >-
      T5 (Raffel et al. 2020, Google) unified all NLP tasks into a text-to-text format, enabling a single model to perform translation, summarization, QA, and classification through the same
      encoder-decoder Transformer.
    source_title: Raffel, Colin, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR 2020
    source_url: https://jmlr.org/papers/v21/20-074.html
    confidence: high
completeness: 0.9
primary_sources:
  - title: Neural Machine Translation of Rare Words with Subword Units (BPE)
    type: academic_paper
    year: 2016
    url: https://arxiv.org/abs/1508.07909
    institution: ACL
  - title: The Curious Case of Neural Text Degeneration
    type: academic_paper
    year: 2020
    url: https://arxiv.org/abs/1904.09751
    institution: ICLR
known_gaps:
  - SentencePiece and Unigram tokenization
  - Speculative decoding advances
disputed_statements:
  - statement: No major disputed statements identified
secondary_sources:
  - title: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"
    type: conference_paper
    year: 2019
    authors:
      - Devlin, Jacob
      - Chang, Ming-Wei
      - Lee, Kenton
      - Toutanova, Kristina
    institution: Google AI / NAACL
    url: https://arxiv.org/abs/1810.04805
  - title: "Large Language Models: A Comprehensive Survey"
    type: survey_paper
    year: 2024
    authors:
      - Zhou, Ce
      - Li, Qian
      - Li, Chen
      - et al.
    institution: International Journal of Machine Learning & Cybernetics (Springer)
    url: https://doi.org/10.1007/s13042-024-02443-6
  - title: Attention Is All You Need (Transformer)
    type: conference_paper
    year: 2017
    authors:
      - Vaswani, Ashish
      - Shazeer, Noam
      - Parmar, Niki
      - et al.
    institution: Google Brain / NeurIPS
    url: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  - title: GPT-4 Technical Report
    type: technical_report
    year: 2024
    authors:
      - OpenAI
      - Achiam, Josh
      - Adler, Steven
      - et al.
    institution: OpenAI
    url: https://arxiv.org/abs/2303.08774
updated: "2026-05-24"
---
## TL;DR
Modern NLP pipelines transform raw text to vector representations suitable for neural processing. Tokenization, embeddings, and decoding strategies are the critical infrastructure between human language and machine understanding.

## Core Explanation
Tokenization: splitting text into model-digestible units (subwords, words, characters). BPE trains a subword vocabulary from corpus statistics. WordPiece (BERT) uses likelihood-based merging. SentencePiece treats input as raw byte sequences, language-agnostic.

## Detailed Analysis
Embeddings: Word2Vec (static), GloVe (global co-occurrence), contextual (BERT, GPT — same word has different vectors depending on context). Decoding: greedy (argmax each step), beam search (maintains k hypotheses), sampling (temperature + top-k/p for diversity).

## Further Reading
- Hugging Face Tokenizers Library
- Jay Alammar: The Illustrated Word2Vec
- NLP Course (Hugging Face)