## TL;DR
Modern NLP pipelines transform raw text to vector representations suitable for neural processing. Tokenization, embeddings, and decoding strategies are the critical infrastructure between human language and machine understanding.
## Core Explanation
Tokenization: splitting text into model-digestible units (subwords, words, characters). BPE trains a subword vocabulary from corpus statistics. WordPiece (BERT) uses likelihood-based merging. SentencePiece treats input as raw byte sequences, language-agnostic.
## Detailed Analysis
Embeddings: Word2Vec (static), GloVe (global co-occurrence), contextual (BERT, GPT — same word has different vectors depending on context). Decoding: greedy (argmax each step), beam search (maintains k hypotheses), sampling (temperature + top-k/p for diversity).
## Further Reading
- Hugging Face Tokenizers Library
- Jay Alammar: The Illustrated Word2Vec
- NLP Course (Hugging Face)