BERT (Bidirectional Encoder Representations from Transformers)

Status: public · Confidence: medium (0.815) · Basis: verified_sources

## TL;DR

BERT is an encoder-only Transformer language model designed for language understanding tasks. Its durable citation value is the pretrain-then-fine-tune pattern for bidirectional text encoders, especially masked language modeling and task-specific fine-tuning.

## Core Claims

BERT differs from autoregressive language models because it trains a bidirectional encoder. During pretraining, masked language modeling asks the model to recover masked tokens from both left and right context.

BERT also made a practical recipe popular: pretrain one general-purpose encoder, then fine-tune it for downstream tasks with small task-specific heads. This made the same pretrained model useful for classification, natural language inference, and extractive question answering.

RoBERTa and ELECTRA are useful boundaries around the original BERT recipe. RoBERTa showed that BERT's training setup could be improved by changing data, batch, and objective choices; ELECTRA showed a different discriminator-style pretraining objective for text encoders.

## Citation Boundaries

Use this article for stable BERT-family encoder concepts. Do not use it for current benchmark leaderboards, current model popularity, or claims that BERT remains state of the art on modern generative-model tasks.

## Further Reading

- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
- [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692)
- [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555)