## TL;DR
Vision Transformers (ViTs) have largely replaced CNNs as the dominant architecture in computer vision. DINOv2 demonstrated that self-supervised ViTs produce universal visual features, while SAM 2 extends segmentation to video.
## Core Explanation
ViT architecture: split image into 16×16 patches → linearly project to tokens → add positional embeddings → process through standard Transformer blocks. Advantages over CNNs: global receptive field from first layer, better scaling with data, and architectural unification with NLP models.
## Detailed Analysis
Self-supervised ViTs: DINO (self-distillation with no labels), MAE (masked autoencoding — predict masked patches), DINOv2 (scaled-up training with curated data). SAM 2 (Meta, 2024) extends the Segment Anything Model to video, enabling promptable segmentation across frames with memory-based tracking.
## Further Reading
- Meta AI: DINOv2 Demo
- Hugging Face: Vision Transformers
- Papers With Code: Image Classification