---
id: vision-transformers
title: "Vision Transformers: ViT, DINOv2, and the End of CNNs"
schema_type: TechArticle
category: ai
language: en
confidence: high
last_verified: "2026-05-24"
created_date: "2026-05-24"
generation_method: ai_assisted
ai_models:
  - claude-opus
derived_from_human_seed: true
conflict_of_interest: none_declared
is_live_document: false
data_period: static
atomic_facts:
  - id: f1
    statement: >-
      ViT (Dosovitskiy et al. 2021, Google) showed that a pure Transformer applied directly to sequences of image patches can outperform CNNs on image classification when pretrained on sufficient data
      (JFT-300M).
    source_title: "Dosovitskiy, Alexey, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021"
    source_url: https://arxiv.org/abs/2010.11929
    confidence: high
  - id: f2
    statement: >-
      Swin Transformer (Liu et al. 2021, Microsoft, ICCV 2021 Best Paper) uses shifted window attention to build hierarchical feature maps with linear complexity, enabling ViTs as general-purpose
      vision backbones.
    source_title: "Liu, Ze, et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021"
    source_url: https://arxiv.org/abs/2103.14030
    confidence: high
  - id: f3
    statement: >-
      DINOv2 (Oquab et al. 2023, Meta AI) produces high-quality visual features through self-supervised pretraining, achieving strong performance across image classification, segmentation, and depth
      estimation without fine-tuning.
    source_title: "Oquab, Maxime, et al. DINOv2: Learning Robust Visual Features without Supervision. NeurIPS 2023"
    source_url: https://arxiv.org/abs/2304.07193
    confidence: high
completeness: 0.9
primary_sources:
  - title: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)"
    type: academic_paper
    year: 2021
    url: https://arxiv.org/abs/2010.11929
    institution: ICLR/Google
  - title: "DINOv2: Learning Robust Visual Features without Supervision"
    type: academic_paper
    year: 2023
    url: https://arxiv.org/abs/2304.07193
    institution: Meta AI
known_gaps:
  - Hybrid CNN-Transformer architectures
  - Efficient ViT for edge devices
disputed_statements:
  - statement: No major disputed statements identified
secondary_sources:
  - title: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)"
    type: conference_paper
    year: 2021
    authors:
      - Dosovitskiy, Alexey
      - Beyer, Lucas
      - Kolesnikov, Alexander
      - et al.
    institution: Google Research / ICLR
    url: https://arxiv.org/abs/2010.11929
  - title: "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV 2021 Best Paper)"
    type: conference_paper
    year: 2021
    authors:
      - Liu, Ze
      - Lin, Yutong
      - Cao, Yue
      - et al.
    institution: Microsoft Research / ICCV
    url: https://arxiv.org/abs/2103.14030
  - title: "A Survey of Vision Transformers: From Image Classification to Object Detection and Segmentation"
    type: survey_paper
    year: 2024
    authors:
      - multiple
    institution: IEEE TPAMI
    url: https://doi.org/10.1109/TPAMI.2024.3385267
  - title: "DINOv2: Learning Robust Visual Features without Supervision (Meta AI)"
    type: conference_paper
    year: 2023
    authors:
      - Oquab, Maxime
      - Darcet, Timothée
      - Moutakanni, Théo
      - et al.
    institution: Meta AI Research / NeurIPS
    url: https://arxiv.org/abs/2304.07193
updated: "2026-05-24"
---
## TL;DR
Vision Transformers (ViTs) have largely replaced CNNs as the dominant architecture in computer vision. DINOv2 demonstrated that self-supervised ViTs produce universal visual features, while SAM 2 extends segmentation to video.

## Core Explanation
ViT architecture: split image into 16×16 patches → linearly project to tokens → add positional embeddings → process through standard Transformer blocks. Advantages over CNNs: global receptive field from first layer, better scaling with data, and architectural unification with NLP models.

## Detailed Analysis
Self-supervised ViTs: DINO (self-distillation with no labels), MAE (masked autoencoding — predict masked patches), DINOv2 (scaled-up training with curated data). SAM 2 (Meta, 2024) extends the Segment Anything Model to video, enabling promptable segmentation across frames with memory-based tracking.

## Further Reading
- Meta AI: DINOv2 Demo
- Hugging Face: Vision Transformers
- Papers With Code: Image Classification