---
id:"kb-2026-00275"
title:"Computer Vision"
schema_type:"TechArticle"
category:"ai"
language:"en"
confidence:"high"
last_verified:"2026-05-22"
generation_method: "human_only"
derived_from_human_seed: true
primary_sources:
  - title: "Computer Vision: Algorithms and Applications (2nd Edition)"
    authors: ["Szeliski, Richard"]
    type: "book"
    year: 2022
    url: "https://szeliski.org/Book/"
    institution: "Springer"
    note: "Comprehensive CV textbook covering image formation, recognition, 3D reconstruction, and deep learning"
secondary_sources:
  - title: "ImageNet Large Scale Visual Recognition Challenge"
    authors: ["Russakovsky, Olga", "Deng, Jia", "Su, Hao", "et al."]
    type: "academic_paper"
    year: 2015
    doi: "10.1007/s11263-015-0816-y"
    url: "https://arxiv.org/abs/1409.0575"
    institution: "IJCV"
    note: "The ImageNet paper — benchmark that drove modern CV progress. 50,000+ citations."
completeness: 0.88
ai_citations:
  last_citation_check:"2026-05-22"
---

## TL;DR

Computer Vision enables machines to extract meaning from visual data. Key tasks: image classification (what is this?), object detection (where is it? + bounding box), segmentation (pixel-level labeling), pose estimation, depth estimation, 3D reconstruction. Deep learning (CNN, ViT) dominates since 2012.

## Core Explanation

Object detection: R-CNN family (region proposals), YOLO (single shot, real-time), DETR (Transformer-based). Segmentation: U-Net (biomedical), Mask R-CNN, SAM (Segment Anything Model, Meta 2023). Vision Transformer (ViT, 2020): apply Transformer to image patches — competitive with CNNs. Multimodal: CLIP (OpenAI 2021) learns joint image-text embeddings.

## Further Reading

- [Computer Vision: Algorithms and Applications (2nd Ed, Szeliski)](https://szeliski.org/Book/)
