Human Pose Estimation: 2D/3D Keypoint Detection and Transformer-Based Body Tracking

## TL;DR
Human pose estimation -- detecting body keypoints (shoulders, elbows, knees) in images and video -- is a foundational computer vision task powering applications from fitness tracking and motion capture to gesture control and sports analytics. Transformer architectures have replaced specialized CNN designs, achieving real-time multi-person performance even on mobile devices.

## Core Explanation
Problem: given an image with K people, output 2D or 3D coordinates for N body joints (typically 17 for COCO: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles). Approaches: (1) Top-down -- detect person bounding boxes first, then estimate pose within each box (HRNet, ViTPose). Accurate but slow for crowds; (2) Bottom-up -- detect all keypoints, then associate into individuals via part affinity fields (OpenPose) or associative embedding (HigherHRNet). Fast, handles crowds; (3) Single-stage -- directly predict person instances with keypoints (PETR, ED-Pose). Output: heatmap-based (2D Gaussian centered at keypoint) vs regression-based (direct coordinate prediction).

## Detailed Analysis
ViTPose (NeurIPS 2022): key insight is simplicity -- no task-specific decoder, just plain ViT encoder + deconvolution + heatmap. Performance scales with ViT size (ViT-B: 75.8 AP to ViT-H: 79.1 AP), matching specialized architectures. MotionBERT (ICCV 2023): pretrains on massive motion capture data (AMASS) using masked motion modeling -- given partial 3D poses, predict masked frames. This pretrained representation transfers to 2D, 3D, and mesh recovery. Real-time: RTMPose (2023) uses CSPNeXt backbone + SimCC head + knowledge distillation, achieving 75+ AP at 30+ FPS on mobile. Whole-body pose (DWPose, ICCV 2023): extends to 133 keypoints. Applications: fitness (form correction), sports analytics (athlete biomechanics), AR/VR (full-body avatar), healthcare (gait analysis), autonomous driving (pedestrian intention from pose).

## Further Reading
- MMPose: Open-Source Pose Estimation Toolbox (OpenMMLab)
- COCO Keypoint Dataset (200K images, 250K persons)
- OpenPose: Real-Time Multi-Person 2D Pose Detection