## TL;DR
Scene text recognition reads text in the wild -- street signs, storefronts, license plates, and handwritten notes captured by smartphone cameras. Transformer-based architectures have transformed OCR from fragile multi-stage pipelines to robust end-to-end models that handle curved text, diverse fonts, and challenging lighting conditions.
## Core Explanation
Scene text vs. document OCR: document OCR (clean white background, standard fonts, high resolution) is largely solved (Tesseract, Google Vision). Scene text (complex backgrounds, variety of fonts/sizes/colors, perspective distortion, curved text, low resolution, occlusion, uneven lighting) remains challenging. Pipeline: (1) Text detection -- locate text regions in the image (DETR, CRAFT, DB); (2) Text recognition -- take cropped text region, output character sequence (CRNN, ASTER, TrOCR); (3) End-to-end -- joint detection + recognition (ABCNet, SwinTextSpotter). Output: transcription + spatial location.
## Detailed Analysis
TrOCR (2021): two pretrained transformers -- ViT splits image into 16x16 patches, processes as tokens. RoBERTa decoder generates text autoregressively. No CTC, no attention-based alignment needed. Simple, scalable. Pretraining: synthetic data generation using text rendering engine (TextRender) with 1000+ fonts. MDPI 2025 end-to-end framework: DETR text detector + transformer recognizer, joint training signal improves both stages. ICDAR benchmarks: ICDAR 2013 (focused), ICDAR 2015 (incidental), Total-Text (curved), COCO-Text. Current SOTA: ~95% F1 on ICDAR 2015 detection, ~98% word accuracy on recognition. Remaining challenges: Non-Latin scripts (Arabic cursive, Chinese, Devanagari) and handwritten historical documents with degraded ink and non-standard characters.