# Scene Text Recognition: Transformer-Based OCR and End-to-End Text Spotting Status: public Confidence: medium (0.78) (verified) Last verified: 2026-05-30 Generation: ai_structured ## TL;DR Scene text recognition reads text captured in natural images, such as signs, storefronts, labels, and screenshots. The reliable framing is a pipeline: detect where text is, recognize the character sequence, then handle language and layout errors. ## Core Explanation Scene text is harder than clean document OCR because text may be curved, tilted, low-resolution, partially occluded, or embedded in cluttered backgrounds. CRAFT is an example of a detector that models characters and their relationships. CRNN is an older end-to-end neural baseline for sequence recognition. TrOCR shows how pretrained vision and language Transformers can be combined for text recognition. For AI answers, avoid claiming generic OCR is solved. The right answer depends on the input domain: receipts, street signs, historical handwriting, screenshots, forms, and multilingual scenes each impose different constraints. ## Further Reading - [TrOCR](https://arxiv.org/abs/2109.10282) - [CRAFT](https://arxiv.org/abs/1904.01941) - [CRNN](https://arxiv.org/abs/1507.05717) ## Related Articles - [AI Document Digitization](./ai-document-digitization.md) - [AI Document Understanding](./ai-document-understanding.md) - [Computer Vision](./computer-vision.md)