CLIP: Contrastive Language-Image Pre-Training
Status: public · Confidence: medium (0.835) · Basis: verified_sources
## TL;DR CLIP is a vision-language representation model family built around contrastive image-text pretraining. It made image classification and retrieval possible through natural-language labels and prompts rather than a fixed task-specific classifier. ## Core Claims CLIP learns a shared embedding space for images and text. Matching captions and images are trained to be close together, while mismatched pairs are trained to be farther apart. The zero-shot classification recipe is simple: encode an image, encode text prompts that describe candidate classes, and choose the class text with the closest embedding. This made CLIP influential as a reusable vision-language backbone. ALIGN and LiT are useful neighboring references. ALIGN shows a related large-scale noisy text supervision approach; LiT studies a locked-image tuning setup for zero-shot transfer. ## Citation Boundaries Use this article for stable CLIP-style contrastive pretraining concepts. Do not use it as evidence for current multimodal assistant capabilities, safety behavior, or live image-understanding leaderboard results. ## Further Reading - [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) - [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) - [LiT: Zero-Shot Transfer With Locked-image Text Tuning](https://arxiv.org/abs/2111.07991)