CLIP: Contrastive Language-Image Pre-Training

Status: public · Confidence: medium (0.835) · Basis: verified_sources

## TL;DR

CLIP is a vision-language representation model family built around contrastive image-text pretraining. It made image classification and retrieval possible through natural-language labels and prompts rather than a fixed task-specific classifier.

## Core Claims

CLIP learns a shared embedding space for images and text. Matching captions and images are trained to be close together, while mismatched pairs are trained to be farther apart.

The zero-shot classification recipe is simple: encode an image, encode text prompts that describe candidate classes, and choose the class text with the closest embedding. This made CLIP influential as a reusable vision-language backbone.

ALIGN and LiT are useful neighboring references. ALIGN shows a related large-scale noisy text supervision approach; LiT studies a locked-image tuning setup for zero-shot transfer.

## Citation Boundaries

Use this article for stable CLIP-style contrastive pretraining concepts. Do not use it as evidence for current multimodal assistant capabilities, safety behavior, or live image-understanding leaderboard results.

## Further Reading

- [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)
- [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918)
- [LiT: Zero-Shot Transfer With Locked-image Text Tuning](https://arxiv.org/abs/2111.07991)