Embedding Model Selection and Vector Distance

Status: public · Confidence: medium (0.725) · Basis: verified_sources

## TL;DR

Embedding model selection determines how text, code, images, or other items are represented as vectors. Vector distance determines how the retrieval engine ranks similarity.

## Core Explanation

Good retrieval is not just "use embeddings." Engineers need to choose an embedding model that fits the domain and language, store vectors with compatible dimensionality, pick the distance metric expected by the model or database, and test recall on known answer-bearing examples. Normalized embeddings can make cosine, dot product, and Euclidean rankings equivalent in some systems, but that is a property to verify rather than assume.

## Source-Mapped Facts

- OpenAI embeddings documentation says an embedding is a vector of floating point numbers and distance between two vectors measures their relatedness. ([source](https://platform.openai.com/docs/guides/embeddings))
- OpenAI embeddings documentation says cosine similarity and Euclidean distance produce identical rankings for OpenAI embeddings because the embeddings are normalized to length 1. ([source](https://platform.openai.com/docs/guides/embeddings))
- Qdrant search documentation lists Dot product, Euclidean distance, and Cosine as available vector similarity metrics. ([source](https://qdrant.tech/documentation/search/search/))

## Further Reading

- [OpenAI vector embeddings](https://platform.openai.com/docs/guides/embeddings)
- [Qdrant search metrics](https://qdrant.tech/documentation/search/search/)
- [pgvector](https://github.com/pgvector/pgvector)