TLDR:

An embedding is a dense vector representation of data—text, images, audio, code—that captures semantic meaning in a continuous numerical space. Embeddings are the foundation of modern AI: they enable semantic search, recommendations, clustering, and serve as input to LLMs and other downstream models.

How Embeddings Work

An embedding model (e.g., OpenAI text-embedding-3, Cohere Embed, Voyage AI, open-source models like sentence-transformers) takes input data and produces a fixed-length vector—typically 384 to 3,072 dimensions. Semantically similar inputs produce vectors that are close together in the embedding space, measured by cosine similarity or Euclidean distance. The relationships are learned during model training on large datasets.

Use Cases

Embeddings power many production AI applications: semantic search (find documents by meaning rather than keywords), recommendation systems (find items similar to user preferences), clustering and classification (group similar items), deduplication (find near-duplicate content), and as inputs to RAG pipelines and downstream ML models. They are foundational infrastructure—almost every production AI system uses embeddings somewhere in the stack.

Choosing an Embedding Model

Selection criteria include: semantic quality (measured on benchmarks like MTEB), dimensionality (higher dimensions can capture more nuance but cost more in storage and compute), domain specialization (general-purpose vs. legal/medical/code-specialized models), supported languages (multilingual capability), and pricing/licensing. Many production teams maintain multiple embedding models for different content types, with periodic re-embedding when models improve significantly.

References

Embeddings under data-protection law

Embeddings complicate the easy claim that “we anonymised the data”: vectors derived from personal text can remain personal data where individuals are identifiable through inversion, linkage or membership inference — regulator thinking on AI increasingly assumes so. The operational consequences: treat vector databases as personal-data stores by default (lawful basis, retention, deletion mechanics — including index rebuilds where required), scope DPAs with model providers to cover embedding generation, and remember that deleting source documents without deleting their vectors fails KVKK/GDPR erasure obligations. RAG architectures inherit all of this — the index is a copy of your documents in another form, and access controls on it deserve the same review as the original repository.