Subject
The numbers that turn meaning into geometry. What embeddings are, how to pick an embedding model, how vector databases store and search them, and how to chunk text so the vectors actually capture what a passage is about.
An embedding is the trick that lets a computer treat "meaning" as something it can measure. You hand a piece of text to an embedding model, and it hands back a list of numbers, usually a few hundred to a few thousand of them. That list is the embedding. On its own a single embedding tells you nothing. The magic shows up when you compare two of them.
The cleanest way to picture an embedding is as a point on a map. Imagine a giant map where every possible sentence has a location. The embedding model decides where to drop each pin. Its one job is to place texts with similar meaning close together and texts with unrelated meaning far apart.
On that map, "the cat sat on the mat" lands near "a feline rested on the rug" even though they share almost no words. "king" lands near "queen" and "monarch." "I need to reset my password" lands near "how do I recover my account" but far from "what time does the store close." The model learned these positions from huge amounts of text, so it captures meaning rather than spelling.
The map is not two-dimensional, of course. It has as many dimensions as the model outputs. You cannot picture 1,024 dimensions, and you do not need to. The intuition holds: nearby means similar.
To turn "near" into a number, you compare two embeddings with cosine similarity. It returns a score, usually between -1 and 1, where 1 means the two texts point in the same direction (very similar), 0 means unrelated, and negative means opposed. In practice for text you mostly see scores between roughly 0 and 1.
The computation is a dot product divided by the two vectors' lengths. You almost never write it by hand. In NumPy it is one line:
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
If your embedding model returns unit-length vectors (most do), the denominator is 1 and cosine similarity is just the dot product. That is why vector databases optimize so hard for fast dot products.
Once you can score how related any two texts are, a stack of useful tools falls out:
An embedding does not store the original text and you cannot read it back out reliably. It is a lossy fingerprint of meaning, not a compression of the words. It also reflects whatever the model learned, including its blind spots: a model trained mostly on English will place non-English text less precisely, and a general model will blur fine distinctions in a specialized domain like law or medicine.
That is the whole core idea. Text in, vector out, compare by distance. Everything else in this cluster, picking a model, storing the vectors, and chunking the text before you embed it, is engineering on top of this one foundation.
A short editorial reading list. Pick whichever fits how you like to learn.
Comments 0
No comments yet. Be the first to share your thoughts.