Skip to main content

Command Palette

Search for a command to run...

Understanding VectorDB and Vector Embedding

Updated
β€’5 min read
Understanding VectorDB and Vector Embedding
V

Currently working as a Software Engineer, and possess strong knowledge and experience with python (flask/Django, selenium), javascript(node, loopback, node-red), and backend design and development.

A vectorDB is a special kind of database that can store and manage unstructured data in the form of vectors.

Vectors are mathematical representations of data points in an n-dimensional space. Each vector is an array of numbers that captures the essential features of the data. Converting data into a meaningful vector form is called vector embedding.

Example:

Text: "Hello world"  
Vector: [0.1, 0.3, 0.5, 0.7]

Putting in short, vectorDB stores the vector embedding of data that allows for efficient similarity search and retrieval.

There are multiple embedding models available to convert data into vector form. Some popular embedding models include

  • BERT

  • FastText

  • Sentence Transformers

  • OpenAI Embeddings

How is text converted into vector form to be stored in VectorDB?

Step 1: Tokenization

The text is first broken down into smaller units called tokens. For example, the sentence "hello, My name is vinayak" can be tokenized into ["hello", ",", "My", "name", "is", "vinayak"].

Step 2: Embedding

For each token, the embedding model generates a vector representation or base vector

Step 3: Contextualization

The model processes these vectors and captures context and relationships between words.

Step 4: Sentence-level Vector using Pooling

After Contextualization, the model combines vectors to create a single embedding for the whole sentence by pooling. Pooling means combining multiple token embeddings into a single fixed-length vector.

Pooling TypeDescriptionExample Result
Mean Pooling (Averaging)Take the mean of all token vectorsAverage over tokens β†’ captures overall sentence meaning
Max PoolingTake the maximum value per dimensionKeeps the most activated feature per dimension
CLS Pooling (BERT)Use only the CLS token’s embeddingUses special summary token
Last Token Pooling (GPT-like)Use the embedding of the last tokenWorks for autoregressive models

Step 5: Storing vectors in VectorDB

This embedding is now stored in a vector database. To find similarity between two words or sentences, we compute cosine similarity:

\[\text{similarity} = \frac{A \cdot B}{\|A\| \|B\|}\]

Where:

  • Aβ‹…B = dot product of the two vectors

  • ||A|| = magnitude (length) of vector

  • ||B|| = magnitude (length) of vector. If similarity is approximately 1, the texts mean nearly the same thing.

  • Chroma

  • Redis Vector / RedisAI

  • Pinecone

Storing the content of a PDF for semantic search/retrieval

  • Extract text from the PDF using pypdf.

  • Split into meaningful sections (e.g., 500–1000 tokens per chunk).

  • Use an embedding model (text-embedding-3-small from OpenAI or all-MiniLM-L6-v2 from SentenceTransformer).

  • For storing in a vector database, we will use chromadb

  • Retrieve + Query

PDF β†’ Text β†’ Chunk β†’ Embeddings β†’ Local Vector DB β†’ Query.

For this project, we will require a few Python packages

  • langchain.text_splitter – RecursiveCharacterTextSplitter for splitting text into chunks.

  • langchain.community – PyPDFLoader for reading PDF data.

  • sentence_transformers – for generating embeddings.

  • chromadb – for storing vectors in ChromaDB.

  • pypdf – required by PyPDFLoader to read PDF files.

pip install langchain_text_splitters langchain_community sentence_transformers chromadb pypdf

Importing Libraries

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.utils import embedding_functions
import numpy as np

Loading PDF File

loader = PyPDFLoader("/Users/admin/Documents/text-algorithms.pdf")
docs = loader.load()

docs contains metadata and page_content, in the Document object

[Document(page_content="Text from page 1...", metadata={'source': '...'}),
Document(page_content="Text from page 2...", metadata={'source': '...'})]

Split into chunks and take page_content

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = splitter.split_documents(docs)
chunks = [chunk.page_content for chunk in chunks]

Creating embeddings using the all-MiniLM-L6-v2 model

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(chunks)

output of embeddings:

array([[-0.08156922,  0.04208222, -0.00247928, ...,  0.0319585 ,
         0.0847574 , -0.05243403],
       [-0.05187003,  0.05612822, -0.01785211, ...,  0.05183639,
         0.11590089, -0.02695952],
       [-0.07564815,  0.08808449,  0.04245563, ...,  0.06281551,
         0.05972335, -0.02641685],
       ...,
       [-0.12733817,  0.05517265,  0.10413621, ...,  0.09523505,
         0.05419088, -0.00297189],
       [-0.08528817,  0.05107336,  0.06153389, ..., -0.0302441 ,
         0.07908767,  0.05990531],
       [-0.01405465,  0.08532862, -0.05380322, ...,  0.00176415,
         0.08464316,  0.07506391]], dtype=float32)

Storing data in chromadb

client = chromadb.Client()
collection = client.create_collection("pdf_collection_test")

for i, chunk in enumerate(chunks):
    collection.add(
        documents=[chunk],
        metadatas=[{"source": "text-algorithms.pdf"}],
        ids=[str(i)]
    )

Retrieve data

query = "What is the main topic?"
query_embedding = model.encode([query])[0]  # Chroma expects a single vector

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3
)

for doc in results['documents'][0]:
    print('-->', doc)

output of query:

--> efficient data structure for factors of the text has to be used. But the algorithm, or more exactly
the factorization defined for the purpose, is related to data compressions methods based on
elimination of repetitions of factors. The algorithm shows another deep application of data
structures developed in Chapters 5 and 6 for storing all factors of a text.
--> pattern. Generally we may want to find a string called a pattern of length m inside a text of
length n where n is greater than m. The pattern can be described in a more complex way to
denote a set of strings and not only a single word. In many cases m is very large. In genetics
the pattern can correspond to a genome that can be very long, in image processing the digitized
images sent serially take millions of characters each. The string-matching problem is the basic
question considered in the book, together with its variations. String-matching is also the basic
subproblem in very other algorithmic problems on texts. Below there is a (not exclusive) list of
basic groups of problems discussed in the book :
variations on the string-matching problem,  problems related to the structure of the
segments of a text, data compression, approximation problems, finding regularities, extensions
to two-dimensional images, extensions to trees, optimal time-space implementations, optimal
--> Addison-Wesley, Reading, Mass., 1991, second edition.
M. LOTHAIRE, Combinatorics on words, Addison-Wesley, Reading, Mass., 1983.
R. S EDGEWICK, Algorithms, Addison-Wesley, Reading, Mass., 1988, second
edition.