Understanding VectorDB and Vector Embedding

Currently working as a Software Engineer, and possess strong knowledge and experience with python (flask/Django, selenium), javascript(node, loopback, node-red), and backend design and development.
A vectorDB is a special kind of database that can store and manage unstructured data in the form of vectors.
Vectors are mathematical representations of data points in an n-dimensional space. Each vector is an array of numbers that captures the essential features of the data. Converting data into a meaningful vector form is called vector embedding.
Example:
Text: "Hello world"
Vector: [0.1, 0.3, 0.5, 0.7]
Putting in short, vectorDB stores the vector embedding of data that allows for efficient similarity search and retrieval.

There are multiple embedding models available to convert data into vector form. Some popular embedding models include
BERT
FastText
Sentence Transformers
OpenAI Embeddings
How is text converted into vector form to be stored in VectorDB?
Step 1: Tokenization
The text is first broken down into smaller units called tokens. For example, the sentence "hello, My name is vinayak" can be tokenized into ["hello", ",", "My", "name", "is", "vinayak"].
Step 2: Embedding
For each token, the embedding model generates a vector representation or base vector
Step 3: Contextualization
The model processes these vectors and captures context and relationships between words.
Step 4: Sentence-level Vector using Pooling
After Contextualization, the model combines vectors to create a single embedding for the whole sentence by pooling. Pooling means combining multiple token embeddings into a single fixed-length vector.
| Pooling Type | Description | Example Result |
| Mean Pooling (Averaging) | Take the mean of all token vectors | Average over tokens β captures overall sentence meaning |
| Max Pooling | Take the maximum value per dimension | Keeps the most activated feature per dimension |
| CLS Pooling (BERT) | Use only the CLS tokenβs embedding | Uses special summary token |
| Last Token Pooling (GPT-like) | Use the embedding of the last token | Works for autoregressive models |
Step 5: Storing vectors in VectorDB
This embedding is now stored in a vector database. To find similarity between two words or sentences, we compute cosine similarity:
\[\text{similarity} = \frac{A \cdot B}{\|A\| \|B\|}\]
Where:
Aβ B = dot product of the two vectors
||A|| = magnitude (length) of vector
||B|| = magnitude (length) of vector. If similarity is approximately 1, the texts mean nearly the same thing.
Popular Vector Databases
Chroma
Redis Vector / RedisAI
Pinecone
Storing the content of a PDF for semantic search/retrieval
Extract text from the PDF using pypdf.
Split into meaningful sections (e.g., 500β1000 tokens per chunk).
Use an embedding model (text-embedding-3-small from OpenAI or all-MiniLM-L6-v2 from SentenceTransformer).
For storing in a vector database, we will use chromadb
Retrieve + Query
PDF β Text β Chunk β Embeddings β Local Vector DB β Query.
For this project, we will require a few Python packages
langchain.text_splitter β RecursiveCharacterTextSplitter for splitting text into chunks.
langchain.community β PyPDFLoader for reading PDF data.
sentence_transformers β for generating embeddings.
chromadb β for storing vectors in ChromaDB.
pypdf β required by PyPDFLoader to read PDF files.
pip install langchain_text_splitters langchain_community sentence_transformers chromadb pypdf
Importing Libraries
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.utils import embedding_functions
import numpy as np
Loading PDF File
loader = PyPDFLoader("/Users/admin/Documents/text-algorithms.pdf")
docs = loader.load()
docs contains metadata and page_content, in the Document object
[Document(page_content="Text from page 1...", metadata={'source': '...'}),
Document(page_content="Text from page 2...", metadata={'source': '...'})]
Split into chunks and take page_content
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = splitter.split_documents(docs)
chunks = [chunk.page_content for chunk in chunks]
Creating embeddings using the all-MiniLM-L6-v2 model
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(chunks)
output of embeddings:
array([[-0.08156922, 0.04208222, -0.00247928, ..., 0.0319585 ,
0.0847574 , -0.05243403],
[-0.05187003, 0.05612822, -0.01785211, ..., 0.05183639,
0.11590089, -0.02695952],
[-0.07564815, 0.08808449, 0.04245563, ..., 0.06281551,
0.05972335, -0.02641685],
...,
[-0.12733817, 0.05517265, 0.10413621, ..., 0.09523505,
0.05419088, -0.00297189],
[-0.08528817, 0.05107336, 0.06153389, ..., -0.0302441 ,
0.07908767, 0.05990531],
[-0.01405465, 0.08532862, -0.05380322, ..., 0.00176415,
0.08464316, 0.07506391]], dtype=float32)
Storing data in chromadb
client = chromadb.Client()
collection = client.create_collection("pdf_collection_test")
for i, chunk in enumerate(chunks):
collection.add(
documents=[chunk],
metadatas=[{"source": "text-algorithms.pdf"}],
ids=[str(i)]
)
Retrieve data
query = "What is the main topic?"
query_embedding = model.encode([query])[0] # Chroma expects a single vector
results = collection.query(
query_embeddings=[query_embedding],
n_results=3
)
for doc in results['documents'][0]:
print('-->', doc)
output of query:
--> efficient data structure for factors of the text has to be used. But the algorithm, or more exactly
the factorization defined for the purpose, is related to data compressions methods based on
elimination of repetitions of factors. The algorithm shows another deep application of data
structures developed in Chapters 5 and 6 for storing all factors of a text.
--> pattern. Generally we may want to find a string called a pattern of length m inside a text of
length n where n is greater than m. The pattern can be described in a more complex way to
denote a set of strings and not only a single word. In many cases m is very large. In genetics
the pattern can correspond to a genome that can be very long, in image processing the digitized
images sent serially take millions of characters each. The string-matching problem is the basic
question considered in the book, together with its variations. String-matching is also the basic
subproblem in very other algorithmic problems on texts. Below there is a (not exclusive) list of
basic groups of problems discussed in the book :
variations on the string-matching problem, problems related to the structure of the
segments of a text, data compression, approximation problems, finding regularities, extensions
to two-dimensional images, extensions to trees, optimal time-space implementations, optimal
--> Addison-Wesley, Reading, Mass., 1991, second edition.
M. LOTHAIRE, Combinatorics on words, Addison-Wesley, Reading, Mass., 1983.
R. S EDGEWICK, Algorithms, Addison-Wesley, Reading, Mass., 1988, second
edition.





