Text splitting, embeddings, and vector storage

Retrieval Augmented Generation (RAG) with LangChain

Meri Nova

Machine Learning Engineer

Preparing data for retrieval

Documents being loaded.

Retrieval Augmented Generation (RAG) with LangChain

Preparing data for retrieval

Documents being split.

Retrieval Augmented Generation (RAG) with LangChain

Preparing data for retrieval

Document chunks are embedded.

Retrieval Augmented Generation (RAG) with LangChain

Preparing data for retrieval

Document chunks are stored.

Retrieval Augmented Generation (RAG) with LangChain

Preparing data for retrieval

The splitting step is highlight in the RAG development workflow.

Retrieval Augmented Generation (RAG) with LangChain

chunk_size

An arrow showing that the ideal chunk size is somewhere in the middle; large chunks can suffer from slow retrieval and difficulties with interpretation, and small chunks can contain insufficient context.

chunk_overlap

  • Include information beyond the boundary

Two chunks with an area highlighted that overlaps both.

Retrieval Augmented Generation (RAG) with LangChain

CharacterTextSplitter

from langchain_text_splitters import CharacterTextSplitter

text = """Machine learning is a fascinating field.\n\nIt involves algorithms and models that can learn from data. These models can then make predictions or decisions without being explicitly programmed to perform the task.\nThis capability is increasingly valuable in various industries, from finance to healthcare.\n\nThere are many types of machine learning, including supervised, unsupervised, and reinforcement learning.\nEach type has its own strengths and applications."""
text_splitter = CharacterTextSplitter( separator="\n\n", chunk_size=100, chunk_overlap=10 )
Retrieval Augmented Generation (RAG) with LangChain

CharacterTextSplitter

chunks = text_splitter.split_text(text)

print(chunks) print([len(chunk) for chunk in chunks])
['Machine learning is a fascinating field.',
 'It involves algorithms and models that can learn from data. These models can...',
 'There are many types of machine learning, including supervised, unsupervised...']

[40, 260, 155]
  • Chunks may lack context
  • Chunks may be larger than chunk_size
Retrieval Augmented Generation (RAG) with LangChain

RecursiveCharacterTextSplitter

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(

separators=["\n\n", "\n", " ", ""],
chunk_size=100, chunk_overlap=10
)
Retrieval Augmented Generation (RAG) with LangChain

RecursiveCharacterTextSplitter

chunks = splitter.split_text(text)

print(chunks)
print([len(chunk) for chunk in chunks])
['Machine learning is a fascinating field.',
 'It involves algorithms and models that can learn from data. These models ...',
 'or decisions without being explicitly programmed to perform the task.',
 'This capability is increasingly valuable in various industries, from ...',
 'There are many types of machine learning, including supervised, ...',
 'learning.',
 'Each type has its own strengths and applications.']
[40, 98, 69, 91, 95, 9, 49]
Retrieval Augmented Generation (RAG) with LangChain

Splitting documents

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("research_paper.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

chunks = splitter.split_documents(documents)
Retrieval Augmented Generation (RAG) with LangChain

Splitting documents

print(chunks)

print([len(chunk.page_content) for chunk in chunks])
[Document(metadata={'source': 'Rag Paper.pdf', 'page': 0}, page_content='...'),
 Document(metadata={'source': 'Rag Paper.pdf', 'page': 0}, page_content='...'),
 Document(metadata={'source': 'Rag Paper.pdf', 'page': 0}, page_content='...')]

[928, 946, 921,...]
Retrieval Augmented Generation (RAG) with LangChain

Embedding and storage

The embedding and storage steps are highlighted.

Retrieval Augmented Generation (RAG) with LangChain

What are embeddings?

A sentence is passed into an embedding model.

Retrieval Augmented Generation (RAG) with LangChain

What are embeddings?

The embedding model embeds the text as a vector of numerical values.

Retrieval Augmented Generation (RAG) with LangChain

What are embeddings?

what_are_embeddings3.jpg

Retrieval Augmented Generation (RAG) with LangChain

What are embeddings?

what_are_embeddings4.jpg

Retrieval Augmented Generation (RAG) with LangChain

Embedding and storing the chunks

  • Embed and store with: OpenAI and ChromaDB
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embedding_model = OpenAIEmbeddings(
    api_key=openai_api_key,
    model="text-embedding-3-small"
)


vector_store = Chroma.from_documents( documents=chunks, embedding=embedding_model )
Retrieval Augmented Generation (RAG) with LangChain

Let's practice!

Retrieval Augmented Generation (RAG) with LangChain

Preparing Video For Download...