Splitting external data for retrieval

Developing LLM Applications with LangChain

Jonathan Bennion

AI Engineer & LangChain Contributor

RAG development steps

The general RAG workflow: a document loader, a document splitter, and the storage and retrieval process.

  • Document splitting: split document into chunks
  • Break documents up to fit within an LLM's context window
Developing LLM Applications with LangChain

Thinking about splitting...

The first paragraph from the introduction of the Attention is All You Need paper.

Line 1:

Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks

Line 2:

in particular, have been firmly established as state of the art approaches in sequence modeling and
1 https://arxiv.org/abs/1706.03762
Developing LLM Applications with LangChain

Chunk overlap

The first paragraph from the introduction of the Attention is All You Need paper split into two chunks with a chunk overlap.

Developing LLM Applications with LangChain

What is the best document splitting strategy?

The word "context" chunked into individual letters.

 

  1. CharacterTextSplitter
  2. RecursiveCharacterTextSplitter
  3. Many others
1 Wikipedia Commons
Developing LLM Applications with LangChain
quote = '''One machine can do the work of fifty ordinary humans.\nNo machine can do
the work of one extraordinary human.'''
len(quote)
103
chunk_size = 24
chunk_overlap = 3
1 Elbert Hubbard
Developing LLM Applications with LangChain
from langchain_text_splitters import CharacterTextSplitter


ct_splitter = CharacterTextSplitter( separator='.', chunk_size=chunk_size, chunk_overlap=chunk_overlap)
docs = ct_splitter.split_text(quote) print(docs)
print([len(doc) for doc in docs])
['One machine can do the work of fifty ordinary humans',
 'No machine can do the work of one extraordinary human']

[52, 53]
  • Split on separator so < chunk_size, but may not always succeed!
Developing LLM Applications with LangChain
from langchain_text_splitters import RecursiveCharacterTextSplitter


rc_splitter = RecursiveCharacterTextSplitter( separators=["\n\n", "\n", " ", ""], chunk_size=chunk_size, chunk_overlap=chunk_overlap)
docs = rc_splitter.split_text(quote) print(docs)
Developing LLM Applications with LangChain

RecursiveCharacterTextSplitter

  • separators=["\n\n", "\n", " ", ""]
['One machine can do the',
 'work of fifty ordinary',
 'humans.',
 'No machine can do the',
 'work of one',
 'extraordinary human.']
  1. Try splitting by paragraph: "\n\n"
  2. Try splitting by sentence: "\n"
  3. Try splitting by words: " "
Developing LLM Applications with LangChain

RecursiveCharacterTextSplitter with HTML

from langchain_community.document_loaders import UnstructuredHTMLLoader 
from langchain_text_splitters import RecursiveCharacterTextSplitter


loader = UnstructuredHTMLLoader("white_house_executive_order_nov_2023.html") data = loader.load()
rc_splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=chunk_overlap, separators=['.'])
docs = rc_splitter.split_documents(data) print(docs[0])
Document(page_content="To search this site, enter a search term [...]
Developing LLM Applications with LangChain

Let's practice!

Developing LLM Applications with LangChain

Preparing Video For Download...