Externe data splitsen voor retrieval

LLM-toepassingen ontwikkelen met LangChain

Jonathan Bennion

AI Engineer & LangChain Contributor

RAG-ontwikkelstappen

De algemene RAG-workflow: een document loader, een document splitter, en opslag- en retrievalproces.

  • Document splitting: document opdelen in chunks
  • Splits om binnen het contextvenster van een LLM te passen
LLM-toepassingen ontwikkelen met LangChain

Nadenken over splitsen...

De eerste alinea uit de introductie van het artikel Attention Is All You Need.

Regel 1:

Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks

Regel 2:

in particular, have been firmly established as state of the art approaches in sequence modeling and
1 https://arxiv.org/abs/1706.03762
LLM-toepassingen ontwikkelen met LangChain

Chunk-overlap

De eerste alinea uit de introductie van het artikel Attention Is All You Need, gesplitst in twee chunks met overlap.

LLM-toepassingen ontwikkelen met LangChain

Wat is de beste strategie om te splitsen?

Het woord "context" opgedeeld in losse letters.

 

  1. CharacterTextSplitter
  2. RecursiveCharacterTextSplitter
  3. Veel andere
1 Wikipedia Commons
LLM-toepassingen ontwikkelen met LangChain
quote = '''One machine can do the work of fifty ordinary humans.\nNo machine can do
the work of one extraordinary human.'''
len(quote)
103
chunk_size = 24
chunk_overlap = 3
1 Elbert Hubbard
LLM-toepassingen ontwikkelen met LangChain
from langchain_text_splitters import CharacterTextSplitter


ct_splitter = CharacterTextSplitter( separator='.', chunk_size=chunk_size, chunk_overlap=chunk_overlap)
docs = ct_splitter.split_text(quote) print(docs)
print([len(doc) for doc in docs])
['One machine can do the work of fifty ordinary humans',
 'No machine can do the work of one extraordinary human']

[52, 53]
  • Splits op scheidingsteken zodat < chunk_size, maar lukt niet altijd!
LLM-toepassingen ontwikkelen met LangChain
from langchain_text_splitters import RecursiveCharacterTextSplitter


rc_splitter = RecursiveCharacterTextSplitter( separators=["\n\n", "\n", " ", ""], chunk_size=chunk_size, chunk_overlap=chunk_overlap)
docs = rc_splitter.split_text(quote) print(docs)
LLM-toepassingen ontwikkelen met LangChain

RecursiveCharacterTextSplitter

  • separators=["\n\n", "\n", " ", ""]
['One machine can do the',
 'work of fifty ordinary',
 'humans.',
 'No machine can do the',
 'work of one',
 'extraordinary human.']
  1. Probeer te splitsen op alinea: "\n\n"
  2. Probeer te splitsen op zin: "\n"
  3. Probeer te splitsen op woorden: " "
LLM-toepassingen ontwikkelen met LangChain

RecursiveCharacterTextSplitter met HTML

from langchain_community.document_loaders import UnstructuredHTMLLoader 
from langchain_text_splitters import RecursiveCharacterTextSplitter


loader = UnstructuredHTMLLoader("white_house_executive_order_nov_2023.html") data = loader.load()
rc_splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=chunk_overlap, separators=['.'])
docs = rc_splitter.split_documents(data) print(docs[0])
Document(page_content="To search this site, enter a search term [...]
LLM-toepassingen ontwikkelen met LangChain

Laten we oefenen!

LLM-toepassingen ontwikkelen met LangChain

Preparing Video For Download...