Métodos avançados de divisão

Retrieval Augmented Generation (RAG) com LangChain

Meri Nova

Machine Learning Engineer

Limitações das estratégias atuais de divisão

 

  1. 🤦 Divisões são ingênuas (sem contexto)

    • Ignoram o contexto do texto ao redor
  2. 🖇 Divisões por caracteres vs. tokens

    • Modelos processam tokens
    • Risco de exceder a janela de contexto

 

SemanticChunker

 

TokenTextSplitter

Retrieval Augmented Generation (RAG) com LangChain

Divisão por tokens

Um divisor por caracteres separando texto em blocos pelo número de caracteres.

Retrieval Augmented Generation (RAG) com LangChain

Divisão por tokens

Um divisor por tokens separando texto em blocos pelo número de tokens.

Retrieval Augmented Generation (RAG) com LangChain

Divisão por tokens

Os tokens estão destacados para mostrar o alinhamento com chunk_size e chunk_overlap.

Retrieval Augmented Generation (RAG) com LangChain

Divisão por tokens

import tiktoken
from langchain_text_splitters import TokenTextSplitter
example_string = "Mary had a little lamb, it's fleece was white as snow."

encoding = tiktoken.encoding_for_model('gpt-4o-mini')
splitter = TokenTextSplitter(encoding_name=encoding.name,
                             chunk_size=10,
                             chunk_overlap=2)

chunks = splitter.split_text(example_string) for i, chunk in enumerate(chunks): print(f"Chunk {i+1}:\n{chunk}\n")
Retrieval Augmented Generation (RAG) com LangChain

Divisão por tokens

Chunk 1:
Mary had a little lamb, it's fleece

Chunk 2:
 fleece was white as snow.
Retrieval Augmented Generation (RAG) com LangChain

Divisão por tokens

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\nNo. tokens: {len(encoding.encode(chunk))}\n{chunk}\n")
Chunk 1:
No. tokens: 10
Mary had a little lamb, it's fleece was

Chunk 2:
No. tokens: 6
 fleece was white as snow.
Retrieval Augmented Generation (RAG) com LangChain

Divisão semântica

Um parágrafo com uma frase sobre aplicações de RAG e outra sobre cachorros.

Retrieval Augmented Generation (RAG) com LangChain

Divisão semântica

O parágrafo foi dividido por caracteres ou tokens e perdeu contexto.

Retrieval Augmented Generation (RAG) com LangChain

Divisão semântica

Um divisor semântico separou o parágrafo onde o tópico muda de RAG para cachorros.

Retrieval Augmented Generation (RAG) com LangChain

Divisão semântica

from langchain_openai import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker

embeddings = OpenAIEmbeddings(api_key="...", model='text-embedding-3-small')
semantic_splitter = SemanticChunker( embeddings=embeddings,
breakpoint_threshold_type="gradient", breakpoint_threshold_amount=0.8
)
1 https://api.python.langchain.com/en/latest/text_splitter/langchain_experimental.text_splitter. SemanticChunker.html
Retrieval Augmented Generation (RAG) com LangChain

Divisão semântica

chunks = semantic_splitter.split_documents(data)
print(chunks[0])
page_content='Retrieval-Augmented Generation for\nKnowledge-Intensive NLP Tasks\ Patrick Lewis,
Ethan Perez,\nAleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich
Küttler,\nMike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela\nFacebook AI
Research; University College London;New York University;\[email protected]\nAbstract\nLarge
pre-trained language models have been shown to store factual knowledge\nin their parameters,
and achieve state-of-the-art results when fine-tuned on down-\nstream NLP tasks. However, their
ability to access and precisely manipulate knowl-\nedge is still limited, and hence on
knowledge-intensive tasks, their performance\nlags behind task-specific architectures.'
metadata={'source': 'rag_paper.pdf', 'page': 0}
Retrieval Augmented Generation (RAG) com LangChain

Vamos praticar!

Retrieval Augmented Generation (RAG) com LangChain

Preparing Video For Download...