Dividir texto, embeddings e armazenamento vetorial

Retrieval Augmented Generation (RAG) com LangChain

Meri Nova

Machine Learning Engineer

Preparando dados para busca

Documentos sendo carregados.

Retrieval Augmented Generation (RAG) com LangChain

Preparando dados para busca

Documentos sendo divididos.

Retrieval Augmented Generation (RAG) com LangChain

Preparando dados para busca

Chunks de documentos são embedados.

Retrieval Augmented Generation (RAG) com LangChain

Preparando dados para busca

Chunks de documentos são armazenados.

Retrieval Augmented Generation (RAG) com LangChain

Preparando dados para busca

A etapa de divisão está destacada no fluxo de desenvolvimento RAG.

Retrieval Augmented Generation (RAG) com LangChain

chunk_size

Uma seta mostra que o tamanho ideal do chunk fica no meio; chunks grandes têm busca lenta e difícil interpretação, e pequenos têm pouco contexto.

chunk_overlap

  • Inclui informação além do limite

Dois chunks com uma área destacada que se sobrepõe a ambos.

Retrieval Augmented Generation (RAG) com LangChain

CharacterTextSplitter

from langchain_text_splitters import CharacterTextSplitter

text = """Machine learning is a fascinating field.\n\nIt involves algorithms and models that can learn from data. These models can then make predictions or decisions without being explicitly programmed to perform the task.\nThis capability is increasingly valuable in various industries, from finance to healthcare.\n\nThere are many types of machine learning, including supervised, unsupervised, and reinforcement learning.\nEach type has its own strengths and applications."""
text_splitter = CharacterTextSplitter( separator="\n\n", chunk_size=100, chunk_overlap=10 )
Retrieval Augmented Generation (RAG) com LangChain

CharacterTextSplitter

chunks = text_splitter.split_text(text)

print(chunks) print([len(chunk) for chunk in chunks])
['Machine learning is a fascinating field.',
 'It involves algorithms and models that can learn from data. These models can...',
 'There are many types of machine learning, including supervised, unsupervised...']

[40, 260, 155]
  • Chunks podem perder contexto
  • Chunks podem ser maiores que chunk_size
Retrieval Augmented Generation (RAG) com LangChain

RecursiveCharacterTextSplitter

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(

separators=["\n\n", "\n", " ", ""],
chunk_size=100, chunk_overlap=10
)
Retrieval Augmented Generation (RAG) com LangChain

RecursiveCharacterTextSplitter

chunks = splitter.split_text(text)

print(chunks)
print([len(chunk) for chunk in chunks])
['Machine learning is a fascinating field.',
 'It involves algorithms and models that can learn from data. These models ...',
 'or decisions without being explicitly programmed to perform the task.',
 'This capability is increasingly valuable in various industries, from ...',
 'There are many types of machine learning, including supervised, ...',
 'learning.',
 'Each type has its own strengths and applications.']
[40, 98, 69, 91, 95, 9, 49]
Retrieval Augmented Generation (RAG) com LangChain

Dividindo documentos

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("research_paper.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

chunks = splitter.split_documents(documents)
Retrieval Augmented Generation (RAG) com LangChain

Dividindo documentos

print(chunks)

print([len(chunk.page_content) for chunk in chunks])
[Document(metadata={'source': 'Rag Paper.pdf', 'page': 0}, page_content='...'),
 Document(metadata={'source': 'Rag Paper.pdf', 'page': 0}, page_content='...'),
 Document(metadata={'source': 'Rag Paper.pdf', 'page': 0}, page_content='...')]

[928, 946, 921,...]
Retrieval Augmented Generation (RAG) com LangChain

Embedding e armazenamento

As etapas de embedding e armazenamento estão destacadas.

Retrieval Augmented Generation (RAG) com LangChain

O que são embeddings?

Uma frase é enviada para um modelo de embedding.

Retrieval Augmented Generation (RAG) com LangChain

O que são embeddings?

O modelo de embedding transforma o texto em um vetor de valores numéricos.

Retrieval Augmented Generation (RAG) com LangChain

O que são embeddings?

what_are_embeddings3.jpg

Retrieval Augmented Generation (RAG) com LangChain

O que são embeddings?

what_are_embeddings4.jpg

Retrieval Augmented Generation (RAG) com LangChain

Fazendo embedding e armazenando os chunks

  • Fazer embedding e armazenar com: OpenAI e ChromaDB
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embedding_model = OpenAIEmbeddings(
    api_key=openai_api_key,
    model="text-embedding-3-small"
)


vector_store = Chroma.from_documents( documents=chunks, embedding=embedding_model )
Retrieval Augmented Generation (RAG) com LangChain

Vamos praticar!

Retrieval Augmented Generation (RAG) com LangChain

Preparing Video For Download...