División de texto, embeddings y almacenamiento de vectores

Retrieval Augmented Generation (RAG) con LangChain

Meri Nova

Machine Learning Engineer

Preparar datos para la recuperación

Cargando documentos.

Retrieval Augmented Generation (RAG) con LangChain

Preparar datos para la recuperación

Dividiendo documentos.

Retrieval Augmented Generation (RAG) con LangChain

Preparar datos para la recuperación

Se generan embeddings de los fragmentos.

Retrieval Augmented Generation (RAG) con LangChain

Preparar datos para la recuperación

Se almacenan los fragmentos.

Retrieval Augmented Generation (RAG) con LangChain

Preparar datos para la recuperación

El paso de división está resaltado en el flujo de trabajo de desarrollo RAG.

Retrieval Augmented Generation (RAG) con LangChain

chunk_size

Una flecha muestra que el tamaño ideal está en medio; los fragmentos grandes pueden ser lentos y difíciles de interpretar, y los pequeños pueden tener poco contexto.

chunk_overlap

  • Incluye info más allá del límite

Dos fragmentos con un área resaltada que se superpone a ambos.

Retrieval Augmented Generation (RAG) con LangChain

CharacterTextSplitter

from langchain_text_splitters import CharacterTextSplitter

text = """Machine learning is a fascinating field.\n\nIt involves algorithms and models that can learn from data. These models can then make predictions or decisions without being explicitly programmed to perform the task.\nThis capability is increasingly valuable in various industries, from finance to healthcare.\n\nThere are many types of machine learning, including supervised, unsupervised, and reinforcement learning.\nEach type has its own strengths and applications."""
text_splitter = CharacterTextSplitter( separator="\n\n", chunk_size=100, chunk_overlap=10 )
Retrieval Augmented Generation (RAG) con LangChain

CharacterTextSplitter

chunks = text_splitter.split_text(text)

print(chunks) print([len(chunk) for chunk in chunks])
['Machine learning is a fascinating field.',
 'It involves algorithms and models that can learn from data. These models can...',
 'There are many types of machine learning, including supervised, unsupervised...']

[40, 260, 155]
  • Los fragmentos pueden carecer de contexto
  • Pueden superar chunk_size
Retrieval Augmented Generation (RAG) con LangChain

RecursiveCharacterTextSplitter

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(

separators=["\n\n", "\n", " ", ""],
chunk_size=100, chunk_overlap=10
)
Retrieval Augmented Generation (RAG) con LangChain

RecursiveCharacterTextSplitter

chunks = splitter.split_text(text)

print(chunks)
print([len(chunk) for chunk in chunks])
['Machine learning is a fascinating field.',
 'It involves algorithms and models that can learn from data. These models ...',
 'or decisions without being explicitly programmed to perform the task.',
 'This capability is increasingly valuable in various industries, from ...',
 'There are many types of machine learning, including supervised, ...',
 'learning.',
 'Each type has its own strengths and applications.']
[40, 98, 69, 91, 95, 9, 49]
Retrieval Augmented Generation (RAG) con LangChain

Dividir documentos

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("research_paper.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

chunks = splitter.split_documents(documents)
Retrieval Augmented Generation (RAG) con LangChain

Dividir documentos

print(chunks)

print([len(chunk.page_content) for chunk in chunks])
[Document(metadata={'source': 'Rag Paper.pdf', 'page': 0}, page_content='...'),
 Document(metadata={'source': 'Rag Paper.pdf', 'page': 0}, page_content='...'),
 Document(metadata={'source': 'Rag Paper.pdf', 'page': 0}, page_content='...')]

[928, 946, 921,...]
Retrieval Augmented Generation (RAG) con LangChain

Embedding y almacenamiento

Se resaltan los pasos de embedding y almacenamiento.

Retrieval Augmented Generation (RAG) con LangChain

¿Qué son los embeddings?

Se envía una frase a un modelo de embeddings.

Retrieval Augmented Generation (RAG) con LangChain

¿Qué son los embeddings?

El modelo convierte el texto en un vector de valores numéricos.

Retrieval Augmented Generation (RAG) con LangChain

¿Qué son los embeddings?

what_are_embeddings3.jpg

Retrieval Augmented Generation (RAG) con LangChain

¿Qué son los embeddings?

what_are_embeddings4.jpg

Retrieval Augmented Generation (RAG) con LangChain

Generar embeddings y guardar fragmentos

  • Genera y guarda con: OpenAI y ChromaDB
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embedding_model = OpenAIEmbeddings(
    api_key=openai_api_key,
    model="text-embedding-3-small"
)


vector_store = Chroma.from_documents( documents=chunks, embedding=embedding_model )
Retrieval Augmented Generation (RAG) con LangChain

¡Vamos a practicar!

Retrieval Augmented Generation (RAG) con LangChain

Preparing Video For Download...