Pemecahan teks, embedding, dan penyimpanan vektor

Retrieval Augmented Generation (RAG) dengan LangChain

Meri Nova

Machine Learning Engineer

Menyiapkan data untuk retrieval

Dokumen sedang dimuat.

Retrieval Augmented Generation (RAG) dengan LangChain

Menyiapkan data untuk retrieval

Dokumen dipecah.

Retrieval Augmented Generation (RAG) dengan LangChain

Menyiapkan data untuk retrieval

Chunk dokumen di-embed.

Retrieval Augmented Generation (RAG) dengan LangChain

Menyiapkan data untuk retrieval

Chunk dokumen disimpan.

Retrieval Augmented Generation (RAG) dengan LangChain

Menyiapkan data untuk retrieval

Langkah pemecahan disorot dalam alur kerja pengembangan RAG.

Retrieval Augmented Generation (RAG) dengan LangChain

chunk_size

Sebuah panah menunjukkan ukuran chunk ideal berada di tengah; chunk besar lambat diambil dan sulit ditafsirkan, sedangkan chunk kecil kekurangan konteks.

chunk_overlap

  • Sertakan info melampaui batas

Dua chunk dengan area tumpang tindih yang disorot.

Retrieval Augmented Generation (RAG) dengan LangChain

CharacterTextSplitter

from langchain_text_splitters import CharacterTextSplitter

text = """Machine learning is a fascinating field.\n\nIt involves algorithms and models that can learn from data. These models can then make predictions or decisions without being explicitly programmed to perform the task.\nThis capability is increasingly valuable in various industries, from finance to healthcare.\n\nThere are many types of machine learning, including supervised, unsupervised, and reinforcement learning.\nEach type has its own strengths and applications."""
text_splitter = CharacterTextSplitter( separator="\n\n", chunk_size=100, chunk_overlap=10 )
Retrieval Augmented Generation (RAG) dengan LangChain

CharacterTextSplitter

chunks = text_splitter.split_text(text)

print(chunks) print([len(chunk) for chunk in chunks])
['Machine learning is a fascinating field.',
 'It involves algorithms and models that can learn from data. These models can...',
 'There are many types of machine learning, including supervised, unsupervised...']

[40, 260, 155]
  • Chunk bisa kekurangan konteks
  • Chunk bisa lebih besar dari chunk_size
Retrieval Augmented Generation (RAG) dengan LangChain

RecursiveCharacterTextSplitter

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(

separators=["\n\n", "\n", " ", ""],
chunk_size=100, chunk_overlap=10
)
Retrieval Augmented Generation (RAG) dengan LangChain

RecursiveCharacterTextSplitter

chunks = splitter.split_text(text)

print(chunks)
print([len(chunk) for chunk in chunks])
['Machine learning is a fascinating field.',
 'It involves algorithms and models that can learn from data. These models ...',
 'or decisions without being explicitly programmed to perform the task.',
 'This capability is increasingly valuable in various industries, from ...',
 'There are many types of machine learning, including supervised, ...',
 'learning.',
 'Each type has its own strengths and applications.']
[40, 98, 69, 91, 95, 9, 49]
Retrieval Augmented Generation (RAG) dengan LangChain

Memecah dokumen

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("research_paper.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

chunks = splitter.split_documents(documents)
Retrieval Augmented Generation (RAG) dengan LangChain

Memecah dokumen

print(chunks)

print([len(chunk.page_content) for chunk in chunks])
[Document(metadata={'source': 'Rag Paper.pdf', 'page': 0}, page_content='...'),
 Document(metadata={'source': 'Rag Paper.pdf', 'page': 0}, page_content='...'),
 Document(metadata={'source': 'Rag Paper.pdf', 'page': 0}, page_content='...')]

[928, 946, 921,...]
Retrieval Augmented Generation (RAG) dengan LangChain

Embedding dan penyimpanan

Langkah embedding dan penyimpanan disorot.

Retrieval Augmented Generation (RAG) dengan LangChain

Apa itu embedding?

Sebuah kalimat dimasukkan ke model embedding.

Retrieval Augmented Generation (RAG) dengan LangChain

Apa itu embedding?

Model embedding memetakan teks menjadi vektor nilai numerik.

Retrieval Augmented Generation (RAG) dengan LangChain

Apa itu embedding?

what_are_embeddings3.jpg

Retrieval Augmented Generation (RAG) dengan LangChain

Apa itu embedding?

what_are_embeddings4.jpg

Retrieval Augmented Generation (RAG) dengan LangChain

Melakukan embedding dan menyimpan chunk

  • Embedding dan simpan dengan: OpenAI dan ChromaDB
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embedding_model = OpenAIEmbeddings(
    api_key=openai_api_key,
    model="text-embedding-3-small"
)


vector_store = Chroma.from_documents( documents=chunks, embedding=embedding_model )
Retrieval Augmented Generation (RAG) dengan LangChain

Ayo berlatih!

Retrieval Augmented Generation (RAG) dengan LangChain

Preparing Video For Download...