Codebestanden laden en splitsen

Retrieval Augmented Generation (RAG) met LangChain

Meri Nova

Machine Learning Engineer

Meer documentloaders...

Een reeks verschillende bestandsindelingen.

Retrieval Augmented Generation (RAG) met LangChain

De ruwe markdown voor het README-bestand in de GitHub-repo van LangChain.

Retrieval Augmented Generation (RAG) met LangChain

Weergegeven markdown uit het README.md-bestand op LangChains GitHub.

Retrieval Augmented Generation (RAG) met LangChain

Markdown-bestanden (.md) laden

from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader("README.md")
markdown_content = loader.load() print(markdown_content[0])
Document(page_content='# Discord Text Classification ![Python Version](https...'
         metadata={'source': 'README.md'})
Retrieval Augmented Generation (RAG) met LangChain

Python-bestanden (.py) laden

from abc import ABC, abstractmethod

class LLM(ABC):
  @abstractmethod
  def complete_sentence(self, prompt):
    pass

...
  • Ingezet in RAG-apps voor code schrijven/fixen, docs maken, enz.
  • Imports, classes, functies, enz.
from langchain_community.document_loaders \
    import PythonLoader

loader = PythonLoader('chatbot.py')

python_data = loader.load() print(python_data[0])
Document(page_content='from abc import ABC, ...

class LLM(ABC):
  @abstractmethod
...',
metadata={'source': 'chatbot.py'})
Retrieval Augmented Generation (RAG) met LangChain

Codebestanden splitsen

python_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150, chunk_overlap=10
)

chunks = python_splitter.split_documents(python_data) for i, chunk in enumerate(chunks[:3]): print(f"Chunk {i+1}:\n{chunk.page_content}\n")
Retrieval Augmented Generation (RAG) met LangChain
Chunk 1:
from abc import ABC, abstractmethod

class LLM(ABC):
  @abstractmethod
  def complete_sentence(self, prompt):
    pass

Chunk 2:
class OpenAI(LLM):
  def complete_sentence(self, prompt):
    return prompt + " ... OpenAI end of sentence."

class Anthropic(LLM):

Chunk 3:
def complete_sentence(self, prompt):
    return prompt + " ... Anthropic end of sentence."

Retrieval Augmented Generation (RAG) met LangChain

Splitsen per taal

  • separators
    • ["\n\n", "\n", " ", ""]
    • ["\nclass ", "\ndef ", "\n\tdef ", "\n\n", " ", ""]
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

python_splitter = RecursiveCharacterTextSplitter.from_language(

language=Language.PYTHON, chunk_size=150, chunk_overlap=10
)
chunks = python_splitter.split_documents(data)
for i, chunk in enumerate(chunks[:3]): print(f"Chunk {i+1}:\n{chunk.page_content}\n")
Retrieval Augmented Generation (RAG) met LangChain
Chunk 1:
from abc import ABC, abstractmethod

Chunk 2:
class LLM(ABC):
  @abstractmethod
  def complete_sentence(self, prompt):
    pass

Chunk 3:
class OpenAI(LLM):
  def complete_sentence(self, prompt):
Retrieval Augmented Generation (RAG) met LangChain

Laten we oefenen!

Retrieval Augmented Generation (RAG) met LangChain

Preparing Video For Download...