Carregando e dividindo arquivos de código

Retrieval Augmented Generation (RAG) com LangChain

Meri Nova

Machine Learning Engineer

Mais carregadores de documentos...

Uma seleção de diferentes formatos de arquivo.

Retrieval Augmented Generation (RAG) com LangChain

O markdown bruto usado para criar o arquivo README no repositório do LangChain no GitHub.

Retrieval Augmented Generation (RAG) com LangChain

Markdown renderizado do arquivo README.md do repositório do LangChain no GitHub.

Retrieval Augmented Generation (RAG) com LangChain

Carregando arquivos Markdown (.md)

from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader("README.md")
markdown_content = loader.load() print(markdown_content[0])
Document(page_content='# Discord Text Classification ![Python Version](https...'
         metadata={'source': 'README.md'})
Retrieval Augmented Generation (RAG) com LangChain

Carregando arquivos Python (.py)

from abc import ABC, abstractmethod

class LLM(ABC):
  @abstractmethod
  def complete_sentence(self, prompt):
    pass

...
  • Integrado em apps de RAG para escrever/corrigir código, criar docs etc.
  • Imports, classes, funções etc.
from langchain_community.document_loaders \
    import PythonLoader

loader = PythonLoader('chatbot.py')

python_data = loader.load() print(python_data[0])
Document(page_content='from abc import ABC, ...

class LLM(ABC):
  @abstractmethod
...',
metadata={'source': 'chatbot.py'})
Retrieval Augmented Generation (RAG) com LangChain

Dividindo arquivos de código

python_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150, chunk_overlap=10
)

chunks = python_splitter.split_documents(python_data) for i, chunk in enumerate(chunks[:3]): print(f"Chunk {i+1}:\n{chunk.page_content}\n")
Retrieval Augmented Generation (RAG) com LangChain
Chunk 1:
from abc import ABC, abstractmethod

class LLM(ABC):
  @abstractmethod
  def complete_sentence(self, prompt):
    pass

Chunk 2:
class OpenAI(LLM):
  def complete_sentence(self, prompt):
    return prompt + " ... OpenAI end of sentence."

class Anthropic(LLM):

Chunk 3:
def complete_sentence(self, prompt):
    return prompt + " ... Anthropic end of sentence."

Retrieval Augmented Generation (RAG) com LangChain

Dividindo por linguagem

  • separators
    • ["\n\n", "\n", " ", ""]
    • ["\nclass ", "\ndef ", "\n\tdef ", "\n\n", " ", ""]
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

python_splitter = RecursiveCharacterTextSplitter.from_language(

language=Language.PYTHON, chunk_size=150, chunk_overlap=10
)
chunks = python_splitter.split_documents(data)
for i, chunk in enumerate(chunks[:3]): print(f"Chunk {i+1}:\n{chunk.page_content}\n")
Retrieval Augmented Generation (RAG) com LangChain
Chunk 1:
from abc import ABC, abstractmethod

Chunk 2:
class LLM(ABC):
  @abstractmethod
  def complete_sentence(self, prompt):
    pass

Chunk 3:
class OpenAI(LLM):
  def complete_sentence(self, prompt):
Retrieval Augmented Generation (RAG) com LangChain

Vamos praticar!

Retrieval Augmented Generation (RAG) com LangChain

Preparing Video For Download...