Loading Documents for RAG with LangChain

Retrieval Augmented Generation (RAG) con LangChain

Meri Nova

Machine Learning Engineer

Meet your instructor...

 

Meri Nova

 

  • Founder at Break Into Data
  • Machine Learning Engineer
  • Content Creator on Linkedin and YouTube

Photo of Meri.

Retrieval Augmented Generation (RAG) con LangChain

Retrieval Augmented Generation (RAG)

 

  • LLM Limitation: knowledge constraints

 

RAG: Integrating external data with LLMs

A person handing an LLM more information in the form of books.

1 Generated with DALL·E 3
Retrieval Augmented Generation (RAG) con LangChain

The standard RAG workflow

A single user.

Retrieval Augmented Generation (RAG) con LangChain

The standard RAG workflow

A user query being sent to a vector database.

Retrieval Augmented Generation (RAG) con LangChain

The standard RAG workflow

Relevant documents being retrieved from the vector database.

Retrieval Augmented Generation (RAG) con LangChain

The standard RAG workflow

The retrieved documents are added to the model prompt.

Retrieval Augmented Generation (RAG) con LangChain

The standard RAG workflow

The prompt is sent to the LLM and the output is returned to the user.

Retrieval Augmented Generation (RAG) con LangChain

Preparing data for retrieval

Documents being loaded.

Retrieval Augmented Generation (RAG) con LangChain

Preparing data for retrieval

Documents being split.

Retrieval Augmented Generation (RAG) con LangChain

Preparing data for retrieval

Document chunks are embedded.

Retrieval Augmented Generation (RAG) con LangChain

Preparing data for retrieval

Document chunks are stored.

Retrieval Augmented Generation (RAG) con LangChain

Document loaders

 

  • Integrate documents with AI systems
  • Support for many common file formats
  • Third party document loaders

 

  • CSVLoader
  • PyPDFLoader
  • UnstructuredHTMLLoader

Documents being loaded.

Retrieval Augmented Generation (RAG) con LangChain

Loading CSV Files

from langchain_community.document_loaders.csv_loader import CSVLoader

csv_loader = CSVLoader(file_path='path/to/your/file.csv')

documents = csv_loader.load() print(documents)
[Document(page_content='Team: Nationals\n"Payroll (millions)": 81.34\n"Wins": 98',
          metadata={'source': 'path/to/your/file.csv', 'row': 0}),
 Document(page_content='Team: Reds\n"Payroll (millions)": 82.20\n"Wins": 97',
          metadata={'source': 'path/to/your/file.csv', 'row': 1}),
 Document(page_content='Team: Yankees\n"Payroll (millions)": 197.96\n"Wins": 95',
          metadata={'source': 'path/to/your/file.csv', 'row': 2})]
Retrieval Augmented Generation (RAG) con LangChain

Loading PDF Files

from langchain_community.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader('rag_paper.pdf')
documents = pdf_loader.load()
print(documents)
[Document(page_content='Retrieval-Augmented Generation for\nKnowledge-Intensive...',
          metadata={'source': 'Rag Paper.pdf', 'page': 0})]
Retrieval Augmented Generation (RAG) con LangChain

Loading HTML Files

from langchain_community.document_loaders import UnstructuredHTMLLoader

html_loader = UnstructuredHTMLLoader(file_path='path/to/your/file.html')

documents = html_loader.load() first_document = documents[0]
print("Content:", first_document.page_content) print("Metadata:", first_document.metadata)
Content: Welcome to Our Website
Metadata: {'source': 'path/to/your/file.html', 'section': 0}
Retrieval Augmented Generation (RAG) con LangChain

Let's practice!

Retrieval Augmented Generation (RAG) con LangChain

Preparing Video For Download...