Loading Documents for RAG with LangChain

Retrieval Augmented Generation (RAG) with LangChain

Meri Nova

Machine Learning Engineer

Meet your instructor...

 

Meri Nova

 

  • Founder at Break Into Data
  • Machine Learning Engineer
  • Content Creator on Linkedin and YouTube

Photo of Meri.

Retrieval Augmented Generation (RAG) with LangChain

Retrieval Augmented Generation (RAG)

 

  • LLM Limitation: knowledge constraints

 

RAG: Integrating external data with LLMs

A person handing an LLM more information in the form of books.

1 Generated with DALL·E 3
Retrieval Augmented Generation (RAG) with LangChain

The standard RAG workflow

A single user.

Retrieval Augmented Generation (RAG) with LangChain

The standard RAG workflow

A user query being sent to a vector database.

Retrieval Augmented Generation (RAG) with LangChain

The standard RAG workflow

Relevant documents being retrieved from the vector database.

Retrieval Augmented Generation (RAG) with LangChain

The standard RAG workflow

The retrieved documents are added to the model prompt.

Retrieval Augmented Generation (RAG) with LangChain

The standard RAG workflow

The prompt is sent to the LLM and the output is returned to the user.

Retrieval Augmented Generation (RAG) with LangChain

Preparing data for retrieval

Documents being loaded.

Retrieval Augmented Generation (RAG) with LangChain

Preparing data for retrieval

Documents being split.

Retrieval Augmented Generation (RAG) with LangChain

Preparing data for retrieval

Document chunks are embedded.

Retrieval Augmented Generation (RAG) with LangChain

Preparing data for retrieval

Document chunks are stored.

Retrieval Augmented Generation (RAG) with LangChain

Document loaders

 

  • Integrate documents with AI systems
  • Support for many common file formats
  • Third party document loaders

 

  • CSVLoader
  • PyPDFLoader
  • UnstructuredHTMLLoader

Documents being loaded.

Retrieval Augmented Generation (RAG) with LangChain

Loading CSV Files

from langchain_community.document_loaders.csv_loader import CSVLoader

csv_loader = CSVLoader(file_path='path/to/your/file.csv')

documents = csv_loader.load() print(documents)
[Document(page_content='Team: Nationals\n"Payroll (millions)": 81.34\n"Wins": 98',
          metadata={'source': 'path/to/your/file.csv', 'row': 0}),
 Document(page_content='Team: Reds\n"Payroll (millions)": 82.20\n"Wins": 97',
          metadata={'source': 'path/to/your/file.csv', 'row': 1}),
 Document(page_content='Team: Yankees\n"Payroll (millions)": 197.96\n"Wins": 95',
          metadata={'source': 'path/to/your/file.csv', 'row': 2})]
Retrieval Augmented Generation (RAG) with LangChain

Loading PDF Files

from langchain_community.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader('rag_paper.pdf')
documents = pdf_loader.load()
print(documents)
[Document(page_content='Retrieval-Augmented Generation for\nKnowledge-Intensive...',
          metadata={'source': 'Rag Paper.pdf', 'page': 0})]
Retrieval Augmented Generation (RAG) with LangChain

Loading HTML Files

from langchain_community.document_loaders import UnstructuredHTMLLoader

html_loader = UnstructuredHTMLLoader(file_path='path/to/your/file.html')

documents = html_loader.load() first_document = documents[0]
print("Content:", first_document.page_content) print("Metadata:", first_document.metadata)
Content: Welcome to Our Website
Metadata: {'source': 'path/to/your/file.html', 'section': 0}
Retrieval Augmented Generation (RAG) with LangChain

Let's practice!

Retrieval Augmented Generation (RAG) with LangChain

Preparing Video For Download...