Integrating document loaders

Developing LLM Applications with LangChain

Jonathan Bennion

AI Engineer & LangChain Contributor

Retrieval Augmented Generation (RAG)

  • Use embeddings to retrieve relevant information to integrate into the prompt

A typical RAG workflow.

Developing LLM Applications with LangChain

RAG development steps

The general RAG workflow: a document loader, a document splitter, and the storage and retrieval process.

Developing LLM Applications with LangChain

LangChain document loaders

  • Classes designed to load and configure documents for system integration
  • Document loaders for common file types: .pdf, .csv
  • 3rd party loaders: S3, .ipynb, .wav

document-loader.jpg

1 https://python.langchain.com/docs/integrations/document_loaders
Developing LLM Applications with LangChain

PDF document loader

  • Requires installation of the pypdf package: pip install pypdf
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("path/to/file/attention_is_all_you_need.pdf")
data = loader.load()
print(data[0])
Document(page_content='Provided proper attribution is provided, Google hereby grants 
permission to\nreproduce the tables and figures in this paper solely for use in [...]
Developing LLM Applications with LangChain

CSV document loader

from langchain_community.document_loaders.csv_loader import CSVLoader


loader = CSVLoader('fifa_countries_audience.csv')
data = loader.load()
print(data[0])
Document(page_content='country: United States\nconfederation: CONCACAF\npopulation_share: [...]
Developing LLM Applications with LangChain

HTML document loader

  • Requires installation of the unstructured package: pip install unstructured
from langchain_community.document_loaders import UnstructuredHTMLLoader


loader = UnstructuredHTMLLoader("white_house_executive_order_nov_2023.html") data = loader.load()
print(data[0])
print(data[0].metadata)
page_content="To search this site, enter a search term\n\nSearch\n\nExecutive Order on the Safe, Secure,
and Trustworthy Development and Use of Artificial Intelligence\n\nHome\n\nBriefing Room\n\nPresidential
Actions\n\nBy the authority vested in me as President by the Constitution and the laws of the United
States of America, it is hereby ordered as follows: ..."

{'source': 'white_house_executive_order_nov_2023.html'}
Developing LLM Applications with LangChain

Let's practice!

Developing LLM Applications with LangChain

Preparing Video For Download...