Creating vector databases with ChromaDB

Introduction to Embeddings with the OpenAI API

Emmanuel Pire

Senior Software Engineer, DataCamp

Installing ChromaDB

  • ChromaDB is a simple yet powerful vector database
  • Two flavors:
    • Local: great for development and prototyping
    • Client/Server: made for production

ChromaDB in local mode alongside ChromaDB in client/server mode. The client and server are shown in separate instances in client/server mode.

Introduction to Embeddings with the OpenAI API

Connecting to the database

import chromadb

client = chromadb.PersistentClient(path="/path/to/save/to")
  • Data will be persisted to disk
Introduction to Embeddings with the OpenAI API

Creating a collection

  • Collections are analogous to tables
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
collection = client.create_collection(
    name="my_collection",

embedding_function=OpenAIEmbeddingFunction( model_name="text-embedding-3-small", api_key="..." )
)
  • Collections are able to create embeddings automatically
Introduction to Embeddings with the OpenAI API

Inspecting collections

client.list_collections()
[Collection(name=my_collection)]
Introduction to Embeddings with the OpenAI API

Inserting embeddings

Single document

collection.add(ids=["my-doc"], documents=["This is the source text"])
  • IDs must be provided
  • Embeddings will be created by the collection!

Multiple documents

collection.add(
  ids=["my-doc-1", "my-doc-2"], 
  documents=["This is document 1", "This is document 2"]
)
Introduction to Embeddings with the OpenAI API

Inspecting a collection

Counting documents in a collection

collection.count()
3
Introduction to Embeddings with the OpenAI API

Inspecting a collection

Peeking at the first 10 items

collection.peek()
{'ids': ['my-doc', 'my-doc-1', 'my-doc-2'],
 'embeddings': [[...], [...], [...]],
 'documents': ['This is the source text',
  'This is document 1',
  'This is document 2'],
 'metadatas': [None, None, None]}
Introduction to Embeddings with the OpenAI API

Retrieving items

collection.get(ids=["s59"])
{'ids': ['s59'],
 'embeddings': None,
 'metadatas': [None],
 'documents': ['Title: Naruto Shippûden the Movie: The Will of Fire (Movie)\nDescription: When ...'],
 'uris': None,
 'data': None}
Introduction to Embeddings with the OpenAI API

Netflix dataset

 

Title: Kota Factory (TV Show)
Description: In a city of coaching centers known to train India's finest...
Categories: International TV Shows, Romantic TV Shows, TV Comedies
Title: The Last Letter From Your Lover (Movie)
Description: After finding a trove of love letters from 1965, a reporter sets...
Categories: Dramas, Romantic Movies
Introduction to Embeddings with the OpenAI API

Estimating embedding cost

  • Embedding model (text-embedding-3-small) costs $0.00002/1k tokens
cost = 0.00002 * len(tokens)/1000
  • Count tokens with the tiktoken library
    • pip install tiktoken
1 https://openai.com/pricing
Introduction to Embeddings with the OpenAI API

Estimating embedding cost

import tiktoken

enc = tiktoken.encoding_for_model("text-embedding-3-small")

total_tokens = sum(len(enc.encode(text)) for text in documents)
cost_per_1k_tokens = 0.00002 print('Total tokens:', total_tokens) print('Cost:', cost_per_1k_tokens * total_tokens/1000)
Total tokens: 444463
Cost: 0.00888926
Introduction to Embeddings with the OpenAI API

Let's practice!

Introduction to Embeddings with the OpenAI API

Preparing Video For Download...