Creating vector databases with ChromaDB

Introduction to Embeddings with the OpenAI API

Emmanuel Pire

Senior Software Engineer, DataCamp

Installing ChromaDB

ChromaDB is a simple yet powerful vector database
Two flavors:
- Local: great for development and prototyping
- Client/Server: made for production

ChromaDB in local mode alongside ChromaDB in client/server mode. The client and server are shown in separate instances in client/server mode.

Connecting to the database

import chromadb


client = chromadb.PersistentClient(path="/path/to/save/to")

Data will be persisted to disk

Creating a collection

Collections are analogous to tables

from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

collection = client.create_collection(
    name="my_collection",

    embedding_function=OpenAIEmbeddingFunction(
        model_name="text-embedding-3-small",
        api_key="..."
    )

)

Collections are able to create embeddings automatically

Inspecting collections

client.list_collections()

[Collection(name=my_collection)]

Inserting embeddings

Single document

collection.add(ids=["my-doc"], documents=["This is the source text"])

IDs must be provided
Embeddings will be created by the collection!

Multiple documents

collection.add(
  ids=["my-doc-1", "my-doc-2"], 
  documents=["This is document 1", "This is document 2"]
)

Inspecting a collection

Counting documents in a collection

collection.count()

Inspecting a collection

Peeking at the first 10 items

collection.peek()

{'ids': ['my-doc', 'my-doc-1', 'my-doc-2'],
 'embeddings': [[...], [...], [...]],
 'documents': ['This is the source text',
  'This is document 1',
  'This is document 2'],
 'metadatas': [None, None, None]}

Retrieving items

collection.get(ids=["s59"])

{'ids': ['s59'],
 'embeddings': None,
 'metadatas': [None],
 'documents': ['Title: Naruto Shippûden the Movie: The Will of Fire (Movie)\nDescription: When ...'],
 'uris': None,
 'data': None}

Netflix dataset

Title: Kota Factory (TV Show)
Description: In a city of coaching centers known to train India's finest...
Categories: International TV Shows, Romantic TV Shows, TV Comedies

Title: The Last Letter From Your Lover (Movie)
Description: After finding a trove of love letters from 1965, a reporter sets...
Categories: Dramas, Romantic Movies

Estimating embedding cost

Embedding model (text-embedding-3-small) costs $0.00002/1k tokens

cost = 0.00002 * len(tokens)/1000

Count tokens with the tiktoken library
- pip install tiktoken

¹ https://openai.com/pricing

Estimating embedding cost

import tiktoken

enc = tiktoken.encoding_for_model("text-embedding-3-small")


total_tokens = sum(len(enc.encode(text)) for text in documents)


cost_per_1k_tokens = 0.00002

print('Total tokens:', total_tokens)
print('Cost:', cost_per_1k_tokens * total_tokens/1000)

Total tokens: 444463
Cost: 0.00888926

Let's practice!

Introduction to Embeddings with the OpenAI API