Introduction to Embeddings with the OpenAI API
Emmanuel Pire
Senior Software Engineer, DataCamp
import chromadb
client = chromadb.PersistentClient(path="/path/to/save/to")
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
collection = client.create_collection( name="my_collection",
embedding_function=OpenAIEmbeddingFunction( model_name="text-embedding-3-small", api_key="..." )
)
client.list_collections()
[Collection(name=my_collection)]
Single document
collection.add(ids=["my-doc"], documents=["This is the source text"])
Multiple documents
collection.add(
ids=["my-doc-1", "my-doc-2"],
documents=["This is document 1", "This is document 2"]
)
Counting documents in a collection
collection.count()
3
Peeking at the first 10 items
collection.peek()
{'ids': ['my-doc', 'my-doc-1', 'my-doc-2'],
'embeddings': [[...], [...], [...]],
'documents': ['This is the source text',
'This is document 1',
'This is document 2'],
'metadatas': [None, None, None]}
collection.get(ids=["s59"])
{'ids': ['s59'],
'embeddings': None,
'metadatas': [None],
'documents': ['Title: Naruto Shippûden the Movie: The Will of Fire (Movie)\nDescription: When ...'],
'uris': None,
'data': None}
Title: Kota Factory (TV Show)
Description: In a city of coaching centers known to train India's finest...
Categories: International TV Shows, Romantic TV Shows, TV Comedies
Title: The Last Letter From Your Lover (Movie)
Description: After finding a trove of love letters from 1965, a reporter sets...
Categories: Dramas, Romantic Movies
text-embedding-3-small
) costs $0.00002/1k tokenscost = 0.00002 * len(tokens)/1000
tiktoken
librarypip install tiktoken
import tiktoken enc = tiktoken.encoding_for_model("text-embedding-3-small")
total_tokens = sum(len(enc.encode(text)) for text in documents)
cost_per_1k_tokens = 0.00002 print('Total tokens:', total_tokens) print('Cost:', cost_per_1k_tokens * total_tokens/1000)
Total tokens: 444463
Cost: 0.00888926
Introduction to Embeddings with the OpenAI API