Busca semântica com Pinecone

Bancos de dados vetoriais para incorporações com Pinecone

James Chapman

Curriculum Manager, DataCamp

Mecanismos de busca semântica

  1. Fazer embeddings e ingerir documentos em um índice Pinecone
  2. Fazer embedding da consulta do usuário
  3. Consultar o índice com a consulta embebida

busca semântica

Bancos de dados vetoriais para incorporações com Pinecone

Configurando Pinecone e OpenAI para busca semântica

from openai import OpenAI
from pinecone import Pinecone, ServerlessSpec

client = OpenAI(api_key="OPENAI_API_KEY")
pc = Pinecone(api_key="PINECONE_API_KEY")


pc.create_index( name="semantic-search-datacamp",
dimension=1536,
spec=ServerlessSpec(cloud='aws', region='us-east-1') )
index = pc.Index("semantic-search-datacamp")
Bancos de dados vetoriais para incorporações com Pinecone

Ingerindo documentos no índice Pinecone

import pandas as pd
import numpy as np
from uuid import uuid4

df = pd.read_csv('squad_dataset.csv')
| id | text                                              | title             |
|----|---------------------------------------------------|-------------------|
| 1  | Architecturally, the school has a Catholic cha... | University of ... |
| 2  | The College of Engineering was established in.... | University of ... |
| 3  | Following the disbandment of Destiny's Child in.. | Beyonce           |
| 4  | Architecturally, the school has a Catholic cha... | University of ... |
Bancos de dados vetoriais para incorporações com Pinecone

Ingerindo documentos no índice Pinecone

batch_limit = 100


for batch in np.array_split(df, len(df) / batch_limit):
metadatas = [{"text_id": row['id'], "text": row['text'], "title": row['title']} for _, row in batch.iterrows()]
texts = batch['text'].tolist()
ids = [str(uuid4()) for _ in range(len(texts))]
response = client.embeddings.create(input=texts, model="text-embedding-3-small") embeds = [np.array(x.embedding) for x in response.data]
index.upsert(vectors=zip(ids, embeds, metadatas), namespace="squad_dataset")
Bancos de dados vetoriais para incorporações com Pinecone

Ingerindo documentos no índice Pinecone

index.describe_index_stats()
{'dimension': 1536, 'index_fullness': 0.02,
 'namespaces': {'squad_dataset': {'vector_count': 2000}},
 'total_vector_count': 2000}
Bancos de dados vetoriais para incorporações com Pinecone

Consultando com Pinecone

query = "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?"

query_response = client.embeddings.create( input=query, model="text-embedding-3-small") query_emb = query_response.data[0].embedding
retrieved_docs = index.query(vector=query_emb, top_k=3, namespace=namespace, include_metadata=True)
Bancos de dados vetoriais para incorporações com Pinecone

Consultando com Pinecone

for result in retrieved_docs['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")
0.41: Architecturally, the school has a Catholic character. Atop the Main Building
gold dome is a golden statue of the Virgin Mary...

0.3: Because of its Catholic identity, a number of religious buildings stand on 
campus. The Old College building has become one of two seminaries...

0.29: Within the white inescutcheon, the five quinas (small blue shields) with 
their five white bezants representing the five wounds...
Bancos de dados vetoriais para incorporações com Pinecone

Hora de construir!

Bancos de dados vetoriais para incorporações com Pinecone

Preparing Video For Download...