Embeddings for classification tasks

Introduction to Embeddings with the OpenAI API

Emmanuel Pire

Senior Software Engineer, DataCamp

Classification tasks

 

Assigning labels to items

  • Categorization
    • Example: headlines into topics
  • Sentiment analysis

 

A table of topics that we could use to categorize articles.

Introduction to Embeddings with the OpenAI API

Classification tasks

 

Assigning labels to items

  • Categorization
    • Example: headlines into topics
  • Sentiment analysis
    • Example: Classifying reviews as positive or negative

 

Embeddings capture semantic meaning

 

A table containing a smiley and a sad face, which could be used to categorize by sentiment.

Introduction to Embeddings with the OpenAI API

Classification with embeddings

  • Zero-shot classification:
    • Not using labeled data

 

Process:

  1. Embed class descriptions

Embedded class descriptions in the vector space.

Introduction to Embeddings with the OpenAI API

Classification with embeddings

  • Zero-shot classification:
    • Not using labeled data

 

Process:

  1. Embed class descriptions
  2. Embed the item to classify
  3. Compute cosine distances

Embedded class descriptions in the vector space, with an unknown vector shown.

Introduction to Embeddings with the OpenAI API

Classification with embeddings

  • Zero-shot classification:
    • Not using labeled data

 

Process:

  1. Embed class descriptions
  2. Embed the item to classify
  3. Compute cosine distances
  4. Assign the most similar label

Embedded class descriptions in the vector space, with an unknown vector assigned the Tech label.

Introduction to Embeddings with the OpenAI API

Embedding class descriptions

topics = [
  {'label': 'Tech'},
  {'label': 'Science'},
  {'label': 'Sport'},
  {'label': 'Business'},
]

class_descriptions = [topic['label'] for topic in topics]
class_embeddings = create_embeddings(class_descriptions)
Introduction to Embeddings with the OpenAI API

Embedding item to classify

article = {"headline": "How NVIDIA GPUs Could Decide Who Wins the AI Race",
           "keywords": ["ai", "business", "computers"]}

def create_article_text(article): return f"""Headline: {article['headline']} Keywords: {', '.join(article['keywords'])}""" article_text = create_article_text(article)
article_embeddings = create_embeddings(article_text)[0]
Introduction to Embeddings with the OpenAI API

Compute cosine distances

def find_closest(query_vector, embeddings):
  distances = []
  for index, embedding in enumerate(embeddings):
    dist = distance.cosine(query_vector, embedding)
    distances.append({"distance": dist, "index": index})
  return min(distances, key=lambda x: x["distance"])

closest = find_closest(article_embeddings, class_embeddings)
Introduction to Embeddings with the OpenAI API

Extract the most similar label

label = topics[closest['index']]['label']

print(label)
Business
article = {"headline": "How NVIDIA GPUs Could Decide Who Wins the AI Race",
           "keywords": ["ai", "business", "computers"]}

Limitation:

  • Class descriptions lacked sufficient detail
Introduction to Embeddings with the OpenAI API

More detailed descriptions

topics = [
  {'label': 'Tech', 'description': 'A news article about technology'},
  {'label': 'Science', 'description': 'A news article about science'},
  {'label': 'Sport', 'description': 'A news article about sports'},
  {'label': 'Business', 'description': 'A news article about business'},
]

class_descriptions = [topic['description'] for topic in topics] class_embeddings = create_embeddings(class_descriptions)
[...] label = topics[closest['index']]['label'] print(label)
Tech
Introduction to Embeddings with the OpenAI API

Let's practice!

Introduction to Embeddings with the OpenAI API

Preparing Video For Download...