Content moderation

Multi-Modal Systems with the OpenAI API

James Chapman

Curriculum Manager, DataCamp

Moderation

Identifying inappropriate content

Traditionally,

Moderators flag content by-hand
- ❌ Time-consuming
Keyword pattern matching
- ❌ Lacks nuance and understanding of context

Speech icons depicting malicious content.

Violation categories

Identify violations of terms or use
Differentiate violation type by category
- Violence
- Hate speech

Speech icons depicting malicious content.

¹ https://openai.com/policies/usage-policies ² https://platform.openai.com/docs/guides/moderation/overview

Creating a moderations request

from openai import OpenAI

client = OpenAI(api_key="ENTER API KEY")


response = client.moderations.create(

  model="text-moderation-latest",

  input="I could kill for a hamburger."

)

Interpreting the results

categories
- true/false indicator of category violation
category_scores
- Confidence of a violation
flagged
- true/false indicator of a violation

response.model_dump()

Response output

Interpreting the category scores

Extracting the category_scores from the response

Larger numbers → greater certainty of violation
Numbers $\neq$ probabilities

Interpreting the category scores

category_scores with violence highlighted

Larger numbers → greater certainty of violation
Numbers $\neq$ probabilities

Considerations for implementing moderation

CategoryScores(harassment=2.775943e-05,
               harassment_threatening=1.3526056e-06,
               hate=2.733528e-07,
               hate_threatening=4.930576e-08,
               ...,
               violence=0.0500854030251503,
               ...)

Tune thresholds for each use case
Stricter thresholds may result in fewer false negatives
More lenient thresholds may result in fewer false positives

Let's practice!

Multi-Modal Systems with the OpenAI API