Moderation

Developing AI Systems with the OpenAI API

Francesca Donadoni

Curriculum Manager, DataCamp

Understanding moderation in the OpenAI API

Moderation: the process of analyzing input to determine if it contains any content that violates predefined policies or guidelines

Understanding moderation in the OpenAI API

A diagram with an input user message read by the OpenAI moderation API and producing as a response, with a list of the malicious content categories considered

Moderating content

moderation_response = client.moderations.create(input="""
...until someone draws an Exploding Kitten.
When that happens, that person explodes. They are now dead.
This process continues until...
""") 

print(moderation_response.results[0].categories.violence)

True

¹ https://ek.explodingkittens.com/how-to-play/exploding-kittens

Moderation in context

moderation_response = client.moderations.create(input="""
In the deck of cards are some Exploding Kittens. You play the game by putting the deck face down and taking turns drawing cards until someone draws an Exploding Kitten.
When that happens, that person explodes. They are now dead.
This process continues until there’s only 1 player left, who wins the game.
The more cards you draw, the greater your chances of drawing an Exploding Kitten.
""") 

moderation_response.results[0].categories.violence

False

Prompt injection

A woman using a chatbot with a malicious prompt being injected

Prompt injection

Limiting the amount of text in prompts
Limiting the number of output tokens generated
Using pre-selected content as validated input and output

Adding guardrails

user_request = """
In the deck of cards are some Exploding Kittens. You play the game by putting the 
deck face down and taking turns drawing cards until  someone draws an Exploding 
Kitten. When that happens, that person explodes. They are now dead.
This process continues until there’s only 1 player left, who wins the game.
The more cards you draw, the greater your chances of drawing an Exploding Kitten.
"""

messages = [{"role": "system",
             "content": "Your role is to assess whether the user question is 
              allowed or not. The allowed topics are games of chess only. If 
              the topic is allowed, reply with an answer as normal, otherwise
              say 'Apologies, but the topic is not_allowed.'",},
            {"role": "user", "content": user_request},]

Adding guardrails

response = client.chat.completions.create(
    model="gpt-4o-mini", 
    messages=messages
)

print(response.choices[0].message.content)

Apologies, but the topic is not allowed.

Let's practice!

Developing AI Systems with the OpenAI API