Data concerns and considerations

Large Language Models (LLMs) Concepts

Vidhi Chugh

AI strategist and ethicist

Data considerations

 

Data considerations

 

  • Data volume and compute power
  • Data quality
  • Labeling
  • Bias
  • Privacy
Large Language Models (LLMs) Concepts

Data volume and compute power

  • LLMs need a lot of data
    • Similar to a child learning to talk
    • 570 GB, ~1.3 million books

 

Child learning to speak

1 Freepik
Large Language Models (LLMs) Concepts

Data volume and compute power

  • LLMs need a lot of data
    • Similar to a child learning to talk
    • 570 GB, ~1.3 million books

 

  • Extensive computing power; think of the energy consumption

 

  • Can cost millions of dollars!

Man working on a computer plugged into large server

Large Language Models (LLMs) Concepts

Data quality

  • Quality data is essential

 

  • Accurate data = better learning = improved response quality = increased trust

 

  • A child learning to talk
    • Gibberish-in -> gibberish-out

low-quality outputs if we train LLMs with data full of mistakes or poor grammar

Large Language Models (LLMs) Concepts

Labeled data

  • Correct data label: accurate learning, generalize patterns, accurate responses

  • Labor-intensive: assigning correct label to each article

Team working on computers to label data

  • Incorrect labels impact model performance
  • Address errors: identify -> analyze -> iterate
Large Language Models (LLMs) Concepts

Data bias

  • Influenced by societal stereotypes
  • Lack of diversity in training data
  • Discrimination and unfair outcomes

 

  • Spot and deal with the biased data
    • Evaluate data imbalances
    • Promote diversity
    • Bias mitigation techniques: more diverse examples

Data bias

  • Example:

    • "The nurse said that..." -> "she" or "her"
Large Language Models (LLMs) Concepts

Data privacy

  • Compliance with data protection and privacy regulations

 

  • Privacy is a concern
    • Training on data without permission can lead to a breach
    • Legal, financial and reputational harm
  • Sensitive or personally identifiable information (PII)

 

  • Get permission

Data privacy

Large Language Models (LLMs) Concepts

Let's practice!

Large Language Models (LLMs) Concepts

Preparing Video For Download...