Generalized overview of NLP

Large Language Models (LLMs) Concepts

Vidhi Chugh

AI strategist and ethicist

Where are we?

Progress chart showing the first step i.e text pre-processing

Large Language Models (LLMs) Concepts

Text pre-processing

  • Can be done in a different order as they are independent

Three most common steps for text pre-processing

Large Language Models (LLMs) Concepts

Tokenization

  • Splits text into individual words, or tokens

 

  • Text:

    • "Working with natural language processing techniques is tricky."

     

  • Tokenization:

    • ["Working", "with", "natural", "language", "processing", "techniques", "is", "tricky", "."]
    • Converts into a list
Large Language Models (LLMs) Concepts

Stop word removal

  • Stop words do not add meaning
  • Eliminated through stop word removal

 

  • Before stop word removal:
    • ["Working", "with", "natural", "language", "processing", "techniques", "is", "challenging", "."]

 

  • After stop word removal:
    • ["Working", "natural", "language", "processing", "techniques", "challenging", "."]
Large Language Models (LLMs) Concepts

Lemmatization

 

  • Group slightly different words with similar meaning

 

  • Reduces words to their base form

 

  • Mapped to root word

 

  • Talking -> Talk

  • Talked -> Talk

  • Talk -> Talk

Large Language Models (LLMs) Concepts

Text representation

Progress chart showing we have reached the text representation stage

Large Language Models (LLMs) Concepts

Text representation

 

  • Text data into numerical form

 

  • Bag-of-words
  • Word embeddings

Image depicting speech as numbers

Large Language Models (LLMs) Concepts

Bag-of-words

 

  • Text into a matrix of word counts

A matrix with a bag of words representation

  • 0 represents the absence of a word
Large Language Models (LLMs) Concepts

Limitations of bag-of-words

  • Does not capture the order or context

    • Can lead to incorrect interpretations
    • Similar sentences but opposite meaning
      • "The cat chased the mouse swiftly."
      • "The mouse chased the cat."
  • Does not capture the semantics between the words

    • Treats related words as independent
    • Like "cat" and "mouse"
Large Language Models (LLMs) Concepts

Word embeddings

  • Capture the semantic meanings as numbers

 

Cat Mouse
Plant -0.9 -0.8
Furry 0.9 0.7
Carnivore 0.9 -0.8

 

  • Cat [-0.9, 0.9, 0.9]
  • Predator-prey relationship:

Predator-prey word embeddings

Large Language Models (LLMs) Concepts

Machine-readable form

 

  • Start with text pre-processing

Data preparation workflow

Large Language Models (LLMs) Concepts

Machine-readable form

 

  • Convert pre-processed text to numerical format

Data preparation workflow with text representation steps

Large Language Models (LLMs) Concepts

Let's practice!

Large Language Models (LLMs) Concepts

Preparing Video For Download...