Generalized overview of NLP

Concetti sui Large Language Models (LLM)

Vidhi Chugh

AI strategist and ethicist

Where are we?

Progress chart showing the first step i.e text pre-processing

Concetti sui Large Language Models (LLM)

Text pre-processing

  • Can be done in a different order as they are independent

Three most common steps for text pre-processing

Concetti sui Large Language Models (LLM)

Tokenization

  • Splits text into individual words, or tokens

 

  • Text:

    • "Working with natural language processing techniques is tricky."

     

  • Tokenization:

    • ["Working", "with", "natural", "language", "processing", "techniques", "is", "tricky", "."]
    • Converts into a list
Concetti sui Large Language Models (LLM)

Stop word removal

  • Stop words do not add meaning
  • Eliminated through stop word removal

 

  • Before stop word removal:
    • ["Working", "with", "natural", "language", "processing", "techniques", "is", "challenging", "."]

 

  • After stop word removal:
    • ["Working", "natural", "language", "processing", "techniques", "challenging", "."]
Concetti sui Large Language Models (LLM)

Lemmatization

 

  • Group slightly different words with similar meaning

 

  • Reduces words to their base form

 

  • Mapped to root word

 

  • Talking -> Talk

  • Talked -> Talk

  • Talk -> Talk

Concetti sui Large Language Models (LLM)

Text representation

Progress chart showing we have reached the text representation stage

Concetti sui Large Language Models (LLM)

Text representation

 

  • Text data into numerical form

 

  • Bag-of-words
  • Word embeddings

Image depicting speech as numbers

Concetti sui Large Language Models (LLM)

Bag-of-words

 

  • Text into a matrix of word counts

A matrix with a bag of words representation

  • 0 represents the absence of a word
Concetti sui Large Language Models (LLM)

Limitations of bag-of-words

  • Does not capture the order or context

    • Can lead to incorrect interpretations
    • Similar sentences but opposite meaning
      • "The cat chased the mouse swiftly."
      • "The mouse chased the cat."
  • Does not capture the semantics between the words

    • Treats related words as independent
    • Like "cat" and "mouse"
Concetti sui Large Language Models (LLM)

Word embeddings

  • Capture the semantic meanings as numbers

 

Cat Mouse
Plant -0.9 -0.8
Furry 0.9 0.7
Carnivore 0.9 -0.8

 

  • Cat [-0.9, 0.9, 0.9]
  • Predator-prey relationship:

Predator-prey word embeddings

Concetti sui Large Language Models (LLM)

Machine-readable form

 

  • Start with text pre-processing

Data preparation workflow

Concetti sui Large Language Models (LLM)

Machine-readable form

 

  • Convert pre-processed text to numerical format

Data preparation workflow with text representation steps

Concetti sui Large Language Models (LLM)

Let's practice!

Concetti sui Large Language Models (LLM)

Preparing Video For Download...