Generalized overview of NLP

Large Language Models (LLMs) Concepts

Vidhi Chugh

AI strategist and ethicist

Where are we?

Progress chart showing the first step i.e text pre-processing

Text pre-processing

Can be done in a different order as they are independent

Three most common steps for text pre-processing

Tokenization

Splits text into individual words, or tokens

Text:
- "Working with natural language processing techniques is tricky."
Tokenization:
- ["Working", "with", "natural", "language", "processing", "techniques", "is", "tricky", "."]
- Converts into a list

Stop word removal

Stop words do not add meaning
Eliminated through stop word removal

Before stop word removal:
- ["Working", "with", "natural", "language", "processing", "techniques", "is", "challenging", "."]

After stop word removal:
- ["Working", "natural", "language", "processing", "techniques", "challenging", "."]

Lemmatization

Group slightly different words with similar meaning

Reduces words to their base form

Mapped to root word

Talking -> Talk
Talked -> Talk
Talk -> Talk

Text representation

Progress chart showing we have reached the text representation stage

Text representation

Text data into numerical form

Bag-of-words
Word embeddings

Image depicting speech as numbers

Bag-of-words

Text into a matrix of word counts

A matrix with a bag of words representation

0 represents the absence of a word

Limitations of bag-of-words

Does not capture the order or context
- Can lead to incorrect interpretations
- Similar sentences but opposite meaning
  - "The cat chased the mouse swiftly."
  - "The mouse chased the cat."
Does not capture the semantics between the words
- Treats related words as independent
- Like "cat" and "mouse"

Word embeddings

Capture the semantic meanings as numbers

	Cat	Mouse
Plant	-0.9	-0.8
Furry	0.9	0.7
Carnivore	0.9	-0.8

Cat [-0.9, 0.9, 0.9]

Predator-prey relationship:

Predator-prey word embeddings

Machine-readable form

Start with text pre-processing

Data preparation workflow

Machine-readable form

Convert pre-processed text to numerical format

Data preparation workflow with text representation steps

Let's practice!

Large Language Models (LLMs) Concepts

Preparing Video For Download...