Auto Models and Tokenizers

Working with Hugging Face

Jacob H. Marquez

Lead Data Engineer

Pipelines: fast and simple

from transformers import pipeline  

my_pipeline = pipeline(
    "text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english"))

print(my_pipeline("Wi-Fi is slower than a snail today!"))
[{'label': 'NEGATIVE', 'score': 0.99}]
Working with Hugging Face

Two ways to use Hugging Face models

Screenshot of Hugging Face showing Use in Transformers button

1 https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
Working with Hugging Face

Two ways to use Hugging Face models

Screenshot of Hugging Face showing instructions for transformers

1 https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
Working with Hugging Face

Two ways to use Hugging Face models

Screenshot of Hugging Face showing instructions for transformers

1 https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
Working with Hugging Face

Auto Classes: flexible and powerful

$$

  • Auto classes: Flexible access to models and tokenizers
  • More control over model behavior and outputs
  • Perfect for advanced tasks

$$

  • Pipelines = quick; Auto classes = flexible

Three slider bars with toggles and a hand adjusting one of them. Representing more control.

Working with Hugging Face

AutoModels

  • Choose an AutoModel class to directly download a model

$$

from transformers import AutoModelForSequenceClassification

# Download a pre-trained text classification model model = AutoModelForSequenceClassification.from_pretrained( "distilbert-base-uncased-finetuned-sst-2-english" )
Working with Hugging Face

AutoTokenizers

  • Prepare text input data
  • Recommended to use the tokenizer paired with the model

$$

from transformers import AutoTokenizer


# Retrieve the tokenizer paired with the model tokenizer = AutoTokenizer.from_pretrained( "distilbert-base-uncased-finetuned-sst-2-english" )
Working with Hugging Face

Tokenizing text with AutoTokenizer

  • Tokenizers clean input and split text into tokens

$$

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Tokenize input text tokens = tokenizer.tokenize("AI: Helping robots think and humans overthink:)") print(tokens)
['ai', ':', 'helping', 'robots', 'think', 'and', 
 'humans', 'over', '##thi', '##nk', ':', ')']
Working with Hugging Face

Different models, different tokenizers

  • Our model (distilbert-base-uncased):

    ['ai', ':', 'helping', 'robots', 'think', 'and', 'humans', 'over', '##thi',
    '##nk', ':', ')']
    
  • BERT-Base-Cased Tokenizer:

    ['AI', ':', 'Help', '##ing', 'robots', 'think', 'and', 'humans', 'over',
    '##thin', '##k', ':', ')']
    
Working with Hugging Face

Building a Pipeline with Auto Classes

from transformers import AutoModelForSequenceClassification,
AutoTokenizer, pipeline

# Download the model and tokenizer my_model = AutoModelForSequenceClassification.from_pretrained( "distilbert-base-uncased-finetuned-sst-2-english") my_tokenizer = AutoTokenizer.from_pretrained( "distilbert-base-uncased-finetuned-sst-2-english")
# Create the custom pipeline my_pipeline = pipeline( task="sentiment-analysis", model=my_model, tokenizer=my_tokenizer)
Working with Hugging Face

Use Cases for AutoModels and AutoTokenizers

$$

  • 🔧 Use for more control and customization

  • 📝 Text Preprocessing: Clean and tokenize for specific use cases

  • 🏆 Thresholding: Prioritize key categories in classification tasks
  • 🚀 Complex Workflows: Control multi-stage processing and integration

$$ More control and customization

Working with Hugging Face

Let's practice!

Working with Hugging Face

Preparing Video For Download...