Best practices for training spaCy models

Advanced NLP with spaCy

Ines Montani

spaCy core developer

Problem 1: Models can "forget" things

  • Existing model can overfit on new data
    • e.g.: if you only update it with WEBSITE, it can "unlearn" what a PERSON is
  • Also known as "catastrophic forgetting" problem
Advanced NLP with spaCy

Solution 1: Mix in previously correct predictions

  • For example, if you're training WEBSITE, also include examples of PERSON
  • Run existing spaCy model over data and extract all other relevant entities

BAD:

TRAINING_DATA = [
    ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]})
]

GOOD:

TRAINING_DATA = [
    ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]}),
    ('Obama is a person', {'entities': [(0, 5, 'PERSON')]})
]
Advanced NLP with spaCy

Problem 2: Models can't learn everything

  • spaCy's models make predictions based on local context
  • Model can struggle to learn if decision is difficult to make based on context
  • Label scheme needs to be consistent and not too specific
    • For example: CLOTHING is better than ADULT_CLOTHING and CHILDRENS_CLOTHING
Advanced NLP with spaCy

Solution 2: Plan your label scheme carefully

  • Pick categories that are reflected in local context
  • More generic is better than too specific
  • Use rules to go from generic labels to specific categories

BAD:

LABELS = ['ADULT_SHOES', 'CHILDRENS_SHOES', 'BANDS_I_LIKE']

GOOD:

LABELS = ['CLOTHING', 'BAND']
Advanced NLP with spaCy

Let's practice!

Advanced NLP with spaCy

Preparing Video For Download...