spaCy basics

Natural Language Processing with spaCy

Azadeh Mobasher

Principal Data Scientist

spaCy NLP pipeline

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Here's my spaCy pipeline.")
  • Import spaCy
  • Use spacy.load() to return nlp, a Language class
    • The Language object is the text processing pipeline
  • Apply nlp() on any text to get a Doc container
Natural Language Processing with spaCy

spaCy NLP pipeline

 

spaCy applies some processing steps using its Language class:

spaCy Language pipeline

Natural Language Processing with spaCy

Container objects in spaCy

  • There are multiple data structures to represent text data in spaCy:

 

Name Description
Doc A container for accessing linguistic annotations of text
Span A slice from a Doc object
Token An individual token, i.e. a word, punctuation, whitespace, etc.
Natural Language Processing with spaCy

Pipeline components

  • The spaCy language processing pipeline always depends on the loaded model and its capabilities.

 

Component Name Description
Tokenizer Tokenizer Segment text into tokens and create Doc object
Tagger Tagger Assign part-of-speech tags
Lemmatizer Lemmatizer Reduce the words to their root forms
EntityRecognizer NER Detect and label named entities
Natural Language Processing with spaCy

Pipeline components

 

  • Each component has unique features to process text

    • Language
    • DependencyParser
    • Sentencizer
Natural Language Processing with spaCy

Tokenization

  • Always the first operation
  • All the other operations require tokens
  • Tokens can be words, numbers and punctuation
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("Tokenization splits a sentence into its tokens.")

print([token.text for token in doc])
['Tokenization', 'splits', 'a', 'sentence', 'into', 'its', 'tokens', '.']
Natural Language Processing with spaCy

Sentence segmentation

  • More complex than tokenization
  • Is a part of DependencyParser component
import spacy
nlp = spacy.load("en_core_web_sm")

text = "We are learning NLP. This course introduces spaCy."
doc = nlp(text)

for sent in doc.sents: print(sent.text)
We are learning NLP.
This course introduces spaCy.
Natural Language Processing with spaCy

Lemmatization

  • A lemma is a the base form of a token
  • The lemma of eats and ate is eat
  • Improves accuracy of language models
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("We are seeing her after one year.")

print([(token.text, token.lemma_) for token in doc])
[('We', 'we'), ('are', 'be'), ('seeing', 'see'), ('her', 'she'), 
('after', 'after'), ('one', 'one'), ('year', 'year'), ('.', '.')]
Natural Language Processing with spaCy

Let's practice!

Natural Language Processing with spaCy

Preparing Video For Download...