Data Structures: Vocab, Lexemes and StringStore

Advanced NLP with spaCy

Ines Montani

spaCy core developer

Shared vocab and string store (1)

  • Vocab: stores data shared across multiple documents
  • To save memory, spaCy encodes all strings to hash values
  • Strings are only stored once in the StringStore via nlp.vocab.strings
  • String store: lookup table in both directions
coffee_hash = nlp.vocab.strings['coffee']
coffee_string = nlp.vocab.strings[coffee_hash]
  • Hashes can't be reversed – that's why we need to provide the shared vocab
# Raises an error if we haven't seen the string before
string = nlp.vocab.strings[3197928453018144401]
Advanced NLP with spaCy

Shared vocab and string store (2)

  • Look up the string and hash in nlp.vocab.strings
doc = nlp("I love coffee")
print('hash value:', nlp.vocab.strings['coffee'])

print('string value:', nlp.vocab.strings[3197928453018144401])
hash value: 3197928453018144401

string value: coffee
  • The doc also exposes the vocab and strings
doc = nlp("I love coffee")
print('hash value:', doc.vocab.strings['coffee'])
hash value: 3197928453018144401
Advanced NLP with spaCy

Lexemes: entries in the vocabulary

  • A Lexeme object is an entry in the vocabulary
doc = nlp("I love coffee")
lexeme = nlp.vocab['coffee']

# print the lexical attributes print(lexeme.text, lexeme.orth, lexeme.is_alpha)
coffee 3197928453018144401 True
  • Contains the context-independent information about a word
    • Word text: lexeme.text and lexeme.orth (the hash)
    • Lexical attributes like lexeme.is_alpha
    • Not context-dependent part-of-speech tags, dependencies or entity labels
Advanced NLP with spaCy

Vocab, hashes and lexemes

Illustration of the words "I", "love" and "coffee" across the Doc, Vocab and StringStore

Advanced NLP with spaCy

Let's practice!

Advanced NLP with spaCy

Preparing Video For Download...