Herkenning van benoemde entiteiten op getranscribeerde tekst

Spoken Language Processing in Python

Daniel Bourke

Machine Learning Engineer/YouTube Creator

spaCy installeren

# Installeer spaCy
$ pip install spacy

# Download spaCy-taalmodel
$ python -m spacy download en_core_web_sm

spaCy gebruiken

import spacy


# Laad een spaCy-taalmodel
nlp = spacy.load("en_core_web_sm")

# Maak een spaCy-doc
doc = nlp("I'd like to talk about a smartphone I ordered on July 31st from your 
Sydney store, my order number is 40939440. I spoke to Georgia about it last week.")

spaCy-tokens

# Toon tokens en posities
for token in doc:
  print(token.text, token.idx)

I 0
'd 1
like 4
to 9
talk 12
about 17
a 23
smartphone 25...

spaCy-zinnen

# Toon zinnen in doc
for sentences in doc.sents:
  print(sentence)

I'd like to talk about a smartphone I ordered on July 31st from your Sydney store, 
my order number is 4093829.
I spoke to one of your customer service team, Georgia, yesterday.

spaCy-entitytypes

Enkele ingebouwde entiteittypes in spaCy:

PERSON Personen, incl. fictieve.
ORG Bedrijven, agentschappen, instellingen, enz.
GPE Landen, steden, staten.
PRODUCT Objecten, voertuigen, voedsel, enz. (Geen diensten.)
DATE Absolute of relatieve datums of perioden.
TIME Tijden korter dan een dag.
MONEY Geldbedragen, incl. eenheid.
CARDINAL Getallen die niet onder een ander type vallen.

spaCy-entitytypes

# Vind benoemde entiteiten in doc
for entity in doc.ents:
  print(entity.text, entity.label_)

July 31st DATE
Sydney GPE
4093829 CARDINAL
one CARDINAL
Georgia GPE
yesterday DATE

Aangepaste entiteiten

# Importeer EntityRuler-klasse
from spacy.pipeline import EntityRuler

# Check de spaCy-pijplijn
print(nlp.pipeline)

[('tagger', <spacy.pipeline.pipes.Tagger at 0x1c3aa8a470>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x1c3bb60588>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1c3bb605e8>)]

De pijplijn aanpassen

# Maak een EntityRuler-instance
ruler = EntityRuler(nlp)

# Voeg tokenpatroon toe aan ruler
ruler.add_patterns([{"label":"PRODUCT", "pattern": "smartphone"}])

# Voeg nieuwe regel toe aan pijplijn vóór ner
nlp.add_pipe(ruler, before="ner")

# Check de bijgewerkte pijplijn
nlp.pipeline

De pijplijn aanpassen

[('tagger', <spacy.pipeline.pipes.Tagger at 0x1c1f9c9b38>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x1c3c9cba08>),
 ('entity_ruler', <spacy.pipeline.entityruler.EntityRuler at 0x1c1d834b70>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1c3c9cba68>)]

De nieuwe pijplijn testen

# Test de nieuwe entiteitregel
for entity in doc.ents:
    print(entity.text, entity.label_)

smartphone PRODUCT
July 31st DATE
Sydney GPE
4093829 CARDINAL
one CARDINAL
Georgia GPE
yesterday DATE

Laten we oefenen!

Spoken Language Processing in Python