Scalabilità e prestazioni

NLP avanzato con spaCy

Ines Montani

spaCy core developer

Elaborare grandi volumi di testo

Usa il metodo nlp.pipe
Elabora testi in streaming, restituisce oggetti Doc
Molto più veloce che chiamare nlp su ogni testo

SBAGLIATO:

docs = [nlp(text) for text in LOTS_OF_TEXTS]

GIUSTO:

docs = list(nlp.pipe(LOTS_OF_TEXTS))

Passare il contesto (1)

Impostando as_tuples=True su nlp.pipe puoi passare tuple (text, context)
Restituisce tuple (doc, context)
Utile per associare metadati al doc

data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    print(doc.text, context['page_number'])

This is a text 15
And another text 16

Passare il contesto (2)

from spacy.tokens import Doc

Doc.set_extension('id', default=None)
Doc.set_extension('page_number', default=None)

data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.id = context['id']
    doc._.page_number = context['page_number']

Usare solo il tokenizer

Illustrazione della pipeline di spaCy

non eseguire l'intera pipeline!

Usare solo il tokenizer (2)

Usa nlp.make_doc per trasformare un testo in un oggetto Doc

SBAGLIATO:

doc = nlp("Hello world")

GIUSTO:

doc = nlp.make_doc("Hello world!")

Disattivare componenti della pipeline

Usa nlp.disable_pipes per disattivare temporaneamente una o più pipe

# Disable tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
    # Process the text and print the entities
    doc = nlp(text)
    print(doc.ents)

le ripristina dopo il blocco with
esegue solo i componenti rimanenti

Passons à la pratique !

NLP avanzato con spaCy