Skalabilitas dan performa

NLP Lanjutan dengan spaCy

Ines Montani

spaCy core developer

Memproses banyak teks

Gunakan metode nlp.pipe
Memproses teks sebagai stream, menghasilkan objek Doc
Jauh lebih cepat daripada memanggil nlp per teks

BURUK:

docs = [nlp(text) for text in LOTS_OF_TEXTS]

BAIK:

docs = list(nlp.pipe(LOTS_OF_TEXTS))

Menyertakan context (1)

Setel as_tuples=True pada nlp.pipe untuk mengirim tuple (text, context)
Menghasilkan tuple (doc, context)
Berguna untuk menautkan metadata ke doc

data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    print(doc.text, context['page_number'])

This is a text 15
And another text 16

Menyertakan context (2)

from spacy.tokens import Doc

Doc.set_extension('id', default=None)
Doc.set_extension('page_number', default=None)

data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.id = context['id']
    doc._.page_number = context['page_number']

Hanya memakai tokenizer

Ilustrasi pipeline spaCy

jangan jalankan seluruh pipeline!

Hanya memakai tokenizer (2)

Gunakan nlp.make_doc untuk mengubah teks menjadi objek Doc

BURUK:

doc = nlp("Hello world")

BAIK:

doc = nlp.make_doc("Hello world!")

Menonaktifkan komponen pipeline

Gunakan nlp.disable_pipes untuk menonaktifkan sementara satu atau lebih pipe

# Disable tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
    # Process the text and print the entities
    doc = nlp(text)
    print(doc.ents)

mengaktifkannya kembali setelah blok with
hanya menjalankan komponen yang tersisa

Ayo berlatih!

NLP Lanjutan dengan spaCy