Gestione di stop word e punteggiatura

Natural Language Processing (NLP) in Python

Fouad Trad

Machine Learning Engineer

Stop word

Frequenti ma danno poco contesto alla macchina
In molti task NLP aggiungono poco valore
Rimuoverle aiuta i modelli a concentrarsi sulle parole chiave

Immagine con diverse stop word come a, an, the, in, of, that, for, by, ecc.

Rimozione delle stop word

Utile per

Capire l’argomento di un testo

Immagine con recensioni di un prodotto su un’app mobile

Rimozione delle stop word

Utile per

Capire l’argomento di un testo

Immagine con recensioni di un prodotto su un’app mobile

Non utile per

Task che richiedono ogni parola del testo

Immagine con un testo tradotto dall’inglese (Good morning) al francese (Bonjour).

Accesso alle stop word

NLTK fornisce un elenco di stop word per più lingue

from nltk.corpus import stopwords
nltk.download('stopwords')


stop_words = stopwords.words('english')

print(stop_words[:10])

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']

Rimozione delle stop word

from nltk.tokenize import word_tokenize


text = "This is an example to demonstrate removing stop words."

tokens = word_tokenize(text)

# The .lower() method helps with case sensitivity
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print(filtered_tokens)

['example', 'demonstrate', 'removing', 'stop', 'words', '.']

Punteggiatura

Strutturano il linguaggio per le persone
In molti task NLP non portano informazione

Immagine con segni di punteggiatura e caratteri speciali.

Rimozione della punteggiatura

Utile per

Task che cercano parole comuni o importanti nei documenti

Immagine con più file e documenti da elaborare.

Rimozione della punteggiatura

Utile per

Task che cercano parole comuni o importanti nei documenti

Immagine con più file e documenti da elaborare.

Non utile per

Task che richiedono mantenere la struttura della frase per chiarezza

Immagine con una pila di libri e un sommario generato.

Accesso e rimozione della punteggiatura

import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

text = "This is an example to demonstrate removing stop words."
tokens = word_tokenize(text)
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]


clean_tokens = [word for word in filtered_tokens if word not in string.punctuation]

print(clean_tokens)

['example', 'demonstrate', 'removing', 'stop', 'words']

Passiamo alla pratica!

Natural Language Processing (NLP) in Python