Topic modeling sulla frode

Rilevamento delle frodi in Python

Charlotte Werger

Data Scientist

Topic modeling: schemi nascosti nei testi

Scoprire argomenti nei testi
"Di cosa parla il testo"
Concettualmente simile al clustering
Confronta gli argomenti di casi di frode vs non frode e usali come feature o flag
Oppure… c’è un argomento che sembra indicare la frode?

Latent Dirichlet Allocation (LDA)

Con LDA ottieni:

Modello "argomenti per testo" (cioè probabilità)
Modello "parole per argomento"

Creare un tuo topic model:

Pulisci i dati
Crea bag-of-words con dizionario e corpus
Dai in input dizionario e corpus al modello LDA

Latent Dirichlet Allocation (LDA)

Bag-of-words: dizionario e corpus

from gensim import corpora

# Create dictionary number of times a word appears
dictionary = corpora.Dictionary(cleaned_emails)

# Filter out (non)frequent words 
dictionary.filter_extremes(no_below=5, keep_n=50000)

# Create corpus
corpus = [dictionary.doc2bow(text) for text in cleaned_emails]

Latent Dirichlet Allocation (LDA) con gensim

import gensim
# Define the LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, 
id2word=dictionary, passes=15)

# Print the three topics from the model with top words
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, 0.029*"email" + 0.016*"send" + 0.016*"results" + 0.016*"invoice")
(1, 0.026*"price" + 0.026*"work" + 0.026*"management" + 0.026*"sell")
(2, 0.029*"distribute" + 0.029*"contact" + 0.016*"supply" + 0.016*"fast")

Passiamo alla pratica !

Rilevamento delle frodi in Python