Topic modeling on fraud

Fraud Detection in Python

Charlotte Werger

Data Scientist

Topic modeling: discover hidden patterns in text data

Discovering topics in text data
"What is the text about"
Conceptually similar to clustering data
Compare topics of fraud cases to non-fraud cases and use as a feature or flag
Or.. is there a particular topic in the data that seems to point to fraud?

Latent Dirichlet Allocation (LDA)

With LDA you obtain:

"topics per text item" model (i.e. probabilities)
"words per topic" model

Creating your own topic model:

Clean your data
Create a bag of words with dictionary and corpus
Feed dictionary and corpus into the LDA model

Latent Dirichlet Allocation (LDA)

Bag of words: dictionary and corpus

from gensim import corpora

# Create dictionary number of times a word appears
dictionary = corpora.Dictionary(cleaned_emails)

# Filter out (non)frequent words 
dictionary.filter_extremes(no_below=5, keep_n=50000)

# Create corpus
corpus = [dictionary.doc2bow(text) for text in cleaned_emails]

Latent Dirichlet Allocation (LDA) with gensim

import gensim
# Define the LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, 
id2word=dictionary, passes=15)

# Print the three topics from the model with top words
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, 0.029*"email" + 0.016*"send" + 0.016*"results" + 0.016*"invoice")
(1, 0.026*"price" + 0.026*"work" + 0.026*"management" + 0.026*"sell")
(2, 0.029*"distribute" + 0.029*"contact" + 0.016*"supply" + 0.016*"fast")

Let's practice!

Fraud Detection in Python