Topic modeling on fraud

Fraud Detection in Python

Charlotte Werger

Data Scientist

Topic modeling: discover hidden patterns in text data

  1. Discovering topics in text data
  2. "What is the text about"
  3. Conceptually similar to clustering data
  4. Compare topics of fraud cases to non-fraud cases and use as a feature or flag
  5. Or.. is there a particular topic in the data that seems to point to fraud?
Fraud Detection in Python

Latent Dirichlet Allocation (LDA)

With LDA you obtain:

  1. "topics per text item" model (i.e. probabilities)
  2. "words per topic" model

Creating your own topic model:

  1. Clean your data
  2. Create a bag of words with dictionary and corpus
  3. Feed dictionary and corpus into the LDA model
Fraud Detection in Python

Latent Dirichlet Allocation (LDA)

Fraud Detection in Python

Bag of words: dictionary and corpus

from gensim import corpora
# Create dictionary number of times a word appears
dictionary = corpora.Dictionary(cleaned_emails)
# Filter out (non)frequent words 
dictionary.filter_extremes(no_below=5, keep_n=50000)
# Create corpus
corpus = [dictionary.doc2bow(text) for text in cleaned_emails]
Fraud Detection in Python

Latent Dirichlet Allocation (LDA) with gensim

import gensim
# Define the LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, 
id2word=dictionary, passes=15)
# Print the three topics from the model with top words
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)
(0, 0.029*"email" + 0.016*"send" + 0.016*"results" + 0.016*"invoice")
(1, 0.026*"price" + 0.026*"work" + 0.026*"management" + 0.026*"sell")
(2, 0.029*"distribute" + 0.029*"contact" + 0.016*"supply" + 0.016*"fast")
Fraud Detection in Python

Let's practice!

Fraud Detection in Python

Preparing Video For Download...