Pemodelan topik pada kecurangan

Deteksi Kecurangan di Python

Charlotte Werger

Data Scientist

Pemodelan topik: temukan pola tersembunyi dalam teks

Menemukan topik dalam data teks
"Teks ini membahas apa"
Secara konsep mirip pengelompokan data
Bandingkan topik kasus fraud vs. non-fraud dan gunakan sebagai fitur atau penanda
Atau, adakah topik tertentu yang mengarah ke fraud?

Latent Dirichlet Allocation (LDA)

Dengan LDA Anda mendapatkan:

Model "topik per item teks" (probabilitas)
Model "kata per topik"

Membangun model topik Anda:

Bersihkan data
Buat bag-of-words dengan kamus dan korpus
Masukkan kamus dan korpus ke model LDA

Latent Dirichlet Allocation (LDA)

Bag-of-words: kamus dan korpus

from gensim import corpora

# Create dictionary number of times a word appears
dictionary = corpora.Dictionary(cleaned_emails)

# Filter out (non)frequent words 
dictionary.filter_extremes(no_below=5, keep_n=50000)

# Create corpus
corpus = [dictionary.doc2bow(text) for text in cleaned_emails]

Latent Dirichlet Allocation (LDA) dengan gensim

import gensim
# Define the LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, 
id2word=dictionary, passes=15)

# Print the three topics from the model with top words
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, 0.029*"email" + 0.016*"send" + 0.016*"results" + 0.016*"invoice")
(1, 0.026*"price" + 0.026*"work" + 0.026*"management" + 0.026*"sell")
(2, 0.029*"distribute" + 0.029*"contact" + 0.016*"supply" + 0.016*"fast")

Ayo berlatih!

Deteksi Kecurangan di Python