Kemiripan berbasis teks

Membangun Recommendation Engine di Python

Rob O'Callaghan

Director of Data

Bekerja tanpa atribut yang jelas

Contoh deskripsi item dari Amazon.

Membangun Recommendation Engine di Python

Term frequency–inverse document frequency

$$ \Large{\text{TF-IDF} = \frac{\frac{\text{Jumlah kemunculan kata}}{\text{Total kata dalam dokumen}}}{\log({\frac{\text{Jumlah dokumen yang memuat kata}}{\text{Total jumlah dokumen}}})}} $$

Membangun Recommendation Engine di Python

Data kita

book_summary_df:

Book Deskripsi
The Hobbit "Bilbo Baggins menjalani hidup sederhana bersama para hobbit di Shire..."
The Great Gatsby "Berlatar New York Era Jazz, novel ini menceritakan kisah tragis Jay ..."
A Game of Thrones "15 tahun telah berlalu sejak pemberontakan Robert, dengan perang sembilan tahun ..."
Macbeth "Seorang jenderal Skotlandia yang berani menerima nubuat dari tiga penyihir ..."
... ...
Membangun Recommendation Engine di Python

Inisialisasi vektorisasi

from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvec = TfidfVectorizer(        ,           )
Membangun Recommendation Engine di Python

Menyaring data

from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvec = TfidfVectorizer(min_df=2,           )
Membangun Recommendation Engine di Python

Menyaring data

from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvec = TfidfVectorizer(min_df=2, max_df=0.7)
Membangun Recommendation Engine di Python

Memvektorisasi data

vectorized_data = tfidfvec.fit_transform(book_summary_df['Descriptions'])

print(tfidfvec.get_feature_names)
['age', 'ancient', 'angry', 'brave', 'battle', 'fellow', 'game', 'general', ...]
print(vectorized_data.to_array())
[[0.21,      0.53,    0.41,    0.64,     0.01,     0.02,     ...
 [0.31,      0.00,    0.42,    0.03,     0.00,     0.73,     ...
 [...,        ...,     ...,     ...,      ...,      ...,     ...
Membangun Recommendation Engine di Python

Memformat data

tfidf_df = pd.DataFrame(vectorized_data.toarray(),
                        columns=tfidfvec.get_feature_names())

tfidf_df.index = book_summary_df['Book']
print(tfidf_df)
                   | 'age'| 'ancient'| 'angry'| 'brave'| 'battle'| 'fellow'|...
|------------------|------|----------|--------|--------|---------|---------|...
| The Hobbit       |  0.21|      0.53|    0.41|    0.64|     0.01|     0.02|...
| The Great Gatsby |  0.31|      0.00|    0.42|    0.03|     0.00|     0.73|...
| A Game of Thrones|  0.61|      0.42|    0.77|    0.31|     0.83|     0.03|...
|               ...|   ...|       ...|     ...|     ...|      ...|      ...|...
Membangun Recommendation Engine di Python

Kemiripan kosinus

Jarak kosinus: $$cos(\theta)=\frac{A.B }{||A||\cdot||B||}$$

Membangun Recommendation Engine di Python

Kemiripan kosinus

from sklearn.metrics.pairwise import cosine_similarity

# Find similarity between all items
cosine_similarity_array = cosine_similarity(tfidf_summary_df)
# Find similarity between two items
cosine_similarity(tfidf_df.loc['The Hobbit'].values.reshape(1, -1),
                  tfidf_df.loc['Macbeth'].values.reshape(1, -1))
Membangun Recommendation Engine di Python

Ayo berlatih!

Membangun Recommendation Engine di Python

Preparing Video For Download...