Somiglianze basate sul testo

Creare motori di raccomandazione in Python

Rob O'Callaghan

Director of Data

Lavorare senza attributi chiari

Esempio di descrizione articolo da Amazon.

Term frequency–inverse document frequency

$$ \Large{\text{TF-IDF} = \frac{\frac{\text{Count of word occurrences}}{\text{Total words in document}}}{\log({\frac{\text{Number of docs word is in}}{\text{Total number of docs}}})}} $$

I nostri dati

book_summary_df:

Libro	Descrizione
The Hobbit	"Bilbo Baggins vive una vita semplice con gli altri hobbit nella Contea..."
The Great Gatsby	"Nella New York dell’Età del Jazz, il romanzo racconta la tragica storia di Jay ..."
A Game of Thrones	"Sono passati 15 anni dalla ribellione di Robert, con una guerra durata nove anni ..."
Macbeth	"Un coraggioso generale scozzese riceve una profezia da tre streghe ..."
...	...

Istanziare il vettorizzatore

from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer(        ,           )

Filtrare i dati

from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer(min_df=2,           )

Filtrare i dati

from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer(min_df=2, max_df=0.7)

Vettorizzare i dati

vectorized_data = tfidfvec.fit_transform(book_summary_df['Descriptions'])

print(tfidfvec.get_feature_names)

['age', 'ancient', 'angry', 'brave', 'battle', 'fellow', 'game', 'general', ...]

print(vectorized_data.to_array())

[[0.21,      0.53,    0.41,    0.64,     0.01,     0.02,     ...
 [0.31,      0.00,    0.42,    0.03,     0.00,     0.73,     ...
 [...,        ...,     ...,     ...,      ...,      ...,     ...

Formattare i dati

tfidf_df = pd.DataFrame(vectorized_data.toarray(),
                        columns=tfidfvec.get_feature_names())

tfidf_df.index = book_summary_df['Book']

print(tfidf_df)

                   | 'age'| 'ancient'| 'angry'| 'brave'| 'battle'| 'fellow'|...
|------------------|------|----------|--------|--------|---------|---------|...
| The Hobbit       |  0.21|      0.53|    0.41|    0.64|     0.01|     0.02|...
| The Great Gatsby |  0.31|      0.00|    0.42|    0.03|     0.00|     0.73|...
| A Game of Thrones|  0.61|      0.42|    0.77|    0.31|     0.83|     0.03|...
|               ...|   ...|       ...|     ...|     ...|      ...|      ...|...

Similarità coseno

Distanza coseno: $$cos(\theta)=\frac{A.B }{||A||\cdot||B||}$$

Similarità coseno

from sklearn.metrics.pairwise import cosine_similarity

# Find similarity between all items
cosine_similarity_array = cosine_similarity(tfidf_summary_df)

# Find similarity between two items
cosine_similarity(tfidf_df.loc['The Hobbit'].values.reshape(1, -1),
                  tfidf_df.loc['Macbeth'].values.reshape(1, -1))

Ayo berlatih!

Creare motori di raccomandazione in Python