TfIdf: More ways to transform text

Sentiment Analysis in Python

Violeta Misheva

Data Scientist

What are the components of TfIdf?

  • TF: term frequency: How often a given word appears within a document in the corpus

  • Inverse document frequency: Log-ratio between the total number of documents and the number of documents that contain a specific word

    • Used to calculate the weight of words that do not occur frequently
Sentiment Analysis in Python

TfIdf score of a word

  • TfIdf score:
TfIdf = term frequency * inverse document frequency
  • BOW does not account for length of a document, TfIdf does.
  • TfIdf likely to capture words common within a document but not across documents.
Sentiment Analysis in Python

How is TfIdf useful?

Twitter airline sentiment
  • Low TfIdf scores: United, Virgin America
  • High TfIdf scores: check-in process (if rare across documents)
More on TfIdf
  • Since it penalizes frequent words, less need to deal with stop words explicitly.
  • Quite useful in search queries and information retrieval to rank the relevance of returned results.
Sentiment Analysis in Python

TfIdf in Python

# Import the TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
  • Arguments of TfidfVectorizer: max_features, ngram_range, stop_words, token_pattern, max_df, min_df
vect = TfidfVectorizer(max_features=100).fit(tweets.text)
X = vect.transform(tweets.text)
Sentiment Analysis in Python

TfidfVectorizer

X
<14640x100 sparse matrix of type '<class 'numpy.float64'>'
    with 119182 stored elements in Compressed Sparse Row format>
X_df = pd.DataFrame(X_txt.toarray(), columns=vect.get_feature_names())
X_df.head()

top 5 rows of data created with TfIdf

Sentiment Analysis in Python

Let's practice!

Sentiment Analysis in Python

Preparing Video For Download...