TfIdf: More ways to transform text

Sentiment Analysis in Python

Violeta Misheva

Data Scientist

What are the components of TfIdf?

TF: term frequency: How often a given word appears within a document in the corpus
Inverse document frequency: Log-ratio between the total number of documents and the number of documents that contain a specific word
- Used to calculate the weight of words that do not occur frequently

TfIdf = term frequency * inverse document frequency

BOW does not account for length of a document, TfIdf does.
TfIdf likely to capture words common within a document but not across documents.

# Import the TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

Arguments of TfidfVectorizer: max_features, ngram_range, stop_words, token_pattern, max_df, min_df

vect = TfidfVectorizer(max_features=100).fit(tweets.text)
X = vect.transform(tweets.text)

X
<14640x100 sparse matrix of type '<class 'numpy.float64'>'
    with 119182 stored elements in Compressed Sparse Row format>

X_df = pd.DataFrame(X_txt.toarray(), columns=vect.get_feature_names())
X_df.head()

top 5 rows of data created with TfIdf

Sentiment Analysis in Python