Basiskenmerken extraheren

Feature Engineering voor NLP in Python

Rounak Banik

Data Scientist

Aantal tekens

"I don't know." # 13 characters
# Compute the number of characters 
text = "I don't know."
num_char = len(text)

# Print the number of characters
print(num_char)
13
# Create a 'num_chars' feature
df['num_chars'] = df['review'].apply(len)
Feature Engineering voor NLP in Python

Aantal woorden

# Split the string into words
text = "Mary had a little lamb."
words = text.split()

# Print the list containing words
print(words)
['Mary', 'had', 'a', 'little', 'lamb.']
# Print number of words
print(len(words))
5
Feature Engineering voor NLP in Python

Aantal woorden

# Function that returns number of words in string
def word_count(string):
    # Split the string into words
    words = string.split()

    # Return length of words list
    return len(words)
# Create num_words feature in df
df['num_words'] = df['review'].apply(word_count)

Feature Engineering voor NLP in Python

Gemiddelde woordlengte

#Function that returns average word length
def avg_word_length(x):

# Split the string into words words = x.split()
# Compute length of each word and store in a separate list word_lengths = [len(word) for word in words]
# Compute average word length avg_word_length = sum(word_lengths)/len(words)
# Return average word length return(avg_word_length)
Feature Engineering voor NLP in Python

Gemiddelde woordlengte

# Create a new feature avg_word_length
df['avg_word_length'] = df['review'].apply(doc_density)
Feature Engineering voor NLP in Python

Speciale kenmerken

 

Datacamp Tweet

Feature Engineering voor NLP in Python

Hashtags en mentions

# Function that returns number of hashtags
def hashtag_count(string):

# Split the string into words words = string.split()
# Create a list of hashtags hashtags = [word for word in words if word.startswith('#')]
# Return number of hashtags return len(hashtags)
hashtag_count("@janedoe This is my first tweet! #FirstTweet #Happy")
2
Feature Engineering voor NLP in Python

Andere kenmerken

  • Aantal zinnen
  • Aantal alinea's
  • Woorden die met een hoofdletter beginnen
  • Woorden in hoofdletters
  • Numerieke waarden
Feature Engineering voor NLP in Python

Laten we oefenen!

Feature Engineering voor NLP in Python

Preparing Video For Download...