Introduction to NLP feature engineering

Feature Engineering for NLP in Python

Rounak Banik

Data Scientist

Numerical data

Iris dataset

sepal length sepal width petal length petal width class
6.3 2.9 5.6 1.8 Iris-virginica
4.9 3.0 1.4 0.2 Iris-setosa
5.6 2.9 3.6 1.3 Iris-versicolor
6.0 2.7 5.1 1.6 Iris-versicolor
7.2 3.6 6.1 2.5 Iris-virginica
Feature Engineering for NLP in Python

One-hot encoding

sex
female
male
female
male
female
...
Feature Engineering for NLP in Python

One-hot encoding

sex one-hot encoding
female
male
female
male
female
... ...
Feature Engineering for NLP in Python

One-hot encoding

sex one-hot encoding sex_female sex_male
female 1 0
male 0 1
female 1 0
male 0 1
female 1 0
... ... ... ...
Feature Engineering for NLP in Python

One-hot encoding with pandas

# Import the pandas library
import pandas as pd

# Perform one-hot encoding on the 'sex' feature of df df = pd.get_dummies(df, columns=['sex'])
Feature Engineering for NLP in Python

Textual data

Movie Review Dataset

review class
This movie is for dog lovers. A very poignant... positive
The movie is forgettable. The plot lacked... negative
A truly amazing movie about dogs. A gripping... positive
Feature Engineering for NLP in Python

Text pre-processing

  • Converting to lowercase
    • Example: Reduction to reduction
  • Converting to base-form
    • Example: reduction to reduce
Feature Engineering for NLP in Python

Vectorization

review class
This movie is for dog lovers. A very poignant... positive
The movie is forgettable. The plot lacked... negative
A truly amazing movie about dogs. A gripping... positive
Feature Engineering for NLP in Python

Vectorization

0 1 2 ... n class
0.03 0.71 0.00 ... 0.22 positive
0.45 0.00 0.03 ... 0.19 negative
0.14 0.18 0.00 ... 0.45 positive
Feature Engineering for NLP in Python

Basic features

  • Number of words
  • Number of characters
  • Average length of words
  • Tweets

Silverado Records Tweet

Feature Engineering for NLP in Python

POS tagging

Word POS
I Pronoun
have Verb
a Article
dog Noun
Feature Engineering for NLP in Python

Named Entity Recognition

  • Does noun refer to person, organization or country?

A person, a country's flag and the logo of TED

Noun NER
Brian Person
DataCamp Organization
Feature Engineering for NLP in Python

Concepts covered

  • Text Preprocessing
  • Basic Features
  • Word Features
  • Vectorization
Feature Engineering for NLP in Python

Let's practice!

Feature Engineering for NLP in Python

Preparing Video For Download...