Feature Engineering for NLP in Python
Rounak Banik
Data Scientist
Iris dataset
sepal length | sepal width | petal length | petal width | class |
---|---|---|---|---|
6.3 | 2.9 | 5.6 | 1.8 | Iris-virginica |
4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
5.6 | 2.9 | 3.6 | 1.3 | Iris-versicolor |
6.0 | 2.7 | 5.1 | 1.6 | Iris-versicolor |
7.2 | 3.6 | 6.1 | 2.5 | Iris-virginica |
sex |
---|
female |
male |
female |
male |
female |
... |
sex | one-hot encoding |
---|---|
female | → |
male | → |
female | → |
male | → |
female | → |
... | ... |
sex | one-hot encoding | sex_female | sex_male |
---|---|---|---|
female | → | 1 | 0 |
male | → | 0 | 1 |
female | → | 1 | 0 |
male | → | 0 | 1 |
female | → | 1 | 0 |
... | ... | ... | ... |
# Import the pandas library import pandas as pd
# Perform one-hot encoding on the 'sex' feature of df df = pd.get_dummies(df, columns=['sex'])
Movie Review Dataset
review | class |
---|---|
This movie is for dog lovers. A very poignant... | positive |
The movie is forgettable. The plot lacked... | negative |
A truly amazing movie about dogs. A gripping... | positive |
Reduction
to reduction
reduction
to reduce
review | class |
---|---|
This movie is for dog lovers. A very poignant... | positive |
The movie is forgettable. The plot lacked... | negative |
A truly amazing movie about dogs. A gripping... | positive |
0 | 1 | 2 | ... | n | class |
---|---|---|---|---|---|
0.03 | 0.71 | 0.00 | ... | 0.22 | positive |
0.45 | 0.00 | 0.03 | ... | 0.19 | negative |
0.14 | 0.18 | 0.00 | ... | 0.45 | positive |
Word | POS |
---|---|
I | Pronoun |
have | Verb |
a | Article |
dog | Noun |
Noun | NER |
---|---|
Brian | Person |
DataCamp | Organization |
Feature Engineering for NLP in Python