Feature Engineering for NLP in Python
Rounak Banik
Data Scientist
Iris dataset
| sepal length | sepal width | petal length | petal width | class |
|---|---|---|---|---|
| 6.3 | 2.9 | 5.6 | 1.8 | Iris-virginica |
| 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
| 5.6 | 2.9 | 3.6 | 1.3 | Iris-versicolor |
| 6.0 | 2.7 | 5.1 | 1.6 | Iris-versicolor |
| 7.2 | 3.6 | 6.1 | 2.5 | Iris-virginica |
| sex |
|---|
| female |
| male |
| female |
| male |
| female |
| ... |
| sex | one-hot encoding |
|---|---|
| female | → |
| male | → |
| female | → |
| male | → |
| female | → |
| ... | ... |
| sex | one-hot encoding | sex_female | sex_male |
|---|---|---|---|
| female | → | 1 | 0 |
| male | → | 0 | 1 |
| female | → | 1 | 0 |
| male | → | 0 | 1 |
| female | → | 1 | 0 |
| ... | ... | ... | ... |
# Import the pandas library import pandas as pd# Perform one-hot encoding on the 'sex' feature of df df = pd.get_dummies(df, columns=['sex'])
Movie Review Dataset
| review | class |
|---|---|
| This movie is for dog lovers. A very poignant... | positive |
| The movie is forgettable. The plot lacked... | negative |
| A truly amazing movie about dogs. A gripping... | positive |
Reduction to reductionreduction to reduce| review | class |
|---|---|
| This movie is for dog lovers. A very poignant... | positive |
| The movie is forgettable. The plot lacked... | negative |
| A truly amazing movie about dogs. A gripping... | positive |
| 0 | 1 | 2 | ... | n | class |
|---|---|---|---|---|---|
| 0.03 | 0.71 | 0.00 | ... | 0.22 | positive |
| 0.45 | 0.00 | 0.03 | ... | 0.19 | negative |
| 0.14 | 0.18 | 0.00 | ... | 0.45 | positive |

| Word | POS |
|---|---|
| I | Pronoun |
| have | Verb |
| a | Article |
| dog | Noun |

| Noun | NER |
|---|---|
| Brian | Person |
| DataCamp | Organization |
Feature Engineering for NLP in Python