Classification modeling

Introduction to Natural Language Processing in R

Kasey Jones

Research Data Scientist

Recap of the steps

Clean/prepare data
- Filter to Boxer/Napoleon Sentences
- Created cleaned tokens of the words
- Created a document-term matrix with TFIDF weighting
Create training and testing datasets
Train a model on the training dataset
Report accuracy on the testing dataset

Step 2: split the data

set.seed(1111)
sample_size <- floor(0.80 * nrow(animal_matrix))
train_ind <- sample(nrow(animal_matrix), size = sample_size)

train <- animal_matrix[train_ind, ]
test <- animal_matrix[-train_ind, ]

Random forest models

Machine Learning with Tree-Based Models in R

Classification example

library(randomForest)
rfc <- randomForest(x = as.data.frame(as.matrix(train)), 
                    y = animal_sentences$Name[train_ind], nTree = 50)
rfc

Call:
 randomForest(...
        OOB estimate of  error rate: 23.33%
Confusion matrix:
         boxer napoleon class.error
boxer       37       20   0.3508772
napoleon     8       55   0.1269841

The confusion matrix

Call:
 randomForest(...
        OOB estimate of  error rate: 23.33%
Confusion matrix:
         boxer napoleon class.error
boxer       37       20   0.3508772
napoleon     8       55   0.1269841

Accuracy: (37 + 55) / (37 + 20 + 8 + 55) = 76%

Test set predictions

y_pred <- predict(rfc, newdata = as.data.frame(as.matrix(test)))

table(animal_sentences[-train_ind, ]$Name, y_pred)

          y_pred
           boxer napoleon
  boxer       14        4
  napoleon     2       10

Accuracy for boxer: 14/18
Accuracy for napoleon: 10/12
Overall accuracy: 24/30 = 80%

Classification practice

Introduction to Natural Language Processing in R