Introduction to Natural Language Processing in R
Kasey Jones
Research Data Scientist
Clean/prepare data
Create training and testing datasets
set.seed(1111)
sample_size <- floor(0.80 * nrow(animal_matrix))
train_ind <- sample(nrow(animal_matrix), size = sample_size)
train <- animal_matrix[train_ind, ]
test <- animal_matrix[-train_ind, ]
library(randomForest)
rfc <- randomForest(x = as.data.frame(as.matrix(train)),
y = animal_sentences$Name[train_ind], nTree = 50)
rfc
Call:
randomForest(...
OOB estimate of error rate: 23.33%
Confusion matrix:
boxer napoleon class.error
boxer 37 20 0.3508772
napoleon 8 55 0.1269841
Call:
randomForest(...
OOB estimate of error rate: 23.33%
Confusion matrix:
boxer napoleon class.error
boxer 37 20 0.3508772
napoleon 8 55 0.1269841
Accuracy: (37 + 55) / (37 + 20 + 8 + 55) = 76%
y_pred <- predict(rfc, newdata = as.data.frame(as.matrix(test)))
table(animal_sentences[-train_ind, ]$Name, y_pred)
y_pred
boxer napoleon
boxer 14 4
napoleon 2 10
Introduction to Natural Language Processing in R