Tekst voorbereiden voor modellering

Introductie tot Natural Language Processing in R

Kasey Jones

Research Data Scientist

Supervised learning in R: classificatie

  • Supervised Learning in R: Classification (Engels)
Introductie tot Natural Language Processing in R

Classificatiemodellering

  • supervised learning-aanpak
  • classificeert observaties in categorieën
    • winst/verlies
    • gevaarlijk, vriendelijk of onverschillig
  • kan verschillende technieken gebruiken:
    • logistische regressie
    • beslisbomen/random forest/xgboost
    • neurale netwerken
    • enz.
Introductie tot Natural Language Processing in R

Basisstappen modelleren

  1. Data opschonen/voorbereiden
  2. Train- en testsets maken
  3. Model trainen op de trainset
  4. Nauwkeurigheid rapporteren op de testset
Introductie tot Natural Language Processing in R

Personageherkenning

Napoloeon Napoleon

Boxer Boxer

1 https://comicvine.gamespot.com/napoleon/4005-141035/ 2 https://hero.fandom.com/wiki/Boxer_(Animal_Farm)
Introductie tot Natural Language Processing in R

Dierzinnen

# Maak zinnen
sentences <- animal_farm %>%
  unnest_tokens(output = "sentence", token = "sentences", input = text_column)
# Label zinnen per dier
sentences$boxer <- grepl('boxer', sentences$sentence)
sentences$napoleon <- grepl('napoleon', sentences$sentence)
# Vervang de diernaam
sentences$sentence <- gsub("boxer", "dier X", sentences$sentence)
sentences$sentence <- gsub("napoleon", "dier X", sentences$sentence)
animal_sentences <- sentences[sentences$boxer + sentences$napoleon == 1, ]
Introductie tot Natural Language Processing in R

Zinnen (vervolg)

animal_sentences$Name <-
    as.factor(ifelse(animal_sentences$boxer, "boxer", "napoleon"))
# 75 van elk
animal_sentences <- 
  rbind(animal_sentences[animal_sentences$Name == "boxer", ][c(1:75), ],
        animal_sentences[animal_sentences$Name == "napoleon", ][c(1:75), ])
animal_sentences$sentence_id <- c(1:dim(animal_sentences)[1])
Introductie tot Natural Language Processing in R

Bereid de data voor

library(tm); library(tidytext)
library(dplyr); library(SnowballC)
animal_tokens <- animal_sentences %>%
  unnest_tokens(output = "word", token = "words", input = sentence) %>%
  anti_join(stop_words) %>%
  mutate(word = wordStem(word))
Introductie tot Natural Language Processing in R

Voorbereiding (vervolg)

animal_matrix <- animal_tokens %>%
  count(sentence_id, word) %>%
  cast_dtm(document = sentence_id, term = word,
           value = n, weighting = tm::weightTfIdf)
animal_matrix
<<DocumentTermMatrix (documents: 150, terms: 694)>>
Non-/sparse entries: 1235/102865
Sparsity           : 99%
Maximal term length: 17
Weighting          : term frequency - inverse document frequency
Introductie tot Natural Language Processing in R

Sparse termen verwijderen

  • Niet-leeg (1.235) + leeg (102.865)
  • Matrixafmetingen 150 × 694
  • Sparsity: 102.865 / 104.100 (99%)

Oplossing: removeSparseTerms()

Introductie tot Natural Language Processing in R

Hoe sparse is te sparse?

removeSparseTerms(animal_matrix, sparse = .90)
<<DocumentTermMatrix (documents: 150, terms: 4)>>
Non-/sparse entries: 207/393
Sparsity           : 66%
removeSparseTerms(animal_matrix, sparse = .99)
removeSparseTerms(animal_matrix, sparse = .99)
<<DocumentTermMatrix (documents: 150, terms: 172)>>
Non-/sparse entries: 713/25087
Sparsity           : 97%
Introductie tot Natural Language Processing in R

Laten we oefenen!

Introductie tot Natural Language Processing in R

Preparing Video For Download...