Gestione dei dati mancanti con imputazioni in R
Michal Oleszak
Machine Learning Engineer

La maggior parte dei modelli statistici stima la distribuzione condizionale della variabile risposta:
$p(y|X)$
Per fare una singola previsione, si riassume la distribuzione condizionale:
Invece, possiamo campionare da queste distribuzioni per aumentare la variabilità.


Compito: imputa PhysActive dai dati nhanes con regressione logistica.
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
Compito: imputa PhysActive dai dati nhanes con regressione logistica.
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
Compito: imputa PhysActive dai dati nhanes con regressione logistica.
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
Compito: imputa PhysActive dai dati nhanes con regressione logistica.
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
preds <- ifelse(preds >= 0.5, 1, 0)
Compito: imputa PhysActive dai dati nhanes con regressione logistica.
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
preds <- ifelse(preds >= 0.5, 1, 0)
nhanes_imp[missing_physactive, "PhysActive"] <- preds[missing_physactive]
Variabilità dei dati imputati:
table(preds[missing_physactive])
1
26
Variabilità dei dati PhysActive osservati:
table(nhanes$PhysActive)
0 1
181 610
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
preds <- ifelse(preds >= 0.5, 1, 0)
nhanes_imp[missing_physactive, "PhysActive"] <- preds[missing_physactive]
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
nhanes_imp[missing_physactive, "PhysActive"] <- preds[missing_physactive]
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
preds <- rbinom(length(preds), size = 1, prob = preds)
nhanes_imp[missing_physactive, "PhysActive"] <- preds[missing_physactive]
Variabilità dei dati imputati:
table(preds[missing_physactive])
0 1
5 21
Variabilità dei dati PhysActive osservati:
table(nhanes$PhysActive)
0 1
181 610
Gestione dei dati mancanti con imputazioni in R