Membandingkan distribusi sampel dan bootstrap

Sampling di R

Richie Cotton

Data Evangelist at DataCamp

Subset fokus kopi

set.seed(19790801)
coffee_sample <- coffee_ratings %>%
  select(variety, country_of_origin, flavor) %>%
  rowid_to_column() %>% 
  slice_sample(n = 500)
glimpse(coffee_sample)
Rows: 500
Columns: 4
$ rowid             <int> 10, 278, 458, 622, 131, 385, 1292, 47, 904, 1020, 5...
$ variety           <chr> "Other", "Bourbon", NA, "Caturra", "Caturra", "Yell...
$ country_of_origin <chr> "Ethiopia", "Guatemala", "Colombia", "Thailand", "C...
$ flavor            <dbl> 8.58, 7.75, 7.75, 7.50, 8.00, 7.83, 7.17, 8.08, 7.3...
Sampling di R

Bootstrap untuk rata-rata cita rasa kopi

mean_flavors_1000 <- replicate(
  n = 1000,
  expr = coffee_sample %>%
    slice_sample(prop = 1, replace = TRUE) %>%
    summarize(mean_flavor = mean(flavor, na.rm = TRUE)) %>%
    pull(mean_flavor)
)
bootstrap_distn <- tibble(
  resample_mean = mean_flavors_1000
)
Sampling di R

Distribusi bootstrap rata-rata cita rasa

 ggplot(bootstrap_distn, aes(resample_mean)) +
  geom_histogram(binwidth = 0.0025)

Histogram distribusi bootstrap.

Sampling di R

Rata-rata: sampel, distribusi bootstrap, populasi

Rata-rata sampel

coffee_sample %>% 
  summarize(mean_flavor = mean(flavor)) %>% 
  pull(mean_flavor)
7.5163

Perkiraan rata-rata populasi

bootstrap_distn %>% 
  summarize(mean_mean_flavor = mean(resample_mean)) %>% 
  pull(mean_mean_flavor)
7.5167

Rata-rata populasi sebenarnya

coffee_ratings %>% 
  summarize(mean_flavor = mean(flavor)) %>% 
  pull(mean_flavor)
7.5260
Sampling di R

Menafsirkan rata-rata

  • Rata-rata distribusi bootstrap biasanya hampir sama dengan rata-rata sampel.
  • Ini mungkin bukan perkiraan yang baik untuk rata-rata populasi.
  • Bootstrapping tidak dapat mengoreksi bias akibat perbedaan antara sampel dan populasi.
Sampling di R

sd sampel vs sd distribusi bootstrap

Simpangan baku sampel

coffee_focus %>% 
  summarize(sd_flavor = sd(flavor)) %>% 
  pull(sd_flavor)
0.3525

Perkiraan simpangan baku populasi?

bootstrap_distn %>% 
  summarize(sd_mean_flavor = sd(resample_mean)) %>% 
  pull(sd_mean_flavor)
0.01572
Sampling di R

Simpangan baku: sampel, distribusi bootstrap, populasi

Simpangan baku sampel

coffee_focus %>% 
  summarize(sd_flavor = sd(flavor)) %>% 
  pull(sd_flavor)
0.3525

Perkiraan simpangan baku populasi

standard_error <- bootstrap_distn %>%
  summarize(sd_mean_flavor = sd(resample_mean)) %>% 
  pull(sd_mean_flavor)
standard_error * sqrt(500)
0.3515

Simpangan baku sebenarnya

coffee_ratings %>%
  summarize(sd_flavor = sd(flavor)) %>%
  pull(sd_flavor)
0.3414

Standard error adalah simpangan baku dari statistik yang dikaji.

Standard error dikali akar ukuran sampel memperkirakan simpangan baku populasi.

Sampling di R

Menafsirkan standard error

  • Standard error taksiran adalah simpangan baku dari distribusi bootstrap untuk suatu statistik sampel.
  • Standard error distribusi bootstrap dikali akar ukuran sampel memperkirakan simpangan baku di populasi.
Sampling di R

Ayo berlatih!

Sampling di R

Preparing Video For Download...