Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp
| Data | Example values | 
|---|---|
| Marriage status | unmarried,married | 
| Household income category | 0-20K,20-40K, ... | 
| T-shirt size | S,M,L,XL | 
factor, each category is stored as a number number and has a corresponding label| Data | Labels | Numeric representation | 
|---|---|---|
| Marriage status | unmarried,married | 1,2 | 
| Household income category | 0-20K,20-40K, ... | 1,2, ... | 
| T-shirt size | S,M,L,XL | 1,2,3,4 | 
tshirt_size
L  XL XL L  M  M  M  L  XL L  S  M  M  S  S  M  XL S  L  S ... 
Levels: S M L XL
levels(tshirt_size)
"S"  "M"  "L"  "XL"
factors cannot have values that fall outside of the predefined ones| Data | Levels | Not allowed | 
|---|---|---|
| Marriage status | unmarried,married | divorced | 
| Household income category | 0-20K,20-40K, ... | 10-30K | 
| T-shirt size | S,M,L,XL | S/M | 



study_data
      name   birthday blood_type
1     Beth 2019-10-20         B-
2 Ignatius 2020-07-08         A-
3     Paul 2019-08-12         O+
4    Helen 2019-03-17         O-
5 Jennifer 2019-12-17         Z+
6  Kennedy 2020-04-27         A+
7    Keith 2019-04-19        AB+
blood_types
  blood_type
1         O-
2         O+
3         A-
4         A+
5         B+
6         B-
7        AB+
8        AB-
study_data
      name   birthday blood_type
1     Beth 2019-10-20         B-
2 Ignatius 2020-07-08         A-
3     Paul 2019-08-12         O+
4    Helen 2019-03-17         O-
5 Jennifer 2019-12-17         Z+  <--
6  Kennedy 2020-04-27         A+
7    Keith 2019-04-19        AB+
blood_types
  blood_type
1         O-
2         O+
3         A-
4         A+
5         B+
6         B-
7        AB+
8        AB-

study_data %>%
  anti_join(blood_types, by = "blood_type")
      name   birthday blood_type
1 Jennifer 2019-12-17         Z+

study_data %>%
  semi_join(blood_types, by = "blood_type")
      name   birthday blood_type
1     Beth 2019-10-20         B-
2 Ignatius 2020-07-08         A-
3     Paul 2019-08-12         O+
4    Helen 2019-03-17         O-
5  Kennedy 2020-04-27         A+
6    Keith 2019-04-19        AB+
Cleaning Data in R