Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp
| Data | Example values |
|---|---|
| Marriage status | unmarried, married |
| Household income category | 0-20K, 20-40K, ... |
| T-shirt size | S, M, L, XL |
factor, each category is stored as a number number and has a corresponding label| Data | Labels | Numeric representation |
|---|---|---|
| Marriage status | unmarried, married |
1, 2 |
| Household income category | 0-20K, 20-40K, ... |
1, 2, ... |
| T-shirt size | S, M, L, XL |
1, 2, 3, 4 |
tshirt_size
L XL XL L M M M L XL L S M M S S M XL S L S ...
Levels: S M L XL
levels(tshirt_size)
"S" "M" "L" "XL"
factors cannot have values that fall outside of the predefined ones| Data | Levels | Not allowed |
|---|---|---|
| Marriage status | unmarried, married |
divorced |
| Household income category | 0-20K, 20-40K, ... |
10-30K |
| T-shirt size | S, M, L, XL |
S/M |



study_data
name birthday blood_type
1 Beth 2019-10-20 B-
2 Ignatius 2020-07-08 A-
3 Paul 2019-08-12 O+
4 Helen 2019-03-17 O-
5 Jennifer 2019-12-17 Z+
6 Kennedy 2020-04-27 A+
7 Keith 2019-04-19 AB+
blood_types
blood_type
1 O-
2 O+
3 A-
4 A+
5 B+
6 B-
7 AB+
8 AB-
study_data
name birthday blood_type
1 Beth 2019-10-20 B-
2 Ignatius 2020-07-08 A-
3 Paul 2019-08-12 O+
4 Helen 2019-03-17 O-
5 Jennifer 2019-12-17 Z+ <--
6 Kennedy 2020-04-27 A+
7 Keith 2019-04-19 AB+
blood_types
blood_type
1 O-
2 O+
3 A-
4 A+
5 B+
6 B-
7 AB+
8 AB-

study_data %>%
anti_join(blood_types, by = "blood_type")
name birthday blood_type
1 Jennifer 2019-12-17 Z+

study_data %>%
semi_join(blood_types, by = "blood_type")
name birthday blood_type
1 Beth 2019-10-20 B-
2 Ignatius 2020-07-08 A-
3 Paul 2019-08-12 O+
4 Helen 2019-03-17 O-
5 Kennedy 2020-04-27 A+
6 Keith 2019-04-19 AB+
Cleaning Data in R