Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp
Data | Example values |
---|---|
Marriage status | unmarried , married |
Household income category | 0-20K , 20-40K , ... |
T-shirt size | S , M , L , XL |
factor
, each category is stored as a number number and has a corresponding labelData | Labels | Numeric representation |
---|---|---|
Marriage status | unmarried , married |
1 , 2 |
Household income category | 0-20K , 20-40K , ... |
1 , 2 , ... |
T-shirt size | S , M , L , XL |
1 , 2 , 3 , 4 |
tshirt_size
L XL XL L M M M L XL L S M M S S M XL S L S ...
Levels: S M L XL
levels(tshirt_size)
"S" "M" "L" "XL"
factor
s cannot have values that fall outside of the predefined onesData | Levels | Not allowed |
---|---|---|
Marriage status | unmarried , married |
divorced |
Household income category | 0-20K , 20-40K , ... |
10-30K |
T-shirt size | S , M , L , XL |
S/M |
study_data
name birthday blood_type
1 Beth 2019-10-20 B-
2 Ignatius 2020-07-08 A-
3 Paul 2019-08-12 O+
4 Helen 2019-03-17 O-
5 Jennifer 2019-12-17 Z+
6 Kennedy 2020-04-27 A+
7 Keith 2019-04-19 AB+
blood_types
blood_type
1 O-
2 O+
3 A-
4 A+
5 B+
6 B-
7 AB+
8 AB-
study_data
name birthday blood_type
1 Beth 2019-10-20 B-
2 Ignatius 2020-07-08 A-
3 Paul 2019-08-12 O+
4 Helen 2019-03-17 O-
5 Jennifer 2019-12-17 Z+ <--
6 Kennedy 2020-04-27 A+
7 Keith 2019-04-19 AB+
blood_types
blood_type
1 O-
2 O+
3 A-
4 A+
5 B+
6 B-
7 AB+
8 AB-
study_data %>%
anti_join(blood_types, by = "blood_type")
name birthday blood_type
1 Jennifer 2019-12-17 Z+
study_data %>%
semi_join(blood_types, by = "blood_type")
name birthday blood_type
1 Beth 2019-10-20 B-
2 Ignatius 2020-07-08 A-
3 Paul 2019-08-12 O+
4 Helen 2019-03-17 O-
5 Kennedy 2020-04-27 A+
6 Keith 2019-04-19 AB+
Cleaning Data in R