Checking membership

Cleaning Data in R

Maggie Matsui

Content Developer @ DataCamp

Categorical data

  • Categorical variables have a fixed and known set of possible values
Data Example values
Marriage status unmarried, married
Household income category 0-20K, 20-40K, ...
T-shirt size S, M, L, XL
Cleaning Data in R

Factors

  • In a factor, each category is stored as a number number and has a corresponding label
Data Labels Numeric representation
Marriage status unmarried, married 1, 2
Household income category 0-20K, 20-40K, ... 1, 2, ...
T-shirt size S, M, L, XL 1, 2, 3, 4
Cleaning Data in R

Factor levels

tshirt_size
L  XL XL L  M  M  M  L  XL L  S  M  M  S  S  M  XL S  L  S ... 
Levels: S M L XL
levels(tshirt_size)
"S"  "M"  "L"  "XL"
Cleaning Data in R

Values that don't belong

  • factors cannot have values that fall outside of the predefined ones
Data Levels Not allowed
Marriage status unmarried, married divorced
Household income category 0-20K, 20-40K, ... 10-30K
T-shirt size S, M, L, XL S/M
Cleaning Data in R

How do we end up with these values?

 

On the left, Data Entry Errors represented by a text box with a cursor for free text, and a drop down menu representing dropdown entry. On the right, parsing errors represented by a database.

Cleaning Data in R

Filtering joins: a quick review

  • Keeps or removes observations from the first table without adding columns

Title: Semi-join. Subtitle: What observations of X are also in Y? On the left, a one-column table named X with values a, b, and c. On the right, a one-column table named Y with values a, c, and d. The a's and c's in each table are connected by lines. In X, the cells for a and c are highlighted.

Cleaning Data in R

Filtering joins: a quick review

  • Keeps or removes observations from the first table without adding columns

Title: Anti-join. Subtitle: What observations of X are not in Y? On the left, a one-column table named X with values a, b, and c. On the right, a one-column table named Y with values a, c, and d. The a's and c's in each table are connected by lines. In X, the cell for b is highlighted.

Cleaning Data in R

Blood type example

study_data
      name   birthday blood_type
1     Beth 2019-10-20         B-
2 Ignatius 2020-07-08         A-
3     Paul 2019-08-12         O+
4    Helen 2019-03-17         O-
5 Jennifer 2019-12-17         Z+
6  Kennedy 2020-04-27         A+
7    Keith 2019-04-19        AB+
blood_types
  blood_type
1         O-
2         O+
3         A-
4         A+
5         B+
6         B-
7        AB+
8        AB-
Cleaning Data in R

Blood type example

study_data
      name   birthday blood_type
1     Beth 2019-10-20         B-
2 Ignatius 2020-07-08         A-
3     Paul 2019-08-12         O+
4    Helen 2019-03-17         O-
5 Jennifer 2019-12-17         Z+  <--
6  Kennedy 2020-04-27         A+
7    Keith 2019-04-19        AB+
blood_types
  blood_type
1         O-
2         O+
3         A-
4         A+
5         B+
6         B-
7        AB+
8        AB-
Cleaning Data in R

Finding non-members

Venn diagram. Left circle represents study_data and right circle represents blood types. On the left side is Z+. In the middle portion is A-, O-, AB+, A+, O+, and B-. On the right is B+ and AB-. The Z+ on the left is in red.

Cleaning Data in R

Anti-join

study_data %>%
  anti_join(blood_types, by = "blood_type")
      name   birthday blood_type
1 Jennifer 2019-12-17         Z+
Cleaning Data in R

Removing non-members

Venn diagram. Left circle represents study_data and right circle represents blood types. On the left side is Z+. In the middle portion is A-, O-, AB+, A+, O+, and B-. On the right is B+ and AB-. The blood types in the middle are in blue.

Cleaning Data in R

Semi-join

study_data %>%
  semi_join(blood_types, by = "blood_type")
      name   birthday blood_type
1     Beth 2019-10-20         B-
2 Ignatius 2020-07-08         A-
3     Paul 2019-08-12         O+
4    Helen 2019-03-17         O-
5  Kennedy 2020-04-27         A+
6    Keith 2019-04-19        AB+
Cleaning Data in R

Let's practice!

Cleaning Data in R

Preparing Video For Download...