Uniformity

Cleaning Data in R

Maggie Matsui

Content Developer @ DataCamp

Uniformity

  • Different units or formats
    • Temperature: °C vs. °F
    • Weight: kg vs. g vs. lb
    • Money: USD $ vs. GBP £ vs. JPY ¥
    • Date: DD-MM-YYYY vs. MM-DD-YYYY vs. YYYY-MM-DD
Cleaning Data in R

Where do uniformity issues come from?

On the left, three databases with arrows pointing to one data table to represent multiple data sources. On the right, a free text box with a cursor and a computer with a keyboard to represent data entry errors.

Cleaning Data in R

Finding uniformity issues

head(nyc_temps)
         date temp
1  2019-04-01  4.2
2  2019-04-02  7.5
3  2019-04-03 12.2
4  2019-04-04 11.1
5  2019-04-05 41.5
6  2019-04-06 11.9
Cleaning Data in R

Finding uniformity issues

library(ggplot2)
ggplot(nyc_temps, aes(x = date, y = temp)) +
  geom_point()

Plot generated by code. There are three temperatures that are clearly much higher than the rest.

Cleaning Data in R

What to do?

  • There's no one best option. It depends on your dataset!
  • Do your research to understand where your data comes from

Plot generated by code. There are three temperatures that are clearly much higher than the rest.

  • Data from Apr 7, 16, and 23 is from an external source that measured temps in °F
Cleaning Data in R

Unit conversion

$$\text{C} = (\text{F} - 32) \times \frac{5}{9}$$

ifelse(condition, value_if_true, value_if_false)

nyc_temps %>%
  mutate(temp_c = ifelse(temp > 50, (temp - 32) * 5 / 9, temp))
         date temp   temp_c
1  2019-04-01  4.2  4.20000
...
7  2019-04-07 58.5 14.72222
...
Cleaning Data in R

Unit conversion

nyc_temps %>%
  mutate(temp_c = ifelse(temp > 50, (temp - 32) * 5 / 9, temp)) %>%
  ggplot(aes(x = date, y = temp_c)) +
    geom_point()

Plot generated by code, but this time there are no clear outliers since they have been converted.

Cleaning Data in R

Date uniformity

nyc_temps
             date temp_c
1      2019-11-23   5.12
2        01/15/19  -0.67
3  April 24, 2019  17.46
4        08/30/19  26.46
5 October 3, 2019  14.63
6      2019-03-17   3.47

 

Date string Date format
"2019-11-23" "%Y-%m-%d"
"01/15/19" "%m/%d/%y"
"April 24, 2019" "%B %d, %Y"

 

?strptime in R console

Cleaning Data in R

Parsing multiple formats

library(lubridate)
parse_date_time(nyc_temps$date,
                orders = c("%Y-%m-%d", "%m/%d/%y", "%B %d, %Y"))
"2019-11-23 UTC" "2019-01-15 UTC" "2019-04-24 UTC" "2019-08-30 UTC"
"2019-10-03 UTC" "2019-03-17 UTC"
parse_date_time("Monday, January 3",
                orders = c("%Y-%m-%d", "%m/%d/%y", "%B %d, %Y"))
NA
Cleaning Data in R

Ambiguous dates

Is 02/04/2019 in February or April?

  • Depends on your data!

 

Options include:

  • Treat as missing
  • If your data comes from multiple sources, infer based on source
  • Infer based on other data in the dataset
Cleaning Data in R

Let's practice!

Cleaning Data in R

Preparing Video For Download...