Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp
°C vs. °Fkg vs. g vs. lb$ vs. GBP £ vs. JPY ¥DD-MM-YYYY vs. MM-DD-YYYY vs. YYYY-MM-DD
head(nyc_temps)
         date temp
1  2019-04-01  4.2
2  2019-04-02  7.5
3  2019-04-03 12.2
4  2019-04-04 11.1
5  2019-04-05 41.5
6  2019-04-06 11.9
library(ggplot2)
ggplot(nyc_temps, aes(x = date, y = temp)) +
  geom_point()


$$\text{C} = (\text{F} - 32) \times \frac{5}{9}$$
ifelse(condition, value_if_true, value_if_false)
nyc_temps %>%
  mutate(temp_c = ifelse(temp > 50, (temp - 32) * 5 / 9, temp))
         date temp   temp_c
1  2019-04-01  4.2  4.20000
...
7  2019-04-07 58.5 14.72222
...
nyc_temps %>%
  mutate(temp_c = ifelse(temp > 50, (temp - 32) * 5 / 9, temp)) %>%
  ggplot(aes(x = date, y = temp_c)) +
    geom_point()

nyc_temps
             date temp_c
1      2019-11-23   5.12
2        01/15/19  -0.67
3  April 24, 2019  17.46
4        08/30/19  26.46
5 October 3, 2019  14.63
6      2019-03-17   3.47
| Date string | Dateformat | 
|---|---|
| "2019-11-23" | "%Y-%m-%d" | 
| "01/15/19" | "%m/%d/%y" | 
| "April 24, 2019" | "%B %d, %Y" | 
?strptime in R console
library(lubridate)
parse_date_time(nyc_temps$date,
                orders = c("%Y-%m-%d", "%m/%d/%y", "%B %d, %Y"))
"2019-11-23 UTC" "2019-01-15 UTC" "2019-04-24 UTC" "2019-08-30 UTC"
"2019-10-03 UTC" "2019-03-17 UTC"
parse_date_time("Monday, January 3",
                orders = c("%Y-%m-%d", "%m/%d/%y", "%B %d, %Y"))
NA
Is 02/04/2019 in February or April?
Options include:
Cleaning Data in R