Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp
°C
vs. °F
kg
vs. g
vs. lb
$
vs. GBP £
vs. JPY ¥
DD-MM-YYYY
vs. MM-DD-YYYY
vs. YYYY-MM-DD
head(nyc_temps)
date temp
1 2019-04-01 4.2
2 2019-04-02 7.5
3 2019-04-03 12.2
4 2019-04-04 11.1
5 2019-04-05 41.5
6 2019-04-06 11.9
library(ggplot2)
ggplot(nyc_temps, aes(x = date, y = temp)) +
geom_point()
$$\text{C} = (\text{F} - 32) \times \frac{5}{9}$$
ifelse(condition, value_if_true, value_if_false)
nyc_temps %>%
mutate(temp_c = ifelse(temp > 50, (temp - 32) * 5 / 9, temp))
date temp temp_c
1 2019-04-01 4.2 4.20000
...
7 2019-04-07 58.5 14.72222
...
nyc_temps %>%
mutate(temp_c = ifelse(temp > 50, (temp - 32) * 5 / 9, temp)) %>%
ggplot(aes(x = date, y = temp_c)) +
geom_point()
nyc_temps
date temp_c
1 2019-11-23 5.12
2 01/15/19 -0.67
3 April 24, 2019 17.46
4 08/30/19 26.46
5 October 3, 2019 14.63
6 2019-03-17 3.47
Date string | Date format |
---|---|
"2019-11-23" |
"%Y-%m-%d" |
"01/15/19" |
"%m/%d/%y" |
"April 24, 2019" |
"%B %d, %Y" |
?strptime
in R console
library(lubridate)
parse_date_time(nyc_temps$date,
orders = c("%Y-%m-%d", "%m/%d/%y", "%B %d, %Y"))
"2019-11-23 UTC" "2019-01-15 UTC" "2019-04-24 UTC" "2019-08-30 UTC"
"2019-10-03 UTC" "2019-03-17 UTC"
parse_date_time("Monday, January 3",
orders = c("%Y-%m-%d", "%m/%d/%y", "%B %d, %Y"))
NA
Is 02/04/2019 in February or April?
Options include:
Cleaning Data in R