Cleaning Data in Python
Adel Nehme
Content Developer @ DataCamp
Column | Unit |
---|---|
Temperature | 32°C is also 89.6°F |
Weight | 70 Kg is also 11 st. |
Date | 26-11-2019 is also 26, November, 2019 |
Money | 100$ is also 10763.90¥ |
temperatures = pd.read_csv('temperature.csv')
temperatures.head()
Date Temperature
0 03.03.19 14.0
1 04.03.19 15.0
2 05.03.19 18.0
3 06.03.19 16.0
4 07.03.19 62.6
temperatures = pd.read_csv('temperature.csv')
temperatures.head()
Date Temperature
0 03.03.19 14.0
1 04.03.19 15.0
2 05.03.19 18.0
3 06.03.19 16.0
4 07.03.19 62.6 <--
# Import matplotlib import matplotlib.pyplot as plt
# Create scatter plot plt.scatter(x = 'Date', y = 'Temperature', data = temperatures)
# Create title, xlabel and ylabel plt.title('Temperature in Celsius March 2019 - NYC') plt.xlabel('Dates') plt.ylabel('Temperature in Celsius')
# Show plot plt.show()
$$C = (F - 32) \times \frac{5}{9}$$
temp_fah = temperatures.loc[temperatures['Temperature'] > 40, 'Temperature']
temp_cels = (temp_fah - 32) * (5/9)
temperatures.loc[temperatures['Temperature'] > 40, 'Temperature'] = temp_cels
# Assert conversion is correct
assert temperatures['Temperature'].max() < 40
birthdays.head()
Birthday First name Last name
0 27/27/19 Rowan Nunez
1 03-29-19 Brynn Yang
2 March 3rd, 2019 Sophia Reilly
3 24-03-19 Deacon Prince
4 06-03-19 Griffith Neal
birthdays.head()
datetime
is useful for representing dates
Date | datetime format |
---|---|
25-12-2019 | %d-%m-%Y |
December 25th 2019 | %c |
12-25-2019 | %m-%d-%Y |
... | ... |
pandas.to_datetime()
# Converts to datetime - but won't work!
birthdays['Birthday'] = pd.to_datetime(birthdays['Birthday'])
ValueError: month must be in 1..12
# Will work!
birthdays['Birthday'] = pd.to_datetime(birthdays['Birthday'],
# Attempt to infer format of each date
infer_datetime_format=True,
# Return NA for rows where conversion failed
errors = 'coerce')
birthdays.head()
Birthday First name Last name
0 NaT Rowan Nunez
1 2019-03-29 Brynn Yang
2 2019-03-03 Sophia Reilly
3 2019-03-24 Deacon Prince
4 2019-06-03 Griffith Neal
birthdays['Birthday'] = birthdays['Birthday'].dt.strftime("%d-%m-%Y")
birthdays.head()
Birthday First name Last name
0 NaT Rowan Nunez
1 29-03-2019 Brynn Yang
2 03-03-2019 Sophia Reilly
3 24-03-2019 Deacon Prince
4 03-06-2019 Griffith Neal
Is 2019-03-08
in August or March?
NA
and treat accordinglyCleaning Data in Python