Cleaning Data in Python
Adel Nehme
VP of AI Curriculum, DataCamp

| Column | Unit |
|---|---|
| Temperature | 32°C is also 89.6°F |
| Weight | 70 Kg is also 11 st. |
| Date | 26-11-2019 is also 26, November, 2019 |
| Money | 100$ is also 10763.90¥ |
temperatures = pd.read_csv('temperature.csv')
temperatures.head()
Date Temperature
0 03.03.19 14.0
1 04.03.19 15.0
2 05.03.19 18.0
3 06.03.19 16.0
4 07.03.19 62.6
temperatures = pd.read_csv('temperature.csv')
temperatures.head()
Date Temperature
0 03.03.19 14.0
1 04.03.19 15.0
2 05.03.19 18.0
3 06.03.19 16.0
4 07.03.19 62.6 <--
# Import matplotlib import matplotlib.pyplot as plt# Create scatter plot plt.scatter(x = 'Date', y = 'Temperature', data = temperatures)# Create title, xlabel and ylabel plt.title('Temperature in Celsius March 2019 - NYC') plt.xlabel('Dates') plt.ylabel('Temperature in Celsius')# Show plot plt.show()


$$C = (F - 32) \times \frac{5}{9}$$
temp_fah = temperatures.loc[temperatures['Temperature'] > 40, 'Temperature']temp_cels = (temp_fah - 32) * (5/9)temperatures.loc[temperatures['Temperature'] > 40, 'Temperature'] = temp_cels
# Assert conversion is correct
assert temperatures['Temperature'].max() < 40
birthdays.head()
Birthday First name Last name
0 27/27/19 Rowan Nunez
1 03-29-19 Brynn Yang
2 March 3rd, 2019 Sophia Reilly
3 24-03-19 Deacon Prince
4 06-03-19 Griffith Neal
birthdays.head()

datetime is useful for representing dates
| Date | datetime format |
|---|---|
| 25-12-2019 | %d-%m-%Y |
| December 25th 2019 | %c |
| 12-25-2019 | %m-%d-%Y |
| ... | ... |
pandas.to_datetime()
# Converts to datetime - but won't work!
birthdays['Birthday'] = pd.to_datetime(birthdays['Birthday'])
ValueError: month must be in 1..12
# Will work!
birthdays['Birthday'] = pd.to_datetime(birthdays['Birthday'],
# Return NA for rows where conversion failed
errors = 'coerce')
birthdays.head()
Birthday First name Last name
0 NaT Rowan Nunez
1 2019-03-29 Brynn Yang
2 2019-03-03 Sophia Reilly
3 2019-03-24 Deacon Prince
4 2019-06-03 Griffith Neal
birthdays['Birthday'] = birthdays['Birthday'].dt.strftime("%d-%m-%Y")
birthdays.head()
Birthday First name Last name
0 NaT Rowan Nunez
1 29-03-2019 Brynn Yang
2 03-03-2019 Sophia Reilly
3 24-03-2019 Deacon Prince
4 03-06-2019 Griffith Neal
Is 2019-03-08 in August or March?
NA and treat accordinglyCleaning Data in Python