Cleaning Data in Python
Adel Nehme
Content Developer @ DataCamp
Type of data | Example values |
---|---|
Names | Alex , Sara ... |
Phone numbers | +96171679912 ... |
Emails | `[email protected]`.. |
Passwords | ... |
Common text data problems
1) Data inconsistency:
+96171679912
or 0096171679912
or ..?
2) Fixed length violations:
Passwords needs to be at least 8 characters
3) Typos:
+961.71.679912
phones = pd.read_csv('phones.csv')
print(phones)
Full name Phone number
0 Noelani A. Gray 001-702-397-5143
1 Myles Z. Gomez 001-329-485-0540
2 Gil B. Silva 001-195-492-2338
3 Prescott D. Hardin +1-297-996-4904
4 Benedict G. Valdez 001-969-820-3536
5 Reece M. Andrews 4138
6 Hayfa E. Keith 001-536-175-8444
7 Hedley I. Logan 001-681-552-1823
8 Jack W. Carrillo 001-910-323-5265
9 Lionel M. Davis 001-143-119-9210
phones = pd.read_csv('phones.csv')
print(phones)
Full name Phone number
0 Noelani A. Gray 001-702-397-5143
1 Myles Z. Gomez 001-329-485-0540
2 Gil B. Silva 001-195-492-2338
3 Prescott D. Hardin +1-297-996-4904 <-- Inconsistent data format
4 Benedict G. Valdez 001-969-820-3536
5 Reece M. Andrews 4138 <-- Length violation
6 Hayfa E. Keith 001-536-175-8444
7 Hedley I. Logan 001-681-552-1823
8 Jack W. Carrillo 001-910-323-5265
9 Lionel M. Davis 001-143-119-9210
phones = pd.read_csv('phones.csv')
print(phones)
Full name Phone number
0 Noelani A. Gray 0017023975143
1 Myles Z. Gomez 0013294850540
2 Gil B. Silva 0011954922338
3 Prescott D. Hardin 0012979964904
4 Benedict G. Valdez 0019698203536
5 Reece M. Andrews NaN
6 Hayfa E. Keith 0015361758444
7 Hedley I. Logan 0016815521823
8 Jack W. Carrillo 0019103235265
9 Lionel M. Davis 0011431199210
# Replace "+" with "00"
phones["Phone number"] = phones["Phone number"].str.replace("+", "00")
phones
Full name Phone number
0 Noelani A. Gray 001-702-397-5143
1 Myles Z. Gomez 001-329-485-0540
2 Gil B. Silva 001-195-492-2338
3 Prescott D. Hardin 001-297-996-4904
4 Benedict G. Valdez 001-969-820-3536
5 Reece M. Andrews 4138
6 Hayfa E. Keith 001-536-175-8444
7 Hedley I. Logan 001-681-552-1823
8 Jack W. Carrillo 001-910-323-5265
9 Lionel M. Davis 001-143-119-9210
# Replace "-" with nothing
phones["Phone number"] = phones["Phone number"].str.replace("-", "")
phones
Full name Phone number
0 Noelani A. Gray 0017023975143
1 Myles Z. Gomez 0013294850540
2 Gil B. Silva 0011954922338
3 Prescott D. Hardin 0012979964904
4 Benedict G. Valdez 0019698203536
5 Reece M. Andrews 4138
6 Hayfa E. Keith 0015361758444
7 Hedley I. Logan 0016815521823
8 Jack W. Carrillo 0019103235265
9 Lionel M. Davis 0011431199210
# Replace phone numbers with lower than 10 digits to NaN
digits = phones['Phone number'].str.len()
phones.loc[digits < 10, "Phone number"] = np.nan
phones
Full name Phone number
0 Noelani A. Gray 0017023975143
1 Myles Z. Gomez 0013294850540
2 Gil B. Silva 0011954922338
3 Prescott D. Hardin 0012979964904
4 Benedict G. Valdez 0019698203536
5 Reece M. Andrews NaN
6 Hayfa E. Keith 0015361758444
7 Hedley I. Logan 0016815521823
8 Jack W. Carrillo 0019103235265
9 Lionel M. Davis 0011431199210
# Find length of each row in Phone number column
sanity_check = phone['Phone number'].str.len()
# Assert minmum phone number length is 10
assert sanity_check.min() >= 10
# Assert all numbers do not have "+" or "-"
assert phone['Phone number'].str.contains("+|-").any() == False
Remember, assert
returns nothing if the condition passes
phones.head()
Full name Phone number
0 Olga Robinson +(01706)-25891
1 Justina Kim +0500-571437
2 Tamekah Henson +0800-1111
3 Miranda Solis +07058-879063
4 Caldwell Gilliam +(016977)-8424
Supercharged control + F
# Replace letters with nothing
phones['Phone number'] = phones['Phone number'].str.replace(r'\D+', '')
phones.head()
Full name Phone number
0 Olga Robinson 0170625891
1 Justina Kim 0500571437
2 Tamekah Henson 08001111
3 Miranda Solis 07058879063
4 Caldwell Gilliam 0169778424
Cleaning Data in Python