Data opschonen in Python
Adel Nehme
VP of AI Curriculum, DataCamp
| Type gegevens | Voorbeelden |
|---|---|
| Namen | Alex, Sara ... |
| Telefoonnummers | +96171679912 ... |
| E-mails | `[email protected]`.. |
| Wachtwoorden | ... |
Veelvoorkomende tekstproblemen
1) Inconsistentie:
+96171679912 of 0096171679912 of ..?
2) Vaste-lengteovertreding:
Wachtwoorden moeten minstens 8 tekens zijn
3) Typfouten:
+961.71.679912
phones = pd.read_csv('phones.csv')
print(phones)
Full name Phone number
0 Noelani A. Gray 001-702-397-5143
1 Myles Z. Gomez 001-329-485-0540
2 Gil B. Silva 001-195-492-2338
3 Prescott D. Hardin +1-297-996-4904
4 Benedict G. Valdez 001-969-820-3536
5 Reece M. Andrews 4138
6 Hayfa E. Keith 001-536-175-8444
7 Hedley I. Logan 001-681-552-1823
8 Jack W. Carrillo 001-910-323-5265
9 Lionel M. Davis 001-143-119-9210
phones = pd.read_csv('phones.csv')
print(phones)
Full name Phone number
0 Noelani A. Gray 001-702-397-5143
1 Myles Z. Gomez 001-329-485-0540
2 Gil B. Silva 001-195-492-2338
3 Prescott D. Hardin +1-297-996-4904 <-- Inconsistent formaat
4 Benedict G. Valdez 001-969-820-3536
5 Reece M. Andrews 4138 <-- Lengte-overtreding
6 Hayfa E. Keith 001-536-175-8444
7 Hedley I. Logan 001-681-552-1823
8 Jack W. Carrillo 001-910-323-5265
9 Lionel M. Davis 001-143-119-9210
phones = pd.read_csv('phones.csv')
print(phones)
Full name Phone number
0 Noelani A. Gray 0017023975143
1 Myles Z. Gomez 0013294850540
2 Gil B. Silva 0011954922338
3 Prescott D. Hardin 0012979964904
4 Benedict G. Valdez 0019698203536
5 Reece M. Andrews NaN
6 Hayfa E. Keith 0015361758444
7 Hedley I. Logan 0016815521823
8 Jack W. Carrillo 0019103235265
9 Lionel M. Davis 0011431199210
# Vervang "+" door "00"
phones["Phone number"] = phones["Phone number"].str.replace("+", "00")
phones
Full name Phone number
0 Noelani A. Gray 001-702-397-5143
1 Myles Z. Gomez 001-329-485-0540
2 Gil B. Silva 001-195-492-2338
3 Prescott D. Hardin 001-297-996-4904
4 Benedict G. Valdez 001-969-820-3536
5 Reece M. Andrews 4138
6 Hayfa E. Keith 001-536-175-8444
7 Hedley I. Logan 001-681-552-1823
8 Jack W. Carrillo 001-910-323-5265
9 Lionel M. Davis 001-143-119-9210
# Vervang "-" door niks
phones["Phone number"] = phones["Phone number"].str.replace("-", "")
phones
Full name Phone number
0 Noelani A. Gray 0017023975143
1 Myles Z. Gomez 0013294850540
2 Gil B. Silva 0011954922338
3 Prescott D. Hardin 0012979964904
4 Benedict G. Valdez 0019698203536
5 Reece M. Andrews 4138
6 Hayfa E. Keith 0015361758444
7 Hedley I. Logan 0016815521823
8 Jack W. Carrillo 0019103235265
9 Lionel M. Davis 0011431199210
# Vervang nummers met minder dan 10 cijfers door NaN
digits = phones['Phone number'].str.len()
phones.loc[digits < 10, "Phone number"] = np.nan
phones
Full name Phone number
0 Noelani A. Gray 0017023975143
1 Myles Z. Gomez 0013294850540
2 Gil B. Silva 0011954922338
3 Prescott D. Hardin 0012979964904
4 Benedict G. Valdez 0019698203536
5 Reece M. Andrews NaN
6 Hayfa E. Keith 0015361758444
7 Hedley I. Logan 0016815521823
8 Jack W. Carrillo 0019103235265
9 Lionel M. Davis 0011431199210
# Bepaal de lengte per rij in de kolom Phone number
sanity_check = phone['Phone number'].str.len()
# Controleer: minimale lengte van een telefoonnummer is 10
assert sanity_check.min() >= 10
# Controleer: geen "+" of "-" meer
assert phone['Phone number'].str.contains("+|-").any() == False
Onthoud: assert geeft niks terug als de voorwaarde klopt
phones.head()
Full name Phone number
0 Olga Robinson +(01706)-25891
1 Justina Kim +0500-571437
2 Tamekah Henson +0800-1111
3 Miranda Solis +07058-879063
4 Caldwell Gilliam +(016977)-8424
Supersnelle Ctrl + F
# Vervang letters met niks
phones['Phone number'] = phones['Phone number'].str.replace(r'\D+', '')
phones.head()
Full name Phone number
0 Olga Robinson 0170625891
1 Justina Kim 0500571437
2 Tamekah Henson 08001111
3 Miranda Solis 07058879063
4 Caldwell Gilliam 0169778424
Data opschonen in Python