Membersihkan Data di Python
Adel Nehme
VP of AI Curriculum, DataCamp
| Jenis data | Contoh nilai |
|---|---|
| Nama | Alex, Sara ... |
| Nomor telepon | +96171679912 ... |
| `[email protected]`.. | |
| Kata sandi | ... |
Masalah umum pada data teks
1) Ketidakkonsistenan data:
+96171679912 atau 0096171679912 atau ..?
2) Pelanggaran panjang tetap:
Kata sandi minimal 8 karakter
3) Salah ketik:
+961.71.679912
phones = pd.read_csv('phones.csv')
print(phones)
Full name Phone number
0 Noelani A. Gray 001-702-397-5143
1 Myles Z. Gomez 001-329-485-0540
2 Gil B. Silva 001-195-492-2338
3 Prescott D. Hardin +1-297-996-4904
4 Benedict G. Valdez 001-969-820-3536
5 Reece M. Andrews 4138
6 Hayfa E. Keith 001-536-175-8444
7 Hedley I. Logan 001-681-552-1823
8 Jack W. Carrillo 001-910-323-5265
9 Lionel M. Davis 001-143-119-9210
phones = pd.read_csv('phones.csv')
print(phones)
Full name Phone number
0 Noelani A. Gray 001-702-397-5143
1 Myles Z. Gomez 001-329-485-0540
2 Gil B. Silva 001-195-492-2338
3 Prescott D. Hardin +1-297-996-4904 <-- Format data tidak konsisten
4 Benedict G. Valdez 001-969-820-3536
5 Reece M. Andrews 4138 <-- Pelanggaran panjang
6 Hayfa E. Keith 001-536-175-8444
7 Hedley I. Logan 001-681-552-1823
8 Jack W. Carrillo 001-910-323-5265
9 Lionel M. Davis 001-143-119-9210
phones = pd.read_csv('phones.csv')
print(phones)
Full name Phone number
0 Noelani A. Gray 0017023975143
1 Myles Z. Gomez 0013294850540
2 Gil B. Silva 0011954922338
3 Prescott D. Hardin 0012979964904
4 Benedict G. Valdez 0019698203536
5 Reece M. Andrews NaN
6 Hayfa E. Keith 0015361758444
7 Hedley I. Logan 0016815521823
8 Jack W. Carrillo 0019103235265
9 Lionel M. Davis 0011431199210
# Ganti "+" dengan "00"
phones["Phone number"] = phones["Phone number"].str.replace("+", "00")
phones
Full name Phone number
0 Noelani A. Gray 001-702-397-5143
1 Myles Z. Gomez 001-329-485-0540
2 Gil B. Silva 001-195-492-2338
3 Prescott D. Hardin 001-297-996-4904
4 Benedict G. Valdez 001-969-820-3536
5 Reece M. Andrews 4138
6 Hayfa E. Keith 001-536-175-8444
7 Hedley I. Logan 001-681-552-1823
8 Jack W. Carrillo 001-910-323-5265
9 Lionel M. Davis 001-143-119-9210
# Ganti "-" dengan kosong
phones["Phone number"] = phones["Phone number"].str.replace("-", "")
phones
Full name Phone number
0 Noelani A. Gray 0017023975143
1 Myles Z. Gomez 0013294850540
2 Gil B. Silva 0011954922338
3 Prescott D. Hardin 0012979964904
4 Benedict G. Valdez 0019698203536
5 Reece M. Andrews 4138
6 Hayfa E. Keith 0015361758444
7 Hedley I. Logan 0016815521823
8 Jack W. Carrillo 0019103235265
9 Lionel M. Davis 0011431199210
# Ganti nomor telepon < 10 digit menjadi NaN
digits = phones['Phone number'].str.len()
phones.loc[digits < 10, "Phone number"] = np.nan
phones
Full name Phone number
0 Noelani A. Gray 0017023975143
1 Myles Z. Gomez 0013294850540
2 Gil B. Silva 0011954922338
3 Prescott D. Hardin 0012979964904
4 Benedict G. Valdez 0019698203536
5 Reece M. Andrews NaN
6 Hayfa E. Keith 0015361758444
7 Hedley I. Logan 0016815521823
8 Jack W. Carrillo 0019103235265
9 Lionel M. Davis 0011431199210
# Cari panjang tiap baris di kolom Phone number
sanity_check = phone['Phone number'].str.len()
# Pastikan panjang minimal nomor telepon 10
assert sanity_check.min() >= 10
# Pastikan tidak ada "+" atau "-"
assert phone['Phone number'].str.contains("+|-").any() == False
Ingat, assert tidak mengembalikan apa pun jika kondisi terpenuhi
phones.head()
Full name Phone number
0 Olga Robinson +(01706)-25891
1 Justina Kim +0500-571437
2 Tamekah Henson +0800-1111
3 Miranda Solis +07058-879063
4 Caldwell Gilliam +(016977)-8424
Control + F supercharged
# Ganti huruf dengan kosong
phones['Phone number'] = phones['Phone number'].str.replace(r'\D+', '')
phones.head()
Full name Phone number
0 Olga Robinson 0170625891
1 Justina Kim 0500571437
2 Tamekah Henson 08001111
3 Miranda Solis 07058879063
4 Caldwell Gilliam 0169778424
Membersihkan Data di Python