Cleaning text data

Cleaning Data in Python

Adel Nehme

Content Developer @ DataCamp

What is text data?

Type of data Example values
Names Alex, Sara ...
Phone numbers +96171679912 ...
Emails `[email protected]`..
Passwords ...

 

Common text data problems

1) Data inconsistency:  

+96171679912 or 0096171679912 or ..?

2) Fixed length violations:  

Passwords needs to be at least 8 characters

3) Typos:  

+961.71.679912

Cleaning Data in Python

Example

phones = pd.read_csv('phones.csv')
print(phones)
              Full name      Phone number
0       Noelani A. Gray  001-702-397-5143
1        Myles Z. Gomez  001-329-485-0540
2          Gil B. Silva  001-195-492-2338
3    Prescott D. Hardin   +1-297-996-4904
4    Benedict G. Valdez  001-969-820-3536
5      Reece M. Andrews              4138
6        Hayfa E. Keith  001-536-175-8444
7       Hedley I. Logan  001-681-552-1823
8      Jack W. Carrillo  001-910-323-5265
9       Lionel M. Davis  001-143-119-9210
Cleaning Data in Python

Example

phones = pd.read_csv('phones.csv')
print(phones)
              Full name      Phone number
0       Noelani A. Gray  001-702-397-5143
1        Myles Z. Gomez  001-329-485-0540
2          Gil B. Silva  001-195-492-2338
3    Prescott D. Hardin   +1-297-996-4904   <-- Inconsistent data format
4    Benedict G. Valdez  001-969-820-3536
5      Reece M. Andrews              4138   <-- Length violation
6        Hayfa E. Keith  001-536-175-8444
7       Hedley I. Logan  001-681-552-1823
8      Jack W. Carrillo  001-910-323-5265
9       Lionel M. Davis  001-143-119-9210
Cleaning Data in Python

Example

phones = pd.read_csv('phones.csv')
print(phones)
              Full name   Phone number
0       Noelani A. Gray  0017023975143
1        Myles Z. Gomez  0013294850540
2          Gil B. Silva  0011954922338
3    Prescott D. Hardin  0012979964904
4    Benedict G. Valdez  0019698203536
5      Reece M. Andrews            NaN
6        Hayfa E. Keith  0015361758444
7       Hedley I. Logan  0016815521823
8      Jack W. Carrillo  0019103235265
9       Lionel M. Davis  0011431199210
Cleaning Data in Python

Fixing the phone number column

# Replace "+" with "00"
phones["Phone number"] = phones["Phone number"].str.replace("+", "00")
phones
            Full name      Phone number
0     Noelani A. Gray  001-702-397-5143
1      Myles Z. Gomez  001-329-485-0540
2        Gil B. Silva  001-195-492-2338
3  Prescott D. Hardin  001-297-996-4904
4  Benedict G. Valdez  001-969-820-3536
5    Reece M. Andrews              4138
6      Hayfa E. Keith  001-536-175-8444
7     Hedley I. Logan  001-681-552-1823
8    Jack W. Carrillo  001-910-323-5265
9     Lionel M. Davis  001-143-119-9210
Cleaning Data in Python

Fixing the phone number column

# Replace "-" with nothing
phones["Phone number"] = phones["Phone number"].str.replace("-", "")
phones
            Full name   Phone number
0     Noelani A. Gray  0017023975143
1      Myles Z. Gomez  0013294850540
2        Gil B. Silva  0011954922338
3  Prescott D. Hardin  0012979964904
4  Benedict G. Valdez  0019698203536
5    Reece M. Andrews           4138
6      Hayfa E. Keith  0015361758444
7     Hedley I. Logan  0016815521823
8    Jack W. Carrillo  0019103235265
9     Lionel M. Davis  0011431199210
Cleaning Data in Python

Fixing the phone number column

# Replace phone numbers with lower than 10 digits to NaN
digits = phones['Phone number'].str.len()
phones.loc[digits < 10, "Phone number"] = np.nan
phones
              Full name   Phone number
0       Noelani A. Gray  0017023975143
1        Myles Z. Gomez  0013294850540
2          Gil B. Silva  0011954922338
3    Prescott D. Hardin  0012979964904
4    Benedict G. Valdez  0019698203536
5      Reece M. Andrews            NaN
6        Hayfa E. Keith  0015361758444
7       Hedley I. Logan  0016815521823
8      Jack W. Carrillo  0019103235265
9       Lionel M. Davis  0011431199210
Cleaning Data in Python

Fixing the phone number column

# Find length of each row in Phone number column
sanity_check = phone['Phone number'].str.len()
# Assert minmum phone number length is 10
assert sanity_check.min() >= 10
# Assert all numbers do not have "+" or "-"
assert phone['Phone number'].str.contains("+|-").any() == False

                                                Remember, assert returns nothing if the condition passes

Cleaning Data in Python

But what about more complicated examples?

phones.head()
          Full name    Phone number
0     Olga Robinson  +(01706)-25891
1       Justina Kim    +0500-571437
2    Tamekah Henson      +0800-1111
3     Miranda Solis   +07058-879063
4  Caldwell Gilliam  +(016977)-8424

 

                                                                                        Supercharged control + F

Cleaning Data in Python

Regular expressions in action

# Replace letters with nothing
phones['Phone number'] = phones['Phone number'].str.replace(r'\D+', '')
phones.head()
          Full name Phone number
0     Olga Robinson   0170625891
1       Justina Kim   0500571437
2    Tamekah Henson     08001111
3     Miranda Solis  07058879063
4  Caldwell Gilliam   0169778424
Cleaning Data in Python

Let's practice!

Cleaning Data in Python

Preparing Video For Download...