Dealing with other data issues

Feature Engineering for Machine Learning in Python

Robert O'Callaghan

Director of Data Science, Ordergroove

Bad characters

print(df['RawSalary'].dtype)
dtype('O')
Feature Engineering for Machine Learning in Python

Bad characters

print(df['RawSalary'].head())
0          NaN
1    70,841.00
2          NaN
3    21,426.00
4    41,671.00
Name: RawSalary, dtype: object
Feature Engineering for Machine Learning in Python

Dealing with bad characters

df['RawSalary'] = df['RawSalary'].str.replace(',', '')
df['RawSalary'] = df['RawSalary'].astype('float')
Feature Engineering for Machine Learning in Python

Finding other stray characters

coerced_vals = pd.to_numeric(df['RawSalary'], 
                             errors='coerce')
Feature Engineering for Machine Learning in Python

Finding other stray characters

print(df[coerced_vals.isna()].head())
0           NaN
2           NaN
4     $51408.00
Name: RawSalary, dtype: object
Feature Engineering for Machine Learning in Python

Chaining methods

df['column_name'] = df['column_name'].method1()
df['column_name'] = df['column_name'].method2()
df['column_name'] = df['column_name'].method3()

Same as:

df['column_name'] = df['column_name']\
                     .method1().method2().method3()
Feature Engineering for Machine Learning in Python

Go ahead and fix bad characters!

Feature Engineering for Machine Learning in Python

Preparing Video For Download...