Feature Engineering with PySpark
John Hogue
Lead Data Scientist, General Mills
Predict the selling price of a house
SALESCLOSEPRICE
DataFrame.count()
for row countdf.count()
5000
DataFrame.columns
for a list of columnsdf.columns
['No.', 'MLSID', 'StreetNumberNumeric', ... ]
DataFrame.columns
for the number of columnslen(df.columns)
74
DataFrame.dtypes
df.dtypes
[('No.', 'integer'), ('MLSID', 'string'), ... ]
Feature Engineering with PySpark