Defining A Problem

Feature Engineering with PySpark

John Hogue

Lead Data Scientist, General Mills

What’s Your Problem?

Predict the selling price of a house

  • Given is listed price and features
    • $X$, independent 'known' variables
  • How much to buy the house for
    • $Y$, dependent 'unknown' variable
    • SALESCLOSEPRICE

Houses for Sale

Feature Engineering with PySpark

Context & Limitations of our Real Estate

  • Homes sold St Paul, MN Area
    • Includes several suburbs
  • Real Estate Types
    • Residential-Single
    • Residential-Multi-Family
  • Full Year of Data
    • Impact of seasonality

St Paul, MN

Feature Engineering with PySpark

What types of attributes are available?

  • Dates
    • Date Listed
    • Year Built
  • Location
    • City
    • School District
    • Address
  • Size
    • # Bedrooms & Bathrooms
    • Living Area
  • Price
    • List Price
    • Sales Closing Price
  • Amenities
    • Pool
    • Fireplace
    • Garage
  • Construction Materials
    • Siding
    • Roofing
Feature Engineering with PySpark

Validating Your Data Load

  • DataFrame.count() for row count
df.count()
5000
  • DataFrame.columns for a list of columns
df.columns
['No.', 'MLSID', 'StreetNumberNumeric', ... ]
  • Length of DataFrame.columns for the number of columns
len(df.columns)
74
Feature Engineering with PySpark

Checking Datatypes

DataFrame.dtypes

  • Creates a list of columns and their data types tuples
df.dtypes
[('No.', 'integer'), ('MLSID', 'string'), ... ]
Feature Engineering with PySpark

Let's Practice

Feature Engineering with PySpark

Preparing Video For Download...