Feature Engineering with PySpark
John Hogue
Lead Data Scientist, General Mills
df.describe(['LISTPRICE']).show()
+-------+------------------+
|summary| LISTPRICE|
+-------+------------------+
| count| 5000|
| mean| 263419.365|
| stddev|143944.10818036905|
| min| 100000|
| max| 99999|
+-------+------------------+
pyspark.sql.functions.mean(col)
pyspark.sql.functions.skewness(col)
pyspark.sql.functions.min(col)
cov(col1, col2)
corr(col1, col2)
mean(col)
df.agg({'SALESCLOSEPRICE': 'mean'}).collect()
[Row(avg(SALESCLOSEPRICE)=262804.4668)]
cov(col1, col2)
df.cov('SALESCLOSEPRICE', 'YEARBUILT')
1281910.3840634783
Plotting PySpark DataFrames using standard libraries like Seaborn require conversion to Pandas
WARNING: Sample PySpark DataFrames before converting to Pandas!
sample(withReplacement, fraction, seed=None)
withReplacement
allow repeats in samplefraction
% of records to keepseed
random seed for reproducibility# Sample 50% of the PySpark DataFrame and count rows
df.sample(False, 0.5, 42).count()
2504
Seaborn distplot()
seaborn.distplot(a)
a
: Series, 1d-array, or list. Observed data.# Import your favorite visualization library import seaborn as sns
# Sample the dataframe sample_df = df.select(['SALESCLOSEPRICE']).sample(False, 0.5, 42)
# Convert the sample to a Pandas DataFrame pandas_df = sample_df.toPandas()
# Plot it sns.distplot(pandas_df)
Seaborn lmplot()
seaborn.lmplot(x, y, data)
x
, y
: strings, Input variables; these should be column names in data.data
: Pandas DataFrame# Import your favorite visualization library import seaborn as sns
# Select columns s_df = df.select(['SALESCLOSEPRICE', 'SQFTABOVEGROUND']) # Sample dataframe s_df = s_df.sample(False, 0.5, 42)
# Convert to Pandas DataFrame pandas_df = s_df.toPandas()
# Plot it sns.lmplot(x='SQFTABOVEGROUND', y='SALESCLOSEPRICE', data=pandas_df)
Feature Engineering with PySpark