Predicting and evaluating

Introduction to Spark SQL in Python

Mark Plutowski

Data Scientist

Applying a model to evaluation data

predicted = df_trained.transform(df_test)
  • prediction column: double
  • probability column: vector of length two
x = predicted.first
print("Right!" if x.label == int(x.prediction) else "Wrong")
Introduction to Spark SQL in Python

Evaluating classification accuracy

model_stats = model.evaluate(df_eval)
type(model_stats)
pyspark.ml.classification.BinaryLogisticRegressionSummary)
print("\nPerformance: %.2f" % model_stats.areaUnderROC)
Introduction to Spark SQL in Python

Example of classifying text

  • Positive labels:

    • ['her', 'him', 'he', 'she', 'them', 'us', 'they', 'himself', 'herself', 'we']
  • Number of examples: 5746

  • Number of examples: 2873 positive, 2873 negative
  • Number of training examples: 4607
  • Number of test examples: 1139
  • training iterations: 21
  • Test AUC: 0.87
Introduction to Spark SQL in Python

Predicting the endword

  • Positive label: 'it'

  • Number of examples: 438

  • Number of examples: 219 positive, 219 negative
  • Number of training examples: 340
  • Number of test examples: 98
  • Test AUC: 0.85
Introduction to Spark SQL in Python

Let's practice!

Introduction to Spark SQL in Python

Preparing Video For Download...