Text Classification

Introduction to Spark SQL in Python

Mark Plutowski

Data Scientist

Endword Prediction

Introduction to Spark SQL in Python

Sequence arrow

Introduction to Spark SQL in Python

Endword

Introduction to Spark SQL in Python

Endword bracket

Introduction to Spark SQL in Python

Shuffle 1

Introduction to Spark SQL in Python

Shuffle 2

Introduction to Spark SQL in Python

Songs

Introduction to Spark SQL in Python

Videos

Introduction to Spark SQL in Python

Selecting the data

df_true = df.where("endword in ('she', 'he', 'hers', 'his', 'her', 'him')")\
            .withColumn('label', lit(1))

df_false = df.where("endword not in ('she', 'he', 'hers', 'his', 'her', 'him')")\
           .withColumn('label', lit(0))
Introduction to Spark SQL in Python

Combining the positive and negative data

df_examples = df_true.union(df_false)
Introduction to Spark SQL in Python

Splitting the data into training and evaluation sets

df_train, df_eval = df_examples.randomSplit((0.60, 0.40), 42)
Introduction to Spark SQL in Python

Training

from pyspark.ml.classification import LogisticRegression

logistic = LogisticRegression(maxIter=50, regParam=0.6, elasticNetParam=0.3)
model = logistic.fit(df_train)
print("Training iterations: ", model.summary.totalIterations)
Introduction to Spark SQL in Python

Let's practice!

Introduction to Spark SQL in Python

Preparing Video For Download...