Text Classification

Introduction to Spark SQL in Python

Mark Plutowski

Data Scientist

Endword Prediction

Sequence arrow

Endword

Endword bracket

Shuffle 1

Shuffle 2

Songs

Videos

Selecting the data

df_true = df.where("endword in ('she', 'he', 'hers', 'his', 'her', 'him')")\
            .withColumn('label', lit(1))

df_false = df.where("endword not in ('she', 'he', 'hers', 'his', 'her', 'him')")\
           .withColumn('label', lit(0))

Combining the positive and negative data

df_examples = df_true.union(df_false)

Splitting the data into training and evaluation sets

df_train, df_eval = df_examples.randomSplit((0.60, 0.40), 42)

Training

from pyspark.ml.classification import LogisticRegression

logistic = LogisticRegression(maxIter=50, regParam=0.6, elasticNetParam=0.3)

model = logistic.fit(df_train)

print("Training iterations: ", model.summary.totalIterations)

Let's practice!

Introduction to Spark SQL in Python