Overview of PySpark MLlib

Big Data Fundamentals with PySpark

Upendra Devisetty

Science Analyst, CyVerse

What is PySpark MLlib?

Machine learning is a scientific discipline that explores the construction and
study of algorithms that can learn from data

MLlib is a component of Apache Spark for machine learning
Various tools provided by MLlib include:
- ML Algorithms: collaborative filtering, classification, and clustering
- Featurization: feature extraction, transformation, dimensionality reduction, and selection
- Pipelines: tools for constructing, evaluating, and tuning ML Pipelines

¹ https://en.wikipedia.org/wiki/Machine_learning

Why PySpark MLlib?

Scikit-learn is a popular Python library for data mining and machine learning
Scikit-learn algorithms only work for small datasets on a single machine
Spark's MLlib algorithms are designed for parallel processing on a cluster
Supports languages such as Scala, Java, and R
Provides a high-level API to build machine learning pipelines

PySpark MLlib Algorithms

Classification (Binary and Multiclass) and Regression: Linear SVMs, logistic regression, decision trees, random forests, gradient-boosted trees, naive Bayes, linear least squares, Lasso, ridge regression, isotonic regression
Collaborative filtering: Alternating least squares (ALS)
Clustering: K-means, Gaussian mixture, Bisecting K-means and Streaming K-Means

The three C's of machine learning in PySpark MLlib

Collaborative filtering (recommender engines): Produce recommendations
Classification: Identifying to which of a set of categories a new observation belongs
Clustering: Groups data based on similar characteristics

PySpark MLlib imports

Collaborative filtering

from pyspark.mllib.recommendation import ALS

Classification

from pyspark.mllib.classification import LogisticRegressionWithLBFGS

Clustering

from pyspark.mllib.clustering import KMeans

Let's practice

Big Data Fundamentals with PySpark

Preparing Video For Download...