Overview of PySpark MLlib

Big Data Fundamentals with PySpark

Upendra Devisetty

Science Analyst, CyVerse

What is PySpark MLlib?

Machine learning is a scientific discipline that explores the construction and
study of algorithms that can learn from data
  • MLlib is a component of Apache Spark for machine learning

  • Various tools provided by MLlib include:

    • ML Algorithms: collaborative filtering, classification, and clustering

    • Featurization: feature extraction, transformation, dimensionality reduction, and selection

    • Pipelines: tools for constructing, evaluating, and tuning ML Pipelines

1 https://en.wikipedia.org/wiki/Machine_learning
Big Data Fundamentals with PySpark

Why PySpark MLlib?

  • Scikit-learn is a popular Python library for data mining and machine learning

  • Scikit-learn algorithms only work for small datasets on a single machine

  • Spark's MLlib algorithms are designed for parallel processing on a cluster

  • Supports languages such as Scala, Java, and R

  • Provides a high-level API to build machine learning pipelines

Big Data Fundamentals with PySpark

PySpark MLlib Algorithms

  • Classification (Binary and Multiclass) and Regression: Linear SVMs, logistic regression, decision trees, random forests, gradient-boosted trees, naive Bayes, linear least squares, Lasso, ridge regression, isotonic regression

  • Collaborative filtering: Alternating least squares (ALS)

  • Clustering: K-means, Gaussian mixture, Bisecting K-means and Streaming K-Means

Big Data Fundamentals with PySpark

The three C's of machine learning in PySpark MLlib

  • Collaborative filtering (recommender engines): Produce recommendations

  • Classification: Identifying to which of a set of categories a new observation belongs

  • Clustering: Groups data based on similar characteristics

Big Data Fundamentals with PySpark

PySpark MLlib imports

  • Collaborative filtering
from pyspark.mllib.recommendation import ALS
  • Classification
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
  • Clustering
from pyspark.mllib.clustering import KMeans
Big Data Fundamentals with PySpark

Let's practice

Big Data Fundamentals with PySpark

Preparing Video For Download...