Big Data Fundamentals with PySpark
Upendra Devisetty
Science Analyst, CyVerse
Machine learning is a scientific discipline that explores the construction and
study of algorithms that can learn from data
MLlib is a component of Apache Spark for machine learning
Various tools provided by MLlib include:
ML Algorithms: collaborative filtering, classification, and clustering
Featurization: feature extraction, transformation, dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
Scikit-learn is a popular Python library for data mining and machine learning
Scikit-learn algorithms only work for small datasets on a single machine
Spark's MLlib algorithms are designed for parallel processing on a cluster
Supports languages such as Scala, Java, and R
Provides a high-level API to build machine learning pipelines
Classification (Binary and Multiclass) and Regression: Linear SVMs, logistic regression, decision trees, random forests, gradient-boosted trees, naive Bayes, linear least squares, Lasso, ridge regression, isotonic regression
Collaborative filtering: Alternating least squares (ALS)
Clustering: K-means, Gaussian mixture, Bisecting K-means and Streaming K-Means
Collaborative filtering (recommender engines): Produce recommendations
Classification: Identifying to which of a set of categories a new observation belongs
Clustering: Groups data based on similar characteristics
from pyspark.mllib.recommendation import ALS
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.clustering import KMeans
Big Data Fundamentals with PySpark