Data Visualization in PySpark using DataFrames

Big Data Fundamentals with PySpark

Upendra Devisetty

Science Analyst, CyVerse

What is Data visualization?

Data visualization is a way of representing your data in graphs or charts
Open source plotting tools to aid visualization in Python:
- Matplotlib, Seaborn, Bokeh etc.,
Plotting graphs using PySpark DataFrames is done using three methods
- pyspark_dist_explore library
- toPandas()
- HandySpark library

Pyspark_dist_explore library provides quick insights into DataFrames
Currently three functions available : hist(), distplot(), and pandas_histogram()

test_df = spark.read.csv("test.csv", header=True, inferSchema=True)

test_df_age = test_df.select('Age')

hist(test_df_age, bins=20, color="red")

test_df = spark.read.csv("test.csv", header=True, inferSchema=True)

test_df_sample_pandas = test_df.toPandas()

test_df_sample_pandas.hist('Age')

Note: When you have large volumes of data, using toPandas() isn't recommended

Pandas DataFrames are in-memory, single-server based structures and operations on PySpark run in parallel
The result is generated as we apply any operation in Pandas whereas operations in PySpark DataFrame are lazy evaluation
Pandas DataFrame as mutable and PySpark DataFrames are immutable
Pandas API support more operations than PySpark Dataframe API

HandySpark is a package designed to improve PySpark user experience
- Easy data fetching
- Distributed computation retained

test_df = spark.read.csv('test.csv', header=True, inferSchema=True)

hdf = test_df.toHandy()

hdf.cols["Age"].hist()

Big Data Fundamentals with PySpark