Data Visualization in PySpark using DataFrames

Big Data Fundamentals with PySpark

Upendra Devisetty

Science Analyst, CyVerse

What is Data visualization?

  • Data visualization is a way of representing your data in graphs or charts

  • Open source plotting tools to aid visualization in Python:

    • Matplotlib, Seaborn, Bokeh etc.,
  • Plotting graphs using PySpark DataFrames is done using three methods

    • pyspark_dist_explore library

    • toPandas()

    • HandySpark library

Big Data Fundamentals with PySpark

Data Visualization using Pyspark_dist_explore

  • Pyspark_dist_explore library provides quick insights into DataFrames

  • Currently three functions available : hist(), distplot(), and pandas_histogram()

test_df = spark.read.csv("test.csv", header=True, inferSchema=True)
test_df_age = test_df.select('Age')
hist(test_df_age, bins=20, color="red")
Big Data Fundamentals with PySpark

Using Pandas for plotting DataFrames

  • It's easy to create charts from pandas DataFrames
test_df = spark.read.csv("test.csv", header=True, inferSchema=True)
test_df_sample_pandas = test_df.toPandas()
test_df_sample_pandas.hist('Age')
  • Note: When you have large volumes of data, using toPandas() isn't recommended
Big Data Fundamentals with PySpark

Pandas DataFrame vs PySpark DataFrame

  • Pandas DataFrames are in-memory, single-server based structures and operations on PySpark run in parallel

  • The result is generated as we apply any operation in Pandas whereas operations in PySpark DataFrame are lazy evaluation

  • Pandas DataFrame as mutable and PySpark DataFrames are immutable

  • Pandas API support more operations than PySpark Dataframe API

Big Data Fundamentals with PySpark

HandySpark method of visualization

  • HandySpark is a package designed to improve PySpark user experience
    • Easy data fetching
    • Distributed computation retained
test_df = spark.read.csv('test.csv', header=True, inferSchema=True)
hdf = test_df.toHandy()
hdf.cols["Age"].hist()
Big Data Fundamentals with PySpark

Let's visualize DataFrames

Big Data Fundamentals with PySpark

Preparing Video For Download...