Big Data Fundamentals with PySpark
Upendra Devisetty
Science Analyst, CyVerse
Data visualization is a way of representing your data in graphs or charts
Open source plotting tools to aid visualization in Python:
Plotting graphs using PySpark DataFrames is done using three methods
pyspark_dist_explore library
toPandas()
HandySpark library
Pyspark_dist_explore
library provides quick insights into DataFrames
Currently three functions available : hist()
, distplot()
, and pandas_histogram()
test_df = spark.read.csv("test.csv", header=True, inferSchema=True)
test_df_age = test_df.select('Age')
hist(test_df_age, bins=20, color="red")
test_df = spark.read.csv("test.csv", header=True, inferSchema=True)
test_df_sample_pandas = test_df.toPandas()
test_df_sample_pandas.hist('Age')
toPandas()
isn't recommendedPandas DataFrames are in-memory, single-server based structures and operations on PySpark run in parallel
The result is generated as we apply any operation in Pandas whereas operations in PySpark DataFrame are lazy evaluation
Pandas DataFrame as mutable and PySpark DataFrames are immutable
Pandas API support more operations than PySpark Dataframe API
test_df = spark.read.csv('test.csv', header=True, inferSchema=True)
hdf = test_df.toHandy()
hdf.cols["Age"].hist()
Big Data Fundamentals with PySpark