Introduction to PySpark
Benjamin Schmidt
Data Engineer
RDDs or Resilient Distributed Datasets:
map()
or filter()
, with actions like collect()
or paralelize()
to retrieve results or create RDDs# Initialize a Spark session from pyspark.sql import SparkSession spark = SparkSession.builder.appName("RDDExample").getOrCreate()
# Create a DataFrame from a csv census_df = spark.read.csv("/census.csv")
# Convert DataFrame to RDD census_rdd = census_df.rdd
# Show the RDD's contents using collect() census_rdd.collect()
# Collect the entire DataFrame into a local Python list of Row objects
data_collected = df.collect()
# Print the collected data
for row in data_collected:
print(row)
```
map()
: method applies functions (including ones we write like a lambda function) across a dataset like:
rdd.map(map_function)
collect()
: collects data from across the cluster like:
rdd.collect()
Introduction to PySpark