Introduction to PySpark
Benjamin Schmidt
Data Engineer

RDDs or Resilient Distributed Datasets:
map() or filter(), with actions like collect() or paralelize()to retrieve results or create RDDs# Initialize a Spark session from pyspark.sql import SparkSession spark = SparkSession.builder.appName("RDDExample").getOrCreate()# Create a DataFrame from a csv census_df = spark.read.csv("/census.csv")# Convert DataFrame to RDD census_rdd = census_df.rdd# Show the RDD's contents using collect() census_rdd.collect()
# Collect the entire DataFrame into a local Python list of Row objects
data_collected = df.collect()
# Print the collected data
for row in data_collected:
    print(row)
```    
map(): method applies functions (including ones we write like a lambda function) across a dataset like:
rdd.map(map_function)collect(): collects data from across the cluster like:
rdd.collect()Introduction to PySpark