Introduction to PySpark
Benjamin Schmidt
Data Engineer
spark.read.csv("path/to/file.csv")
spark.read.json("path/to/file.json")
spark.read.parquet("path/to/file.parquet")
Spark can infer schemas from data with inferSchema=True
Manually define schema for better control - useful for fixed data structures
IntegerType
: Whole numbers1
, 3478
, -1890456
922334775806
3.14159
"This is an example of a string."
# Import the necessary types as classes
from pyspark.sql.types import (StructType,
StructField, IntegerType,
StringType, ArrayType)
# Construct the schema
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("scores", ArrayType(IntegerType()), True)
])
# Set the schema
df = spark.createDataFrame(data, schema=schema)
.select()
to choose specific columns.filter()
or .where()
to filter rows based on conditions.sort()
to order by a collection of columns# Select and show only the name and age columns
df.select("name", "age").show()
# Filter on age > 30
df.filter(df["age"] > 30).show()
# Use Where to filter match a specific value
df.where(df["age"] == 30).show()
# Use Sort to sort on age
df.sort("age", ascending=False).show()
.sort()
or .orderBy()
na.drop()
to remove rows with null values# Sort using the age column
df.sort("age", ascending=False).show()
# Drop missing values
df.na.drop().show()
spark.read_json()
: Load data from JSONspark.read.schema()
: Define schemas explicitly.na.drop()
: Drop rows with missing values.select()
, .filter()
, .sort()
, .orderBy()
: Basic data manipulation functionsIntroduction to PySpark