Introduction to PySpark
Ben Schmidt
Data Engineer
.na.drop()
to remove rows with null values# Drop rows with any nulls df_cleaned = df.na.drop()
# Filter out nulls df_cleaned = df.where(col("columnName").isNotNull())
.na.fill({"column": value)
to replace nulls with a specific value# Fill nulls in the age column with the value 0
df_filled = df.na.fill({"age": 0})
.withColumn()
to add a new column based on calculations or existing columns# Create a new column 'age_plus_5'
df = df.withColumn("age_plus_5", df["age"] + 5)
withColumnRenamed()
to rename columns# Rename the 'age' column to 'years'
df = df.withColumnRenamed("age", "years")
drop()
to remove unnecessary columns# Drop the 'department' column
df = df.drop("department")
.filter()
to select rows based on specific conditions# Filter rows where salary is greater than 50000
filtered_df = df.filter(df["salary"] > 50000)
.groupBy()
and aggregate functions (e.g., .sum()
, .avg()
) to summarize data# Group by department and calculate the average salary
grouped_df = df.groupBy("department").avg("salary")
Filtering
+------+---+-----------------+
|salary|age| occupation |
+------+---+-----------------+
| 60000| 45|Exec-managerial |
| 70000| 35|Prof-specialty |
+------+---+-----------------+
GroupBy
`
+----------+-----------+
|department|avg(salary)|
+----------+-----------+
| HR| 80000.0|
| IT| 70000.0|
+----------+-----------+
`
# Drop rows with any nulls df_cleaned = df.na.drop()
#Drop nulls on a column df_cleaned = df.where(col("columnName").isNotNull())
# Fill nulls in the age column with the value 0 df_filled = df.na.fill({"age": 0})
Use .withColumn()
to add a new column based on calculations or existing columns. Syntax: .withColumn("new_col_name", "original transformation")
# Create a new column 'age_plus_5'
df = df.withColumn("age_plus_5", df["age"] + 5)
Use withColumnRenamed()
to rename columns
Syntax: withColumnRenamed(
old column name,
new column name`
# Rename the 'age' column to 'years'
df = df.withColumnRenamed("age", "years")
drop()
to remove unnecessary columns
Syntax: .drop(column name
)# Drop the 'department' column
df = df.drop("department")
# Filter rows where salary is greater than 50000
filtered_df = df.filter(df["salary"] > 50000)
.groupBy()
and aggregate functions (e.g., .sum()
, .avg()
) to summarize data # Group by department and calculate the average salary
grouped_df = df.groupBy("department").avg("salary")
Introduction to PySpark