Data manipulation with DataFrames

Introduction to PySpark

Ben Schmidt

Data Engineer

Handling missing data

  • Use .na.drop() to remove rows with null values
# Drop rows with any nulls
df_cleaned = df.na.drop()

# Filter out nulls df_cleaned = df.where(col("columnName").isNotNull())
  • Use .na.fill({"column": value) to replace nulls with a specific value
# Fill nulls in the age column with the value 0
df_filled = df.na.fill({"age": 0})
Introduction to PySpark

Column operations

  • Use .withColumn() to add a new column based on calculations or existing columns
# Create a new column 'age_plus_5'
df = df.withColumn("age_plus_5", df["age"] + 5)
  • Use withColumnRenamed() to rename columns
# Rename the 'age' column to 'years'
df = df.withColumnRenamed("age", "years")
  • Use drop() to remove unnecessary columns
# Drop the 'department' column
df = df.drop("department")
Introduction to PySpark

Row operations

  • Use .filter() to select rows based on specific conditions
# Filter rows where salary is greater than 50000
filtered_df = df.filter(df["salary"] > 50000)
  • Use .groupBy() and aggregate functions (e.g., .sum(), .avg()) to summarize data
# Group by department and calculate the average salary
grouped_df = df.groupBy("department").avg("salary")
Introduction to PySpark

Row Operations Outomes

  • Filtering

    +------+---+-----------------+
    |salary|age|      occupation |
    +------+---+-----------------+
    | 60000| 45|Exec-managerial  |
    | 70000| 35|Prof-specialty   |
    +------+---+-----------------+
    
  • GroupBy ` +----------+-----------+ |department|avg(salary)| +----------+-----------+ | HR| 80000.0| | IT| 70000.0| +----------+-----------+

`

Introduction to PySpark

CheatSheet

# Drop rows with any nulls
df_cleaned = df.na.drop()

#Drop nulls on a column df_cleaned = df.where(col("columnName").isNotNull())
# Fill nulls in the age column with the value 0 df_filled = df.na.fill({"age": 0})
  • Use .withColumn() to add a new column based on calculations or existing columns. Syntax: .withColumn("new_col_name", "original transformation")

    # Create a new column 'age_plus_5'
    df = df.withColumn("age_plus_5", df["age"] + 5)
    
  • Use withColumnRenamed() to rename columns Syntax: withColumnRenamed(old column name,new column name`

# Rename the 'age' column to 'years'
df = df.withColumnRenamed("age", "years")
  • Use drop() to remove unnecessary columns Syntax: .drop(column name)
# Drop the 'department' column
df = df.drop("department")
# Filter rows where salary is greater than 50000
filtered_df = df.filter(df["salary"] > 50000)
  • Use .groupBy() and aggregate functions (e.g., .sum(), .avg()) to summarize data
# Group by department and calculate the average salary
grouped_df = df.groupBy("department").avg("salary")
Introduction to PySpark

Let's practice!

Introduction to PySpark

Preparing Video For Download...