Wrapping up

Data Transformation with Spark SQL in Databricks

Disha Mukherjee

Lead Data Engineer

Chapter 1

Databricks workspace overview

  • Loading and inspecting DataFrames
  • Shaping data with PySpark and Spark SQL
Data Transformation with Spark SQL in Databricks

Chapter 2

total_rows = df_enriched.count()
distinct_rows = df_enriched.distinct().count()

null_rate = ( df_enriched.filter(F.col("Customer_ID").isNull()) .count() / total_rows * 100 )
print(f"Total rows: {total_rows}") print(f"Duplicates: {total_rows - distinct_rows}") print(f"Null rate: {null_rate:.2f}%")
Total rows:   33,223
Duplicates:   0
Null rate:    0.00%
  • Cleaning nulls, duplicates, and invalid records
  • Aggregating and joining with broadcast optimization
Data Transformation with Spark SQL in Databricks

Chapter 3

Lakeflow pipeline DAG

  • Window functions and streaming queries
  • Production pipelines with Databricks Jobs and Lakeflow
Data Transformation with Spark SQL in Databricks

Practice in a live Databricks workspace

Interactive table with sorting and searching

$$

$$

$$

  • Transactional and online retail datasets
  • Interactive practice in a live Databricks workspace
Data Transformation with Spark SQL in Databricks

Congratulations!

Data Transformation with Spark SQL in Databricks

Preparing Video For Download...