Delta Lake table optimization

Transform and Analyze Data with Microsoft Fabric

Luis Silva

Solution Architect - Data & AI

What is Delta Lake?

  • Open source storage layer for lakehouses
  • ACID transactions, metadata handling and versioning
  • Fabric uses Delta Lake table format (Parquet) as the standard
  • Interoperability across Fabric experiences

Fabric architecture diagram showing data ingested and stored as Delta Lake tables and then queried using Power BI

Transform and Analyze Data with Microsoft Fabric

Table maintenance

  • Keep delta tables in shape
  • Table maintenance operations:
    • Optimize
    • V-Order
    • Vacuum
Transform and Analyze Data with Microsoft Fabric

Optimize

  • Consolidate multiple small Parquet files into a large file
    • Ideal file size between 128MB and 1GB
    • Improve compression and distribution, leading to efficient reads
    • Recommended to optimize after loading large tables

Diagram representing a lakehouse table composed of a dozen small parquet files being optimized into two large parquet files. The original files remain inactive

Transform and Analyze Data with Microsoft Fabric

Optimize

  • Consolidate multiple small Parquet files into a large file
    • Ideal file size between 128MB and 1GB
    • Improve compression and distribution, leading to efficient reads
    • Recommended to optimize after loading large tables

Diagram representing a lakehouse table composed of a dozen small parquet files being optimized into two large parquet files. The original files remain inactive

Transform and Analyze Data with Microsoft Fabric

Running the Optimize command from Lakehouse explorer

Screenshot of the Maintenance command in the Lakehouse explorer, highlighting the option Run OPTIMIZE command

Transform and Analyze Data with Microsoft Fabric

Running the Optimize command in Spark SQL

%%sql
OPTIMIZE <lakehouse>.<table>;
Transform and Analyze Data with Microsoft Fabric

Running the Optimize command in PySpark

from delta.tables import DeltaTable

dt = DeltaTable.forPath( spark, "Tables/<table>" )
dt.optimize().executeCompaction()
Transform and Analyze Data with Microsoft Fabric

V-Order

  • Special optimization applied when writing Parquet files
  • Enabled by default
Transform and Analyze Data with Microsoft Fabric

Applying V-Order from Lakehouse explorer

Screenshot of the Maintenance command in the Lakehouse explorer, highlighting the option Apply V-ORDER

Transform and Analyze Data with Microsoft Fabric

Controlling V-Order writing in Apache Spark session

  • Enable V-Order for the session

    %%sql 
    SET spark.sql.parquet.vorder.enabled=TRUE
    

     

  • Disable V-Order for the session

    %%sql 
    SET spark.sql.parquet.vorder.enabled=FALSE
    
Transform and Analyze Data with Microsoft Fabric

Applying V-Order when optimizing a table

%%sql 
OPTIMIZE <table|fileOrFolderPath> VORDER;
Transform and Analyze Data with Microsoft Fabric

Vacuum

  • Remove older files that are no longer needed and that are older than the retention threshold
  • Reduce cloud storage costs

Diagram representing a lakehouse table composed of two large parquet files marked as active and a dozen small parquet files that are no longer active but still in storage. After running vacuum, the older files are removed from storage

Transform and Analyze Data with Microsoft Fabric

Vacuum

  • Remove older files that are no longer needed and that are older than the retention threshold
  • Reduce cloud storage costs

Diagram representing a lakehouse table composed of two large parquet files marked as active and a dozen small parquet files that are no longer active but still in storage. After running vacuum, the older files are removed from storage

Transform and Analyze Data with Microsoft Fabric

Running Vacuum from Lakehouse explorer

Screenshot of the Maintenance command in the Lakehouse explorer, highlighting the option Run VACUUM

Transform and Analyze Data with Microsoft Fabric

Let's practice!

Transform and Analyze Data with Microsoft Fabric

Preparing Video For Download...