Delta Lake table optimization

Transform and Analyze Data with Microsoft Fabric

Luis Silva

Solution Architect - Data & AI

What is Delta Lake?

Open source storage layer for lakehouses
ACID transactions, metadata handling and versioning
Fabric uses Delta Lake table format (Parquet) as the standard
Interoperability across Fabric experiences

Fabric architecture diagram showing data ingested and stored as Delta Lake tables and then queried using Power BI

Table maintenance

Keep delta tables in shape
Table maintenance operations:
- Optimize
- V-Order
- Vacuum

Optimize

Consolidate multiple small Parquet files into a large file
- Ideal file size between 128MB and 1GB
- Improve compression and distribution, leading to efficient reads
- Recommended to optimize after loading large tables

Diagram representing a lakehouse table composed of a dozen small parquet files being optimized into two large parquet files. The original files remain inactive

Optimize

Consolidate multiple small Parquet files into a large file
- Ideal file size between 128MB and 1GB
- Improve compression and distribution, leading to efficient reads
- Recommended to optimize after loading large tables

Diagram representing a lakehouse table composed of a dozen small parquet files being optimized into two large parquet files. The original files remain inactive

Running the Optimize command from Lakehouse explorer

Screenshot of the Maintenance command in the Lakehouse explorer, highlighting the option Run OPTIMIZE command

Running the Optimize command in Spark SQL

%%sql
OPTIMIZE <lakehouse>.<table>;

Running the Optimize command in PySpark

from delta.tables import DeltaTable

dt = DeltaTable.forPath( spark, "Tables/<table>" )
dt.optimize().executeCompaction()

V-Order

Special optimization applied when writing Parquet files
Enabled by default

Applying V-Order from Lakehouse explorer

Screenshot of the Maintenance command in the Lakehouse explorer, highlighting the option Apply V-ORDER

Controlling V-Order writing in Apache Spark session

Enable V-Order for the session

%%sql 
SET spark.sql.parquet.vorder.enabled=TRUE

Disable V-Order for the session

%%sql 
SET spark.sql.parquet.vorder.enabled=FALSE

Applying V-Order when optimizing a table

%%sql 
OPTIMIZE <table|fileOrFolderPath> VORDER;

Vacuum

Remove older files that are no longer needed and that are older than the retention threshold
Reduce cloud storage costs

Diagram representing a lakehouse table composed of two large parquet files marked as active and a dozen small parquet files that are no longer active but still in storage. After running vacuum, the older files are removed from storage

Vacuum

Remove older files that are no longer needed and that are older than the retention threshold
Reduce cloud storage costs

Running Vacuum from Lakehouse explorer

Screenshot of the Maintenance command in the Lakehouse explorer, highlighting the option Run VACUUM

Let's practice!

Transform and Analyze Data with Microsoft Fabric