Optimizing Delta Tables with the VACUUM Command

VivekR
3 min readApr 13, 2023

--

VACUUM OPERATION. Source: iStockPhoto

Delta Lake is an open-source storage layer that brings ACID transactional capabilities to Apache Spark and big data workloads. The Delta Lake files are organized into a directory tree and provide features like versioning, transactional consistency, schema enforcement, and metadata management. In the previous article, we discussed Delta Lake Versioning: Time Travel Feature. In this article, we discuss the VACUUM command.

As data is continuously ingested and transformed in a Delta Table, the size of the table increases. This increase in size may result in inefficient usage of storage, slower query performance, and increased cost. To address this issue, Delta Lake provides a VACUUM command to reclaim storage space and optimize the performance of Delta Tables.

What is the VACUUM command?

The VACUUM command is a Delta Lake command that removes data files that are no longer necessary for queries or transactions. When data is deleted or updated in a Delta Table, Delta Lake doesn’t physically delete the data files, instead, it marks them as “deleted” and creates new data files with the updated data. These “deleted” data files still consume storage space until they are removed using the VACUUM command.

The VACUUM command works by identifying the data files that are no longer needed and deleting them. The command also performs a compaction operation that combines small files into larger ones, which can improve query performance.

When to use the VACUUM command?

The VACUUM command should be used when a Delta Table has a large number of deleted files or small data files, which can result in inefficient storage usage and slower query performance. The command can be used to free up storage space and optimize query performance.

Delta Lake automatically tracks the number of deleted files and determines when to run the VACUUM command. By default, Delta Lake will run the VACUUM command automatically after a configurable number of deletes or updates. The default number of deleted files that trigger the automatic VACUUM command is 10,000, but this can be configured using the delta.deletedFileRetentionCheckInterval and delta.deletedFileRetentionDuration table properties. If a Delta Table has a large number of deleted files or small data files, it is recommended to manually run the VACUUM command to free up storage space and optimize query performance.

Here are some examples of using the VACUUM command in Databricks.

  1. Manually running the VACUUM command:
%sql
VACUUM delta.`/mnt/data/customer_data`

In this example, the VACUUM command is run manually on the Delta Table located at /mnt/data/customer_data. The command will remove the deleted files and compact the data files.

  1. Configuring automatic vacuuming:
%sql
ALTER TABLE customer_data SET TBLPROPERTIES ('delta.deletedFileRetentionDuration'='interval 1 day', 'delta.autoVacuum'='true', 'delta.autoVacuum.interval'='interval 1 hour')

In this example, the automatic vacuuming is configured for the Delta Table customer_data. The deleted files will be retained for 1 day and the table will be vacuumed every hour.

The VACUUM command is an important feature of Delta Lake that allows for efficient storage space management and query performance optimization. By removing unnecessary data files and compacting the remaining data files, the VACUUM command can help reduce storage costs and improve query performance. It is important to understand when to use the VACUUM command and how to configure it for automatic vacuuming to ensure that Delta Tables are well-maintained and efficient.

If you found the article to be helpful, you can buy me a coffee here:
Buy Me A Coffee.

--

--

VivekR
VivekR

Written by VivekR

Data Engineer, Big Data Enthusiast and Automation using Python

No responses yet