Optimizing Performance with Delta Tables: A Guide to Merge and Copy Into Commands

VivekR
4 min readApr 15, 2023

--

Merge and Copy Into Commands Source

Delta Lake is a powerful storage layer that enables scalable, reliable, and performant data pipelines on top of Apache Spark. In the previous article, we discussed INSERT OVERWRITE vs INSERT INTO for Efficient Table Insertion
One of the key features of Delta Lake is the ability to perform atomic, scalable, and high-performance updates and merges of data through its Merge and Copy Into commands. In this article, we will explore these two commands and their performance implications on Delta Tables.

Merge Command

The Merge command allows you to merge data from one table or data frame into another table based on a join condition. The Merge command supports three types of operations: insert, update, and delete.

Here is an example of a Merge command that updates a Delta Table called sales with new data from another Delta Table called new_sales:

MERGE INTO sales
USING new_sales
ON sales.sale_id = new_sales.sale_id
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *

In this example, the MERGE INTO clause specifies the target Delta Table (sales) and the USING clause specifies the source Delta Table (new_sales). The ON clause specifies the join condition between the two tables. The WHEN MATCHED clause specifies the update operation to be performed when a match is found between the two tables, and the WHEN NOT MATCHED clause specifies the insert operation to be performed when a match is not found.

The UPDATE SET * clause updates all columns in the sales table with the corresponding values from the new_sales table. The INSERT * clause inserts all columns from the new_sales table into the sales table.

The Merge command can also perform delete operations by using the WHEN MATCHED DELETE clause.

Copy Into Command

The Copy Into command allows you to load data from external sources into a Delta Table. The Copy Into command supports loading data from CSV, JSON, Parquet, ORC, and Avro files.

Here is an example of a Copy Into command that loads data from a CSV file into a Delta Table called sales:

COPY INTO sales
FROM 'dbfs:/mnt/data/sales.csv'
FORMAT CSV

In this example, the COPY INTO clause specifies the target Delta Table (sales) and the FROM clause specifies the location of the CSV file to be loaded. The FORMAT clause specifies the format of the file to be loaded.

The Copy Into command can also perform schema merging by using the MERGE SCHEMA clause.

Performance Implications

The Merge and Copy Into commands provide powerful capabilities for updating and loading data in Delta Tables. However, these commands can have performance implications if not used properly.

The Merge command can be slower than a traditional insert or update operation because it involves a join operation between the source and target tables. The performance of the Merge command can be improved by using the Delta Lake Auto Optimize feature, which optimizes the layout and organization of Delta Tables to minimize the amount of data scanned during queries.

The Copy Into command can be slow if the data source is large or if the data needs to be converted to the Delta format. The performance of the Copy Into command can be improved by using the Delta Lake Optimize command, which optimizes the layout and organization of Delta Tables to minimize the amount of data scanned during queries. The Optimize command can also perform data compaction to reduce the storage size of Delta Tables.

Here is an example of how to use the Delta Lake Auto Optimize and Optimize commands to improve the performance of Merge and Copy Into commands:

-- Enable Delta Lake Auto Optimize
ALTER TABLE sales SET TBLPROPERTIES ('delta.autoOptimize.optimizeWrite' = 'true');

-- Merge data into the sales table
MERGE INTO sales
USING new_sales
ON sales.sale_id = new_sales.sale_id
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *

-- Optimize the layout and organization of the sales table
OPTIMIZE sales;

-- Load data from a CSV file into the sales table
COPY INTO sales
FROM 'dbfs:/mnt/data/sales.csv'
FORMAT CSV

-- Optimize the layout and organization of the sales table
OPTIMIZE sales;

In this example, the Delta Lake Auto Optimize feature is enabled for the sales table, which optimizes the layout and organization of the table during write operations. The Merge command is then used to merge data from the new_sales table into the sales table, followed by an Optimize command to optimize the layout and organization of the table.

The Copy Into command is then used to load data from a CSV file into the sales table, followed by another Optimize command to optimize the layout and organization of the table.

In conclusion, the Merge and Copy Into commands provide powerful capabilities for updating and loading data in Delta Tables. However, it’s important to be mindful of the performance and data consistency considerations when using these commands. By using the Delta Lake Auto Optimize and Optimize commands, you can optimize the layout and organization of Delta Tables to improve the performance of Merge and Copy Into commands.

If you found the article to be helpful, you can buy me a coffee here:
Buy Me A Coffee.

--

--

VivekR

Data Engineer, Big Data Enthusiast and Automation using Python