Delta Tables: Deep Copy vs Shallow Copy — Differences, Use Cases and Performance Implications
Delta tables provide ACID transactions, data versioning, and schema enforcement capabilities, making it easier to build high-quality data lakes.
In the previous article, we discussed Merge and Copy Into Commands. In this article, we talk about two types of copying methods supported by the Delta table: deep copy and shallow copy.
What is a Deep Copy?
A deep copy of a Delta table creates a new, completely independent copy of the table. A deep copy can be created using the COPY command in Databricks. The COPY command creates a new table, which is a deep copy of the source table, and optionally applies filters to the data. Here’s an example of how to create a deep copy of a Delta table using the COPY command:
COPY INTO <destination_table>
FROM <source_table>
When you create a deep copy of a Delta table, all the data and metadata is copied from the source table to the destination table. This includes schema, partitions, data files, and transaction log. Deep copy is a safe method for creating a completely independent copy of a Delta table. Deep copy ensures data consistency, and it is useful when you want to perform data transformations on the new copy without affecting the original table.
What is a Shallow Copy?
A shallow copy of a Delta table creates a new table that shares data with the original table. A shallow copy can be created using the CLONE command in Databricks. The CLONE command creates a new table, which is a shallow copy of the source table, and optionally applies filters to the data. Here’s an example of how to create a shallow copy of a Delta table using the CLONE command:
CREATE TABLE <destination_table> CLONE <source_table> OPTIONS ('shallow' 'true')
When you create a shallow copy of a Delta table, the data files are shared between the source and destination tables. The destination table only contains metadata that points to the same data files as the source table. Shallow copy is a faster method for creating a copy of a Delta table because it doesn’t copy the data files. Shallow copy is useful when you want to create a new table that shares data with an existing table and you don’t want to incur the overhead of copying data.
Performance Implications
Deep copy and shallow copy have different performance implications. Deep copy takes longer to execute because it copies all the data files from the source table to the destination table. Shallow copy is faster because it only copies metadata and doesn’t copy data files. However, shallow copy has some limitations. For example, if you delete a file from one of the tables, it will affect the other table because they share the same data files.
When to Use Deep Copy and Shallow Copy
Deep copy and shallow copy have different use cases. Deep copy is useful when you want to create a completely independent copy of a Delta table. This is useful when you want to perform data transformations on the new copy without affecting the original table. Deep copy is also useful when you want to create a table backup.
Shallow copy is useful when you want to create a new table that shares data with an existing table. This is useful when you want to perform experiments on the new table without affecting the original table. Shallow copy is also useful when you want to create a copy of a table quickly without incurring the overhead of copying data.
In conclusion, Delta Lake provides two powerful copying methods: deep copy and shallow copy. As a Delta table expert, understanding the differences between these two methods and their appropriate use cases is crucial for optimizing query performance and ensuring the best possible results for your data lake.
If you found the article to be helpful, you can buy me a coffee here:
Buy Me A Coffee.