I decided to re-read this impactful paper about data lakehouse from 2021 which talks about the “lakehouse” architecture within data warehousing. Below are my notes about it.
Novelty
- Data warehousing can/should be built on direct access to open data format files, preferably on the cloud
- This will help with reliability and cost (cheap storage on the cloud), and offer good performance for even ML workloads
- Reliability because there’s one less step of ETL, compared to data lakes
- Performance will be achieved through indexing layers on top of the basic storage
My summary of the paper
- Existing data warehouse architectures will get phased out in favour of “lake house”. These lakehouses will provide: i) open, direct access data formats ii) first-class support for ML and data science workloads iii) state-of-the-art performance
- First generation of warehouse: Systems like Teradata. They are schema-on-write, they couple compute and storage, and they struggle with unstructured data.
- Second generation: Data lakes (Hadoop + HDFS/S3). They offer better performance, but schema-on-read aspect leads to issues regarding reliability, data quality and governance. Data has to be ETL-ed into the data lake first, and then again into things like Teradata. The growth of cloud object storage made the data storage part cheap and reliable. Still, a big downside is the two ETL steps needed to get data into a warehouse.
- Also, neither of these generations were designed for advanced analytics and ML workloads
- The lakehouse architecture will be based on open file formats, and give reliable, performant access to the data, and also support ML type workloads. It will avoid the accidental complexity of the data lake style architecture. Today, even tools like Hive allow you to access external tables, but they give up on ACID transactional support at that point.
- Implementing a lakehouse
- Store data in a low cost object store on the cloud but in an open file format
- Have a metadata layer (similar to a write-ahead log) on top to help with transactions (esp which version of a table is in which file). This can also help with access control and governance.
- SQL optimization is a huge question. Some techniques for this: Caching, auxiliary data structures, and data layout optimizations. Note that these are independent of the actual underlying format.
- For ML and analytics workloads: declarative Dataframe APIs that enable lower level optimizations during data access
- More details about each of these bits:
- Metadata layer: Store which objects are part of which table, in a transaction log format. Enables: Transactions, zero-copy cloning, and time travel to past versions of a table. Can also implement governance through this metadata layer.
- SQL performance (leaving out details because self-explantory):
- Caching
- Auxiliary data, like min and max values of different columns in a file
- Data layout optimizations
- ML and analytics workloads: Declarative Dataframe APIs that also leverage the transaction log, caching, auxiliary data and data layout optimizations already described above
- Possible future work
- Come up with completely new storage formats for enabling more data layout optimizations or better suited for modern hardware
- More possibilities in caching, storing auxiliary data, and data layout optimizations
- Where and how to use serverless computing systems
- “Factorized ML” frameworks that push ML logic into SQL joins
- Can “feature stores” also benefit from the table versioning that Lakehouses offer?
- Other ways to satisfy Lakehouse goals?
- Massively parallel serving layer (such as Hive LLAP). But if this approach doesn’t read directly from the warehouse format, performance will suffer. Also, orgs may want to keep data in open format for compliance and future-proofing
- Impact of Lakehouse on other trends (a great section)
- No need for “polystores” -> Basically those which query multiple storage engines and get back the results (Presto?)
- Data integration and cleaning can happen on the lakehouse
- HTAP systems could be built as bolt-on layers in front of a lakehouse (strong claim!)
- Data management for ML could be built on top of a lakehouse
- Cloud-native DBMS engines may need to integrate with metadata layers of Lakehouses
- Different teams can access the data from the object store without onboarding to the same compute engine
- Related work
- Snowflake, BigQuery and Hive. But typically these are not used for storing the data itself, and they end up reading already stored data via “external tables” or connectors. Hence they can’t provide an ACID support over already stored data.
My analysis and comments
The fact that you no longer have to copy data from a lakehouse to another warehouse is a big architectural change. A key problem with data lakes happened when data had to be ETL-ed out of them into warehouses like BigQuery or Snowflake. Avoiding that failure mode would imply not having to ETL data out of lakehouses also. But are there already such use cases today? For example, how do lakehouses fit in with vector databases and LLMs? On a related note, you can’t make much of the fact that lakehouses can handle unstructured data, because tomorrow new types of “unstructured” data may arise.
I think in general, performance and data staleness can’t have permanent solutions, because every new architecture will face these problems over time.
It is not clear whether the paper is just describing an industry trend (withering away of warehouses), or advocating specific technical choices (adopting an open data file format), or both.
An open data file format is not useful on its own. Firstly, any file format is already “open” within the context of whichever company or governance structure has full access to that format. If you’re advocating for open-ness outside of these existing contexts, then what you’re actually saying is that there are lots of new applications and optimizations that can be built on top if only we were to standardize data access APIs. But while this may happen, there is always inherent tension between what proprietary lakehouse products will themselves offer and what the rest of the ecosystem can.
A lakehouse gets close to violating the relational database approach of exposing only relations and not the underlying storage structures. This design choice needs more justification, especially since performance optimization may get hampered in the long run.
The “object store” part should have been explained more, given that it is a vital part of the architecture. It is treated as a black box and taken for granted. Are object stores a good starting point for a lakehouse?
The other papers and articles provided as references in this paper are very good! Depending on your previous background and curiosity, you should check out some of them.
Questions
- The paper mentions innovations like metadata layers, auxiliary data structures and caching. I don’t know whether all these are novel, or whether existing DBMSes are already implemented using these.
- The paper doesn’t say what it would take to modify an existing OLTP DBMS to become a lakehouse. Their example implementation (Delta Lake) starts with a data lake.
- Do lakehouses now support transactions on multiple tables?
Presentation aspects
The architecture (section 3) is described in very abstract terms. An example of what is meant by Metadata layer would be nice. The future directions and alternative designs for all components could have been consolidated into one section towards the end for ease of reading, rather than be given separately within each section.