Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics

This paper argues that the shared-nothing architecture, or the lake + warehouse is less attractive than a unified “lakehouse” model. It is true that Snowflake and Databricks are quite usable and cost-effective.

There are a few parts of the paper that are debatable. The first thing is the assertion that object storage is cheaper than block storage:

  • Amazon S3 may be 10x more expensive than bare-metal storage (both block and object).
  • Object storage is not necessarily more efficient than block storage, until you start considering avg/peak IOPS and replication/erasure encoding. Certain high frequency access patterns are impossible or more expensive with object storage.
  • The cloud makes development convenient, but there are notable gains in cost reduction possible by moving off the cloud while using something like Kubernetes.
  • In the cloud, networking is expensive. You can expect something like 10MB/s/core, but you can get 100Gbps with certain protocols with your own hardware.

At the same time, the idea of designing systems around tiered persistent storage does make sense, given that object storage will be cheaper than block storage for However, it is not always the case that the shared storage should be the “source of truth”, as it could lead to poor performance (esp. latency-wise).

The paper argues some things I am less knowledgable about:

  • Supporting unstructured data (images, etc.)
  • ACID transactions. This feature has slowly been added to all data warehouses over time.
  • Metadata management. With metadata stores, you can optimize your data layout, track “tables”, etc. Well you also need metadata management to manage transactions.

I am excited for the many optimizations and features they are noticing are possible. I hope they use the engineering bandwidth at Databricks to make it possible.

The Composable Data Management System Manifesto

The authors at Meta notice that databases are/can be seen as sets of composable libraries. Previously, most databases were monorepos.

They argue that you can have one repo that handles storage execution, one for SQL parsing, one for SQL optimizing, one for data layout, etc. Then, you can have people that choose and compose these libraries into working DBMSs.

They mention a few existing projects around each area. This is quite useful for people who are creating DBMSs but do not want to rewrite everything from scratch.