My first few paper summaries will be on industry papers. The first one will be on a product that I used at work, and a product behind a company worth ~$40B.
https://event.cwi.nl/lsde/papers/p215-dageville-snowflake.pdf
Bolded terms are terms I am unfamiliar with, and may review later on.
1 Introduction
Snowflake is a data warehouse. As discussed later, the user interface is SQL, used through a database driver or through the UI.
If the reader is unfamiliar with data warehousing, they may find the related work section relevant. Snowflake’s competition at the time included Redshift, BigQuery, and Hadoop.
The authors’ mention the following as key features:
- Pure SaaS experience.
- Relational: ANSI SQL and ACID
- Semi-structured
- Elastic
- Highly available
- Durable
- Cost-efficient
- Secure
2 Storage Versus Compute
The authors provide an argument for separation of storage and compute.
When you consider cloud vs on-premise pricing, this makes sense.
3 Architecture
Snowflake has 3 layers:
-
Data Storage: (S3 Object Storage)
- Tables are partitioned into S3 files.
- Data format: PAX, hybrid columnar.
- Header contains metadata, offsets.
- You partially read a file (just the metadata) and can “seek()” to avoid reading all the data in a file.
- Also used as a cache when local disk is full.
- Metadata is stored in a transactional KV store.
-
Virtual Warehouses
-
A VW is a cluster of EC2 instances.
-
Create/destroy warehouse does not affect the data.
-
Queries are run on a single warehouse. EC2 instances are not shared between warehouses.
-
Architecture immediately provides elasticity and performance isolation.
-
Local disk caching:
- Consistent hashing is used. Future queries that access a table file will use the same EC2 Instance.
-
Skew handling: A node that is faster than others can steal files from a peer. Nodes talk to one another.
-
Execution engine:
-
Columnar storage and execution
-
Vectorized, pipelined processing
-
Push-based: to improve cache efficiency
- Vs pull-based, Volcano.
-
No buffer pool (in-memory caching).
-
-
-
Cloud Services: A multi-tenant, set of HTTP services.
-
Access control, query optimizer, transaction manager, etc.
-
Replicated, scale-out.
-
Query optimizer:
- Cascades-style
- No indices
-
Concurrency Control
- ACID transations w/ snapshot isolation
- MVCC: natural due to immutable table files in S3.
-
Pruning
- Min-max based pruning, also known as small materialized aggregates [38], zone maps [29], and data skipping [40]
- Rely on pruning over metadata.
- Both static and dynamic pruning supported.
-
4 Features:
-
Web UI
-
JDBC/ODBC, etc.
-
No failure modes, tuning nobs, storage grooming tasks.
-
Fault tolerance:
- Metadata store is replicated across AZs.
- AZ failures are tolerated
-
Online upgrades:
- Users progressively switched to new version.
- Cache and metadata storage shared.
- Upgrades once per week.
-
Semi-structured data:
- Variant and array types.
- Allows snowflake to support ELT instead of ETL.
- Post-relational operations: extraction, flattening, and aggregation
- Schema inference. Store frequent columns separately as columns.
- Bloom filters.
- Optimizstic conversion, to improve query speed.
-
Time travel
- Retain old copies of files in S3
-
Security
- 2 factor authentication, encrypted data
- Key Hierarchy AWS Cloud HSM. 4 levels: root, account, table, file keys.
- Key rotation.
- Rekeying: re-encrypt old data with new keys.
5 Related work
- AWS Redshift/ParAccel
- BigQuery
- Microsoft SQL Data warehouse
6 Lessons Learned
- SQL on Hadoop. Snowflake replaces Hadoop.
- Performance not an issue to users.
- Issues: multitenancy, node failres, network failures. Security as an issue.