https://crss.us/media/pubs/1f0c405f9fa2cc9de23a45710fa85b9e7330a958.pdf
Pure Storage is another company with an industry paper. Pure Storage has a market cap of ~$20B USD.
Abstract
- The paper describes the foundation of an all-flash enterprise storage system (Purity).
- Purity uses SSDs, then adds deduplication and data compression on top of it.
1 Introduction
- Flash (SSDs) is becoming cheaper. However, enterprise users have higher resiiency, availiability , and cost requirements.
- Purity is an enterprise storage system.
- Data is stored in “mediums”, a virtual/logical mapping abstraction.
- System is designed around fast random reads and sequential writes. Note this is different than HDDs, which prefer both sequential reads and writes.
- The author’s claim they have cost parity with disk.
- Purity uses log-structured indexes and data layouts.
- They also use striping and Reed-Solomon erasure encoding.
- Controller high availability
- 5.4x data reduction
2 Background
SSDs:
- SSDs are faster than mechanical disk.
- SSDs are a set of flash chips + an ASIC.
- SSD firmware is complex.
- ASIC + flash chips are called the flash translation layer (FTL)
- Flash chips contain dies that can run in parallel (support parallel reads).
- Dies contain erase blocks of 2-16MiB. Blocks contain pages of 512-4096 bytes.
- Writes clear all data in an erase block.
- SSDs store 1-4 bits per cell. Storing more decreases endurance.
- SLC SSDs store 1 bit, support 100,000 P/E cycles.
- MLC SSDs store 2 bits, support only 3000-5000 P/E cycles.
- Purity uses MLC SSDs
Disk-based enterprise storage:
- Traditional architecture = a few controllers with multiple disks behind.
- This allows applications to support >10k random IOPS when a single disk can only support <1k.
- Also called block storage, or arrays.
- Arrays contain RAID-protected (striped) disks, battery-backed RAM, and a large (in-memory) data cache.
Scale-out storage:
- Pure Storage can also be compared to scale-out storage such as S3, Spanner, and DynamoDB.
- They argue that scale-up flash is better than scale-out storage in many cases.
3 Design Principles
-
Extra reads tos save space
- Flash capacity is more expensive than disk.
- Techniques: compression, error correcton
-
Logical monoticity
- Data is stored as immutable facts (tuples), associated with sequence numbers.
- CPU core parallelism can be levereged better.
- Inserts and deletes are idempotent and commutative.
- Immutability supports atomic bulk deletion. Delete all that match a predicate.
- Approach to inserts is similar to CALM theorem: eventual consistency can be built using logical monotonicity.
- Sequence numbers used to implement stronger semantics (linearizable updates, etc.)
-
Writes are careful
- FTL is optimized for consumer performance, which is different than enterprise workloads.
- So Purity assumes that the FTL does not support random writes well.
- Log-structured layouts and data reduction used to turn random writes into smaller sequential writes.
- Approximate materialized aggregates kept to reduce writes.
-
Virtualization
- Abstractions used to enable compression, deduplication, and data layout selection.
4 Implementation
Hardware
-
An array contains:
- 2 controllers (dual-socket x86)
- Network links: 12 X16Gb/s fiber, 12x10GB/s Ethernet iSCISI
- Shelves containing 11-24 MLC SSDs.
- SSDs are consumer-grade SATA
- Shelves contain NVRAM devices.
-
Physical storage layout
-
Data partitioned into segments. Segements are striped across SSDs. Support <=2 SSD failures by default.
-
Segment data on a single drive is called an AU. Each AU is 8MB.
-
Segments are written in write units 1MB.
-
Segments are flushed from memory to SSDs when a segio is “full”.
- A segio contains log records on one end and data on the other end.
-
Writes committed to NVRAM and then written to the SSD’s segios.
-
-
Recovery
- Check all segment headers. Takes 12 seconds. Clients have 30 second timeout. This operation happens both on startup and on primary controller failure.
- Primary controller asynchronously warms the secondary controller’s cache, reducing I/Os for scanning all segment headers.
- Boot region contains allocator state (for AUs) and location of the relations
- The boot region contains a frontier set. The frontier set lists AUs that are ready to be reused.
- On failover, scanning for log records means only scanning segments in the frontier set.
- Frontier sets reduce startup scan time from 12s to 0.1s.
- Recovery can handle corrupted pages and missing drives, since log records are stored in segments, which are replicated.
-
I/O scheduling
- Event driven C++ with one thread per core.
- Event handlers run w/ affinity to NUMA groups
- Requests have bounded latency.
- Event handlers are lock-free.
- Use Reed-Solomon to reconstruct data when a request is slow (>95th percentile latency)
- SSD hardware and firmware are the leading cause of high app latency.
- Slow SSD reads tend to happen when SSDs are writing. So, SSDs that are writing are treated as failed (unavailable to process reads).
- 7+2 Reed-Solomon encoding used.
-
Mediums
- UI = volumes (e.g. /var/mnt/…)
- Physical data address = (volume, offset). Purity handles data using a logical address (medium, offset).
-
Cblocks
- Minimum block size in storage protocols is 512B.
- Application data is stored in compressed blocks (cblocks).
- Each medium contains a set of cblocks.
- Cblocks are dynamically increased in size to match application writes.
-
Deduplication
-
Blocks are deduplicated
- 1/8 block hashes are recorded
- Most duplicate sequences of 8 blocks are detected.
- When duplicate detected, mapping of logical address to (segment, offset) is recorded.
-
Inline duplication used as well.
-
-
Log structured indexes
- Metadata stored in LSM trees called Pyramids.
- Levels of the LSM tree.
- Merge and flatten operate on “patches”
- Merge and flatten are idempotent.
-
Metadata layout
- Metadata stored in columnar format.
- Run-length encoding used.
- Range encoding also used.
-
Elision
- Use “elide rules” instead of tombstones.
- Supports lock-free operations.
- Garbage collector uses the elide tables when doing LSM Tree merges.
- Elide records are stored as ranges. Ranges are merged.
- Elide tables do not grow unboundedly.
5 Customer Deployments
-
Reliability
- Telemetry “phoned home” (to AWS)
- Performance degradation noticed from telemetry fixed proactively
- Similar to SaaS model.
-
Database deployments
- Lower latencies lead to lower rollback rates, which impoves database performance more than expected.
- 10x speedups can be created when only 2x speedup is expected based on CPU and IO utilization.
-
5 minute rule
- 5 minute rule = “data accessed every 5 minutes or less is in RAM, other data on disk”
- Warm data can be moved from RAM to cache earlier, since latency is faster.
- Purity allows for a 10-minute rule instead.
-
Virtualization environments
- Purity also used to run many VMs at once
- Deduplication ratios of 5-10x are common.
-
Cloud service providers
- Purity also used by CSPs (AWS, Microsoft, etc.)
- 8-rack Purity application is similar to 160 rack OpenStack storage nodes.
6 Related Work
-
LSM Trees
-
FAWN scale-out storage.
-
KV workload studies
-
Sequence numbers similar to Corfu.
-
Flash hardware limitation studies.
-
Deduplication studies used too