Pure Storage Paper Notes

https://crss.us/media/pubs/1f0c405f9fa2cc9de23a45710fa85b9e7330a958.pdf

Pure Storage is another company with an industry paper. Pure Storage has a market cap of ~$20B USD.

Abstract

The paper describes the foundation of an all-flash enterprise storage system (Purity).
Purity uses SSDs, then adds deduplication and data compression on top of it.

1 Introduction

Flash (SSDs) is becoming cheaper. However, enterprise users have higher resiiency, availiability , and cost requirements.
Purity is an enterprise storage system.
Data is stored in “mediums”, a virtual/logical mapping abstraction.
System is designed around fast random reads and sequential writes. Note this is different than HDDs, which prefer both sequential reads and writes.
The author’s claim they have cost parity with disk.
Purity uses log-structured indexes and data layouts.
They also use striping and Reed-Solomon erasure encoding.
Controller high availability
5.4x data reduction

2 Background

SSDs:

SSDs are faster than mechanical disk.
SSDs are a set of flash chips + an ASIC.
SSD firmware is complex.
ASIC + flash chips are called the flash translation layer (FTL)
Flash chips contain dies that can run in parallel (support parallel reads).
Dies contain erase blocks of 2-16MiB. Blocks contain pages of 512-4096 bytes.
Writes clear all data in an erase block.
SSDs store 1-4 bits per cell. Storing more decreases endurance.
SLC SSDs store 1 bit, support 100,000 P/E cycles.
MLC SSDs store 2 bits, support only 3000-5000 P/E cycles.
Purity uses MLC SSDs

Disk-based enterprise storage:

Traditional architecture = a few controllers with multiple disks behind.
This allows applications to support >10k random IOPS when a single disk can only support <1k.
Also called block storage, or arrays.
Arrays contain RAID-protected (striped) disks, battery-backed RAM, and a large (in-memory) data cache.

Scale-out storage:

Pure Storage can also be compared to scale-out storage such as S3, Spanner, and DynamoDB.
They argue that scale-up flash is better than scale-out storage in many cases.

3 Design Principles

Extra reads tos save space
- Flash capacity is more expensive than disk.
- Techniques: compression, error correcton
Logical monoticity
- Data is stored as immutable facts (tuples), associated with sequence numbers.
- CPU core parallelism can be levereged better.
- Inserts and deletes are idempotent and commutative.
- Immutability supports atomic bulk deletion. Delete all that match a predicate.
- Approach to inserts is similar to CALM theorem: eventual consistency can be built using logical monotonicity.
- Sequence numbers used to implement stronger semantics (linearizable updates, etc.)
Writes are careful
- FTL is optimized for consumer performance, which is different than enterprise workloads.
- So Purity assumes that the FTL does not support random writes well.
- Log-structured layouts and data reduction used to turn random writes into smaller sequential writes.
- Approximate materialized aggregates kept to reduce writes.
Virtualization
- Abstractions used to enable compression, deduplication, and data layout selection.

4 Implementation

Hardware

An array contains:
- 2 controllers (dual-socket x86)
- Network links: 12 X16Gb/s fiber, 12x10GB/s Ethernet iSCISI
- Shelves containing 11-24 MLC SSDs.
- SSDs are consumer-grade SATA
- Shelves contain NVRAM devices.
Physical storage layout
- Data partitioned into segments. Segements are striped across SSDs. Support <=2 SSD failures by default.
- Segment data on a single drive is called an AU. Each AU is 8MB.
- Segments are written in write units 1MB.
- Segments are flushed from memory to SSDs when a segio is “full”.
  - A segio contains log records on one end and data on the other end.
- Writes committed to NVRAM and then written to the SSD’s segios.
Recovery
- Check all segment headers. Takes 12 seconds. Clients have 30 second timeout. This operation happens both on startup and on primary controller failure.
- Primary controller asynchronously warms the secondary controller’s cache, reducing I/Os for scanning all segment headers.
- Boot region contains allocator state (for AUs) and location of the relations
- The boot region contains a frontier set. The frontier set lists AUs that are ready to be reused.
- On failover, scanning for log records means only scanning segments in the frontier set.
- Frontier sets reduce startup scan time from 12s to 0.1s.
- Recovery can handle corrupted pages and missing drives, since log records are stored in segments, which are replicated.
I/O scheduling
- Event driven C++ with one thread per core.
- Event handlers run w/ affinity to NUMA groups
- Requests have bounded latency.
- Event handlers are lock-free.
- Use Reed-Solomon to reconstruct data when a request is slow (>95th percentile latency)
- SSD hardware and firmware are the leading cause of high app latency.
- Slow SSD reads tend to happen when SSDs are writing. So, SSDs that are writing are treated as failed (unavailable to process reads).
- 7+2 Reed-Solomon encoding used.
Mediums
- UI = volumes (e.g. /var/mnt/…)
- Physical data address = (volume, offset). Purity handles data using a logical address (medium, offset).
Cblocks
- Minimum block size in storage protocols is 512B.
- Application data is stored in compressed blocks (cblocks).
- Each medium contains a set of cblocks.
- Cblocks are dynamically increased in size to match application writes.
Deduplication
- Blocks are deduplicated
  - 1/8 block hashes are recorded
  - Most duplicate sequences of 8 blocks are detected.
  - When duplicate detected, mapping of logical address to (segment, offset) is recorded.
- Inline duplication used as well.
Log structured indexes
- Metadata stored in LSM trees called Pyramids.
- Levels of the LSM tree.
- Merge and flatten operate on “patches”
- Merge and flatten are idempotent.
Metadata layout
- Metadata stored in columnar format.
- Run-length encoding used.
- Range encoding also used.
Elision
- Use “elide rules” instead of tombstones.
- Supports lock-free operations.
- Garbage collector uses the elide tables when doing LSM Tree merges.
- Elide records are stored as ranges. Ranges are merged.
- Elide tables do not grow unboundedly.

5 Customer Deployments

Reliability
- Telemetry “phoned home” (to AWS)
- Performance degradation noticed from telemetry fixed proactively
- Similar to SaaS model.
Database deployments
- Lower latencies lead to lower rollback rates, which impoves database performance more than expected.
- 10x speedups can be created when only 2x speedup is expected based on CPU and IO utilization.
5 minute rule
- 5 minute rule = “data accessed every 5 minutes or less is in RAM, other data on disk”
- Warm data can be moved from RAM to cache earlier, since latency is faster.
- Purity allows for a 10-minute rule instead.
Virtualization environments
- Purity also used to run many VMs at once
- Deduplication ratios of 5-10x are common.
Cloud service providers
- Purity also used by CSPs (AWS, Microsoft, etc.)
- 8-rack Purity application is similar to 160 rack OpenStack storage nodes.

LSM Trees
FAWN scale-out storage.
KV workload studies
Sequence numbers similar to Corfu.
Flash hardware limitation studies.
Deduplication studies used too

‍

‍

Abstract#

1 Introduction#

2 Background#

3 Design Principles#

4 Implementation#

5 Customer Deployments#

6 Related Work#

Abstract

1 Introduction

2 Background

3 Design Principles

4 Implementation

5 Customer Deployments

6 Related Work