The TiDB Paper is an industry paper with authors from a Chinese background. Unlike Snowflake, the company behind TiDB has not gone public.
https://www.vldb.org/pvldb/vol13/p3072-huang.pdf
1 Introduction
TiDB是一个HTAP数据库。所以,TiDB可以处理transactional queries和analytical queries。
TiDB类似于SQL RDBMS。用户用SQL(use SQL driver,写SQL DML,等)。
The authors mention the following contributions:
- 他们已经建造一个Raft-based HTAP 数据库。这数据库提供high availability,consistency,scalability,data freshness,and isolation。就是说,这系统的availability,consistency,等有good performance。
- Raft learner role
- multi-Raft storage system
- 新SQL engine。这SQL engine可以选column store table scan或 row-based store scan/index。
- Benchmarks using CH-benCHmark
2 Raft-Based HTAP
Data is stored in multiple Raft groups.
Learner role asynchronously replicates data。
- 问题:Asynchronous replication怎么是strongly consistent?
The section mentions a few engineering challenges:
- Raft performance: supporting concurrent reads and writes
- Learner role performance
- Transactional and analytical query performance
3 Architecture
3 layers: Distributed Storage, Placement Driver, and computation engine layer
Distruted Storage Layer:
-
Row store (TiKV) and column store (TiFlash)
- Key:{table{tableID} record{rowID}}
- Value: {col0, col1, col2, col3}
-
Region abstraction: Data associated with a contiguous range of keys.
- A Region is replicated across multiple nodes.
-
Raft used to maintain consistency.
-
Data asynchronously replicated from TiKV to TiFlash.
Placement Driver (PD)
- Manages Regions (eg rebalancing Regions).
- Manages global timestamps. Timestamps act as sequence IDs, aka logical clocks.
- PD is stateless and can have replicas.
Computation Engine
- Also stateless and can scale out.
- Implements 2PC based on Percolator.
- Can read from both TiKV and TiFlash
TiSpark:
- Can use Spark (eg. SparkSQL) to query data inside of TiDB.
4 Multi-Raft Storage
-
TiKV details:
-
Region 96MB.
-
RocksDB under the hood. Used to store both data and metadata.
-
Raft read/write request optimization:
- Parallelism: append + send entries at same time.
-
TiKV read:
-
Linearlizable reads
-
Optimizations:
- Read index: Send heartbeat to follower nodes. If log’s applied index >= query’s read index, then return value without sending more requests to followers.
- Lease read: Raft leader and follower agree on lease period (don’t change eleader lection during a period of time.
- Follower read: Follower return result if follower’s applied index is equal to or greater than the (transaction’s?) read index.
-
-
Region splitting and merging (PD+TiKV)
-
Rebalancing workloads.
-
Each region must be on >3 replicas.
-
Dynamic Region Split and Merge
- Apply split: update Region range and epoch. Creating new regions.
- Merge: Replicas moved to same servers. Replicas merged locally. 2PC, stopping requests to one service.
-
-
-
Column-based Storage (TiFlash details):
-
Learner nodes mentioned again
-
Learner nodes receive logs from Raft leaders, “replay” logs
- Log compaction process deletes invalid pre-written logs
- Tuples generated. Stored in memory? “row buffer”
- After a time/size threshold is reached, the row buffer is converted into columnar data, written to data pool.
-
Raft log entries: “operation type[transaction status][@start ts][#commit ts]operation data”
-
Schema changes:
- Learner nodes (TiFlash) maintain schema cache, occasionally ask TiKV for schemas.
- Do both periodic sync and sync after error (column mismatch, etc.)
-
Columnar Delta Tree (storage engine)
- Append delta updates immediately.
- Later “merge” updates. Turn deltas into columnal non-delta format.
- Normal format similar to Parquet. Columnar, with compression.
- Note: This is similar Delta Lake. HUDI
- Deltas stored in small files, so need to be periodically compacted to improve read I/O cost.
- B+ tree used to reduce cost of queries. This allows for some types of queries to access delta data files less frequently.
- Cons = worse write amplification.
-
Reads provide snapshot isolation isolation level.
-
5 HTAP Engines
-
What is the Percolator model?
-
Transactions:
- Supports both Snapshot Isolation and Repeatable Read.
- Both optimistic and pessimistic locking are supported.
- Based on MVCC.
-
Optimistic transaction:
- Begin, SQL engine fetches timestamp from PD (start timestamp).
- TiKV reads data with commit timestamp < start timestamp.
- After commit, SQL engine starts 2PC process, locks keys, and sends pre-writes.
- SQL engine asks PD for another timestamp (commit ts). Send commit to TiKV.
- If success, SQL engine returns success to client.
- Secondary keys and locks are cleared asynchronously.
-
Pessimistic transaction:
- Locks acquired when DMLs are executed (before pre-write).
- Pessimistic transactions can be retried with the help of for_update_ts, a timestamp acquired when locking keys.
- Also support read committed isolation level.
-
Timestamp generation is not a performance bottleneck currently.
-
No centralized lock manager. Locks are stored in TiKV.
-
Query optimization:
-
Rule-based and cost-based optimized.
- Cropping, eliminating projection, pushing down predicates, deriving predicates, constant folding, unnesting.
-
Indexes
- Asynchronously built and dropped
-
2 phases of optimization: rule-based, cost-based. Choose from 3:
-
Binary search used to find Regions containing index. Skyline pruning used to ignore irrelevant indexes.
-
Pulling iterator model.
-
Computation at storage layer. Example: evaluate filters in coprocessor, vectorization.
-
TiSpark
- TiKV and TiFlash as storage layer.
- It can push down computation to storage layer.
- It can use index data.
- Also supports bulk loading data.
-
-
Isolation and Coordination
- Resource isolation improved since analytical and tarnsaction queries to different engine servers.
- TiKV and TiFlash are expected to be run on different servers.
- 3 access paths: row scan, index scan, column scan.
- Query engine selects optimal access path based on cost estimate
6 Experiments
-
6 servers
-
10Gbps Ethernet
-
CH-benCHmark
-
Throughput and average latency.
- Throughptu icreases with # of clients.
-
…
-
TiFlash and TiKV are faster than SparkSQL?
-
TiSpark is comparable to SparkSQL, but not better
- HTAP:
- Analytical processing degrades TP by at most 10%
-
Log replication delay:
- <100ms. More data = larger latency.
7 Related Work
-
Evolving from an existing database: Oracle and SQL Server introduce multiple storage engines.
-
SAP HANA
-
Transforming open-source systems:
- Apache Spark, Wildfire HTAP engine
- SnappyData Spark + in-memory engine
-
New databases:
- MemSQL
- HyPer
- BatchDB
- Lineage-based data store
- Peloton
- CockroachDB