Tidb Paper Notes

The TiDB Paper is an industry paper with authors from a Chinese background. Unlike Snowflake, the company behind TiDB has not gone public.

https://www.vldb.org/pvldb/vol13/p3072-huang.pdf

1 Introduction

TiDB是一个HTAP数据库。所以，TiDB可以处理transactional queries和analytical queries。

TiDB类似于SQL RDBMS。用户用SQL（use SQL driver，写SQL DML，等）。

The authors mention the following contributions：

他们已经建造一个Raft-based HTAP 数据库。这数据库提供high availability，consistency，scalability，data freshness，and isolation。就是说，这系统的availability，consistency，等有good performance。
Raft learner role
multi-Raft storage system
新SQL engine。这SQL engine可以选column store table scan或 row-based store scan/index。
Benchmarks using CH-benCHmark

2 Raft-Based HTAP

Data is stored in multiple Raft groups.

Learner role asynchronously replicates data。

问题：Asynchronous replication怎么是strongly consistent？

The section mentions a few engineering challenges:

Raft performance: supporting concurrent reads and writes
Learner role performance
Transactional and analytical query performance

3 Architecture

3 layers: Distributed Storage, Placement Driver, and computation engine layer

Distruted Storage Layer:

Row store (TiKV) and column store (TiFlash)
- Key:{table{tableID} record{rowID}}
- Value: {col0, col1, col2, col3}
Region abstraction: Data associated with a contiguous range of keys.
- A Region is replicated across multiple nodes.
Raft used to maintain consistency.
Data asynchronously replicated from TiKV to TiFlash.

Placement Driver (PD)

Manages Regions (eg rebalancing Regions).
Manages global timestamps. Timestamps act as sequence IDs, aka logical clocks.
PD is stateless and can have replicas.

Computation Engine

Also stateless and can scale out.
Implements 2PC based on Percolator.
Can read from both TiKV and TiFlash

TiSpark:

Can use Spark (eg. SparkSQL) to query data inside of TiDB.

4 Multi-Raft Storage

TiKV details:
- Region 96MB.
- RocksDB under the hood. Used to store both data and metadata.
- Raft read/write request optimization:
  - Parallelism: append + send entries at same time.
- TiKV read:
  - Linearlizable reads
  - Optimizations:
    - Read index: Send heartbeat to follower nodes. If log’s applied index >= query’s read index, then return value without sending more requests to followers.
    - Lease read: Raft leader and follower agree on lease period (don’t change eleader lection during a period of time.
    - Follower read: Follower return result if follower’s applied index is equal to or greater than the (transaction’s?) read index.
- Region splitting and merging (PD+TiKV)
  - Rebalancing workloads.
  - Each region must be on >3 replicas.
  - Dynamic Region Split and Merge
    - Apply split: update Region range and epoch. Creating new regions.
    - Merge: Replicas moved to same servers. Replicas merged locally. 2PC, stopping requests to one service.
Column-based Storage (TiFlash details):
- Learner nodes mentioned again
- Learner nodes receive logs from Raft leaders, “replay” logs
  - Log compaction process deletes invalid pre-written logs
  - Tuples generated. Stored in memory? “row buffer”
  - After a time/size threshold is reached, the row buffer is converted into columnar data, written to data pool.
- Raft log entries: “operation type[transaction status][@start ts][#commit ts]operation data”
- Schema changes:
  - Learner nodes (TiFlash) maintain schema cache, occasionally ask TiKV for schemas.
  - Do both periodic sync and sync after error (column mismatch, etc.)
- Columnar Delta Tree (storage engine)
  - Append delta updates immediately.
  - Later “merge” updates. Turn deltas into columnal non-delta format.
  - Normal format similar to Parquet. Columnar, with compression.
  - Note: This is similar Delta Lake. HUDI
  - Deltas stored in small files, so need to be periodically compacted to improve read I/O cost.
  - B+ tree used to reduce cost of queries. This allows for some types of queries to access delta data files less frequently.
  - Cons = worse write amplification.
- Reads provide snapshot isolation isolation level.

5 HTAP Engines

What is the Percolator model?
Transactions:
- Supports both Snapshot Isolation and Repeatable Read.
- Both optimistic and pessimistic locking are supported.
- Based on MVCC.
Optimistic transaction:
- Begin, SQL engine fetches timestamp from PD (start timestamp).
- TiKV reads data with commit timestamp < start timestamp.
- After commit, SQL engine starts 2PC process, locks keys, and sends pre-writes.
- SQL engine asks PD for another timestamp (commit ts). Send commit to TiKV.
- If success, SQL engine returns success to client.
- Secondary keys and locks are cleared asynchronously.
Pessimistic transaction:
- Locks acquired when DMLs are executed (before pre-write).
- Pessimistic transactions can be retried with the help of for_update_ts, a timestamp acquired when locking keys.
- Also support read committed isolation level.
Timestamp generation is not a performance bottleneck currently.
No centralized lock manager. Locks are stored in TiKV.
Query optimization:
- Rule-based and cost-based optimized.
  - Cropping, eliminating projection, pushing down predicates, deriving predicates, constant folding, unnesting.
- Indexes
  - Asynchronously built and dropped
- 2 phases of optimization: rule-based, cost-based. Choose from 3:
- Binary search used to find Regions containing index. Skyline pruning used to ignore irrelevant indexes.
- Pulling iterator model.
- Computation at storage layer. Example: evaluate filters in coprocessor, vectorization.
- TiSpark
  - TiKV and TiFlash as storage layer.
  - It can push down computation to storage layer.
  - It can use index data.
  - Also supports bulk loading data.
Isolation and Coordination
- Resource isolation improved since analytical and tarnsaction queries to different engine servers.
- TiKV and TiFlash are expected to be run on different servers.
- 3 access paths: row scan, index scan, column scan.
- Query engine selects optimal access path based on cost estimate

6 Experiments

6 servers
10Gbps Ethernet
CH-benCHmark
Throughput and average latency.
- Throughptu icreases with # of clients.
…
TiFlash and TiKV are faster than SparkSQL?
TiSpark is comparable to SparkSQL, but not better
- HTAP:
- Analytical processing degrades TP by at most 10%
Log replication delay:
- <100ms. More data = larger latency.

Evolving from an existing database: Oracle and SQL Server introduce multiple storage engines.
SAP HANA
Transforming open-source systems:
- Apache Spark, Wildfire HTAP engine
- SnappyData Spark + in-memory engine
New databases:
- MemSQL
- HyPer
- BatchDB
- Lineage-based data store
- Peloton
- CockroachDB

1 Introduction#

2 Raft-Based HTAP#

3 Architecture#

4 Multi-Raft Storage#

5 HTAP Engines#

6 Experiments#

7 Related Work#