The TiDB Paper is an industry paper with authors from a Chinese background. Unlike Snowflake, the company behind TiDB has not gone public.

https://www.vldb.org/pvldb/vol13/p3072-huang.pdf

1 Introduction

TiDB是一个HTAP数据库。所以,TiDB可以处理transactional queries和analytical queries。

TiDB类似于SQL RDBMS。用户用SQL(use SQL driver,写SQL DML,等)。

The authors mention the following contributions:

  • 他们已经建造一个Raft-based HTAP 数据库。这数据库提供high availability,consistency,scalability,data freshness,and isolation。就是说,这系统的availability,consistency,等有good performance。
  • Raft learner role
  • multi-Raft storage system
  • 新SQL engine。这SQL engine可以选column store table scan或 row-based store scan/index。
  • Benchmarks using CH-benCHmark

2 Raft-Based HTAP

Data is stored in multiple Raft groups.

Learner role asynchronously replicates data。

  • 问题:Asynchronous replication怎么是strongly consistent?

The section mentions a few engineering challenges:

  • Raft performance: supporting concurrent reads and writes
  • Learner role performance
  • Transactional and analytical query performance

3 Architecture

3 layers: Distributed Storage, Placement Driver, and computation engine layer

Distruted Storage Layer:

  • Row store (TiKV) and column store (TiFlash)

    • Key:{table{tableID} record{rowID}}
    • Value: {col0, col1, col2, col3}
  • Region abstraction: Data associated with a contiguous range of keys.

    • A Region is replicated across multiple nodes.
  • Raft used to maintain consistency.

  • Data asynchronously replicated from TiKV to TiFlash.

Placement Driver (PD)

  • Manages Regions (eg rebalancing Regions).
  • Manages global timestamps. Timestamps act as sequence IDs, aka logical clocks.
  • PD is stateless and can have replicas.

Computation Engine

  • Also stateless and can scale out.
  • Implements 2PC based on Percolator.
  • Can read from both TiKV and TiFlash

TiSpark:

  • Can use Spark (eg. SparkSQL) to query data inside of TiDB.

4 Multi-Raft Storage

  • TiKV details:

    • Region 96MB.

    • RocksDB under the hood. Used to store both data and metadata.

    • Raft read/write request optimization:

      • Parallelism: append + send entries at same time.
    • TiKV read:

      • Linearlizable reads

      • Optimizations:

        • Read index: Send heartbeat to follower nodes. If log’s applied index >= query’s read index, then return value without sending more requests to followers.
        • Lease read: Raft leader and follower agree on lease period (don’t change eleader lection during a period of time.
        • Follower read: Follower return result if follower’s applied index is equal to or greater than the (transaction’s?) read index.
    • Region splitting and merging (PD+TiKV)

      • Rebalancing workloads.

      • Each region must be on >3 replicas.

      • Dynamic Region Split and Merge

        • Apply split: update Region range and epoch. Creating new regions.
        • Merge: Replicas moved to same servers. Replicas merged locally. 2PC, stopping requests to one service.
  • Column-based Storage (TiFlash details):

    • Learner nodes mentioned again

    • Learner nodes receive logs from Raft leaders, “replay” logs

      • Log compaction process deletes invalid pre-written logs
      • Tuples generated. Stored in memory? “row buffer”
      • After a time/size threshold is reached, the row buffer is converted into columnar data, written to data pool.
    • Raft log entries: “operation type[transaction status][@start ts][#commit ts]operation data”

    • Schema changes:

      • Learner nodes (TiFlash) maintain schema cache, occasionally ask TiKV for schemas.
      • Do both periodic sync and sync after error (column mismatch, etc.)
    • Columnar Delta Tree (storage engine)

      • Append delta updates immediately.
      • Later “merge” updates. Turn deltas into columnal non-delta format.
      • Normal format similar to Parquet. Columnar, with compression.
      • Note: This is similar Delta Lake. HUDI
      • Deltas stored in small files, so need to be periodically compacted to improve read I/O cost.
      • B+ tree used to reduce cost of queries. This allows for some types of queries to access delta data files less frequently.
      • Cons = worse write amplification.
    • Reads provide snapshot isolation isolation level.

5 HTAP Engines

  • What is the Percolator model?

  • Transactions:

    • Supports both Snapshot Isolation and Repeatable Read.
    • Both optimistic and pessimistic locking are supported.
    • Based on MVCC.
  • Optimistic transaction:

    • Begin, SQL engine fetches timestamp from PD (start timestamp).
    • TiKV reads data with commit timestamp < start timestamp.
    • After commit, SQL engine starts 2PC process, locks keys, and sends pre-writes.
    • SQL engine asks PD for another timestamp (commit ts). Send commit to TiKV.
    • If success, SQL engine returns success to client.
    • Secondary keys and locks are cleared asynchronously.
  • Pessimistic transaction:

    • Locks acquired when DMLs are executed (before pre-write).
    • Pessimistic transactions can be retried with the help of for_update_ts, a timestamp acquired when locking keys.
    • Also support read committed isolation level.
  • Timestamp generation is not a performance bottleneck currently.

  • No centralized lock manager. Locks are stored in TiKV.

  • Query optimization:

    • Rule-based and cost-based optimized.

      • Cropping, eliminating projection, pushing down predicates, deriving predicates, constant folding, unnesting.
    • Indexes

      • Asynchronously built and dropped
    • 2 phases of optimization: rule-based, cost-based. Choose from 3:

    • Binary search used to find Regions containing index. Skyline pruning used to ignore irrelevant indexes.

    • Pulling iterator model.

    • Computation at storage layer. Example: evaluate filters in coprocessor, vectorization.

    • TiSpark

      • TiKV and TiFlash as storage layer.
      • It can push down computation to storage layer.
      • It can use index data.
      • Also supports bulk loading data.
  • Isolation and Coordination

    • Resource isolation improved since analytical and tarnsaction queries to different engine servers.
    • TiKV and TiFlash are expected to be run on different servers.
    • 3 access paths: row scan, index scan, column scan.
    • Query engine selects optimal access path based on cost estimate

6 Experiments

  • 6 servers

  • 10Gbps Ethernet

  • CH-benCHmark

  • Throughput and average latency.

    • Throughptu icreases with # of clients.
  • TiFlash and TiKV are faster than SparkSQL?

  • TiSpark is comparable to SparkSQL, but not better

    • HTAP:
    • Analytical processing degrades TP by at most 10%
  • Log replication delay:

    • <100ms. More data = larger latency.
  • Evolving from an existing database: Oracle and SQL Server introduce multiple storage engines.

  • SAP HANA

  • Transforming open-source systems:

    • Apache Spark, Wildfire HTAP engine
    • SnappyData Spark + in-memory engine
  • New databases:

    • MemSQL
    • HyPer
    • BatchDB
    • Lineage-based data store
    • Peloton
    • CockroachDB