Author(s): CalvinNeo
TiFlash Proxy is used to be a fork of TiKV which can replicate data from TiKV to TiFlash. However, since TiKV is upgrading rapidly, it brings lots of troubles for the old version:
- Proxy can't cherry-pick TiKV 's bugfix in time.
- It is hard to take proxy into account when developing TiKV.
Generally speaking, there are two storage components in TiKV for maintaining multi-raft RSM: RaftEngine
and KvEngine
:
- KvEngine is mainly used for applying raft command and providing key-value services.
Multiple modifications about region data/meta/apply-state will be encapsulated into one
Write Batch
and written into KvEngine atomically. - RaftEngine will parse its own committed raft log into corresponding normal/admin raft commands, which will be handled by the apply process.
It is an option to wrap a self-defined KvEngine by TiKV's Engine Traits
. This new KvEngine holds a original TiKV's RocksEngine
and will do the following filtering:
- For metadata like
RaftApplyState
, we store them inRocksEngine
. - For KV data in
write
/lock
/default
cf, we forward them to TiFlash and will no longer write toRocksEngine
.
However, it may cost a lot to achieve such a replacement:
- It's not easy to guarantee atomicity while writing/reading dynamic key-value pair(such as meta/apply-state) and patterned data(strong schema) together for other storage systems.
- A few modules and components(like importer or lighting) reply on the SST format of KvEngine in TiKV. For example, thoses SST files shall be transformed to adapt a column storage engine.
- A flush to storage layer may be more expensive in other storage engine than TiKV.
Apart from Engine Traits
, we also need coprocessor
s to observe and control TiKV's applying procedures:
- An observer before execution of raft commands, which can:
- Filter incompatible commands for TiFlash.
- Nofify TiFlash do some preliminary work.
- An observer after execution of raft commands, which can suggest a persistence.
- An observer when receiving tasks for applying snapshots, which allows TiFlash pre-handling multiple snapsnots in parallel.
- An observer after snapshots are applied, which informs TiFlash to do a immediate persistence.
- An observer when a peer is destroyed.
- An observer for empty raft entry.
- We can't ignore empty raft entry, otherwise can cause wait index timeout.
- An observer to fetch used/total size of storage.
- TiFlash supports multi-disks, so we need to report correct storage information.
- An observer controls whether to persist before calling
finish_for
andcommit
.
The whole work can be divided into two parts:
- TiKV side TiKV provides new engine traits interfaces and observers.
- TiFlash(Proxy) side By implementing these new interfaces and observers, TiFlash can receive data from TiKV,
As described in tikv#12849, we provide a mechanism that enables external modules including but not limited to proxy to obtain data through raft apply state machine.
Different from the way of capturing kv change log which is used by TiCDC, Proxy uses regions instead of tables as granularity which is smaller. Proxy also retains the raft state machine, which allows us to manipulate the apply process more finely.
As described in tiflash#5170. After refactoring, the Proxy can be divided into several crate/modules:
proxy_server
This is a replacement ofcomponents/server
.mock-engine-store
This is a replacement ofcomponents/test_raftstore
and the oldmock-engine-store
.engine_store_ffi
This is decoupled fromcomponents/raftstore
. The observers are also implemented in this crate.engine_tiflash
This is the self-definedKvEngine
.raftstore-proxy
As before, this crate serves as the entry point for the proxy. The definition of ffi interfaces is also located here.gen-proxy-ffi
As before, this crate generates interface code intoengine_store_ffi
.