From 363bea050579bd1bb3a6edce3efd4d3febd9ede4 Mon Sep 17 00:00:00 2001 From: kennytm Date: Fri, 6 Dec 2019 13:39:32 +0800 Subject: [PATCH] rfc: add dumpling, a data exporting tool --- rfc/2019-12-06-dumpling.md | 264 +++++++++++++++++++++++++++++++++++++ 1 file changed, 264 insertions(+) create mode 100644 rfc/2019-12-06-dumpling.md diff --git a/rfc/2019-12-06-dumpling.md b/rfc/2019-12-06-dumpling.md new file mode 100644 index 00000000..1e8dd387 --- /dev/null +++ b/rfc/2019-12-06-dumpling.md @@ -0,0 +1,264 @@ +# Proposal: Dumpling 🥟: a data exporting library for the TiDB ecosystem + +- Author(s): ET-Team (kennytm) +- Last updated: Dec 06, 2019 +- Discussion at: https://github.com/pingcap/community/issues/122, https://github.com/pingcap/community/pull/123 + +## Abstract + +We propose introduce a library to replace Mydumper, code named **Dumpling**, +optimized for TiDB Lightning and to be usable as a library/plugin inside DM and +TiDB, as well as be an independent program. + +## Background + +### Optimize output format for machine reading + +[Mydumper] is a tool to dump MySQL databases into local filesystem as SQL dump. + +[TiDB Lightning] (a data importing tool to TiDB) and [DM] (a platform for +migration data from MySQL to TiDB) currently relies on Mydumper to extract data +from an upstream database. + +[TiDB Lightning]: https://pingcap.com/docs/dev/reference/tools/tidb-lightning/overview/ +[DM]: https://pingcap.com/docs/dev/reference/tools/data-migration/overview/ +[Mydumper]: https://github.com/pingcap/mydumper/ + +Currently Lightning needs a [relatively complex tokenizer][t] to parse the SQL +dump. This slows down the import speed. If we could make the data export tool to +produce a very easy-to-parse binary output, the processing speed could be +improved. + +[t]: https://github.com/pingcap/tidb-lightning/blob/master/lightning/mydump/parser.rl + +However, we do not want to heavily modify Mydumper because of the next reason. + +### License and maintenance + +Mydumper is licensed in GPLv3. This is incompatible with the license of all +other PingCAP projects (Apache 2.0). This forces mydumper to only be usable as +an external program in DM, instead of a library (neither static nor dynamic). + +Licensing aside, Mydumper is written in C and GLib, and does not expose itself +as a library (almost the entire program is included in a single file). This +further hinders us from tightly integrating it into DM and TiDB, which would +have expected Go modules. + +Finally, the [official Mydumper repository] is sparsely updated, and we prefer +not to create too much divergence in our fork. These prevent us from investing +into the Mydumper project, and drive us to create a brand new project. + +[official Mydumper repository]: https://github.com/maxbube/mydumper + +### New features + +As a new project designed for the TiDB ecosystem, we can include further +features we've desired for long, e.g. writing directly to cloud storage. + +## Proposal + +We propose introducing a new project **Dumpling**, which is + +* Primarily a library written in Go. +* Support writing SQL dumps from MySQL-compatible database, + with speed and concurrency matching Mydumper. +* Support every feature of Mydumper needed by DM and Lightning, but no more. +* Support writing in SQL, CSV, and a custom binary format for quick consumption. +* Support writing to cloud storage besides the local filesystem. + +## Rationale + +### Name + +The initial motivation of this tool is to supplement Lightning. +We call the new tool "Dumpling" as a portmanteau of "dump" + "Lightning". + +### Programming language + +We'd like to embed Dumpling into TiDB (as an `EXPORT` statement) and DM +(supporting the dumper unit). This forces us to write Dumpling in Go, unless +we'd like to wrap Dumpling through a Cgo layer or still want to it as a +subprocess. + +Using Go also gives us an easy to use test and code coverage framework. + +### Essential features + +Dumpling must support parallel export of a single (unique-indexed) table before +we could release a GA version. TiDB's hidden `_tidb_rowid column` is considered +a unique index. Even in parallel export mode, one should still be possible to +split into multiple files by size. + +PingCAP's documentation referred to mydumper in several places. +Dumpling must support these use cases: + +1. https://pingcap.com/docs/dev/how-to/migrate/from-mysql/#use-the-mydumper-loader-tool-to-export-and-import-all-the-data + + * -h host + * -P port + * -u user + * -t number_of_threads + * -B schema_name + * -T table,names + * --skip-tz-utc + +2. https://pingcap.com/docs/dev/how-to/maintain/backup-and-restore/#full-backup-and-restoration-using-mydumper-loader + + * -F file_size + +3. DM also has test cases using the following features: + + * --no-locks + * --statement-size + * --regex (but we are likely going to use "[black-white-list]" instead.) + +[black-white-list]: https://pingcap.com/docs/stable/reference/tools/tidb-lightning/table-filter/#filtering-databases + +### Source and target databases + +Dumpling should, at minimum, support exporting from the following databases: + +* MySQL 5.7 and 8.0 +* MariaDB 10.3 and 10.4 +* TiDB 2.1 and above + +The SQL dump should be consumable by these database systems as well. Outside +TiDB, we are going to support the last two stable major versions only. We are +not going to support obsoleted database system like Drizzle. + +In the far future, we'd also like to support reading (but not writing): + +* Oracle 12c, 19c +* IBM DB2? Microsoft SQL Server? PostgreSQL? + +### Output format + +Dumpling should be able the output files in these formats: + +* SQL dump +* CSV +* A binary output format which allows quick reading by Lightning. + +#### SQL format + +The output should be MySQL-compatible without changing the `SQL_MODE`. +This means: + +* Identifiers should always be printed `` `backquoted` `` +* Strings should always be printed `'single-quoted'` + +Unfortunately, the `NO_BACKSLASH_ESCAPES` mode cannot be conveniently ignored. +This means we still need to expose this configuration. + +#### CSV format + +The output should be compatible with [RFC 4180] and MySQL output, which is also +the default setting of Lightning. The first row should be the column names. +`NULL`s are written as `\N`. Normal backslashes are escaped. + +[RFC 4180]: https://tools.ietf.org/html/rfc4180 + +#### Binary format + +The new binary output format should have the following features: + +* The values should be compatible with `types.Datum`. +* Prefer speed over storage size (e.g. avoid LEB-128 encoding from protobuf) + +The details are to be decided in the future. + +### Output file name + +The file names should be compatible with the mydumper structure, i.e. + +* `{db}-schema-create.sql` +* `{db}.{table}-schema.sql` +* `{db}.{table}.{NNN}.sql` + +However, mydumper won't handle cases when the database and table names contain +special characters not allowed in the file system or conflicts with the naming +scheme. In Dumpling we are going to escape these characters using percent +encoding: + +* `< > : " / \ | ? *` +* `%` +* `.` +* All characters between U+0000 to U+001F inclusive. + +For instance, if the table name is `foo.bar` the file name will be encoded as `db.foo%2Ebar.123.sql`. + +### Compression + +Each output file should be directly compressible as `*.gz`. + +In the future, we could also support compressing as `*.lz4`, `*.zst` and `*.xz`, +and the compression level should be configurable. + +### Output location + +Besides the local file system, Dumpling should be able to upload the dump to +remote storage (cloud) via S3 and GCS protocol at minimum. + +### Integration with TiDB + +We expect Dumpling could be integrated as part of the TiDB Toolkit library (or +plugin) into TiDB. The whole database could be logically backed up using an +`EXPORT` statement. This detailed design will not be elaborated here. + +## Alternatives + +Instead of creating a new project, there are several existing clones of Mydumper +written in Go, but we do not consider them suitable to fit our primary purposes. + +* https://github.com/morgo/tidump + * Originally an experiment +* https://github.com/xelabs/go-mydumper + * GPLv3, not suitable for us + +## Compatibility + +## Implementation + +Phase 1: Essential features as mydumper replacement for DM and Lightning + +- [ ] Support dumping as SQL files to a local filesystem correctly + - [ ] Resulting data are sorted by primary key + - [ ] SQL files are split into size close to the given configuration + - [ ] Single tables can be dumped in parallel, if a primary key or unique + btree key exists + - [ ] Consistency: dumping a snapshot instead of live data (either acquire + a read lock or ignore new updates) + - [ ] Error handling and recovery + - [ ] Support all data types allowed by TiDB + - [ ] Support edge cases (e.g. [naughty strings], character sets, timestamp; + special characters in table name, table partitions, views, generated + columns, MariaDB sequences, etc.) +- [ ] Decide the CLI parameters +- [ ] Implement black-white-list filtering +- [ ] Has adequate logging and Prometheus metrics +- [ ] Testing + - [ ] Benchmark against Mydumper, for a large database + - [ ] Unit test coverage ≥ 75% + - [ ] E2E test (MySQL/TiDB → Dumpling → Loader → MySQL/TiDB) + +[naughty strings]: https://github.com/minimaxir/big-list-of-naughty-strings + +Phase 2: Further features which should be simple to implement + +- [ ] Decide the binary output specification + - [ ] Implement the encoder *and* decoder +- [ ] Support writing to external storage (S3, GCS, …), reusing library from BR +- [ ] Support dumping as CSV +- [ ] Support output compression +- [ ] `--where` clause support + +Phase 3: Advanced features + +- [ ] Checkpoints +- [ ] Integrate into TiDB + - [ ] Decide the syntax of the `EXPORT` statement + - [ ] Implement it +- [ ] Support reading from Oracle database (?) + +## Open issues (if applicable) +