pingcap · IANTHEREAL · Jan 30, 2020 · Dec 6, 2019 · Jan 30, 2020 · gregwebs
diff --git a/rfc/2019-12-06-dumpling.md b/rfc/2019-12-06-dumpling.md
@@ -0,0 +1,264 @@
+# Proposal: Dumpling 🥟: a data exporting library for the TiDB ecosystem
+
+- Author(s): ET-Team (kennytm)
+- Last updated:  Dec 06, 2019
+- Discussion at: https://github.com/pingcap/community/issues/122, https://github.com/pingcap/community/pull/123
+
+## Abstract
+
+We propose introduce a library to replace Mydumper, code named **Dumpling**,
+optimized for TiDB Lightning and to be usable as a library/plugin inside DM and
+TiDB, as well as be an independent program.
+
+## Background
+
+### Optimize output format for machine reading
+
+[Mydumper] is a tool to dump MySQL databases into local filesystem as SQL dump.
+
+[TiDB Lightning] (a data importing tool to TiDB) and [DM] (a platform for
+migration data from MySQL to TiDB) currently relies on Mydumper to extract data
+from an upstream database.
+
+[TiDB Lightning]: https://pingcap.com/docs/dev/reference/tools/tidb-lightning/overview/
+[DM]: https://pingcap.com/docs/dev/reference/tools/data-migration/overview/
+[Mydumper]: https://github.com/pingcap/mydumper/
+
+Currently Lightning needs a [relatively complex tokenizer][t] to parse the SQL
+dump. This slows down the import speed. If we could make the data export tool to
+produce a very easy-to-parse binary output, the processing speed could be
+improved.
+
+[t]: https://github.com/pingcap/tidb-lightning/blob/master/lightning/mydump/parser.rl
+
+However, we do not want to heavily modify Mydumper because of the next reason.
+
+### License and maintenance
+
+Mydumper is licensed in GPLv3. This is incompatible with the license of all
+other PingCAP projects (Apache 2.0). This forces mydumper to only be usable as
+an external program in DM, instead of a library (neither static nor dynamic).
+
+Licensing aside, Mydumper is written in C and GLib, and does not expose itself
+as a library (almost the entire program is included in a single file). This
+further hinders us from tightly integrating it into DM and TiDB, which would
+have expected Go modules.
+
+Finally, the [official Mydumper repository] is sparsely updated, and we prefer
+not to create too much divergence in our fork. These prevent us from investing
+into the Mydumper project, and drive us to create a brand new project.
+
+[official Mydumper repository]: https://github.com/maxbube/mydumper
+
+### New features
+
+As a new project designed for the TiDB ecosystem, we can include further
+features we've desired for long, e.g. writing directly to cloud storage.
+
+## Proposal
+
+We propose introducing a new project **Dumpling**, which is
+
+* Primarily a library written in Go.
+* Support writing SQL dumps from MySQL-compatible database,
+    with speed and concurrency matching Mydumper.
+* Support every feature of Mydumper needed by DM and Lightning, but no more.
+* Support writing in SQL, CSV, and a custom binary format for quick consumption.
+* Support writing to cloud storage besides the local filesystem.
+
+## Rationale
+
+### Name
+
+The initial motivation of this tool is to supplement Lightning.
+We call the new tool "Dumpling" as a portmanteau of "dump" + "Lightning".
+
+### Programming language
+
+We'd like to embed Dumpling into TiDB (as an `EXPORT` statement) and DM
+(supporting the dumper unit). This forces us to write Dumpling in Go, unless
+we'd like to wrap Dumpling through a Cgo layer or still want to it as a
+subprocess.
+
+Using Go also gives us an easy to use test and code coverage framework.
+
+### Essential features
+
+Dumpling must support parallel export of a single (unique-indexed) table before
+we could release a GA version. TiDB's hidden `_tidb_rowid column` is considered
+a unique index. Even in parallel export mode, one should still be possible to
+split into multiple files by size.
+
+PingCAP's documentation referred to mydumper in several places.
+Dumpling must support these use cases:
+
+1. https://pingcap.com/docs/dev/how-to/migrate/from-mysql/#use-the-mydumper-loader-tool-to-export-and-import-all-the-data
+
+    * -h host
+    * -P port
+    * -u user
+    * -t number_of_threads
+    * -B schema_name
+    * -T table,names
+    * --skip-tz-utc
+
+2. https://pingcap.com/docs/dev/how-to/maintain/backup-and-restore/#full-backup-and-restoration-using-mydumper-loader
+
+    * -F file_size
+
+3. DM also has test cases using the following features:
+
+    * --no-locks
+    * --statement-size
+    * --regex (but we are likely going to use "[black-white-list]" instead.)
+
+[black-white-list]: https://pingcap.com/docs/stable/reference/tools/tidb-lightning/table-filter/#filtering-databases
+
+### Source and target databases
+
+Dumpling should, at minimum, support exporting from the following databases:
+
+* MySQL 5.7 and 8.0
+* MariaDB 10.3 and 10.4
+* TiDB 2.1 and above
+
+The SQL dump should be consumable by these database systems as well. Outside
+TiDB, we are going to support the last two stable major versions only. We are
+not going to support obsoleted database system like Drizzle.
+
+In the far future, we'd also like to support reading (but not writing):
+
+* Oracle 12c, 19c
+* IBM DB2? Microsoft SQL Server? PostgreSQL?
+
+### Output format
+
+Dumpling should be able the output files in these formats:
+
+* SQL dump
+* CSV
+* A binary output format which allows quick reading by Lightning.
+
+#### SQL format
+
+The output should be MySQL-compatible without changing the `SQL_MODE`.
+This means:
+
+* Identifiers should always be printed `` `backquoted` ``
+* Strings should always be printed `'single-quoted'`
+
+Unfortunately, the `NO_BACKSLASH_ESCAPES` mode cannot be conveniently ignored.
+This means we still need to expose this configuration.
+
+#### CSV format
+
+The output should be compatible with [RFC 4180] and MySQL output, which is also
+the default setting of Lightning. The first row should be the column names.
+`NULL`s are written as `\N`. Normal backslashes are escaped.
+
+[RFC 4180]: https://tools.ietf.org/html/rfc4180
+
+#### Binary format
+
+The new binary output format should have the following features:
+
+* The values should be compatible with `types.Datum`.
+* Prefer speed over storage size (e.g. avoid LEB-128 encoding from protobuf)
+
+The details are to be decided in the future.
+
+### Output file name
+
+The file names should be compatible with the mydumper structure, i.e.
+
+* `{db}-schema-create.sql`
+* `{db}.{table}-schema.sql`
+* `{db}.{table}.{NNN}.sql`
+
+However, mydumper won't handle cases when the database and table names contain
+special characters not allowed in the file system or conflicts with the naming
+scheme. In Dumpling we are going to escape these characters using percent
+encoding:
+
+* `< > : " / \ | ? *`
+* `%`
+* `.`
+* All characters between U+0000 to U+001F inclusive.
+
+For instance, if the table name is `foo.bar` the file name will be encoded as `db.foo%2Ebar.123.sql`.
+
+### Compression
+
+Each output file should be directly compressible as `*.gz`.
+
+In the future, we could also support compressing as `*.lz4`, `*.zst` and `*.xz`,
+and the compression level should be configurable.
+
+### Output location
+
+Besides the local file system, Dumpling should be able to upload the dump to
+remote storage (cloud) via S3 and GCS protocol at minimum.
+
+### Integration with TiDB
+
+We expect Dumpling could be integrated as part of the TiDB Toolkit library (or
+plugin) into TiDB. The whole database could be logically backed up using an
+`EXPORT` statement. This detailed design will not be elaborated here.
+
+## Alternatives
+
+Instead of creating a new project, there are several existing clones of Mydumper
+written in Go, but we do not consider them suitable to fit our primary purposes.
+
+* https://github.com/morgo/tidump
+    * Originally an experiment
+* https://github.com/xelabs/go-mydumper
+    * GPLv3, not suitable for us
+
+## Compatibility
+
+## Implementation
+
+Phase 1: Essential features as mydumper replacement for DM and Lightning
+
+- [ ] Support dumping as SQL files to a local filesystem correctly
+    - [ ] Resulting data are sorted by primary key
+    - [ ] SQL files are split into size close to the given configuration
+    - [ ] Single tables can be dumped in parallel, if a primary key or unique
+        btree key exists
+    - [ ] Consistency: dumping a snapshot instead of live data (either acquire
+        a read lock or ignore new updates)
+    - [ ] Error handling and recovery
+    - [ ] Support all data types allowed by TiDB
+    - [ ] Support edge cases (e.g. [naughty strings], character sets, timestamp;
+        special characters in table name, table partitions, views, generated
+        columns, MariaDB sequences, etc.)
+- [ ] Decide the CLI parameters
+- [ ] Implement black-white-list filtering
+- [ ] Has adequate logging and Prometheus metrics
+- [ ] Testing
+    - [ ] Benchmark against Mydumper, for a large database
+    - [ ] Unit test coverage ≥ 75%
+    - [ ] E2E test (MySQL/TiDB → Dumpling → Loader → MySQL/TiDB)
+
+[naughty strings]: https://github.com/minimaxir/big-list-of-naughty-strings
+
+Phase 2: Further features which should be simple to implement
+
+- [ ] Decide the binary output specification
+    - [ ] Implement the encoder *and* decoder
+- [ ] Support writing to external storage (S3, GCS, …), reusing library from BR
+- [ ] Support dumping as CSV
+- [ ] Support output compression
+- [ ] `--where` clause support
+
+Phase 3: Advanced features
+
+- [ ] Checkpoints
+- [ ] Integrate into TiDB
+    - [ ] Decide the syntax of the `EXPORT` statement
+    - [ ] Implement it
+- [ ] Support reading from Oracle database (?)
+
+## Open issues (if applicable)
+