-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: add dumpling, a data exporting tool #123
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,264 @@ | ||
# Proposal: Dumpling 🥟: a data exporting library for the TiDB ecosystem | ||
|
||
- Author(s): ET-Team (kennytm) | ||
- Last updated: Dec 06, 2019 | ||
- Discussion at: https://github.com/pingcap/community/issues/122, https://github.com/pingcap/community/pull/123 | ||
|
||
## Abstract | ||
|
||
We propose introduce a library to replace Mydumper, code named **Dumpling**, | ||
optimized for TiDB Lightning and to be usable as a library/plugin inside DM and | ||
TiDB, as well as be an independent program. | ||
|
||
## Background | ||
|
||
### Optimize output format for machine reading | ||
|
||
[Mydumper] is a tool to dump MySQL databases into local filesystem as SQL dump. | ||
|
||
[TiDB Lightning] (a data importing tool to TiDB) and [DM] (a platform for | ||
migration data from MySQL to TiDB) currently relies on Mydumper to extract data | ||
from an upstream database. | ||
|
||
[TiDB Lightning]: https://pingcap.com/docs/dev/reference/tools/tidb-lightning/overview/ | ||
[DM]: https://pingcap.com/docs/dev/reference/tools/data-migration/overview/ | ||
[Mydumper]: https://github.com/pingcap/mydumper/ | ||
|
||
Currently Lightning needs a [relatively complex tokenizer][t] to parse the SQL | ||
dump. This slows down the import speed. If we could make the data export tool to | ||
produce a very easy-to-parse binary output, the processing speed could be | ||
improved. | ||
|
||
[t]: https://github.com/pingcap/tidb-lightning/blob/master/lightning/mydump/parser.rl | ||
|
||
However, we do not want to heavily modify Mydumper because of the next reason. | ||
|
||
### License and maintenance | ||
|
||
Mydumper is licensed in GPLv3. This is incompatible with the license of all | ||
other PingCAP projects (Apache 2.0). This forces mydumper to only be usable as | ||
an external program in DM, instead of a library (neither static nor dynamic). | ||
|
||
Licensing aside, Mydumper is written in C and GLib, and does not expose itself | ||
as a library (almost the entire program is included in a single file). This | ||
further hinders us from tightly integrating it into DM and TiDB, which would | ||
have expected Go modules. | ||
|
||
Finally, the [official Mydumper repository] is sparsely updated, and we prefer | ||
not to create too much divergence in our fork. These prevent us from investing | ||
into the Mydumper project, and drive us to create a brand new project. | ||
|
||
[official Mydumper repository]: https://github.com/maxbube/mydumper | ||
|
||
### New features | ||
|
||
As a new project designed for the TiDB ecosystem, we can include further | ||
features we've desired for long, e.g. writing directly to cloud storage. | ||
|
||
## Proposal | ||
|
||
We propose introducing a new project **Dumpling**, which is | ||
|
||
* Primarily a library written in Go. | ||
* Support writing SQL dumps from MySQL-compatible database, | ||
with speed and concurrency matching Mydumper. | ||
* Support every feature of Mydumper needed by DM and Lightning, but no more. | ||
* Support writing in SQL, CSV, and a custom binary format for quick consumption. | ||
* Support writing to cloud storage besides the local filesystem. | ||
|
||
## Rationale | ||
|
||
### Name | ||
|
||
The initial motivation of this tool is to supplement Lightning. | ||
We call the new tool "Dumpling" as a portmanteau of "dump" + "Lightning". | ||
|
||
### Programming language | ||
|
||
We'd like to embed Dumpling into TiDB (as an `EXPORT` statement) and DM | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Once the new physical backup is released, it seems that the only use case for using this against TiDB is to export data to another database (MySQL). In the case of an immediate restore into MySQL the export from TiDB probably won't add convenience because one already has to run an external tool to load the data into MySQL. The TiDB that runs this might essentially need to be considered offline if the backup process is using up all its resources. So then the value proposition would then be that it is easier to deploy an additional TiDB than to deploy a new tool. This will only be the case if the resource requirements of backup are the same as TiDB. If the resource requirements are bigger, then this won't work. If the resource requirements are smaller and the TiDB node will still serve requests, wrapping as subprocess could still be a good idea to better isolate the backup workload. In contrast, the new physical BR tool ran from inside TiDB would only perform meta operations from TiDB with most of the actual backup work done from TiKV: this should leave most resources still available for the TiDB node. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This paragraph explains why we choose Go not other languages. Even if we don't want And given that we're going to have The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Import with lightning will have some similarly deployment issues since it will use a great deal of CPU. However one of the main use cases is to import when a cluster is first created and no useful queries can be run untill import is complete. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True. The Anyway these are getting off-topic. |
||
(supporting the dumper unit). This forces us to write Dumpling in Go, unless | ||
we'd like to wrap Dumpling through a Cgo layer or still want to it as a | ||
subprocess. | ||
|
||
Using Go also gives us an easy to use test and code coverage framework. | ||
|
||
### Essential features | ||
|
||
Dumpling must support parallel export of a single (unique-indexed) table before | ||
we could release a GA version. TiDB's hidden `_tidb_rowid column` is considered | ||
a unique index. Even in parallel export mode, one should still be possible to | ||
split into multiple files by size. | ||
|
||
PingCAP's documentation referred to mydumper in several places. | ||
Dumpling must support these use cases: | ||
|
||
1. https://pingcap.com/docs/dev/how-to/migrate/from-mysql/#use-the-mydumper-loader-tool-to-export-and-import-all-the-data | ||
|
||
* -h host | ||
* -P port | ||
* -u user | ||
* -t number_of_threads | ||
* -B schema_name | ||
* -T table,names | ||
* --skip-tz-utc | ||
|
||
2. https://pingcap.com/docs/dev/how-to/maintain/backup-and-restore/#full-backup-and-restoration-using-mydumper-loader | ||
|
||
* -F file_size | ||
|
||
3. DM also has test cases using the following features: | ||
|
||
* --no-locks | ||
* --statement-size | ||
* --regex (but we are likely going to use "[black-white-list]" instead.) | ||
|
||
[black-white-list]: https://pingcap.com/docs/stable/reference/tools/tidb-lightning/table-filter/#filtering-databases | ||
|
||
### Source and target databases | ||
|
||
Dumpling should, at minimum, support exporting from the following databases: | ||
|
||
* MySQL 5.7 and 8.0 | ||
* MariaDB 10.3 and 10.4 | ||
* TiDB 2.1 and above | ||
|
||
The SQL dump should be consumable by these database systems as well. Outside | ||
TiDB, we are going to support the last two stable major versions only. We are | ||
not going to support obsoleted database system like Drizzle. | ||
|
||
In the far future, we'd also like to support reading (but not writing): | ||
|
||
* Oracle 12c, 19c | ||
* IBM DB2? Microsoft SQL Server? PostgreSQL? | ||
|
||
### Output format | ||
|
||
Dumpling should be able the output files in these formats: | ||
|
||
* SQL dump | ||
* CSV | ||
* A binary output format which allows quick reading by Lightning. | ||
|
||
#### SQL format | ||
|
||
The output should be MySQL-compatible without changing the `SQL_MODE`. | ||
This means: | ||
|
||
* Identifiers should always be printed `` `backquoted` `` | ||
* Strings should always be printed `'single-quoted'` | ||
|
||
Unfortunately, the `NO_BACKSLASH_ESCAPES` mode cannot be conveniently ignored. | ||
This means we still need to expose this configuration. | ||
|
||
#### CSV format | ||
|
||
The output should be compatible with [RFC 4180] and MySQL output, which is also | ||
the default setting of Lightning. The first row should be the column names. | ||
`NULL`s are written as `\N`. Normal backslashes are escaped. | ||
|
||
[RFC 4180]: https://tools.ietf.org/html/rfc4180 | ||
|
||
#### Binary format | ||
|
||
The new binary output format should have the following features: | ||
|
||
* The values should be compatible with `types.Datum`. | ||
* Prefer speed over storage size (e.g. avoid LEB-128 encoding from protobuf) | ||
|
||
The details are to be decided in the future. | ||
|
||
### Output file name | ||
|
||
The file names should be compatible with the mydumper structure, i.e. | ||
|
||
* `{db}-schema-create.sql` | ||
* `{db}.{table}-schema.sql` | ||
* `{db}.{table}.{NNN}.sql` | ||
|
||
However, mydumper won't handle cases when the database and table names contain | ||
special characters not allowed in the file system or conflicts with the naming | ||
scheme. In Dumpling we are going to escape these characters using percent | ||
encoding: | ||
|
||
* `< > : " / \ | ? *` | ||
* `%` | ||
* `.` | ||
* All characters between U+0000 to U+001F inclusive. | ||
|
||
For instance, if the table name is `foo.bar` the file name will be encoded as `db.foo%2Ebar.123.sql`. | ||
|
||
### Compression | ||
|
||
Each output file should be directly compressible as `*.gz`. | ||
|
||
In the future, we could also support compressing as `*.lz4`, `*.zst` and `*.xz`, | ||
and the compression level should be configurable. | ||
|
||
### Output location | ||
|
||
Besides the local file system, Dumpling should be able to upload the dump to | ||
remote storage (cloud) via S3 and GCS protocol at minimum. | ||
|
||
### Integration with TiDB | ||
|
||
We expect Dumpling could be integrated as part of the TiDB Toolkit library (or | ||
plugin) into TiDB. The whole database could be logically backed up using an | ||
`EXPORT` statement. This detailed design will not be elaborated here. | ||
|
||
## Alternatives | ||
|
||
Instead of creating a new project, there are several existing clones of Mydumper | ||
written in Go, but we do not consider them suitable to fit our primary purposes. | ||
|
||
* https://github.com/morgo/tidump | ||
* Originally an experiment | ||
* https://github.com/xelabs/go-mydumper | ||
* GPLv3, not suitable for us | ||
|
||
## Compatibility | ||
|
||
## Implementation | ||
|
||
Phase 1: Essential features as mydumper replacement for DM and Lightning | ||
|
||
- [ ] Support dumping as SQL files to a local filesystem correctly | ||
- [ ] Resulting data are sorted by primary key | ||
- [ ] SQL files are split into size close to the given configuration | ||
- [ ] Single tables can be dumped in parallel, if a primary key or unique | ||
btree key exists | ||
- [ ] Consistency: dumping a snapshot instead of live data (either acquire | ||
a read lock or ignore new updates) | ||
- [ ] Error handling and recovery | ||
- [ ] Support all data types allowed by TiDB | ||
- [ ] Support edge cases (e.g. [naughty strings], character sets, timestamp; | ||
special characters in table name, table partitions, views, generated | ||
columns, MariaDB sequences, etc.) | ||
- [ ] Decide the CLI parameters | ||
- [ ] Implement black-white-list filtering | ||
- [ ] Has adequate logging and Prometheus metrics | ||
- [ ] Testing | ||
- [ ] Benchmark against Mydumper, for a large database | ||
- [ ] Unit test coverage ≥ 75% | ||
- [ ] E2E test (MySQL/TiDB → Dumpling → Loader → MySQL/TiDB) | ||
|
||
[naughty strings]: https://github.com/minimaxir/big-list-of-naughty-strings | ||
|
||
Phase 2: Further features which should be simple to implement | ||
|
||
- [ ] Decide the binary output specification | ||
- [ ] Implement the encoder *and* decoder | ||
- [ ] Support writing to external storage (S3, GCS, …), reusing library from BR | ||
- [ ] Support dumping as CSV | ||
- [ ] Support output compression | ||
- [ ] `--where` clause support | ||
|
||
Phase 3: Advanced features | ||
|
||
- [ ] Checkpoints | ||
- [ ] Integrate into TiDB | ||
- [ ] Decide the syntax of the `EXPORT` statement | ||
- [ ] Implement it | ||
- [ ] Support reading from Oracle database (?) | ||
|
||
## Open issues (if applicable) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TiDumpling might help with both searchability and understanding. There are already open source projects named dumpling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming is hard 🙃. We could also specify the official name as TiDB Dumpling (like TiDB Lightning and TiDB Binlog).