Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: add dumpling, a data exporting tool #123

Merged
merged 2 commits into from
Jan 30, 2020
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
264 changes: 264 additions & 0 deletions rfc/2019-12-06-dumpling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
# Proposal: Dumpling 🥟: a data exporting library for the TiDB ecosystem

- Author(s): ET-Team (kennytm)
- Last updated: Dec 06, 2019
- Discussion at: https://github.com/pingcap/community/issues/122, https://github.com/pingcap/community/pull/123

## Abstract

We propose introduce a library to replace Mydumper, code named **Dumpling**,
optimized for TiDB Lightning and to be usable as a library/plugin inside DM and
TiDB, as well as be an independent program.

## Background

### Optimize output format for machine reading

[Mydumper] is a tool to dump MySQL databases into local filesystem as SQL dump.

[TiDB Lightning] (a data importing tool to TiDB) and [DM] (a platform for
migration data from MySQL to TiDB) currently relies on Mydumper to extract data
from an upstream database.

[TiDB Lightning]: https://pingcap.com/docs/dev/reference/tools/tidb-lightning/overview/
[DM]: https://pingcap.com/docs/dev/reference/tools/data-migration/overview/
[Mydumper]: https://github.com/pingcap/mydumper/

Currently Lightning needs a [relatively complex tokenizer][t] to parse the SQL
dump. This slows down the import speed. If we could make the data export tool to
produce a very easy-to-parse binary output, the processing speed could be
improved.

[t]: https://github.com/pingcap/tidb-lightning/blob/master/lightning/mydump/parser.rl

However, we do not want to heavily modify Mydumper because of the next reason.

### License and maintenance

Mydumper is licensed in GPLv3. This is incompatible with the license of all
other PingCAP projects (Apache 2.0). This forces mydumper to only be usable as
an external program in DM, instead of a library (neither static nor dynamic).

Licensing aside, Mydumper is written in C and GLib, and does not expose itself
as a library (almost the entire program is included in a single file). This
further hinders us from tightly integrating it into DM and TiDB, which would
have expected Go modules.

Finally, the [official Mydumper repository] is sparsely updated, and we prefer
not to create too much divergence in our fork. These prevent us from investing
into the Mydumper project, and drive us to create a brand new project.

[official Mydumper repository]: https://github.com/maxbube/mydumper

### New features

As a new project designed for the TiDB ecosystem, we can include further
features we've desired for long, e.g. writing directly to cloud storage.

## Proposal

We propose introducing a new project **Dumpling**, which is

* Primarily a library written in Go.
* Support writing SQL dumps from MySQL-compatible database,
with speed and concurrency matching Mydumper.
* Support every feature of Mydumper needed by DM and Lightning, but no more.
* Support writing in SQL, CSV, and a custom binary format for quick consumption.
* Support writing to cloud storage besides the local filesystem.

## Rationale

### Name

The initial motivation of this tool is to supplement Lightning.
We call the new tool "Dumpling" as a portmanteau of "dump" + "Lightning".
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TiDumpling might help with both searchability and understanding. There are already open source projects named dumpling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is hard 🙃. We could also specify the official name as TiDB Dumpling (like TiDB Lightning and TiDB Binlog).


### Programming language

We'd like to embed Dumpling into TiDB (as an `EXPORT` statement) and DM
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the new physical backup is released, it seems that the only use case for using this against TiDB is to export data to another database (MySQL). In the case of an immediate restore into MySQL the export from TiDB probably won't add convenience because one already has to run an external tool to load the data into MySQL.

The TiDB that runs this might essentially need to be considered offline if the backup process is using up all its resources. So then the value proposition would then be that it is easier to deploy an additional TiDB than to deploy a new tool. This will only be the case if the resource requirements of backup are the same as TiDB. If the resource requirements are bigger, then this won't work. If the resource requirements are smaller and the TiDB node will still serve requests, wrapping as subprocess could still be a good idea to better isolate the backup workload.
Unintelligent load balancing between TiDB could easily lead to the TiDB doing backup to get over-worked.

In contrast, the new physical BR tool ran from inside TiDB would only perform meta operations from TiDB with most of the actual backup work done from TiKV: this should leave most resources still available for the TiDB node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph explains why we choose Go not other languages. Even if we don't want EXPORT, we still need integration with DM.

And given that we're going to have IMPORT with Lightning, it is natural to support EXPORT as well.

The EXPORT statement is not meant to replace BR. BR will be given their own BACKUP and RESTORE statements.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import with lightning will have some similarly deployment issues since it will use a great deal of CPU. However one of the main use cases is to import when a cluster is first created and no useful queries can be run untill import is complete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. The IMPORT and EXPORT statements allow DBAs to manage logical backups via the SQL interface for familiarity. The individual executables are still available though.

Anyway these are getting off-topic.

(supporting the dumper unit). This forces us to write Dumpling in Go, unless
we'd like to wrap Dumpling through a Cgo layer or still want to it as a
subprocess.

Using Go also gives us an easy to use test and code coverage framework.

### Essential features

Dumpling must support parallel export of a single (unique-indexed) table before
we could release a GA version. TiDB's hidden `_tidb_rowid column` is considered
a unique index. Even in parallel export mode, one should still be possible to
split into multiple files by size.

PingCAP's documentation referred to mydumper in several places.
Dumpling must support these use cases:

1. https://pingcap.com/docs/dev/how-to/migrate/from-mysql/#use-the-mydumper-loader-tool-to-export-and-import-all-the-data

* -h host
* -P port
* -u user
* -t number_of_threads
* -B schema_name
* -T table,names
* --skip-tz-utc

2. https://pingcap.com/docs/dev/how-to/maintain/backup-and-restore/#full-backup-and-restoration-using-mydumper-loader

* -F file_size

3. DM also has test cases using the following features:

* --no-locks
* --statement-size
* --regex (but we are likely going to use "[black-white-list]" instead.)

[black-white-list]: https://pingcap.com/docs/stable/reference/tools/tidb-lightning/table-filter/#filtering-databases

### Source and target databases

Dumpling should, at minimum, support exporting from the following databases:

* MySQL 5.7 and 8.0
* MariaDB 10.3 and 10.4
* TiDB 2.1 and above

The SQL dump should be consumable by these database systems as well. Outside
TiDB, we are going to support the last two stable major versions only. We are
not going to support obsoleted database system like Drizzle.

In the far future, we'd also like to support reading (but not writing):

* Oracle 12c, 19c
* IBM DB2? Microsoft SQL Server? PostgreSQL?

### Output format

Dumpling should be able the output files in these formats:

* SQL dump
* CSV
* A binary output format which allows quick reading by Lightning.

#### SQL format

The output should be MySQL-compatible without changing the `SQL_MODE`.
This means:

* Identifiers should always be printed `` `backquoted` ``
* Strings should always be printed `'single-quoted'`

Unfortunately, the `NO_BACKSLASH_ESCAPES` mode cannot be conveniently ignored.
This means we still need to expose this configuration.

#### CSV format

The output should be compatible with [RFC 4180] and MySQL output, which is also
the default setting of Lightning. The first row should be the column names.
`NULL`s are written as `\N`. Normal backslashes are escaped.

[RFC 4180]: https://tools.ietf.org/html/rfc4180

#### Binary format

The new binary output format should have the following features:

* The values should be compatible with `types.Datum`.
* Prefer speed over storage size (e.g. avoid LEB-128 encoding from protobuf)

The details are to be decided in the future.

### Output file name

The file names should be compatible with the mydumper structure, i.e.

* `{db}-schema-create.sql`
* `{db}.{table}-schema.sql`
* `{db}.{table}.{NNN}.sql`

However, mydumper won't handle cases when the database and table names contain
special characters not allowed in the file system or conflicts with the naming
scheme. In Dumpling we are going to escape these characters using percent
encoding:

* `< > : " / \ | ? *`
* `%`
* `.`
* All characters between U+0000 to U+001F inclusive.

For instance, if the table name is `foo.bar` the file name will be encoded as `db.foo%2Ebar.123.sql`.

### Compression

Each output file should be directly compressible as `*.gz`.

In the future, we could also support compressing as `*.lz4`, `*.zst` and `*.xz`,
and the compression level should be configurable.

### Output location

Besides the local file system, Dumpling should be able to upload the dump to
remote storage (cloud) via S3 and GCS protocol at minimum.

### Integration with TiDB

We expect Dumpling could be integrated as part of the TiDB Toolkit library (or
plugin) into TiDB. The whole database could be logically backed up using an
`EXPORT` statement. This detailed design will not be elaborated here.

## Alternatives

Instead of creating a new project, there are several existing clones of Mydumper
written in Go, but we do not consider them suitable to fit our primary purposes.

* https://github.com/morgo/tidump
* Originally an experiment
* https://github.com/xelabs/go-mydumper
* GPLv3, not suitable for us

## Compatibility

## Implementation

Phase 1: Essential features as mydumper replacement for DM and Lightning

- [ ] Support dumping as SQL files to a local filesystem correctly
- [ ] Resulting data are sorted by primary key
- [ ] SQL files are split into size close to the given configuration
- [ ] Single tables can be dumped in parallel, if a primary key or unique
btree key exists
- [ ] Consistency: dumping a snapshot instead of live data (either acquire
a read lock or ignore new updates)
- [ ] Error handling and recovery
- [ ] Support all data types allowed by TiDB
- [ ] Support edge cases (e.g. [naughty strings], character sets, timestamp;
special characters in table name, table partitions, views, generated
columns, MariaDB sequences, etc.)
- [ ] Decide the CLI parameters
- [ ] Implement black-white-list filtering
- [ ] Has adequate logging and Prometheus metrics
- [ ] Testing
- [ ] Benchmark against Mydumper, for a large database
- [ ] Unit test coverage ≥ 75%
- [ ] E2E test (MySQL/TiDB → Dumpling → Loader → MySQL/TiDB)

[naughty strings]: https://github.com/minimaxir/big-list-of-naughty-strings

Phase 2: Further features which should be simple to implement

- [ ] Decide the binary output specification
- [ ] Implement the encoder *and* decoder
- [ ] Support writing to external storage (S3, GCS, …), reusing library from BR
- [ ] Support dumping as CSV
- [ ] Support output compression
- [ ] `--where` clause support

Phase 3: Advanced features

- [ ] Checkpoints
- [ ] Integrate into TiDB
- [ ] Decide the syntax of the `EXPORT` statement
- [ ] Implement it
- [ ] Support reading from Oracle database (?)

## Open issues (if applicable)