Skip to content
This repository has been archived by the owner on Dec 8, 2021. It is now read-only.

Support to specify a disk quota for intermediate files #446

Open
shuijing198799 opened this issue Nov 4, 2020 · 5 comments
Open

Support to specify a disk quota for intermediate files #446

shuijing198799 opened this issue Nov 4, 2020 · 5 comments
Labels
feature-request This issue is a feature request

Comments

@shuijing198799
Copy link

shuijing198799 commented Nov 4, 2020

Feature Request

Is your feature request related to a problem? Please describe:

The lightning need a volume to save intermediate file, It’s hard to predict the size of this disk, So we must prepare as large a disk as possible to install these temporary files, For example , If I need to restore 2T data, we have to prepare a 2T volume for lightning, It is a bad experience to use on the cloud。

Describe the feature you'd like:

We want to specify the volume size and the checkpoint will exceed this size in the lightning process.

Describe alternatives you've considered:

none

Teachability, Documentation, Adoption, Optimization:

@shuijing198799 shuijing198799 added the feature-request This issue is a feature request label Nov 4, 2020
@overvenus overvenus changed the title Hope to specify a disk size to check point Support to specify a disk quote for intermediate files Nov 4, 2020
@kennytm
Copy link
Collaborator

kennytm commented Nov 4, 2020

by "check point file" you mean those "SST files" in the local backend?

@shuijing198799
Copy link
Author

shuijing198799 commented Nov 5, 2020

by "check point file" you mean those "SST files" in the local backend?

yes。 I use immediate files instead.

@kennytm
Copy link
Collaborator

kennytm commented Nov 6, 2020

Seems we can use https://pkg.go.dev/github.com/cockroachdb/pebble#DB.EstimateDiskUsage to fetch the disk usage.

Abstract

Periodically, before a WriteRows, check every engine's total estimated disk usage. If the total disk usage exceeds the "(soft) disk quota", we block the write to the largest engines until the remaining total is less than the quota, and flush the blocked engines' content into TiKV. The engine UUIDs are reused.

This will cause subsequent imports to suffer from range overlapping, which we have to accept as trade-off.

Checkpoint validity

The flushing design must be compatible with checkpoints, that is no data will be lost if we Ctrl+C → resume in the middle of a process. Checkpoints may be earlier than the actual progress, so some data (process) duplication should be acceptable and ignored.

Now let's consider the flush process:

  1. ... parallel WriteRows ...
  2. detected quota overflow, start emergency ingest to TiKV
  3. CloseEngine()
    1. Flush()
    2. saveEngineMeta()
  4. ImportEngine()
    1. readAndSplitIntoRange()
    2. loop:
      • SplitAndScatterRegionByRanges()
      • WriteAndIngestByRanges()
  5. Reset engine to empty
  6. ... parallel WriteRows ...

Let's consider what happens regarding the place of interruption (I) and actual saved checkpoint (C):

Case I=3, C<3

Currently, with Local backend, a checkpoint is flushed only when the entire engine is written because Flush() is expensive (#326 (comment)). So the end of step 3 is a good point to save the checkpoint.

If step 3's checkpoint is not recorded, we will restart from the beginning, while the engine contained some incomplete data. This makes us to hit step 2 quicker, and some "future" data will be ingested. But this is still fine since those duplicated KV in the future are ignored.

Case I=4, C<4

If step 4 is actually completed, all data will have been copied to TiKV. So whether C=1 (restart from scratch) or C=3 (import again) should be fine in terms of data, just slower.

Case I=5, C<5

If step 5 is actually completed, the local data is cleaned up. Starting from C=1 should be fine. Starting from C=3 or C=4 will lead to importing an empty database, which is also fine because the data are already sent to TiKV.

Considering these, it should be fine to place a checkpoint immediately before flushing, importing and resetting the engine.

Implementation

  1. Every engine provides a StorageSize() uint64 method. TiDB and Importer backend implement that by returning 0, Local backend implement that by calculating total occupied size.

  2. Periodically (how?), compute the StorageSize() for every engine, and sort the result in ascending order. At the point the total storage size exceeds the "quota", mark those engines for flushing.

    • The "Period" depends on how expensive it is to compute StorageSize().
  3. For every engine marked for flush,

    • Acquire a write lock from the engine's "flush" RWMutex.
    • Do the flush + ingest + clean, writing checkpoint in between
    • If the engine is a data-engine, perform a Flush() on the corresponding index-engine too.
    • Release the write lock
  4. For every deliveryLoop,

    • Before WriteRows(), try to acquire a read lock from the engine's "flush" RWMutex.
    • if the read lock is immediately acquired, do WriteRows() as usual, and continue.
    • otherwise, do the actual read lock acquisition.
    • after the read lock is acquired, immediately write the current file offset to the checkpoint.
    • do WriteRows() as usual.

@overvenus
Copy link
Member

StorageSize() seems to be fast if we calculate the size of the full range.

Also, I suggest we should maintain an approximate size which can be last calculated storage size + written bytes. The "written bytes" means the size of bytes we have written to DB since the last storage size calculation. By this way, we can avoid overwriting accidentally.

@kennytm
Copy link
Collaborator

kennytm commented Nov 7, 2020

By this way, we can avoid overwriting accidentally.

could you elaborate how this works?

@kennytm kennytm changed the title Support to specify a disk quote for intermediate files Support to specify a disk quota for intermediate files Nov 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature-request This issue is a feature request
Projects
None yet
Development

No branches or pull requests

3 participants