Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

lightning/backend/local: use batch write instead of ingest on overlap #857

Closed
wants to merge 1 commit into from

Conversation

kennytm
Copy link
Collaborator

@kennytm kennytm commented Mar 11, 2021

What problem does this PR solve?

Local backend's local writer always use Ingest SST to copy the content into Pebble. If the SST overlaps with existing data in Pebble, it will be placed at a higher level, eventually placing all at L0. This greatly increases read amplification and makes Flush() (a prerequisite to disk quota) extremely slow.

What is changed and how it works?

If the SST to be ingested overlaps with some existing data, we switch to use Batch Write instead so Pebble can better rearrange the data.

Replaces #843.

We enabled WAL to ensure a flushed Batch Write survives even if Lightning crashes.

Check List

Tests

  • Unit test

Code changes

Side effects

Related changes

  • Need to cherry-pick to the release branch

Release Note

  • No release note

@@ -608,7 +608,6 @@ func (local *local) openEngineDB(engineUUID uuid.UUID, readOnly bool) (*pebble.D
L0CompactionThreshold: local.maxOpenFiles / 2,
L0StopWritesThreshold: local.maxOpenFiles / 2,
MaxOpenFiles: local.maxOpenFiles,
DisableWAL: true,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"We enabled WAL to ensure a flushed Batch Write survives even if Lightning crashes."

it is in the PR description 😓

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the WAL is to ensure a commited batch can be recovered even if it's not flushed. Without WAL, a flushed batch can still survive even if lightning crashes?🤔

BTW I'd like to know the performance regression with WAL on and off. If it's non-trivial, I suspect is it worth we pay this price only to address the rare crash.

@@ -1884,7 +1883,24 @@ func (w *LocalWriter) writeKVsOrIngest(desc localIngestDescription) error {
}
}

// if write failed only because of unorderedness, we immediately ingest the memcache.
// if write failed because of unorderedness, we immediately ingest or write the memcache.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have disabled the implicit compaction, the main difference between ingest and write-batch approach is the memcache size. For ingest, it's 128MB; and For write-batch it's 512MB. So ingest approach will generated 3 times more sst files than wb. This is likely a temporary solution. Maybe a simpler way is to use ingest for data engine and write batch for index engine.

Because the write-base approach is proved to be very slow, especially when there is only one table will a lot of indexes. So I don't think this change can address the performance problem. Shall we wait for #753 with the manual compaction?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a simpler way is to use ingest for data engine and write batch for index engine.

so you mean just remove the overlap check?

So I don't think this change can address the performance problem.

this PR addresses the problem that (during Disk Quota) FlushAllEngines stuck because of having 17000 L0 files. i doubt #753 has tested that scenario.

Copy link
Collaborator

@glorv glorv Mar 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the write-batch approach still have this problem, but because it's batch-size is larger, it this sceniro, the L0 SST files should be 17000 / 4 , this number is still very big. Without compaction (either manually or implicit) we can't fully address this problem. And the write-batch is slow due to skip-list insert.

We can accept this pr, but this is a temp solution. #753 uses manual compaction to control the SST file number. With 10TB source files size, it can ensure the total SST files are no more than 3k.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTY, I'm afraid this approach will introduce performance regression because for each write-batch it needs to first check the db iterator and then fall back to the write-batch solution if the overlap is somewhat severe

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we write all data by writebatch in index-engine, we may be blocked by flush-thread because all data must be flushed by only thread even if there are big data writing into index engine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could split data of source into more than 4 index-engine and just merge all result of them by a merge iterator?

@ti-chi-bot
Copy link
Member

@kennytm: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@glorv
Copy link
Collaborator

glorv commented Jul 14, 2021

Close outdated pr.

@glorv glorv closed this Jul 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants