-
Notifications
You must be signed in to change notification settings - Fork 102
lightning/backend/local: use batch write instead of ingest on overlap #857
Conversation
@@ -608,7 +608,6 @@ func (local *local) openEngineDB(engineUUID uuid.UUID, readOnly bool) (*pebble.D | |||
L0CompactionThreshold: local.maxOpenFiles / 2, | |||
L0StopWritesThreshold: local.maxOpenFiles / 2, | |||
MaxOpenFiles: local.maxOpenFiles, | |||
DisableWAL: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"We enabled WAL to ensure a flushed Batch Write survives even if Lightning crashes."
it is in the PR description 😓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the WAL
is to ensure a commited batch can be recovered even if it's not flushed. Without WAL, a flushed batch can still survive even if lightning crashes?🤔
BTW I'd like to know the performance regression with WAL on and off. If it's non-trivial, I suspect is it worth we pay this price only to address the rare crash.
@@ -1884,7 +1883,24 @@ func (w *LocalWriter) writeKVsOrIngest(desc localIngestDescription) error { | |||
} | |||
} | |||
|
|||
// if write failed only because of unorderedness, we immediately ingest the memcache. | |||
// if write failed because of unorderedness, we immediately ingest or write the memcache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have disabled the implicit compaction, the main difference between ingest and write-batch approach is the memcache size. For ingest, it's 128MB; and For write-batch it's 512MB. So ingest approach will generated 3 times more sst files than wb. This is likely a temporary solution. Maybe a simpler way is to use ingest for data engine and write batch for index engine.
Because the write-base approach is proved to be very slow, especially when there is only one table will a lot of indexes. So I don't think this change can address the performance problem. Shall we wait for #753 with the manual compaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a simpler way is to use ingest for data engine and write batch for index engine.
so you mean just remove the overlap check?
So I don't think this change can address the performance problem.
this PR addresses the problem that (during Disk Quota) FlushAllEngines stuck because of having 17000 L0 files. i doubt #753 has tested that scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the write-batch approach still have this problem, but because it's batch-size is larger, it this sceniro, the L0 SST files should be 17000 / 4 , this number is still very big. Without compaction (either manually or implicit) we can't fully address this problem. And the write-batch is slow due to skip-list insert.
We can accept this pr, but this is a temp solution. #753 uses manual compaction to control the SST file number. With 10TB source files size, it can ensure the total SST files are no more than 3k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTY, I'm afraid this approach will introduce performance regression because for each write-batch it needs to first check the db iterator and then fall back to the write-batch solution if the overlap is somewhat severe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we write all data by writebatch in index-engine, we may be blocked by flush-thread because all data must be flushed by only thread even if there are big data writing into index engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could split data of source into more than 4 index-engine and just merge all result of them by a merge iterator?
@kennytm: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Close outdated pr. |
What problem does this PR solve?
Local backend's local writer always use Ingest SST to copy the content into Pebble. If the SST overlaps with existing data in Pebble, it will be placed at a higher level, eventually placing all at L0. This greatly increases read amplification and makes Flush() (a prerequisite to disk quota) extremely slow.
What is changed and how it works?
If the SST to be ingested overlaps with some existing data, we switch to use Batch Write instead so Pebble can better rearrange the data.
Replaces #843.
We enabled WAL to ensure a flushed Batch Write survives even if Lightning crashes.
Check List
Tests
Code changes
Side effects
Related changes
Release Note