-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BCFR-900] Add id column as PRIMARY KEY for evm.logs & evm.log_poller_blocks #14451
Conversation
f2f440e
to
a7415ca
Compare
9cfbec6
to
94b9a15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great improvements, left some minor comments
@@ -136,6 +136,18 @@ func (o *ObservedORM) DeleteLogsAndBlocksAfter(ctx context.Context, start int64) | |||
}) | |||
} | |||
|
|||
func (o *ObservedORM) DeleteLogsByRowID(ctx context.Context, rowIDs []uint64) (int64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
CREATE UNIQUE INDEX idx_logs_chain_block_logindex ON evm.logs (evm_chain_id, block_number, log_index); | ||
ALTER TABLE evm.log_poller_blocks DROP CONSTRAINT log_poller_blocks_pkey; | ||
ALTER TABLE evm.log_poller_blocks ADD COLUMN id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY; | ||
DROP INDEX IF EXISTS evm.idx_evm_log_poller_blocks_order_by_block; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it because we need to make it unique?
DROP INDEX IF EXISTS evm.idx_evm_log_poller_blocks_block_number_evm_chain_id; | ||
CREATE UNIQUE INDEX idx_log_poller_blocks_chain_block ON evm.log_poller_blocks (evm_chain_id, block_number DESC); | ||
DROP INDEX IF EXISTS evm.idx_evm_logs_ordered_by_block_and_created_at; | ||
CREATE INDEX idx_logs_chain_address_event_block_logindex ON evm.logs (evm_chain_id, address, event_sig, block_number, log_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we rely on that index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most of the queries of the logs table should be able to use it. For example, the query for SelectLogs()
is:
query := logsQuery(`
WHERE evm_chain_id = :evm_chain_id
AND address = :address
AND event_sig = :event_sig
AND block_number >= :start_block
AND block_number <= :end_block
ORDER BY block_number, log_index`)
Like most of the logs queries, it filtesr on chain id, address, and event sig and then sorts on block_number, log_index (the order in which logs appear on chain).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got rid of a similar index we had for created_at
, which was used at one point for ordering instead of block_number, log_index
but we no longer use created_at
for anything since it doesn't reliably indicate the order in which the events occurred on chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense, but isn't that already covered by one of the existing indexes?
create index evm_logs_idx
on evm.logs (evm_chain_id, block_number, address, event_sig);
create index evm_logs_by_timestamp
on evm.logs (evm_chain_id, address, event_sig, block_timestamp, block_number);
Usually, there are not many matching logs within a single block, so probably gain from adding log_index to the index won't be that high, right? Postgres should be able to sort those small datasets efficiently by using the log_index in memory.
I know you are replacing the created_at
index one with the new one. Therefore, there should not be any impact on the db size because of that - there were some teams in the past very unhappy with us adding more indexes and growing db :D I'm fine with having that exploration in a separate PR/ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
postgres should be able to sort those small datasets efficiently by using the log_index in memory.
That's true, I'd be fine with omitting the log_index
column as it's not adding much. On the other hand, I wouldn't expect that including it would increase the size of that index much.
isn't that already covered by one of the existing indexes?
I figured it probably can't use either of those for this, but admittedly I haven't tested this.
My general understanding of how these BTREE indices work based on what I've seen in previous plans is that you have to have all of the columns you're filtering on first, and then the sort columns after. Assuming that is how BTREE's work (I'm not 100% sure about that), the evm_chain_id, block_number, address, event_sig
index will help for block range queries where you're looking for events of specific types within a particular block range. While this one would help with ordering the results for events of a specific type.
The one with block_timestamp
in it seems more likely to work. If it had a way of knowing for sure that block numbers and block timestamps will be ordered in the same way, then it should be able to use it... but I don't think it can know that so my assumption would be that it probably won't use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side thought: from a database perspective, block_timestamp
really ought to have been a column of the blocks
table rather than the logs
table. We could have easily joined those on block_number to search by block_timestamp without much cost. I guess the reason why it's in the logs
table is mostly so that it can be returned by the queries that return []Log
and to keep the logs
schema in line with that data struct. Not sure if this decision might be something to revisit in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the evm_chain_id, block_number, address, event_sig index will help for block range queries where you're looking for events of specific types within a particular block range
After writing this I realized this is not right. If this were the only use case, then we'd be able to get rid of the evm_chain_id, block_number, address, event_sig
index, and just use evm_chain_id, address, event_sig, block_number, log_index
for everything... because the SELECT queries where we're looking for events in a particular block range also filter on event_sig and address... so it can just use the evm_chain_id, address, event_sig, block_number, log_index
index.
Unfortunately, I just looked through the code and I do see one query that filters on block range but not address or event sig. It's where we delete all of the logs and blocks after a particular block due to a re-org:
DELETE FROM evm.logs
WHERE evm_chain_id = $1
AND block_number >= $2
AND block_number <= (SELECT MAX(block_number) FROM evm.logs WHERE evm_chain_id = $1)
So I think we can get rid of the address, event_sig
columns at the end of this index to save some space.
DROP INDEX IF EXISTS evm.idx_evm_log_poller_blocks_order_by_block; | ||
DROP INDEX IF EXISTS evm.idx_evm_log_poller_blocks_block_number_evm_chain_id; | ||
CREATE UNIQUE INDEX idx_log_poller_blocks_chain_block ON evm.log_poller_blocks (evm_chain_id, block_number DESC); | ||
DROP INDEX IF EXISTS evm.idx_evm_logs_ordered_by_block_and_created_at; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be great to add a comment to every index creation/removal for future generations to easier identify the index with the relevant query ;)
|
||
ALTER TABLE evm.logs DROP CONSTRAINT logs_pkey; | ||
ALTER TABLE evm.logs ADD COLUMN id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY; | ||
CREATE UNIQUE INDEX idx_logs_chain_block_logindex ON evm.logs (evm_chain_id, block_number, log_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need it to be an index? I think for most of the cases we query by at least (evm_chain_id, address, event_sig) so this index might not be needed, but you need to verify that.
If it's only about uniqueness then adding constraints will be more lightweight than indexes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, I'll check on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After checking on this, I realized why I originally made it an index.
I don't think it's true about a unique constraint being more lightweight. Every unique constraint generates an unique index automatically... the only difference is we pick the name here rather than it picking one automatically.
I mentioned in another comment that we need at least (chain id, block number) as an index, for the DeleteLogsAndBlocksAfter query. But if we add a unique constraint, that would generate an entire second index... taking up twice the disk space. By doing it this way instead, all we have to do is add log_index to the end and then there is no need to have 2 separate indexes... which is what we had before (due to the one generated by the primary key).
In the end I decided it seems best to leave everything as-is, but I added a lot of comments explaining what's going on and separated the tables out as you suggested.
|
||
// Remove up to limit blocks at a time, until we've reached the limit or removed everything eligible for deletion | ||
var deleted, rows int64 | ||
for limitBlock += (limit - 1); deleted < limit; limitBlock += limit { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why do you need to define limitBlock
outside of the loop? Wouldn't that work with just
for limitBlock := (limit - 1); deleted < limit; limitBlock += limit {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has to be outside the loop because it needs to be initialized as MIN(block_number) + (limit - 1)
, and the first part of that requires a one-time query. You may have missed that it's in this line above:
err = o.ds.GetContext(ctx, &limitBlock, `SELECT MIN(block_number) FROM evm.log_poller_blocks`)
Some additional context: originally, I was hoping we could just include the MIN(block_number)
in the DELETE
query that's inside the loop. But apparently aggregate functions like MIN
are not allowed in DELETE queries.
5e3683a
to
dfbf614
Compare
Also: - Add UNIQUE INDEXes to replace previous primary keys (still necessary, both for optimizing queries and for enforcing uniqueness constraints) - Replace all SELECT *'s with helper functions for selecting all columns - Refactor nestedBlockQuery into withConfs, and make a bit more use of it
Some of the columns in these indexes (such as created_at) are no longer used. Others were not optimized for the queries we need.
Previously it was using fromBlock > :block_number which is inconsistent with the other fromBlocks in queries
Also, adds a missing CREATE INDEX line that was left out of the Goose Down section: CREATE INDEX idx_evm_log_poller_blocks_order_by_block ON evm_log_poller_blocks (evm_chain_id, block_number DESC); And removes an extraneous DROP line that was in the Goose Up section: DROP INDEX IF EXISTS evm.idx_logs_chain_address_event_block_logindex;
dfbf614
to
b100ded
Compare
6129520
to
77862b0
Compare
I see you updated files related to
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the queries misses a chainID clause, but otherwise looks good. Great job!
core/chains/evm/logpoller/orm.go
Outdated
WHERE block_number <= $1 AND evm_chain_id = $2`, end, ubig.New(o.chainID)) | ||
|
||
var limitBlock int64 | ||
err = o.ds.GetContext(ctx, &limitBlock, `SELECT MIN(block_number) FROM evm.log_poller_blocks`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err = o.ds.GetContext(ctx, &limitBlock, `SELECT MIN(block_number) FROM evm.log_poller_blocks`) | |
err = o.ds.GetContext(ctx, &limitBlock, `SELECT MIN(block_number) FROM evm.log_poller_blocks WHERE evm_chain_id = $1`, ubig.New(o.chainID)`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, nice catch!
WHERE block_number <= $1 AND evm_chain_id = $2`, end, ubig.New(o.chainID)) | ||
|
||
var limitBlock int64 | ||
err = o.ds.GetContext(ctx, &limitBlock, `SELECT MIN(block_number) FROM evm.log_poller_blocks`) | ||
if err != nil { | ||
return 0, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should we handle errNoRows here?
Quality Gate passedIssues Measures |
The query for pruning expired logs with a max limit set was taking longer than it should. This was due in part to needing to join on an awkward combination combination of columns due to their being no single primary key.
Add id column as PRIMARY KEY for evm.logs & evm.log_poller_blocks
Join on id column instead of previous primary keys
Replace all SELECT *'s with helper functions for selecting all columns
Refactor nestedBlockQuery into withConfs, and make a bit more use of it## Motivation
While adding the id column, we can't just remove the old primary key because the index on it was helping to accelerate some queries.
Instead of just resurrecting it as-is, I took the opportunity to clean up several of the indices on the logs & blocks table. Some indexed columns (eg created_at) were never actually being used,
while others were not ordered in the most optimal way for accelerating the queries we have. Also, at least one of them was redundant with the primary key just in a different order.
After testing the performance of the
DeleteExpiredLogs
query, it was found that using the id column makes a huge difference to the part of the query involving actual expired logs. But it makes no difference to the part which finds logs not matching any filter--which is the main thing keeping the query from being faster. To address this, the pruning of unmatched logs has been separated off into its own methodPruneExpiredLogs
which only every 20th timePruneExpiredLogs
succeeds. Both of the queries were faster on their own than together anyway, and there is no reason to prune logs corresponding to old jobs as frequently as expired logs for present jobs since those are not actively growing and are only a long-term issue.This also includes an update to the
DeleteBlocksBefore
query which drastically improves its performance. Instead of using aLIMIT
clause, which complicates theDELETE
query, it uses an iterative range filter by block number, starting atMIN(block_number)
and continuing upwards in windows oflimit
blocks until it has either removed everything it can or hits the limit. (Hitting the limit on # of blocks deleted can take more than 1 iteration because we don't necessarily save every block.)BCFR-900