Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BCFR-900] Add id column as PRIMARY KEY for evm.logs & evm.log_poller_blocks #14451

Merged
merged 17 commits into from
Oct 1, 2024

Conversation

reductionista
Copy link
Contributor

@reductionista reductionista commented Sep 16, 2024

The query for pruning expired logs with a max limit set was taking longer than it should. This was due in part to needing to join on an awkward combination combination of columns due to their being no single primary key.

  • Add id column as PRIMARY KEY for evm.logs & evm.log_poller_blocks

  • Join on id column instead of previous primary keys

  • Replace all SELECT *'s with helper functions for selecting all columns

  • Refactor nestedBlockQuery into withConfs, and make a bit more use of it## Motivation

  • While adding the id column, we can't just remove the old primary key because the index on it was helping to accelerate some queries.
    Instead of just resurrecting it as-is, I took the opportunity to clean up several of the indices on the logs & blocks table. Some indexed columns (eg created_at) were never actually being used,
    while others were not ordered in the most optimal way for accelerating the queries we have. Also, at least one of them was redundant with the primary key just in a different order.

  • After testing the performance of the DeleteExpiredLogs query, it was found that using the id column makes a huge difference to the part of the query involving actual expired logs. But it makes no difference to the part which finds logs not matching any filter--which is the main thing keeping the query from being faster. To address this, the pruning of unmatched logs has been separated off into its own method PruneExpiredLogs which only every 20th time PruneExpiredLogs succeeds. Both of the queries were faster on their own than together anyway, and there is no reason to prune logs corresponding to old jobs as frequently as expired logs for present jobs since those are not actively growing and are only a long-term issue.

  • This also includes an update to the DeleteBlocksBefore query which drastically improves its performance. Instead of using a LIMIT clause, which complicates the DELETE query, it uses an iterative range filter by block number, starting at MIN(block_number) and continuing upwards in windows of limit blocks until it has either removed everything it can or hits the limit. (Hitting the limit on # of blocks deleted can take more than 1 iteration because we don't necessarily save every block.)

BCFR-900

@reductionista reductionista requested review from a team as code owners September 16, 2024 21:47
@reductionista reductionista requested review from EasterTheBunny and removed request for a team September 16, 2024 21:47
@reductionista reductionista force-pushed the BCFR-900-log-poller-id-columns branch 2 times, most recently from f2f440e to a7415ca Compare September 24, 2024 02:16
@reductionista reductionista requested review from a team as code owners September 24, 2024 02:16
@reductionista reductionista force-pushed the BCFR-900-log-poller-id-columns branch 2 times, most recently from 9cfbec6 to 94b9a15 Compare September 24, 2024 03:31
@reductionista reductionista changed the title Add id column as PRIMARY KEY for evm.logs & evm.log_poller_blocks [BCFR-900] Add id column as PRIMARY KEY for evm.logs & evm.log_poller_blocks Sep 24, 2024
Copy link
Collaborator

@mateusz-sekara mateusz-sekara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvements, left some minor comments

core/chains/evm/logpoller/log_poller.go Show resolved Hide resolved
core/chains/evm/logpoller/log_poller.go Outdated Show resolved Hide resolved
@@ -136,6 +136,18 @@ func (o *ObservedORM) DeleteLogsAndBlocksAfter(ctx context.Context, start int64)
})
}

func (o *ObservedORM) DeleteLogsByRowID(ctx context.Context, rowIDs []uint64) (int64, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

CREATE UNIQUE INDEX idx_logs_chain_block_logindex ON evm.logs (evm_chain_id, block_number, log_index);
ALTER TABLE evm.log_poller_blocks DROP CONSTRAINT log_poller_blocks_pkey;
ALTER TABLE evm.log_poller_blocks ADD COLUMN id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY;
DROP INDEX IF EXISTS evm.idx_evm_log_poller_blocks_order_by_block;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because we need to make it unique?

DROP INDEX IF EXISTS evm.idx_evm_log_poller_blocks_block_number_evm_chain_id;
CREATE UNIQUE INDEX idx_log_poller_blocks_chain_block ON evm.log_poller_blocks (evm_chain_id, block_number DESC);
DROP INDEX IF EXISTS evm.idx_evm_logs_ordered_by_block_and_created_at;
CREATE INDEX idx_logs_chain_address_event_block_logindex ON evm.logs (evm_chain_id, address, event_sig, block_number, log_index);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we rely on that index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of the queries of the logs table should be able to use it. For example, the query for SelectLogs() is:

    query := logsQuery(`
        WHERE evm_chain_id = :evm_chain_id
        AND address = :address
        AND event_sig = :event_sig
        AND block_number >= :start_block
        AND block_number <= :end_block
        ORDER BY block_number, log_index`)

Like most of the logs queries, it filtesr on chain id, address, and event sig and then sorts on block_number, log_index (the order in which logs appear on chain).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of a similar index we had for created_at, which was used at one point for ordering instead of block_number, log_index but we no longer use created_at for anything since it doesn't reliably indicate the order in which the events occurred on chain.

Copy link
Collaborator

@mateusz-sekara mateusz-sekara Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense, but isn't that already covered by one of the existing indexes?

create index evm_logs_idx
    on evm.logs (evm_chain_id, block_number, address, event_sig);

create index evm_logs_by_timestamp
    on evm.logs (evm_chain_id, address, event_sig, block_timestamp, block_number);

Usually, there are not many matching logs within a single block, so probably gain from adding log_index to the index won't be that high, right? Postgres should be able to sort those small datasets efficiently by using the log_index in memory.

I know you are replacing the created_at index one with the new one. Therefore, there should not be any impact on the db size because of that - there were some teams in the past very unhappy with us adding more indexes and growing db :D I'm fine with having that exploration in a separate PR/ticket.

Copy link
Contributor Author

@reductionista reductionista Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postgres should be able to sort those small datasets efficiently by using the log_index in memory.

That's true, I'd be fine with omitting the log_index column as it's not adding much. On the other hand, I wouldn't expect that including it would increase the size of that index much.

isn't that already covered by one of the existing indexes?

I figured it probably can't use either of those for this, but admittedly I haven't tested this.

My general understanding of how these BTREE indices work based on what I've seen in previous plans is that you have to have all of the columns you're filtering on first, and then the sort columns after. Assuming that is how BTREE's work (I'm not 100% sure about that), the evm_chain_id, block_number, address, event_sig index will help for block range queries where you're looking for events of specific types within a particular block range. While this one would help with ordering the results for events of a specific type.

The one with block_timestamp in it seems more likely to work. If it had a way of knowing for sure that block numbers and block timestamps will be ordered in the same way, then it should be able to use it... but I don't think it can know that so my assumption would be that it probably won't use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side thought: from a database perspective, block_timestamp really ought to have been a column of the blocks table rather than the logs table. We could have easily joined those on block_number to search by block_timestamp without much cost. I guess the reason why it's in the logs table is mostly so that it can be returned by the queries that return []Log and to keep the logs schema in line with that data struct. Not sure if this decision might be something to revisit in the future.

Copy link
Contributor Author

@reductionista reductionista Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the evm_chain_id, block_number, address, event_sig index will help for block range queries where you're looking for events of specific types within a particular block range

After writing this I realized this is not right. If this were the only use case, then we'd be able to get rid of the evm_chain_id, block_number, address, event_sig index, and just use evm_chain_id, address, event_sig, block_number, log_index for everything... because the SELECT queries where we're looking for events in a particular block range also filter on event_sig and address... so it can just use the evm_chain_id, address, event_sig, block_number, log_index index.

Unfortunately, I just looked through the code and I do see one query that filters on block range but not address or event sig. It's where we delete all of the logs and blocks after a particular block due to a re-org:

DELETE FROM evm.logs
                         WHERE evm_chain_id = $1
                          AND block_number >= $2
                          AND block_number <= (SELECT MAX(block_number) FROM evm.logs WHERE evm_chain_id = $1)

So I think we can get rid of the address, event_sig columns at the end of this index to save some space.

core/chains/evm/logpoller/parser.go Show resolved Hide resolved
DROP INDEX IF EXISTS evm.idx_evm_log_poller_blocks_order_by_block;
DROP INDEX IF EXISTS evm.idx_evm_log_poller_blocks_block_number_evm_chain_id;
CREATE UNIQUE INDEX idx_log_poller_blocks_chain_block ON evm.log_poller_blocks (evm_chain_id, block_number DESC);
DROP INDEX IF EXISTS evm.idx_evm_logs_ordered_by_block_and_created_at;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be great to add a comment to every index creation/removal for future generations to easier identify the index with the relevant query ;)


ALTER TABLE evm.logs DROP CONSTRAINT logs_pkey;
ALTER TABLE evm.logs ADD COLUMN id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY;
CREATE UNIQUE INDEX idx_logs_chain_block_logindex ON evm.logs (evm_chain_id, block_number, log_index);
Copy link
Collaborator

@mateusz-sekara mateusz-sekara Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need it to be an index? I think for most of the cases we query by at least (evm_chain_id, address, event_sig) so this index might not be needed, but you need to verify that.

If it's only about uniqueness then adding constraints will be more lightweight than indexes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I'll check on this.

Copy link
Contributor Author

@reductionista reductionista Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After checking on this, I realized why I originally made it an index.

I don't think it's true about a unique constraint being more lightweight. Every unique constraint generates an unique index automatically... the only difference is we pick the name here rather than it picking one automatically.

I mentioned in another comment that we need at least (chain id, block number) as an index, for the DeleteLogsAndBlocksAfter query. But if we add a unique constraint, that would generate an entire second index... taking up twice the disk space. By doing it this way instead, all we have to do is add log_index to the end and then there is no need to have 2 separate indexes... which is what we had before (due to the one generated by the primary key).

In the end I decided it seems best to leave everything as-is, but I added a lot of comments explaining what's going on and separated the tables out as you suggested.


// Remove up to limit blocks at a time, until we've reached the limit or removed everything eligible for deletion
var deleted, rows int64
for limitBlock += (limit - 1); deleted < limit; limitBlock += limit {
Copy link
Collaborator

@mateusz-sekara mateusz-sekara Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why do you need to define limitBlock outside of the loop? Wouldn't that work with just

	for limitBlock := (limit - 1); deleted < limit; limitBlock += limit {

Copy link
Contributor Author

@reductionista reductionista Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to be outside the loop because it needs to be initialized as MIN(block_number) + (limit - 1), and the first part of that requires a one-time query. You may have missed that it's in this line above:

err = o.ds.GetContext(ctx, &limitBlock, `SELECT MIN(block_number) FROM evm.log_poller_blocks`)

Some additional context: originally, I was hoping we could just include the MIN(block_number) in the DELETE query that's inside the loop. But apparently aggregate functions like MIN are not allowed in DELETE queries.

@reductionista reductionista force-pushed the BCFR-900-log-poller-id-columns branch 5 times, most recently from 5e3683a to dfbf614 Compare September 26, 2024 15:48
Also:
   - Add UNIQUE INDEXes to replace previous primary keys (still necessary, both for
     optimizing queries and for enforcing uniqueness constraints)
   - Replace all SELECT *'s with helper functions for selecting all columns
   - Refactor nestedBlockQuery into withConfs, and make a bit more use of it
Some of the columns in these indexes (such as created_at) are no longer used.
Others were not optimized for the queries we need.
Previously it was using fromBlock > :block_number which is inconsistent with the other fromBlocks in queries
Also, adds a missing CREATE INDEX line that was left out of the Goose Down section:
   CREATE INDEX idx_evm_log_poller_blocks_order_by_block ON evm_log_poller_blocks (evm_chain_id, block_number DESC);

And removes an extraneous DROP line that was in the Goose Up section:
  DROP INDEX IF EXISTS evm.idx_logs_chain_address_event_block_logindex;
Copy link
Contributor

I see you updated files related to core. Please run pnpm changeset in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

Copy link
Collaborator

@dhaidashenko dhaidashenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the queries misses a chainID clause, but otherwise looks good. Great job!

WHERE block_number <= $1 AND evm_chain_id = $2`, end, ubig.New(o.chainID))

var limitBlock int64
err = o.ds.GetContext(ctx, &limitBlock, `SELECT MIN(block_number) FROM evm.log_poller_blocks`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err = o.ds.GetContext(ctx, &limitBlock, `SELECT MIN(block_number) FROM evm.log_poller_blocks`)
err = o.ds.GetContext(ctx, &limitBlock, `SELECT MIN(block_number) FROM evm.log_poller_blocks WHERE evm_chain_id = $1`, ubig.New(o.chainID)`)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, nice catch!

WHERE block_number <= $1 AND evm_chain_id = $2`, end, ubig.New(o.chainID))

var limitBlock int64
err = o.ds.GetContext(ctx, &limitBlock, `SELECT MIN(block_number) FROM evm.log_poller_blocks`)
if err != nil {
return 0, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should we handle errNoRows here?

core/chains/evm/logpoller/orm.go Show resolved Hide resolved
core/chains/evm/logpoller/log_poller.go Show resolved Hide resolved
mateusz-sekara
mateusz-sekara previously approved these changes Sep 27, 2024
@reductionista reductionista added this pull request to the merge queue Oct 1, 2024
Merged via the queue into develop with commit e96210b Oct 1, 2024
126 of 127 checks passed
@reductionista reductionista deleted the BCFR-900-log-poller-id-columns branch October 1, 2024 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants