Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full internal message extraction #978

Closed
2 tasks
frrist opened this issue Jun 7, 2022 · 8 comments · Fixed by #1027
Closed
2 tasks

Full internal message extraction #978

frrist opened this issue Jun 7, 2022 · 8 comments · Fixed by #1027
Assignees
Labels
kind/feature New feature request
Milestone

Comments

@frrist
Copy link
Member

frrist commented Jun 7, 2022

Background

Internal messages (also known as execution traces, or VM Messages) are messages sent by actors while executing (on-chain) messages. As the name suggests, internal messages are executed internally by the filecoin VM and therefore do not appear on-chain. One example of an Internal Message send exists within the Multsisig actor. Messages may be staged in a multsig actor and sent once they are approved by the required number of signers. The result of the multisig send does not appear on chain. A second example of an internal message send happens within the miner actor DisputeWindowPoSt method, the message that rewards the reporter is sent internally and thus doesn't appear on chain.

Collecting the information from internal-message-sends permits a granular inspection of activity within the filecoin network. In addition to the above examples, it allows for inspection of:

  • the funds being burned by a storage provider.
  • the block rewards received by a storage provider.
  • deals activated by a storage provider.
  • deal timeouts in the market actor.
  • funds burnt by the reward actor.
  • and more!

Currently, Lily exports two models, internal_messages and internal_parsed_messages. These tables are poorly named since they actually contain "implicit messages": messages which are implicitly applied for each block - AwardBlockRewards and EpochTick. In a sense, the messages tracked in these tables are internal as they do not appear on-chain, but I am unsure if the messages described above in the first couple of paragraphs belong in these tables.

Model Design

Create a new table with the following schema for tracking internal messages.

Table Name: vm_messages

Table Definition:

Column Type Description
Height (pk) bigint Height message was applied at
StateRoot (pk) text StateRoot message was applied to
CID (pk) text cid of the message
Source (pk) text on chain message triggering the message
From text sender of the message
To text receiver of the message
Value bigint value (attoFIL) contained in the message
Method text method message will execute
Actor_code text cid of the receiver
Exit_code bigint exit code of the message execution
Gas_used bigint gas (attoFIL) spent executing the message
Params jsonb message params parsed as JSON
Return jsonb returned params parsed as JSON

Table Indexes

  • height (desc)
  • stateroot (hash)
  • CID (hash)
  • Source (hash)
  • from (hash)
  • to (hash)
  • method (hash)
  • actor_code (hash)

Acceptance Criteria

  • Implement internal message extraction logic in lily
  • Implement schema migration required for model
@frrist frrist self-assigned this Jun 7, 2022
@frrist frrist added the kind/feature New feature request label Jun 7, 2022
@f8-ptrk
Copy link
Contributor

f8-ptrk commented Jul 27, 2022

love it.

please leave comments in the issue describing the data you require (e.g. message paramaters as JSON? internal message receipts? Receipt parameters?, etc.) (edited)

as much data as available - i know this will possibly get out of hand fast and people might not need everything of it so we might want to think about having some params to fine tune the process for their needs!

we for our part, especially in front of the fvm background, would love to see "everything" being parsed

@placer14
Copy link
Contributor

placer14 commented Aug 2, 2022

I would be careful with an index on every column. Each index created will need to be updated on insert and having many indices will impact insertion speed. I suspect this will also end up being our heaviest table. Index updates will get slower as the table grows. Can we drop any of these indices?

(I don't know if you intend to have capital letters in the column names, but snake_case only.)

@frrist
Copy link
Member Author

frrist commented Aug 2, 2022

Can we drop any of these indices

Absolutely, what do you think we should change? I have updated the issue slightly wrt indexes

I don't know if you intend to have capital letters in the column names

Not my intention, and FWIW column names are enforced via the ORM anyways

@kasteph
Copy link
Contributor

kasteph commented Aug 2, 2022

The data engineering team has no concerns regarding the indices since we'll be following the Lily -> CSV -> BigQuery to store the vm_messages.

I'd like to make a suggestion regarding the creation of indices. If at all possible, apply the proposed schema on a local DB first without any indices save for the composite primary key and another index such as method. You can then run analyze <table> according to the docs to examine the distribution of the values of the table and how those distributions are being used with the existing indices.

@placer14
Copy link
Contributor

placer14 commented Aug 2, 2022

If at all possible, apply the proposed schema on a local DB first without any indices save for the composite primary key and another index such as method.

Agreeing with Steph, I still think we should separate read schema concerns from write schema concerns. We can't have a good schema for both and is why we're generally at odds wrt indices and normalized schemas. I have an idea to separately manage the read schema requirements from the ingestion schema requirements which Lily should be managing in its migrations.

Without a good place for separating read/write schemas, we can accept the cost of insertion with as few indices as we can get away with.

Absolutely, what do you think we should change? I have updated the issue slightly wrt indexes

I would drop the stateroot index. And combine the index on some of the columns ordered by most-accessed column first.

  • height,CID,stateroot (pk, btree, is_unique) (will be used for pk + range queries on height even if CID isn't also included in condition... but will also use CID (as well as stateroot) if it is.)
  • CID (hash) (lookups directly by CID, source, from, and to)
  • Source (hash)
  • from (hash)
  • to (hash)
  • (Maybe) actor_code,method (btree) (mostly for looking up actor code or actor code + method. Lookups for just method will be inefficient here unless we want to add another index. But only if we really think we'll need this.)

@kasteph
Copy link
Contributor

kasteph commented Aug 2, 2022

I just ran some small, probably impractical in real life queries against a schema that @frrist provided me with. I used an abridged schema of what he originally sent me to test the table with fewer indices. The vm_messages table had ~13k rows while receipts had ~7.7m. The queries are meant to try and use the indices available except the last one (self join on method).

vm_messages schema
create table vm_messages
(
    height     bigint  not null,
    state_root text    not null,
    cid        text    not null,
    source     text    not null,
    "from"     text    not null,
    "to"       text    not null,
    value      numeric not null,
    method     text    not null,
    actor_code text    not null,
    exit_code  bigint  not null,
    gas_used   bigint  not null,
    params     jsonb,
    returns    jsonb,
    primary key (height, state_root, cid, source)
);

alter table vm_messages
    owner to postgres;

create index vm_messages_height_idx
    on vm_messages (height desc);

create index vm_messages_state_root_idx
    on vm_messages using hash (state_root);

create index vm_messages_cid_idx
    on vm_messages using hash (cid);

create index vm_messages_source_idx
    on vm_messages using hash (source);
self join on vm_messages
explain analyze
select a.height, a.method
from vm_messages a
join vm_messages b
on a.height = b.height and a.state_root = b.state_root and a.from = b.to
where a.params->'DealIDs' is not null;

query plan:

                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.42..4166.82 rows=31104 width=10) (actual time=5.844..194.945 rows=1610 loops=1)
   ->  Seq Scan on vm_messages a  (cost=0.00..1085.29 rows=13475 width=81) (actual time=0.045..11.095 rows=1713 loops=1)
         Filter: ((params -> 'DealIDs'::text) IS NOT NULL)
         Rows Removed by Filter: 11830
   ->  Memoize  (cost=0.42..1.31 rows=1 width=76) (actual time=0.060..0.107 rows=1 loops=1713)
         Cache Key: a.height, a.state_root, a."from"
         Cache Mode: logical
         Hits: 896  Misses: 817  Evictions: 0  Overflows: 0  Memory Usage: 206kB
         ->  Index Scan using vm_messages_pkey on vm_messages b  (cost=0.41..1.30 rows=1 width=76) (actual time=0.124..0.222 rows=1 loops=817)
               Index Cond: ((height = a.height) AND (state_root = a.state_root))
               Filter: (a."from" = "to")
               Rows Removed by Filter: 562
 Planning Time: 9.981 ms
 Execution Time: 195.101 ms
(14 rows)
vm_messages join receipts
explain analyze verbose
select v.height, v.cid, v.method
from vm_messages v
join receipts r
on v.cid = r.message
where v.params->'DealID' is not null;
                                                                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------
-------------
 Gather  (cost=1000.00..407611.62 rows=13475 width=73) (actual time=2162.106..2164.971 rows=0 loops=1)
   Output: v.height, v.cid, v.method
   Workers Planned: 2
   Workers Launched: 2
   ->  Nested Loop  (cost=0.00..405264.12 rows=5615 width=73) (actual time=2157.425..2157.426 rows=0 loops=3)
         Output: v.height, v.cid, v.method
         Worker 0:  actual time=2155.520..2155.521 rows=0 loops=1
         Worker 1:  actual time=2155.593..2155.593 rows=0 loops=1
         ->  Parallel Seq Scan on public.receipts r  (cost=0.00..211130.97 rows=3207997 width=63) (actual time=0.050..330.858 rows=2566393 loops=3)
               Output: r.message, r.state_root, r.idx, r.exit_code, r.gas_used, r.height
               Worker 0:  actual time=0.053..330.871 rows=2580387 loops=1
               Worker 1:  actual time=0.057..329.548 rows=2483895 loops=1
         ->  Index Scan using vm_messages_cid_idx on public.vm_messages v  (cost=0.00..0.04 rows=2 width=73) (actual time=0.000..0.000 rows=0 loops
=7699180)
               Output: v.height, v.state_root, v.cid, v.source, v."from", v."to", v.value, v.method, v.actor_code, v.exit_code, v.gas_used, v.param
s, v.returns
               Index Cond: (v.cid = r.message)
               Rows Removed by Index Recheck: 0
               Filter: ((v.params -> 'DealID'::text) IS NOT NULL)
               Worker 0:  actual time=0.000..0.000 rows=0 loops=2580387
               Worker 1:  actual time=0.000..0.000 rows=0 loops=2483895
 Planning Time: 1.046 ms
 Execution Time: 2165.080 ms
(21 rows)
self join on method
explain analyze verbose select v.height, x.height, v.cid, v.method from vm_messages v join vm_messages x on v.method = x.method;
                                                                        QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------
-------
 Hash Join  (cost=1220.72..330940.98 rows=27617278 width=81) (actual time=12.539..2718.985 rows=27617277 loops=1)
   Output: v.height, x.height, v.cid, v.method
   Hash Cond: (v.method = x.method)
   ->  Seq Scan on public.vm_messages v  (cost=0.00..1051.43 rows=13543 width=73) (actual time=0.008..2.932 rows=13543 loops=1)
         Output: v.height, v.state_root, v.cid, v.source, v."from", v."to", v.value, v.method, v.actor_code, v.exit_code, v.gas_used, v.params, v.r
eturns
   ->  Hash  (cost=1051.43..1051.43 rows=13543 width=10) (actual time=12.502..12.503 rows=13543 loops=1)
         Output: x.height, x.method
         Buckets: 16384  Batches: 1  Memory Usage: 687kB
         ->  Seq Scan on public.vm_messages x  (cost=0.00..1051.43 rows=13543 width=10) (actual time=0.005..7.527 rows=13543 loops=1)
               Output: x.height, x.method
 Planning Time: 0.195 ms
 Execution Time: 3482.044 ms
(12 rows)

I ran multiple analyses on these two queries and there was not much variance in the execution times of each. I think this demonstrates that height and cid are sufficient indices even when joining a big table such as receipts. Not sure how much we'll be doing reads on method though but one can assume that other columns of similar type (enum repr. as string) without an index, such as actor_code will perform roughly the same. If there will be dashboards in the near future where we'd want to group messages by actor_code and/or method then it might make sense to index that now (not as hash but btree if we index height as brin).

@f8-ptrk
Copy link
Contributor

f8-ptrk commented Aug 3, 2022

so the consensus is that the read schema needs to be different from the write schema for this to be useful? does that in general apply to lilly?

@frrist frrist linked a pull request Aug 3, 2022 that will close this issue
@frrist
Copy link
Member Author

frrist commented Aug 3, 2022

@f8-ptrk generally, we want a schema optimized for the ingestion of data as bottlenecks in writing to the database will prevent lily from keeping up with the chain; We aim to strike a balance between ingestion speed and query speed by including some indices in the schema, and recognize they may not be optimal for all use-cases. Users of Lily are free to introduce their own indices outside the provided schema or, as @kasteph alludes to, write directly to CSV files and implement a separate ingestion process.

I have incorporated the feedback from this issue into #1027, which is ready for review @placer14 @kasteph (@f8-ptrk feel free to review as well). We can continue discussion around index's changes there.

frrist added a commit that referenced this issue Aug 10, 2022
* feat: implement vm message extraction

- closes #978
@frrist frrist added this to the Lily v0.12.0 milestone Aug 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants