Full internal message extraction #978

frrist · 2022-06-07T17:27:26Z

Background

Internal messages (also known as execution traces, or VM Messages) are messages sent by actors while executing (on-chain) messages. As the name suggests, internal messages are executed internally by the filecoin VM and therefore do not appear on-chain. One example of an Internal Message send exists within the Multsisig actor. Messages may be staged in a multsig actor and sent once they are approved by the required number of signers. The result of the multisig send does not appear on chain. A second example of an internal message send happens within the miner actor DisputeWindowPoSt method, the message that rewards the reporter is sent internally and thus doesn't appear on chain.

Collecting the information from internal-message-sends permits a granular inspection of activity within the filecoin network. In addition to the above examples, it allows for inspection of:

the funds being burned by a storage provider.
the block rewards received by a storage provider.
deals activated by a storage provider.
deal timeouts in the market actor.
funds burnt by the reward actor.
and more!

Currently, Lily exports two models, internal_messages and internal_parsed_messages. These tables are poorly named since they actually contain "implicit messages": messages which are implicitly applied for each block - AwardBlockRewards and EpochTick. In a sense, the messages tracked in these tables are internal as they do not appear on-chain, but I am unsure if the messages described above in the first couple of paragraphs belong in these tables.

Model Design

Create a new table with the following schema for tracking internal messages.

Table Name: `vm_messages`

Table Definition:

Column	Type	Description
Height (pk)	bigint	Height message was applied at
StateRoot (pk)	text	StateRoot message was applied to
CID (pk)	text	cid of the message
Source (pk)	text	on chain message triggering the message
From	text	sender of the message
To	text	receiver of the message
Value	bigint	value (attoFIL) contained in the message
Method	text	method message will execute
Actor_code	text	cid of the receiver
Exit_code	bigint	exit code of the message execution
Gas_used	bigint	gas (attoFIL) spent executing the message
Params	jsonb	message params parsed as JSON
Return	jsonb	returned params parsed as JSON

Table Indexes

height (desc)
stateroot (hash)
CID (hash)
Source (hash)
from (hash)
to (hash)
method (hash)
actor_code (hash)

Acceptance Criteria

Implement internal message extraction logic in lily
Implement schema migration required for model

The text was updated successfully, but these errors were encountered:

f8-ptrk · 2022-07-27T03:12:58Z

love it.

please leave comments in the issue describing the data you require (e.g. message paramaters as JSON? internal message receipts? Receipt parameters?, etc.) (edited)

as much data as available - i know this will possibly get out of hand fast and people might not need everything of it so we might want to think about having some params to fine tune the process for their needs!

we for our part, especially in front of the fvm background, would love to see "everything" being parsed

placer14 · 2022-08-02T02:36:02Z

I would be careful with an index on every column. Each index created will need to be updated on insert and having many indices will impact insertion speed. I suspect this will also end up being our heaviest table. Index updates will get slower as the table grows. Can we drop any of these indices?

(I don't know if you intend to have capital letters in the column names, but snake_case only.)

frrist · 2022-08-02T16:45:23Z

Can we drop any of these indices

Absolutely, what do you think we should change? I have updated the issue slightly wrt indexes

I don't know if you intend to have capital letters in the column names

Not my intention, and FWIW column names are enforced via the ORM anyways

kasteph · 2022-08-02T19:50:16Z

The data engineering team has no concerns regarding the indices since we'll be following the Lily -> CSV -> BigQuery to store the vm_messages.

I'd like to make a suggestion regarding the creation of indices. If at all possible, apply the proposed schema on a local DB first without any indices save for the composite primary key and another index such as method. You can then run analyze <table> according to the docs to examine the distribution of the values of the table and how those distributions are being used with the existing indices.

placer14 · 2022-08-02T21:22:29Z

If at all possible, apply the proposed schema on a local DB first without any indices save for the composite primary key and another index such as method.

Agreeing with Steph, I still think we should separate read schema concerns from write schema concerns. We can't have a good schema for both and is why we're generally at odds wrt indices and normalized schemas. I have an idea to separately manage the read schema requirements from the ingestion schema requirements which Lily should be managing in its migrations.

Without a good place for separating read/write schemas, we can accept the cost of insertion with as few indices as we can get away with.

Absolutely, what do you think we should change? I have updated the issue slightly wrt indexes

I would drop the stateroot index. And combine the index on some of the columns ordered by most-accessed column first.

height,CID,stateroot (pk, btree, is_unique) (will be used for pk + range queries on height even if CID isn't also included in condition... but will also use CID (as well as stateroot) if it is.)
CID (hash) (lookups directly by CID, source, from, and to)
Source (hash)
from (hash)
to (hash)
(Maybe) actor_code,method (btree) (mostly for looking up actor code or actor code + method. Lookups for just method will be inefficient here unless we want to add another index. But only if we really think we'll need this.)

kasteph · 2022-08-02T21:35:02Z

I just ran some small, probably impractical in real life queries against a schema that @frrist provided me with. I used an abridged schema of what he originally sent me to test the table with fewer indices. The vm_messages table had ~13k rows while receipts had ~7.7m. The queries are meant to try and use the indices available except the last one (self join on method).

vm_messages schema

create table vm_messages
(
    height     bigint  not null,
    state_root text    not null,
    cid        text    not null,
    source     text    not null,
    "from"     text    not null,
    "to"       text    not null,
    value      numeric not null,
    method     text    not null,
    actor_code text    not null,
    exit_code  bigint  not null,
    gas_used   bigint  not null,
    params     jsonb,
    returns    jsonb,
    primary key (height, state_root, cid, source)
);

alter table vm_messages
    owner to postgres;

create index vm_messages_height_idx
    on vm_messages (height desc);

create index vm_messages_state_root_idx
    on vm_messages using hash (state_root);

create index vm_messages_cid_idx
    on vm_messages using hash (cid);

create index vm_messages_source_idx
    on vm_messages using hash (source);

self join on vm_messages

explain analyze
select a.height, a.method
from vm_messages a
join vm_messages b
on a.height = b.height and a.state_root = b.state_root and a.from = b.to
where a.params->'DealIDs' is not null;

query plan:

                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.42..4166.82 rows=31104 width=10) (actual time=5.844..194.945 rows=1610 loops=1)
   ->  Seq Scan on vm_messages a  (cost=0.00..1085.29 rows=13475 width=81) (actual time=0.045..11.095 rows=1713 loops=1)
         Filter: ((params -> 'DealIDs'::text) IS NOT NULL)
         Rows Removed by Filter: 11830
   ->  Memoize  (cost=0.42..1.31 rows=1 width=76) (actual time=0.060..0.107 rows=1 loops=1713)
         Cache Key: a.height, a.state_root, a."from"
         Cache Mode: logical
         Hits: 896  Misses: 817  Evictions: 0  Overflows: 0  Memory Usage: 206kB
         ->  Index Scan using vm_messages_pkey on vm_messages b  (cost=0.41..1.30 rows=1 width=76) (actual time=0.124..0.222 rows=1 loops=817)
               Index Cond: ((height = a.height) AND (state_root = a.state_root))
               Filter: (a."from" = "to")
               Rows Removed by Filter: 562
 Planning Time: 9.981 ms
 Execution Time: 195.101 ms
(14 rows)

vm_messages join receipts

explain analyze verbose
select v.height, v.cid, v.method
from vm_messages v
join receipts r
on v.cid = r.message
where v.params->'DealID' is not null;

                                                                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------
-------------
 Gather  (cost=1000.00..407611.62 rows=13475 width=73) (actual time=2162.106..2164.971 rows=0 loops=1)
   Output: v.height, v.cid, v.method
   Workers Planned: 2
   Workers Launched: 2
   ->  Nested Loop  (cost=0.00..405264.12 rows=5615 width=73) (actual time=2157.425..2157.426 rows=0 loops=3)
         Output: v.height, v.cid, v.method
         Worker 0:  actual time=2155.520..2155.521 rows=0 loops=1
         Worker 1:  actual time=2155.593..2155.593 rows=0 loops=1
         ->  Parallel Seq Scan on public.receipts r  (cost=0.00..211130.97 rows=3207997 width=63) (actual time=0.050..330.858 rows=2566393 loops=3)
               Output: r.message, r.state_root, r.idx, r.exit_code, r.gas_used, r.height
               Worker 0:  actual time=0.053..330.871 rows=2580387 loops=1
               Worker 1:  actual time=0.057..329.548 rows=2483895 loops=1
         ->  Index Scan using vm_messages_cid_idx on public.vm_messages v  (cost=0.00..0.04 rows=2 width=73) (actual time=0.000..0.000 rows=0 loops
=7699180)
               Output: v.height, v.state_root, v.cid, v.source, v."from", v."to", v.value, v.method, v.actor_code, v.exit_code, v.gas_used, v.param
s, v.returns
               Index Cond: (v.cid = r.message)
               Rows Removed by Index Recheck: 0
               Filter: ((v.params -> 'DealID'::text) IS NOT NULL)
               Worker 0:  actual time=0.000..0.000 rows=0 loops=2580387
               Worker 1:  actual time=0.000..0.000 rows=0 loops=2483895
 Planning Time: 1.046 ms
 Execution Time: 2165.080 ms
(21 rows)

self join on method

explain analyze verbose select v.height, x.height, v.cid, v.method from vm_messages v join vm_messages x on v.method = x.method;

                                                                        QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------
-------
 Hash Join  (cost=1220.72..330940.98 rows=27617278 width=81) (actual time=12.539..2718.985 rows=27617277 loops=1)
   Output: v.height, x.height, v.cid, v.method
   Hash Cond: (v.method = x.method)
   ->  Seq Scan on public.vm_messages v  (cost=0.00..1051.43 rows=13543 width=73) (actual time=0.008..2.932 rows=13543 loops=1)
         Output: v.height, v.state_root, v.cid, v.source, v."from", v."to", v.value, v.method, v.actor_code, v.exit_code, v.gas_used, v.params, v.r
eturns
   ->  Hash  (cost=1051.43..1051.43 rows=13543 width=10) (actual time=12.502..12.503 rows=13543 loops=1)
         Output: x.height, x.method
         Buckets: 16384  Batches: 1  Memory Usage: 687kB
         ->  Seq Scan on public.vm_messages x  (cost=0.00..1051.43 rows=13543 width=10) (actual time=0.005..7.527 rows=13543 loops=1)
               Output: x.height, x.method
 Planning Time: 0.195 ms
 Execution Time: 3482.044 ms
(12 rows)

I ran multiple analyses on these two queries and there was not much variance in the execution times of each. I think this demonstrates that height and cid are sufficient indices even when joining a big table such as receipts. Not sure how much we'll be doing reads on method though but one can assume that other columns of similar type (enum repr. as string) without an index, such as actor_code will perform roughly the same. If there will be dashboards in the near future where we'd want to group messages by actor_code and/or method then it might make sense to index that now (not as hash but btree if we index height as brin).

f8-ptrk · 2022-08-03T00:47:44Z

so the consensus is that the read schema needs to be different from the write schema for this to be useful? does that in general apply to lilly?

frrist · 2022-08-03T18:26:59Z

@f8-ptrk generally, we want a schema optimized for the ingestion of data as bottlenecks in writing to the database will prevent lily from keeping up with the chain; We aim to strike a balance between ingestion speed and query speed by including some indices in the schema, and recognize they may not be optimal for all use-cases. Users of Lily are free to introduce their own indices outside the provided schema or, as @kasteph alludes to, write directly to CSV files and implement a separate ingestion process.

I have incorporated the feedback from this issue into #1027, which is ready for review @placer14 @kasteph (@f8-ptrk feel free to review as well). We can continue discussion around index's changes there.

* feat: implement vm message extraction - closes #978

frrist self-assigned this Jun 7, 2022

frrist added the kind/feature New feature request label Jun 7, 2022

frrist mentioned this issue Aug 2, 2022

Implement internal message extraction #865

Closed

frrist linked a pull request Aug 3, 2022 that will close this issue

feat: implement vm message extraction #1027

Merged

frrist mentioned this issue Aug 3, 2022

feat: implement vm message extraction #1027

Merged

frrist closed this as completed in #1027 Aug 10, 2022

frrist added a commit that referenced this issue Aug 10, 2022

feat: implement vm message extraction (#1027)

fb9bc26

* feat: implement vm message extraction - closes #978

frrist added this to the Lily v0.12.0 milestone Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full internal message extraction #978

Full internal message extraction #978

frrist commented Jun 7, 2022 •

edited

Loading

f8-ptrk commented Jul 27, 2022 •

edited

Loading

placer14 commented Aug 2, 2022 •

edited

Loading

frrist commented Aug 2, 2022 •

edited

Loading

kasteph commented Aug 2, 2022 •

edited

Loading

placer14 commented Aug 2, 2022

kasteph commented Aug 2, 2022

f8-ptrk commented Aug 3, 2022 •

edited

Loading

frrist commented Aug 3, 2022

Full internal message extraction #978

Full internal message extraction #978

Comments

frrist commented Jun 7, 2022 • edited Loading

Background

Model Design

Table Name: vm_messages

Table Definition:

Table Indexes

Acceptance Criteria

f8-ptrk commented Jul 27, 2022 • edited Loading

placer14 commented Aug 2, 2022 • edited Loading

frrist commented Aug 2, 2022 • edited Loading

kasteph commented Aug 2, 2022 • edited Loading

placer14 commented Aug 2, 2022

kasteph commented Aug 2, 2022

f8-ptrk commented Aug 3, 2022 • edited Loading

frrist commented Aug 3, 2022

frrist commented Jun 7, 2022 •

edited

Loading

Table Name: `vm_messages`

f8-ptrk commented Jul 27, 2022 •

edited

Loading

placer14 commented Aug 2, 2022 •

edited

Loading

frrist commented Aug 2, 2022 •

edited

Loading

kasteph commented Aug 2, 2022 •

edited

Loading

f8-ptrk commented Aug 3, 2022 •

edited

Loading