New DB representation for data sources #3405

leoyvens · 2022-03-29T16:02:25Z

The DB storage of dynamic data sources needs to change to reflect that graph node is now multichain and to fullfill the requirements of file data sources.

Dynamic data sources are currently stored in sugraphs.dynamic_ethereum_contract_data_source. This proposes that each subgraph will gain a sgd*.data_source table under its own DB namespace. The proposed schema for this table is:

create table sgd*.data_sources$ (
      vid integer generated by default as identity primary key,
      block_range int4range not null,
      access_group integer generated by default as identity,
      manifest_idx integer not null,
      parent integer references sgd*.data_source,
      id bytea,
      param bytea,
      context jsonb,
  }

Each field is explained and justified below. This is a good time to bikeshed on all of their names.

vid: Just a primary key, has no particular meaning.

block_range: The use of a block range is meant to allow an efficent implementation of data source removal, which is not currently a feature but quickly becomes necessary for use cars such as replacing the IPFS file associated with a user profile. Since we antecipate that the query "which data sources are live at this block" will be relevant, a range makes sense compared to separate start and end block fields.

access_group : Works as described here in the 'data source indepedency' section. This will also be added to entity tables. We let the DB handle generating new access groups from a sequence when necessary, since these need not be deterministic.

manifest_idx : Serves the purposes that name currently serves, to look up the definition in the manifest. This more efficient representation stores the position in the manifest rather than the name.

id: An optional user given id, in case the user needs to refer to a specfic data source after creation, such as for removal.

param: The stored representation of the params argument passed on creation. Examples of stored data: Address for Ethereum, account id for near, file CID for file data sources. A bytea representation is optimal for current uses, and we can always add a jsonb alternative if required in the future.

context: Optional data soure context, as it exists today.

parent: The data source which triggered the craetion of this data source. Tracking parent-child wil relationships will likely be required for features such as ipfs data sources created from other data sources. It's assumed in this design that each data source has a single parent.

Unblocks:
#3072
#3202

Ping @lutter @maoueh for review.

The text was updated successfully, but these errors were encountered:

lutter · 2022-03-29T19:36:22Z

This looks good. A few small comments:

the table should be called data_source$ since $ is legal in PG identifiers and can't clash with names coming from GraphQL (otherwise the table name could conflict with an entity type DataSource)
does access_group need to be an int8 or could we get away with an int4?
similarly, could manifest_idx be a int2?
will id be used to search for data sources? Could we get away with a bytea for it?

leoyvens · 2022-03-30T11:04:29Z

The access_group space needs to be as large as the data source space. But I think we should make vid, access_group and parent all int4, data sources are all kept in memory so it seems unrealistic to support billions of them, even if we're aiming to support millions. Making manifest_idx and int2 is a small saving and it's not unthinkable that someone would generate a manifest with over 32k data sources.

About id, this is a DX question since we're choosing between String or Bytes when we eventually have an AS interface for setting it. Using Bytes does seem like it will may be a common case, for storing CIDs or contract addresses, and strings can always be converted to bytes, so I agree on making it bytea.

maoueh · 2022-04-08T19:26:19Z

That's pretty good, how will migrations be handled?

leoyvens · 2022-04-08T19:28:23Z

My current thinking is to not migrate existing subgraphs, so graph node will continue to support the existing schema.

lutter · 2022-04-12T23:10:49Z

Another option for migration would be to do it during start_subgraph_deployment: if the data_sources table does not exist yet, create it and populate it from the current table. It might be that not having to accommodate looking in two different storage schemes helps keep the code simpler.

leoyvens · 2022-04-13T13:53:31Z

It would make the code cleaner, though it is adds risk to the migration itself. One thing to consider is that when doing the migration we need to block all indexing that could be happening concurrently, and make sure the index nodes are updated before continuing. And the downtime this migration could impose on hosted, due to the number of subgraphs, might not be acceptable.

lutter · 2022-04-13T14:58:39Z

That's why I suggested to do it in start_subgraph_deployment since you know indexing isn't happening at that point. Yes, it will slow down startup; you are right that it'll be hard to predict what the load on the system will be, we'd have to figure out a way to quantify that. I don't think it'll introduce a lot of downtime, since the writes that need to happen are not that different from what a subgraph does when it writes 'normal' entitites.

leoyvens · 2022-04-13T15:25:17Z

Ah I see now. Then a remaining issue would be if we should have code to revert the migration ready, or if we should rely on the integration testing to give us enough confidence to migrate without the option of going back.

leoyvens · 2022-04-19T15:30:14Z

My concern is on the backwards compatibility with previous graph node versions. It's been a while since we've made a change that breaks the possibility of downgrading. And it's dangerous if downgrading corrupts the DB. The migration doesn't really need to be reverted since it's not destructive, it only copies data to a new table. One way to maintain backwards compatibility would be to write ethereum data sources both to the legacy sugraphs.dynamic_ethereum_contract_data_source and the new table.

azf20 · 2022-04-22T09:52:52Z

I think this data is only relevant during indexing, so could this be handled by apiVersioning? (file data sources will necessitate a new apiVersion)

leoyvens · 2022-04-22T10:39:31Z

This change can be made without changing any existing behaviour, so it doesn't need a new api version.

azf20 · 2022-04-22T11:34:59Z

Yes, but the functionality which requires it (file data sources) will require a new api version. If we want to maintain backwards compatibility, either the new Graph Node versions write to the legacy table, or versioning determines which table is written to for a given subgraph (older Graph Node versions won't be able to index the new apiVersion anyway, so we don't need to worry about rolling back)
I don't feel strongly, just felt like this was an alternative path.. cc @dizzyd who I was talking to yesterday about the various versions in Graph Node

leoyvens · 2022-04-22T11:41:33Z

That's possible, but makes it less likely that we'll ever remove the code path which supports the legacy table.

leoyvens · 2022-04-27T18:32:06Z

After discussions with @lutter, the latest plan is:

Both schema versions will be supported in graph node, at least until all of hosted is migrated and the migration has been included in a release.
This means supporting both schemas for both indexing and grafting subgraphs.
The migration in hosted will be triggered on a per-shard configuration, so we have a chance to restore from a backup if something goes very wrong.

leoyvens · 2022-08-04T19:59:32Z

This is implemented for new deployments. The remaining work tracked by this issue would be to migrate existing deployments to the new schema.

leoyvens · 2023-12-21T20:05:14Z

It's been a year and no particular need to actively move subgraphs to the new schema has been identified, so I'm closing.

leoyvens added the area/store label Mar 29, 2022

leoyvens self-assigned this Mar 29, 2022

leoyvens mentioned this issue Apr 27, 2022

store: alter deployment_schemas.version column to int #3505

Merged

leoyvens mentioned this issue May 17, 2022

Prepare for private data sources table #3569

Merged

This comment was marked as off-topic.

Sign in to view

leoyvens mentioned this issue Jun 9, 2022

Dynamic data sources for NEAR accounts #3628

Closed

leoyvens mentioned this issue Jul 17, 2022

Private data sources table #3742

Merged

leoyvens closed this as completed Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New DB representation for data sources #3405

New DB representation for data sources #3405

leoyvens commented Mar 29, 2022 •

edited

Loading

lutter commented Mar 29, 2022

leoyvens commented Mar 30, 2022

maoueh commented Apr 8, 2022

leoyvens commented Apr 8, 2022

lutter commented Apr 12, 2022

leoyvens commented Apr 13, 2022 •

edited

Loading

lutter commented Apr 13, 2022

leoyvens commented Apr 13, 2022

leoyvens commented Apr 19, 2022

azf20 commented Apr 22, 2022

leoyvens commented Apr 22, 2022

azf20 commented Apr 22, 2022

leoyvens commented Apr 22, 2022

leoyvens commented Apr 27, 2022

This comment was marked as off-topic.

leoyvens commented Aug 4, 2022

leoyvens commented Dec 21, 2023

New DB representation for data sources #3405

New DB representation for data sources #3405

Comments

leoyvens commented Mar 29, 2022 • edited Loading

lutter commented Mar 29, 2022

leoyvens commented Mar 30, 2022

maoueh commented Apr 8, 2022

leoyvens commented Apr 8, 2022

lutter commented Apr 12, 2022

leoyvens commented Apr 13, 2022 • edited Loading

lutter commented Apr 13, 2022

leoyvens commented Apr 13, 2022

leoyvens commented Apr 19, 2022

azf20 commented Apr 22, 2022

leoyvens commented Apr 22, 2022

azf20 commented Apr 22, 2022

leoyvens commented Apr 22, 2022

leoyvens commented Apr 27, 2022

This comment was marked as off-topic.

leoyvens commented Aug 4, 2022

leoyvens commented Dec 21, 2023

leoyvens commented Mar 29, 2022 •

edited

Loading

leoyvens commented Apr 13, 2022 •

edited

Loading