Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New DB representation for data sources #3405

Closed
leoyvens opened this issue Mar 29, 2022 · 17 comments
Closed

New DB representation for data sources #3405

leoyvens opened this issue Mar 29, 2022 · 17 comments
Assignees

Comments

@leoyvens
Copy link
Collaborator

leoyvens commented Mar 29, 2022

The DB storage of dynamic data sources needs to change to reflect that graph node is now multichain and to fullfill the requirements of file data sources.

Dynamic data sources are currently stored in sugraphs.dynamic_ethereum_contract_data_source. This proposes that each subgraph will gain a sgd*.data_source table under its own DB namespace. The proposed schema for this table is:

create table sgd*.data_sources$ (
      vid integer generated by default as identity primary key,
      block_range int4range not null,
      access_group integer generated by default as identity,
      manifest_idx integer not null,
      parent integer references sgd*.data_source,
      id bytea,
      param bytea,
      context jsonb,
  }

Each field is explained and justified below. This is a good time to bikeshed on all of their names.

vid: Just a primary key, has no particular meaning.

block_range: The use of a block range is meant to allow an efficent implementation of data source removal, which is not currently a feature but quickly becomes necessary for use cars such as replacing the IPFS file associated with a user profile. Since we antecipate that the query "which data sources are live at this block" will be relevant, a range makes sense compared to separate start and end block fields.

access_group : Works as described here in the 'data source indepedency' section. This will also be added to entity tables. We let the DB handle generating new access groups from a sequence when necessary, since these need not be deterministic.

manifest_idx : Serves the purposes that name currently serves, to look up the definition in the manifest. This more efficient representation stores the position in the manifest rather than the name.

id: An optional user given id, in case the user needs to refer to a specfic data source after creation, such as for removal.

param: The stored representation of the params argument passed on creation. Examples of stored data: Address for Ethereum, account id for near, file CID for file data sources. A bytea representation is optimal for current uses, and we can always add a jsonb alternative if required in the future.

context: Optional data soure context, as it exists today.

parent: The data source which triggered the craetion of this data source. Tracking parent-child wil relationships will likely be required for features such as ipfs data sources created from other data sources. It's assumed in this design that each data source has a single parent.

Unblocks:
#3072
#3202

Ping @lutter @maoueh for review.

@leoyvens leoyvens self-assigned this Mar 29, 2022
@lutter
Copy link
Collaborator

lutter commented Mar 29, 2022

This looks good. A few small comments:

  • the table should be called data_source$ since $ is legal in PG identifiers and can't clash with names coming from GraphQL (otherwise the table name could conflict with an entity type DataSource)
  • does access_group need to be an int8 or could we get away with an int4?
  • similarly, could manifest_idx be a int2?
  • will id be used to search for data sources? Could we get away with a bytea for it?

@leoyvens
Copy link
Collaborator Author

The access_group space needs to be as large as the data source space. But I think we should make vid, access_group and parent all int4, data sources are all kept in memory so it seems unrealistic to support billions of them, even if we're aiming to support millions. Making manifest_idx and int2 is a small saving and it's not unthinkable that someone would generate a manifest with over 32k data sources.

About id, this is a DX question since we're choosing between String or Bytes when we eventually have an AS interface for setting it. Using Bytes does seem like it will may be a common case, for storing CIDs or contract addresses, and strings can always be converted to bytes, so I agree on making it bytea.

@maoueh
Copy link
Contributor

maoueh commented Apr 8, 2022

That's pretty good, how will migrations be handled?

@leoyvens
Copy link
Collaborator Author

leoyvens commented Apr 8, 2022

My current thinking is to not migrate existing subgraphs, so graph node will continue to support the existing schema.

@lutter
Copy link
Collaborator

lutter commented Apr 12, 2022

Another option for migration would be to do it during start_subgraph_deployment: if the data_sources table does not exist yet, create it and populate it from the current table. It might be that not having to accommodate looking in two different storage schemes helps keep the code simpler.

@leoyvens
Copy link
Collaborator Author

leoyvens commented Apr 13, 2022

It would make the code cleaner, though it is adds risk to the migration itself. One thing to consider is that when doing the migration we need to block all indexing that could be happening concurrently, and make sure the index nodes are updated before continuing. And the downtime this migration could impose on hosted, due to the number of subgraphs, might not be acceptable.

@lutter
Copy link
Collaborator

lutter commented Apr 13, 2022

That's why I suggested to do it in start_subgraph_deployment since you know indexing isn't happening at that point. Yes, it will slow down startup; you are right that it'll be hard to predict what the load on the system will be, we'd have to figure out a way to quantify that. I don't think it'll introduce a lot of downtime, since the writes that need to happen are not that different from what a subgraph does when it writes 'normal' entitites.

@leoyvens
Copy link
Collaborator Author

Ah I see now. Then a remaining issue would be if we should have code to revert the migration ready, or if we should rely on the integration testing to give us enough confidence to migrate without the option of going back.

@leoyvens
Copy link
Collaborator Author

My concern is on the backwards compatibility with previous graph node versions. It's been a while since we've made a change that breaks the possibility of downgrading. And it's dangerous if downgrading corrupts the DB. The migration doesn't really need to be reverted since it's not destructive, it only copies data to a new table. One way to maintain backwards compatibility would be to write ethereum data sources both to the legacy sugraphs.dynamic_ethereum_contract_data_source and the new table.

@azf20
Copy link
Contributor

azf20 commented Apr 22, 2022

I think this data is only relevant during indexing, so could this be handled by apiVersioning? (file data sources will necessitate a new apiVersion)

@leoyvens
Copy link
Collaborator Author

This change can be made without changing any existing behaviour, so it doesn't need a new api version.

@azf20
Copy link
Contributor

azf20 commented Apr 22, 2022

Yes, but the functionality which requires it (file data sources) will require a new api version. If we want to maintain backwards compatibility, either the new Graph Node versions write to the legacy table, or versioning determines which table is written to for a given subgraph (older Graph Node versions won't be able to index the new apiVersion anyway, so we don't need to worry about rolling back)
I don't feel strongly, just felt like this was an alternative path.. cc @dizzyd who I was talking to yesterday about the various versions in Graph Node

@leoyvens
Copy link
Collaborator Author

That's possible, but makes it less likely that we'll ever remove the code path which supports the legacy table.

@leoyvens
Copy link
Collaborator Author

After discussions with @lutter, the latest plan is:

  • Both schema versions will be supported in graph node, at least until all of hosted is migrated and the migration has been included in a release.
  • This means supporting both schemas for both indexing and grafting subgraphs.
  • The migration in hosted will be triggered on a per-shard configuration, so we have a chance to restore from a backup if something goes very wrong.

@air3ijai

This comment was marked as off-topic.

@leoyvens
Copy link
Collaborator Author

leoyvens commented Aug 4, 2022

This is implemented for new deployments. The remaining work tracked by this issue would be to migrate existing deployments to the new schema.

@leoyvens
Copy link
Collaborator Author

It's been a year and no particular need to actively move subgraphs to the new schema has been identified, so I'm closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants