Implement indexing system #4

yourheropaul · 2019-08-02T10:49:10Z

The proposal is defined in Joystream/joystream#17.

Update 27/08/2019: We'll now use an on-chain approach. A purpose-built indexing module will provide storage for the schemata and some programmatic method of resolving nominated queries from consumers.

The indexing node takes the form of a separate application that's designed to be paired with at least one archive node. It will read the schema provided by the indexing module and query resolution code from the indexing module storage, and then present a query service to end users.

Assumptions

The query server will be compatible with GraphQL. No other system of data delivery will be accommodated in the first version.
The query resolution code will be WASM. This will be compiled by users of the module, and uploaded into the module storage for their particular runtime.
Basic storage queries will be supported for all modules, but the emphasis will be on bespoke queries that are useful to a consumer. With reference to the original proposal, a list of great people query is more useful than a list of people query that can be filtered by greatness (see original proposal for context). This will require custom code to be written for each bespoke query.
The test case will be the existing forum module. Ideally Pioneer could be updated to use the indexing node instead of the RPC API.

Expected users

API Consumer

The end user, who will make requests to the indexing node in order to build applications on top of the Joystream platform. We can typically expect consumers to want their applications to be reasonably long-lived, and ideally survive runtime upgrades without any code changes. This user will want the API to be as responsive and reliable as possible. They may also want to validate proofs for requests made to the indexing node API.

Query author

A mid-level user who writes bespoke queries for modules. They will want an intuitive way of constructing bespoke queries for compilation to WASM, and some tooling to deploy the new binary blobs to the indexing module storage.

Indexing node operator

Someone who runs and maintains an indexing node; the lowest level user. They will need to be incentivised to run the node.

Design goals:

Decoupled, interface-level interactions

Consumers should not be required to understand the underlying storage formats of data in the chain, and applications should not break if the underlying storage changes. The indexing system should provide real-world bespoke queries that perform operations or return results at a level of abstraction that the consumer actually cares about.

For example, a consumer might want to fetch a list of the categories in the forum module. In the current implementation, the list of categories is stored as a map, for which the keyspace is not directly published; instead, the next available key can be retrieved from storage. It's possible to infer the list of available keys and thus iterate through the map using the RPC API, but the consumer shouldn't be required to know that or implement such a solution themselves. If they did, their application would become brittle as a result of the right coupling to the underlying storage.

Instead, the indexing system should provide a list-of-categories query, which performs the underlying lookup procedure on behalf of the consumer.

Module integration

The process of integrating a module with the indexing system (that is, the indexing module and the indexing node) should be as painless as possible. Bespoke query code should be written in a language with wide support, and there should be some sort of support API to minimise the amount of boilerplate and complexity.

Ideally, modules should be able to specify their own bespoke queries. A halfway measure for accomplishing this might be automatically parsing the module's Rust code and using it to generate at least some of the bespoke query code.

Performance

A deployment of the indexing system should be able to support some large number of concurrent queries at any one time. The exact number is to be determined. This may involve cache layers, or many-to-many relationships between indexing nodes and archive nodes.

In-flight updates

When updated, the indexing module should signal that an update has occurred, and allow any consumers to update their client implementations. Additionally, the indexing node should detect changes to the module and update its schema appropriately.

Ideally the indexing node should update connected browsers seamlessly, without requiring them to refresh the web interface.

Technical considerations:

Arrangement of the index node in relation to the network. It's currently assumed that we'll use a satellite pattern, wherein the indexing service (it's not exactly a node in itself) extracts data from a full node.
Query Language. GraphQL, which is also likely to be used in Geth. This will move the vast majority of the processing to the server side.
Implementation. @joystream/types currently depends on an old version @polkadot/api, which is missing some of the modern APIs. This is now invalid, because the types will be specified on-chain.
Performance issues. Given that a goal is high performance, the architecture of the indexing node should be carefully considered. The current application stub is a node.js application.
Interface. There must be some easy way of executing, and seeing the results of, queries. The proposal also calls for a JavaScript client library with a built in light client to validate proofs.

Questions

Is is possible to use The Graph Protocol with a Joystream/Substrate datasource? See comment below for details. Answer: Maybe, but not it its current form. We would have to add a new data source for Substrate; see Can we use the a modified version of the existing Graph Protocol node? #5 .
How to extract state for from the chain? ~~The previous attempt hinted at what seems to be an ETL approach to data. The extraction logic was not defined, but it would have to have been via the API.~~ Answer: using the RPC, via its metadata and our internal TypeScript classes. See Extracting data from a (full) node #6.
How do we map concepts in storage to GraphQL? By using a combination of static types from @joystream/types and the metadata from the API. See Mapping chain concepts to GraphQL #7 This will now be accomplished by the indexing module.
How do we detect runtime changes in flight? Ideally we'd detect changes to the runtime, and update the query schema without having to restart any clients. ~~ This will be possible with the on-module approach using module events.
How do we validate proofs on the frontend? We'll need a light node
Is there any way of inferring queries from existing modules (in the runtime or via Rust code)? Maybe. We can assume that all modules will be open source, so it may be possible to parse existing Rust code and somehow extract bespoke queries and resolvers from that code. It's also conceivable that the runtime itself may be interrogated in order to inform the schema generate. The time invested in such automation may better spent elsewhere.
Can we make use of change_logs, as suggested by Gavin Wood (and expanded)?
Application language. Could we use Go? It's more performant that node, there's an existing Substrate RPC client and has good GraphQL support. It might be possible to re-use some of the Geth implementation for their (history-only) attempt. Answer: TypeScript. It's a mid-day point between Go (which few people know) and plain node.js (which everyone knows).

Useful resources

polkadot runtime spec
Substrate RPC API
EthQL — we basically need this for Substrate
https://github.com/graphprotocol/graph-cli — The Graph Protocol's extractor

The text was updated successfully, but these errors were encountered:

yourheropaul added the research label Aug 2, 2019

yourheropaul self-assigned this Aug 2, 2019

yourheropaul mentioned this issue Aug 21, 2019

[WIP] Implement indexing node #8

Merged

7 tasks

yourheropaul changed the title ~~Implement indexing node~~ Implement indexing system Aug 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement indexing system #4

Implement indexing system #4

yourheropaul commented Aug 2, 2019 •

edited

Loading

Implement indexing system #4

Implement indexing system #4

Comments

yourheropaul commented Aug 2, 2019 • edited Loading

Assumptions

Expected users

API Consumer

Query author

Indexing node operator

Design goals:

Decoupled, interface-level interactions

Module integration

Performance

In-flight updates

Technical considerations:

Questions

Useful resources

yourheropaul commented Aug 2, 2019 •

edited

Loading