Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [Blueprint] Support for multiple protocol versions #857

Open
cameel opened this issue Sep 28, 2018 · 4 comments
Open

[WIP] [Blueprint] Support for multiple protocol versions #857

cameel opened this issue Sep 28, 2018 · 4 comments
Assignees
Labels
blueprint Draft of a specification

Comments

@cameel
Copy link
Contributor

cameel commented Sep 28, 2018

Status of this blueprint

This blueprint is still a work in progress but the part describing changes needed in Concent's application code is done.

TODO: Diagram showing which components are shared between clusters

TODO: Deployment procedure for adding a cluster running a new protocol version

  • Assigning persistent disks and IPs
  • Scripts that automate creation and destruction of clusters
  • Updating config maps on the cluster that hosts the load balancer
  • Database migration
  • When to add a new cluster and when to update an existing one

TODO: Changes in the structure of concent-deployment-values

  • List of settings that can differ between clusters running different protocol versions
  • Ability to assign different amount of resources to different clusters

Motivation

The protocol used by Concent to communicate with Golem clients is defined by golem-messages.

Currently the library is not designed to provide backwards-compatibility between major or minor releases. Golem is still under heavy development and the overhead needed to provide API stable enough to make keeping it compatible feasible is considered too big. Instead, every new version of the protocol effectively creates a separate network and clients running different versions don't communicate with each other. Clients are expected to promptly switch to the latest version.

While users tend to quickly adopt new versions, there's always a period in which multiple versions coexist. Users of any of them may want to use Concent. The problem is that it's impossible to use muliple versions of the library in a single application.

The simplest solution is to only support clients capable of using the latest version of the protocol in Concent. That's actually our fallback solution. Concent is by design an optional service and access to it is not guaranteed. That's also the preferred solution for non-production environments.

That's not the solution we want to provide on the mainnet. Instead, there are multiple Concent clusters that share a load balancer and the database. Duplicating the whole cluster ensures that the old clients can keep using Concent and finish any already started cases while the new version is already available to the new clients. The load balancer can redirect traffic to the right cluster based on protocol version specified in headers so old and new clients can connect to the same address and do not need to be aware of multiple versions. Sharing the database ensures that all the clusters know which subtasks are being processed and won't allow clients to collect forced payments from deposit for the same service multiple times and also that all the components that communicate with blockchain can correctly keep track of Ethereum transaction nonce.

Protocol versions

Protocol versions are simply versions of golem-messages. Note that it's not a version of Golem or Concent. Different Golem versions using the same version of golem-messages can communicate with the same Concent cluster without the mechanism outlined below.

Version strings must conform to semver 2.0 format.

Versions are considered compatible if they share the minor and major version number. E.g 2.18.5 is compatible with 2.18.1 but not with 2.17.5 or 3.0.0.

Load balancer

Version based routing

Load balancer is a HTTP server that decides to which cluster a request should be routed. This is decided based on protocol version specified in the X-Golem-Messages HTTP header. Rules are as follows:

  • If version is not in the right format, HTTP 400 response is returned.
  • If version is compatible with the one used by any of the available clusters the request is routed to that cluster.
  • If there's no matching cluster, HTTP 404 response is returned.
  • If the header is missing, the request is routed to the cluster running the latest version.

Load balancer location

Concent's domain always points at the IP address of the cluster running the load balancer. Load balancer is a HTTP server (nginx) running on one of the clusters. The server is configured with IP addresses of all the other clusters and can pass requests on to them.

All the clusters have static, public IP addresses and can also be reached directly, bypassing the load balancer.

Every cluster already has a nginx proxy and version based routing is just one more responsibility handled by it. It would be possible to configure nginx on all the clusters to route all messages based on version even when reached by IP but keeping the addresses up to date would require deploying updated configuration for all nginx instances on all clusters whenever a cluster is added or deleted. Since Golem will only be using the domain name anyway, doing it this way is just extra maintenance for no real benefit.

Load balancer can be on any cluster but it's recommended to always keep it on the one running the latest Concent version. When it needs to be moved to a dfferent cluster, we configure it on the target cluster and give it the IP of the previous cluster. The IP address the domain points at stays the same to avoid changing DNS records (which is not instantaneous).

SSL termination

SSL traffic terminates at the load balancer since it has to be able to see request headers. If it decides that the request is meant to the current cluster, it simply passes it on unencrypted to the endpoint that normally handles this type of request. If it's meant for a different cluster, the request must be reencrypted and passed to that cluster.

All the clusters (including the one running the load balancer) have self-signed SSL certificates and the load balancer is configured to know their public keys. The certificates are issued for domains in the form gm<version with dashes>.<concent domain>, (e.g. gm2-13-0.concent.golem.network). We may or may not actually create those domains. The names are mapped to IPs in load balancer's /etc/hosts file so using real subdomains is not necessary as long as we only use them internally (and even then clients can add them to /etc/hosts too or simply accept the certificate without validating the domain name).

Database

Since multiple Concent versions need to access the same database simultaneously and no single version will be able to decode all Golem messages stored in it we need to store information about version.

This information is stored in the protocol_version column in the StoredMessage model.

All messages associated with the same subtask must be compatible (but don't have to be exactly the same). This is enforced using validations and database constraints.

StoredMessage is meant to be the only place where Concent stores raw, serialized messages. All the other tables and APIs are designed to use simple values that do not depend on types defined in golem-messages.

The version number we store is the version of golem-messages used by Concent. It's not version used by the client and submitted in a HTTP header because:

  • The information we need is "which version of golem-messages has created the binary blob we're trying to deserialize". Since Concent always decodes received messages and serializes them again before putting them in the database, the answer is always the version Concent uses.
  • Client may not even specify its version in some cases. E.g. if the header is missing, we want to use one of the available versions by default.
  • Mixed versions are not supported. All nested messages must be deserializable by the same version as the message that contains them.

HTTP endpoints and validations

Every endpoint checks protocol version specified in the X-Golem-Messages header and responds with ServiceRefused if it's not compatible with Concent's version. If the header is missing, version is not checked and Concent assumes that the client uses a compatible version. This allows the client to force communication when using a version that's deemed incompatible based on version numbers but actually works in practice.

send/ endpoint

Concent responds with ServiceRefused if the subtask referenced by the client is already present in the database and protocol_version of any of its messages is incompatible with golem-messages available in that instance.

receive/ endpoint

Since Concent creates responses on the fly, the client will always get a response containing a message created with Concent's version of golem-messages. Client can see this version in the Concent-Golem-Messages-Version HTTP header on the response.

If the response needs to contain a nested message taken from the database and that message is not compatible with Concent's version of golem-messages, Concent responds with ServiceRefused. The client is always required to stay on the same protocol version while communicating about the same subtask.

ClientAuthentication and FileTransferToken messages

Many endpoints accept additional messages for authentication in request headers or body. Concent simply decodes those messages using its own version of Golem messages. It has no way to discern an invalid message from a message that's valid but can't be decoded because the client has incorrectly declared a compatible version in X-Golem-Messages.

Signing Service

Every cluster has its own instance of Signing Service. This means that when Concent supports 5 versions, Golem must keep 5 copies of Signing Service running on its infrastructure. This is a consequence of Signing Service using golem-messages for communication.

That said, the service only needs a few message types and those messages are declared in its own code (rather than in golem-messages) so it should be possible to provide backwards-compatibility over a greater range of versions of golem-messages. By adding an ability for the service to stay connected to more than one Concent cluster we could in many cases support all the clusters with a single instance. This feature may be added at a later time if running multiple Signing Service instances proves to be too burdensome.

A potential obstacle to backwards-compatibility in the Signing Service and Middleman is that there are proposed extensions that will require passing the original messages submitted as a part of the use case (e.g. TaskToCompute, ReportComputedTask) to the Signing Service for inspection. This would tie it to a specific version of golem-messages and force us to have separate instances.

Load balancer and Signing Service

Since Signing Service communicats over plain TCP (rather than HTTP), it has no way to specify protocol version in headers. In fact it cannot specify anything and nginx simply passes the TCP traffic where it's told to, without any modifications.

For that reason Signing Service must communicate with the right cluster directly. Every instance has to get the IP of the cluster running the same protocol version as a command line parameter.

This is not a problem because the service is over Golem's complete control and users are not expected to be running it.

Database and Signing Service

The code that handles communication with the Signing Service never stores or accesses Golem messages in the database so there's no risk of version incompatibility once the communication is established using the right versions.

Admin panel

Admin panel never needs to deal directly with Golem messages and connects to the same database no matter which cluster is serving it so there are no problems with protocol incompatibilities here. Even when Golem messages are stored in tables, they are never decoded by the panel.

The only thing of concern is now to access it. Browsers may not be happy about the certificate when a cluster is accessed directly via IP so it's recommended to always use the panel on the cluster running the latest version that's accessible via Concent's domain name.

@cameel cameel added the blueprint Draft of a specification label Sep 28, 2018
@cameel cameel self-assigned this Sep 28, 2018
@rwrzesien
Copy link
Contributor

Some implementation details proposals:

This information is stored in the protocol_version column in the StoredMessage model.

protocol_version column could be enum type.

Every endpoint checks protocol version specified in the X-Golem-Messages header and responds with ServiceRefused if it's not compatible with Concent's version.

Could be a new decorator.

@cameel

Many endpoints accept additional messages for authentication in request headers or body. Concent simply decodes those messages using its own version of Golem messages. It has no way to discern an invalid message from a message that's valid but can't be decoded because the client has incorrectly declared a compatible version in X-Golem-Messages.

What about gatekeeper? Is it ever possible that it will be on a different cluster than core?

Every cluster has its own instance of Signing Service. This means that when Concent supports 5 versions, Golem must keep 5 copies of Signing Service running on its infrastructure. This is a consequence of Signing Service using golem-messages for communication.

Actually now I don't see any reason for those payloads to be anything else than self-validating dictionaries. There are no nested structures there, and all types are simple. It shouldn't be difficult to change that at this point, maybe just a lot of places to touch.

@cameel
Copy link
Contributor Author

cameel commented Oct 2, 2018

protocol_version column could be enum type.

I considered doing it like this but the problem is that enum represents a static, unchanging list of values and that's not what we have here. Adding a new version to the cluster adds a new possible value in the database and that's a new, unknown enum item from the perspective of the older version.

The older version may not be able to decode the message but it should at least be able to read and validate the version field. If we make it an enum it might not be possible

Could be a new decorator.

Could be. But it's an implementation detail so I'm not going to add it to the blueprint.

What about gatekeeper? Is it ever possible that it will be on a different cluster than core?

Gatekeeper too. It's in the section title ("ClientAuthentication and FileTransferToken messages).

Gatekeeper is already (conceptually) on a different cluster. It's running of the storage cluster and it's the only part of it that's version dependent.

As stated in the text, Gatekeeper should try to decode the message using its own version. If decoding fails, the message must either be damaged or come from an incompatible version (we can't be sure which is true). Also, like in all other endpoints, there should be a check for X-Golem-Messages so if the client does not lie about the version, version mismatch will be detected before we even try to decode the message.

Actually now I don't see any reason for those payloads to be anything else than self-validating dictionaries. There are no nested structures there, and all types are simple. It shouldn't be difficult to change that at this point, maybe just a lot of places to touch.

The main reason was the ease of implementation. Golem messages had exactly what we needed (and much more) so I took it. Even the current, very simplified frame structure took a lot of time to implement so adding a custom format would delay implementation even further.

It's possible to replace it with something simpler and I think we'll do that at some point.

But please keep in mind that this will not let us stop using golem-messages in Signing Service and Middleman. There are proposed extensions to Signing Service that add sanity checks and we'll be implementing them soon after mainnet. These checks will involve Concent submitting more information (including some messages submitted during the use case) and Signing Service checking it according to some rules. Now that I think about it, it's even worse than I stated in the blueprint because we won't be able to make that work between versions. To decode those messages we'll have to have one instance of Signing Service per golem-messages version.

@rwrzesien
Copy link
Contributor

rwrzesien commented Jan 8, 2019

@cameel

Admin panel never needs to deal directly with Golem messages

Until we implement #1044.

send/ endpoint
Concent responds with ServiceRefused if the subtask referenced by the client is already present in the database and protocol_version of any of its messages is incompatible with golem-messages available in that instance.

That is not like it is exactly implemented. I remember @pawelkisielewicz has an idea that it is enough to check only task_to_compute as it comes with every message. I am not saying this assumption is wrong, I just wanted to point it out as a difference for further consideration.

  1. When reviewing the code related to multiple golem-messages version support I have found some implementation issues that might be fixed with low priority:
  • is_given_golem_messages_version_supported_by_concent - make it one-liner
  • StoredMessage.protocol_version.max_length make it longer (currently 10).
  • StoredMessage.protocol_version.__str__ use f-string.

@kbeker
Copy link
Contributor

kbeker commented Feb 18, 2019

@cameel /receive end-point now have 2 queries to database to match messages for client. Clients will get only messages which are compatible with Concent Service and Golem Client. Any older messages which waits in database for him as not delivered will wait there until Client contact withConcent with appropriate GolemMessages version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blueprint Draft of a specification
Projects
None yet
Development

No branches or pull requests

3 participants