-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [Blueprint] Support for multiple protocol versions #857
Comments
Some implementation details proposals:
Could be a new decorator.
What about gatekeeper? Is it ever possible that it will be on a different cluster than core?
Actually now I don't see any reason for those payloads to be anything else than self-validating dictionaries. There are no nested structures there, and all types are simple. It shouldn't be difficult to change that at this point, maybe just a lot of places to touch. |
I considered doing it like this but the problem is that enum represents a static, unchanging list of values and that's not what we have here. Adding a new version to the cluster adds a new possible value in the database and that's a new, unknown enum item from the perspective of the older version. The older version may not be able to decode the message but it should at least be able to read and validate the version field. If we make it an enum it might not be possible
Could be. But it's an implementation detail so I'm not going to add it to the blueprint.
Gatekeeper too. It's in the section title (" Gatekeeper is already (conceptually) on a different cluster. It's running of the storage cluster and it's the only part of it that's version dependent. As stated in the text, Gatekeeper should try to decode the message using its own version. If decoding fails, the message must either be damaged or come from an incompatible version (we can't be sure which is true). Also, like in all other endpoints, there should be a check for
The main reason was the ease of implementation. Golem messages had exactly what we needed (and much more) so I took it. Even the current, very simplified frame structure took a lot of time to implement so adding a custom format would delay implementation even further. It's possible to replace it with something simpler and I think we'll do that at some point. But please keep in mind that this will not let us stop using golem-messages in Signing Service and Middleman. There are proposed extensions to Signing Service that add sanity checks and we'll be implementing them soon after mainnet. These checks will involve Concent submitting more information (including some messages submitted during the use case) and Signing Service checking it according to some rules. Now that I think about it, it's even worse than I stated in the blueprint because we won't be able to make that work between versions. To decode those messages we'll have to have one instance of Signing Service per golem-messages version. |
Until we implement #1044.
That is not like it is exactly implemented. I remember @pawelkisielewicz has an idea that it is enough to check only
|
@cameel |
Status of this blueprint
This blueprint is still a work in progress but the part describing changes needed in Concent's application code is done.
TODO: Diagram showing which components are shared between clusters
TODO: Deployment procedure for adding a cluster running a new protocol version
TODO: Changes in the structure of
concent-deployment-values
Motivation
The protocol used by Concent to communicate with Golem clients is defined by golem-messages.
Currently the library is not designed to provide backwards-compatibility between major or minor releases. Golem is still under heavy development and the overhead needed to provide API stable enough to make keeping it compatible feasible is considered too big. Instead, every new version of the protocol effectively creates a separate network and clients running different versions don't communicate with each other. Clients are expected to promptly switch to the latest version.
While users tend to quickly adopt new versions, there's always a period in which multiple versions coexist. Users of any of them may want to use Concent. The problem is that it's impossible to use muliple versions of the library in a single application.
The simplest solution is to only support clients capable of using the latest version of the protocol in Concent. That's actually our fallback solution. Concent is by design an optional service and access to it is not guaranteed. That's also the preferred solution for non-production environments.
That's not the solution we want to provide on the mainnet. Instead, there are multiple Concent clusters that share a load balancer and the database. Duplicating the whole cluster ensures that the old clients can keep using Concent and finish any already started cases while the new version is already available to the new clients. The load balancer can redirect traffic to the right cluster based on protocol version specified in headers so old and new clients can connect to the same address and do not need to be aware of multiple versions. Sharing the database ensures that all the clusters know which subtasks are being processed and won't allow clients to collect forced payments from deposit for the same service multiple times and also that all the components that communicate with blockchain can correctly keep track of Ethereum transaction nonce.
Protocol versions
Protocol versions are simply versions of golem-messages. Note that it's not a version of Golem or Concent. Different Golem versions using the same version of golem-messages can communicate with the same Concent cluster without the mechanism outlined below.
Version strings must conform to semver 2.0 format.
Versions are considered compatible if they share the minor and major version number. E.g
2.18.5
is compatible with2.18.1
but not with2.17.5
or3.0.0
.Load balancer
Version based routing
Load balancer is a HTTP server that decides to which cluster a request should be routed. This is decided based on protocol version specified in the
X-Golem-Messages
HTTP header. Rules are as follows:Load balancer location
Concent's domain always points at the IP address of the cluster running the load balancer. Load balancer is a HTTP server (nginx) running on one of the clusters. The server is configured with IP addresses of all the other clusters and can pass requests on to them.
All the clusters have static, public IP addresses and can also be reached directly, bypassing the load balancer.
Every cluster already has a nginx proxy and version based routing is just one more responsibility handled by it. It would be possible to configure nginx on all the clusters to route all messages based on version even when reached by IP but keeping the addresses up to date would require deploying updated configuration for all nginx instances on all clusters whenever a cluster is added or deleted. Since Golem will only be using the domain name anyway, doing it this way is just extra maintenance for no real benefit.
Load balancer can be on any cluster but it's recommended to always keep it on the one running the latest Concent version. When it needs to be moved to a dfferent cluster, we configure it on the target cluster and give it the IP of the previous cluster. The IP address the domain points at stays the same to avoid changing DNS records (which is not instantaneous).
SSL termination
SSL traffic terminates at the load balancer since it has to be able to see request headers. If it decides that the request is meant to the current cluster, it simply passes it on unencrypted to the endpoint that normally handles this type of request. If it's meant for a different cluster, the request must be reencrypted and passed to that cluster.
All the clusters (including the one running the load balancer) have self-signed SSL certificates and the load balancer is configured to know their public keys. The certificates are issued for domains in the form
gm<version with dashes>.<concent domain>
, (e.g.gm2-13-0.concent.golem.network
). We may or may not actually create those domains. The names are mapped to IPs in load balancer's/etc/hosts
file so using real subdomains is not necessary as long as we only use them internally (and even then clients can add them to/etc/hosts
too or simply accept the certificate without validating the domain name).Database
Since multiple Concent versions need to access the same database simultaneously and no single version will be able to decode all Golem messages stored in it we need to store information about version.
This information is stored in the
protocol_version
column in theStoredMessage
model.All messages associated with the same subtask must be compatible (but don't have to be exactly the same). This is enforced using validations and database constraints.
StoredMessage
is meant to be the only place where Concent stores raw, serialized messages. All the other tables and APIs are designed to use simple values that do not depend on types defined in golem-messages.The version number we store is the version of golem-messages used by Concent. It's not version used by the client and submitted in a HTTP header because:
HTTP endpoints and validations
Every endpoint checks protocol version specified in the
X-Golem-Messages
header and responds withServiceRefused
if it's not compatible with Concent's version. If the header is missing, version is not checked and Concent assumes that the client uses a compatible version. This allows the client to force communication when using a version that's deemed incompatible based on version numbers but actually works in practice.send/
endpointConcent responds with
ServiceRefused
if the subtask referenced by the client is already present in the database andprotocol_version
of any of its messages is incompatible with golem-messages available in that instance.receive/
endpointSince Concent creates responses on the fly, the client will always get a response containing a message created with Concent's version of golem-messages. Client can see this version in the
Concent-Golem-Messages-Version
HTTP header on the response.If the response needs to contain a nested message taken from the database and that message is not compatible with Concent's version of golem-messages, Concent responds with
ServiceRefused
. The client is always required to stay on the same protocol version while communicating about the same subtask.ClientAuthentication
andFileTransferToken
messagesMany endpoints accept additional messages for authentication in request headers or body. Concent simply decodes those messages using its own version of Golem messages. It has no way to discern an invalid message from a message that's valid but can't be decoded because the client has incorrectly declared a compatible version in
X-Golem-Messages
.Signing Service
Every cluster has its own instance of Signing Service. This means that when Concent supports 5 versions, Golem must keep 5 copies of Signing Service running on its infrastructure. This is a consequence of Signing Service using golem-messages for communication.
That said, the service only needs a few message types and those messages are declared in its own code (rather than in golem-messages) so it should be possible to provide backwards-compatibility over a greater range of versions of golem-messages. By adding an ability for the service to stay connected to more than one Concent cluster we could in many cases support all the clusters with a single instance. This feature may be added at a later time if running multiple Signing Service instances proves to be too burdensome.
A potential obstacle to backwards-compatibility in the Signing Service and Middleman is that there are proposed extensions that will require passing the original messages submitted as a part of the use case (e.g.
TaskToCompute
,ReportComputedTask
) to the Signing Service for inspection. This would tie it to a specific version of golem-messages and force us to have separate instances.Load balancer and Signing Service
Since Signing Service communicats over plain TCP (rather than HTTP), it has no way to specify protocol version in headers. In fact it cannot specify anything and nginx simply passes the TCP traffic where it's told to, without any modifications.
For that reason Signing Service must communicate with the right cluster directly. Every instance has to get the IP of the cluster running the same protocol version as a command line parameter.
This is not a problem because the service is over Golem's complete control and users are not expected to be running it.
Database and Signing Service
The code that handles communication with the Signing Service never stores or accesses Golem messages in the database so there's no risk of version incompatibility once the communication is established using the right versions.
Admin panel
Admin panel never needs to deal directly with Golem messages and connects to the same database no matter which cluster is serving it so there are no problems with protocol incompatibilities here. Even when Golem messages are stored in tables, they are never decoded by the panel.
The only thing of concern is now to access it. Browsers may not be happy about the certificate when a cluster is accessed directly via IP so it's recommended to always use the panel on the cluster running the latest version that's accessible via Concent's domain name.
The text was updated successfully, but these errors were encountered: