gRPC Frontend Future and Roadmap
This ticket aims to give an overview of what the strategy/future roadmap for the gRPC frontend will be.
gRPC workflow events and gRPC admin comments should not block each other. It is better to change and properly finish one knob at the time. We can survive with the XRootD SSI admin commands a bit longer. The priority should be to get the workflow events working.
- Allow spawning of both XRootD and gRPC frontends simultaneously in CI (Done, see #1148 (closed))
- Test the gRPC workflow events in CI (without authentication) while sticking with the XRootD frontend for
cta-admincommands. - Implement token authentication for the gRPC frontend (CTA keycloak instance) (Done, see !901 (merged))
- Test the gRPC workflow events in CI (with authentication) while sticking with the XRootD frontend for
cta-admincommands. - Stress test the gRPC workflow events (first in CI, later in a production scenario)
This should be our first target, because at this point we should be confident in the workings of the workflow events and start rolling it out in production in exactly the same way as we did in CI (spawn both frontends).
Both the CI and production upgrade procedure is then very clear with this:
- Deploy a gRPC frontend, while keeping the XRootD frontend
- Switch the EOS configuration to use the gRPC frontend
- Done.
We should only continue worrying about the admin commands after the workflow events work nicely (i.e. somewhere after step 4).
Based on the discussion we had, we want to make the EOS configuration switch a bit easier. Right now we have:
protowfusegrpc <true or false>
protowfendpoint <xrd-endpoint or grpc-frontend>
protowfresource <resource>
This makes the switch between gRPC and XRootD a bit painful as we have to update the protowfendpoint field. It would be better to have this separate:
protowfusegrpc <true or false>
protowfendpointgrpc <grpc-endpoint>
protowfendpoint <xrd-endpoint>
protowfresource <resource>
This way, the switch is just a matter of toggling the protowfusegrpc flag, which is easier and less error prone in a production environment.
Finally, high availability should be considered and tested from day 1. We need a way for the clients to get information on which frontends are available and use this information to connect. In case a connection fails, one of the other frontends can be used to try again. This is again something that can be tested in CI first.