Use async server-side implementation for WFE in gRPC Frontend
Test failures for JWT Authentication
When I first tested the changes for JWT Authentication in CI, the following tests failed:
-
test_client,- Reason: error thrown by the gRPC framework:
"Server Threadpool Exhausted". This error is returned by the framework, only in the case of synchronous server RPC implementations, which can only use up to a fixed number of request serving threads (configurable). If these run out, then we will get this error from the framework.
- Reason: error thrown by the gRPC framework:
-
test_client_gfal- Reason: Undetermined. These are the relevant errors seen in the test output, but from the logs it is not obvious what is going wrong:
ERROR with xrootd transfer for file 0/0004, full logs in /dev/shm/bfdf15c4-8a85-491e-98e7-99d2fdb0a74e/00004
ERROR with xrootd transfer for file 0/0000, full logs in /dev/shm/bfdf15c4-8a85-491e-98e7-99d2fdb0a74e/00000
ERROR with xrootd transfer for file 0/0009, full logs in /dev/shm/bfdf15c4-8a85-491e-98e7-99d2fdb0a74e/00009
ERROR with xrootd transfer for file 0/0012, full logs in /dev/shm/bfdf15c4-8a85-491e-98e7-99d2fdb0a74e/00012
ERROR with xrootd transfer for file 0/0002, full logs in /dev/shm/bfdf15c4-8a85-491e-98e7-99d2fdb0a74e/00002
ERROR with xrootd transfer for file 0/0005, full logs in /dev/shm/bfdf15c4-8a85-491e-98e7-99d2fdb0a74e/00005
ERROR with xrootd transfer for file 0/0007, full logs in /dev/shm/bfdf15c4-8a85-491e-98e7-99d2fdb0a74e/00007
Failures also present in the main branch
I noticed that these tests were also failing similarly in our main branch, therefore the error was not due to the changes for JWT Authentication.
- an old pipeline from the main branch, https://gitlab.cern.ch/cta/CTA/-/pipelines/12290212
- and a newer one https://gitlab.cern.ch/cta/CTA/-/pipelines/12460922
- another one https://gitlab.cern.ch/cta/CTA/-/pipelines/12461897
So in order to resolve at least the first problem, I had the idea of using the async callback API on the server side to implement the RPCs for the physics workflow events. The async API controls the size of the threadpool and grows/shrinks it, and it is not possible to get such an error from the Async API. (looking at the gRPC source code the error about "thread pool exhausted" only comes from the sync implementation).
Errors resolved by using async API, and stress test passes
Indeed this removes the errors for both tests on the main branch:
-
https://gitlab.cern.ch/cta/CTA/-/pipelines/12435605
-
test_clientfailure is due to 4 non-whitelisted errors and warnings, but test completes otherwise
-
- stress test results:
Errors also resolved by better configuring the sync options, stress test also passes
-
https://gitlab.cern.ch/cta/CTA/-/pipelines/12450696
-
test_clientfailures: same as above, test completes but failure due to 4 non-whitelisted errors/warnings
-
- stress test results:
A note on XRootD/SSI Configuration / number of threads
The XRootD/SSI framework has an async implementation for rpcs. The number of request serving threads is managed by the framework, but there is a lower and upper limit:
Defaults
xrd.sched mint 8 maxt 2048 avlt 512 idle 780
Our configuration
Resume testing for JWT Authentication
Tested with 4096 threads and sync frontend:
- https://gitlab.cern.ch/cta/CTA/-/pipelines/12460219
-
test_client_gfal2fails - stress test triggered (but if test_client_gfal2 fails, not much hope for a successful stress test)
TODO: Test JWT Authentication with Async Frontend Implementation
More details in the linked codiMD.


