The CTA Frontend uses the protobuf definitions in xrootd-ssi-protobuf-interface/eos_cta/protobuf/. These were designed for EOS and are not sufficiently general for use with dCache. This issue is to propose the minimum changes to the protobuf definitions to allow CTA to work with dCache, without breaking compatibility with EOS.
message FileInfo { string fid = 1; // disk system unique file ID uint64 size = 2; // file size string storageClass = 3; // tape system related storage class (file family) cta.common.ChecksumBlob csb = 4; // set of knows checksums for the given file}
EOS file metadata
Defined in message Metadata
fid is uint64 (vs. string in dCache)
size, csb are the same
Storage class is sent in the extended attributes, map<string, string> xattr
To be checked, do we even use all of the rest of the fields in message Metadata? Perhaps some fields can be deprecated.
message ArchiveResponse { uint64 fid = 1; // tape system unique file ID string reqId = 2; // tape request scheduler ID, used to cancel the request}message RetrieveResponse { string reqId = 1; // tape request scheduler ID, used to cancel the request}
Note: empty response for Delete and CancelRetrieve events = no error reporting!
CTA XrdSsi Frontend response format
message Response { enum ResponseType { RSP_INVALID = 0; //< Response type was not set RSP_SUCCESS = 1; //< Request is valid and was accepted for processing RSP_ERR_PROTOBUF = 2; //< Framework error caused by Google Protocol Buffers layer RSP_ERR_CTA = 3; //< Server error reported by CTA Frontend RSP_ERR_USER = 4; //< User request is invalid } ResponseType type = 1; //< Encode the type of this response map<string, string> xattr = 2; //< xattribute map string message_txt = 3; //< Optional response message text cta.admin.HeaderType show_header = 4; //< Type of header to display (for stream responses)}
Archive file id/request ID is returned in the extended attributes.
Why are we adapting the existing protobuf definitions? we can instead develop a protobuf for the new frontend independent from the design of xrootd-ssi-protobuf-interface, and develop independent version of cta-admin, etc to work with the grpc frontend.
Fair comment, we don't need to tie ourselves rigidly to the existing definitions. I suppose our minimum constraints are these:
The new definitions should be broad enough to include all the metadata from EOS and the metadata from dCache.
The new definitions should be able to be implemented in the EOS WFE with a minimum of effort.
We will be using both Frontends in parallel for some time, we should not have to maintain two separate sets of protobufs when we change something. (This can be solved by importing the existing definitions into a new version of cta_frontend.proto where possible).
The current EOSCTA implementation went for a "One protobuf to rule them all" approach, where all request/response types are encompassed in one protobuf which is then sent to a single dispatcher function in the Frontend.
The dCache approach is to have a separate protobuf and RPC call for each request type.
Changing this would make it difficult to share code between the SSI and gRPC implementations, so we should discuss if there is a strong case for doing so.
Hi Michael. From my experience with GRPC the dCache approach seems better. Effectively, the grpc library already handles the dispatching, so there is no need to do this ourselves, it is just extra code that needs to be maintained. The way I see it, doing it the dcache way would still be reused, for example looking at the class XrdSsiCtaRequestMessage.hpp, this is basically your GRPC service, but with the added complexity of dispatching the request to the appropriate function and obtaining the options for each command. This part can also be removed with the dCache approach, and doing this way allows the compiler to better check for type errors, which our functions getOptional and getRequired do not allow for.
From my point of view, at least for the cta-admin interface, we could just keep the current proto and add a grpc needs. Most of the rest is already there.
Note: empty response for Delete and CancelRetrieve events = no error reporting!
This is not quite true. The frontend still can return error code and error message, but instead of a reply object gRPC transport is used:
StatusCtaRpcImpl::CancelRetrieve(::grpc::ServerContext*context,const::cta::dcache::rpc::CancelRetrieveRequest*request,::google::protobuf::Empty*response){...// check validate request argsif(request->instance().name().empty()){lc.log(cta::log::WARNING,"CTA instance is not set");return::grpc::Status(::grpc::StatusCode::INVALID_ARGUMENT,"CTA instance is not set.");}....lc.log(cta::log::INFO,"retrieve request canceled.");returnStatus::OK;}
Hi @timkrtch, @jchodak, I have finally got around to looking at this in detail, sorry for the delay!
As previously mentioned, I would like to refactor the existing CTA Frontend so that there is a maximum amount of code shared between the two implementations, and only the transport protocol is different. This will ensure that the two implementations are equivalent, we perform exactly the same checks, and if we make changes in future, they only have to be made in one place. This is the main purpose of the changes I am proposing below.
The main data structure that is passed around in the XRootD Frontend is cta.eos.Notification. I would like to retain this data structure in the gRPC Frontend as it will make the refactoring and code sharing much easier. (Sorry for eos in the name, this is a namespace issue that we can make more generic down the line).
There is no problem to have a separate gRPC event for Archive, Retrieve, etc. to replace the current "one protobuf to rule them all and a big dispatcher" model. But I will add the dispatcher as well to give us an easy migration path from EOS. It can be removed in future.
I promoted archive_file_id and storage_class to first class citizens as requested. The "schedule by creation time" issue (CTA#1279) should also be fine, I just need to check we are passing the correct btime (birth time) from EOS and fix that if not.
A note about what you called the "request id" for the Archive and Retrieve requests: this id is the address of the object in the objectstore. It's a hack to get around the fact that queue requests in the objectstore cannot be indexed by archiveFileId. In future we will move to a PostgreSQL SchedulerDB and then this id will be redundant. I renamed it to objectstore_id to make this distinction clearer. (Elsewhere in the code we use the XRootD request ID, which we call reqId, so this is also to avoid confusion between those different IDs).
The MR with all these changes is !249 (merged). This implies some minor changes to dCache code to fill the protocol buffer. Please review and let me know if it is OK for you. Thanks!