CTA Frontend protobuf changes to support dCache

assigned to @mdavis

added Front end (Admin interface) in CTA-old Front end (EOS interface) in CTA-old Needs Discussion in CTA-old dCache integration in CTA-old labels

dCache protobuf definitions (from !202 (merged)):

Preamble

syntax = "proto3";

option java_multiple_files = true;
option java_package = "org.dcache.cta.rpc";
option optimize_for = CODE_SIZE;
package cta.dcache.rpc;

import "google/protobuf/empty.proto";

import "cta_common.proto";
import "cta_admin.proto";
import "cta_eos.proto";

CTA Frontend API (protobuf definitions)

//
// gRPC interface to CTA frontend
//

/*
 * File metadata
 */
message FileInfo {
  string fid = 1; // disk system unique file ID
  uint64 size = 2; // file size
  string storageClass = 3; // tape system related storage class (file family)
  cta.common.ChecksumBlob csb = 4; // set of knows checksums for the given file
}

/*
 * Response to the ARCHIVE request.
 */
message ArchiveResponse {
  uint64 fid = 1; // tape system unique file ID
  string reqId = 2; // tape request scheduler ID, used to cancel the request
}

/*
 * Response to the RETRIEVE request.
 */
message RetrieveResponse {
  string reqId = 1; // tape request scheduler ID, used to cancel the request
}


/*
 * ARCHIVE request.
 */
message ArchiveRequest {
  cta.common.Service instance = 1;  // client instance ID
  cta.eos.Client cli = 2;      // requester information
  cta.eos.Transport transport = 3;      // IO, error and success endpoints
  FileInfo file = 4;      // files' metadata
}

/*
 * RETRIEVE request.
 */
message RetrieveRequest {
  cta.common.Service instance = 1;  // client instance ID
  cta.eos.Client cli = 2;      // requester information
  cta.eos.Transport transport = 3;      // IO, error and success endpoints
  FileInfo file = 4;      // files' metadata
  uint64 archiveId = 5; // tape system unique file ID
}

/*
 * DELETE request.
 */
message DeleteRequest {
  cta.common.Service instance = 1;  // client instance ID
  cta.eos.Client cli = 2;      // requester information
  FileInfo file = 3;      // files' metadata
  uint64 archiveId = 4; // tape system unique file ID
  string reqId = 5; // pending ARCHIVE request scheduler ID
}

/*
 * CANCEL RETRIEVE request.
 */
message CancelRetrieveRequest {
  cta.common.Service instance = 1;  // client instance ID
  cta.eos.Client cli = 2;      // requester information
  uint64 fid = 3; // tape system unique file ID
  string reqId = 4; // tape request scheduler ID, used to cancel the request
}

gRPC callback definitions

service CtaRpc {
  rpc Version (google.protobuf.Empty) returns (cta.admin.Version) {}

  rpc Archive (ArchiveRequest) returns (ArchiveResponse) {}
  rpc Retrieve (RetrieveRequest) returns (RetrieveResponse) {}
  rpc Delete (DeleteRequest) returns (google.protobuf.Empty) {}
  rpc CancelRetrieve (CancelRetrieveRequest) returns (google.protobuf.Empty) {}
}

dCache file metadata

message FileInfo {
  string fid = 1; // disk system unique file ID
  uint64 size = 2; // file size
  string storageClass = 3; // tape system related storage class (file family)
  cta.common.ChecksumBlob csb = 4; // set of knows checksums for the given file
}

EOS file metadata

Defined in message Metadata
fid is uint64 (vs. string in dCache)
size, csb are the same
Storage class is sent in the extended attributes, map<string, string> xattr
To be checked, do we even use all of the rest of the fields in message Metadata? Perhaps some fields can be deprecated.

dCache response format

message ArchiveResponse {
  uint64 fid = 1; // tape system unique file ID
  string reqId = 2; // tape request scheduler ID, used to cancel the request
}

message RetrieveResponse {
  string reqId = 1; // tape request scheduler ID, used to cancel the request
}

Note: empty response for Delete and CancelRetrieve events = no error reporting!

CTA XrdSsi Frontend response format

message Response {
  enum ResponseType {
    RSP_INVALID                       = 0;      //< Response type was not set
    RSP_SUCCESS                       = 1;      //< Request is valid and was accepted for processing
    RSP_ERR_PROTOBUF                  = 2;      //< Framework error caused by Google Protocol Buffers layer
    RSP_ERR_CTA                       = 3;      //< Server error reported by CTA Frontend
    RSP_ERR_USER                      = 4;      //< User request is invalid
  }
  ResponseType type                   = 1;      //< Encode the type of this response
  map<string, string> xattr           = 2;      //< xattribute map
  string message_txt                  = 3;      //< Optional response message text
  cta.admin.HeaderType show_header    = 4;      //< Type of header to display (for stream responses)
}

Archive file id/request ID is returned in the extended attributes.

We could drop the Alert message. This was added as it is a feature supported by XRootD SSI, but I believe we don't use it for anything.

Why are we adapting the existing protobuf definitions? we can instead develop a protobuf for the new frontend independent from the design of xrootd-ssi-protobuf-interface, and develop independent version of cta-admin, etc to work with the grpc frontend.

Fair comment, we don't need to tie ourselves rigidly to the existing definitions. I suppose our minimum constraints are these:

The new definitions should be broad enough to include all the metadata from EOS and the metadata from dCache.
The new definitions should be able to be implemented in the EOS WFE with a minimum of effort.
We will be using both Frontends in parallel for some time, we should not have to maintain two separate sets of protobufs when we change something. (This can be solved by importing the existing definitions into a new version of cta_frontend.proto where possible).

The current EOSCTA implementation went for a "One protobuf to rule them all" approach, where all request/response types are encompassed in one protobuf which is then sent to a single dispatcher function in the Frontend.

The dCache approach is to have a separate protobuf and RPC call for each request type.

Changing this would make it difficult to share code between the SSI and gRPC implementations, so we should discuss if there is a strong case for doing so.

Hi Michael. From my experience with GRPC the dCache approach seems better. Effectively, the grpc library already handles the dispatching, so there is no need to do this ourselves, it is just extra code that needs to be maintained. The way I see it, doing it the dcache way would still be reused, for example looking at the class XrdSsiCtaRequestMessage.hpp, this is basically your GRPC service, but with the added complexity of dispatching the request to the appropriate function and obtaining the options for each command. This part can also be removed with the dCache approach, and doing this way allows the compiler to better check for type errors, which our functions getOptional and getRequired do not allow for.

From my point of view, at least for the cta-admin interface, we could just keep the current proto and add a grpc needs. Most of the rest is already there.

@mdavis

Note: empty response for Delete and CancelRetrieve events = no error reporting!

This is not quite true. The frontend still can return error code and error message, but instead of a reply object gRPC transport is used:

Status CtaRpcImpl::CancelRetrieve(::grpc::ServerContext* context, const ::cta::dcache::rpc::CancelRetrieveRequest* request, ::google::protobuf::Empty* response) {

    ...

    // check validate request args
    if (request->instance().name().empty()) {
        lc.log(cta::log::WARNING, "CTA instance is not set");
        return ::grpc::Status(::grpc::StatusCode::INVALID_ARGUMENT, "CTA instance is not set.");
    }

    ....

    lc.log(cta::log::INFO, "retrieve request canceled.");

    return Status::OK;
}

The full list of available status codes defined at: https://grpc.github.io/grpc/core/md_doc_statuscodes.html

mentioned in merge request !213 (merged)

assigned to @mdavis

removed Needs Discussion in CTA-old label

added workflowAssigned in CTA-old label

added workflowIn Progress in CTA-old label and removed workflowAssigned in CTA-old label

created branch 1240-cta-frontend-protobuf-changes-to-support-dcache to address this issue

mentioned in merge request !249 (merged)

Hi @timkrtch, @jchodak, I have finally got around to looking at this in detail, sorry for the delay!

As previously mentioned, I would like to refactor the existing CTA Frontend so that there is a maximum amount of code shared between the two implementations, and only the transport protocol is different. This will ensure that the two implementations are equivalent, we perform exactly the same checks, and if we make changes in future, they only have to be made in one place. This is the main purpose of the changes I am proposing below.

The main data structure that is passed around in the XRootD Frontend is cta.eos.Notification. I would like to retain this data structure in the gRPC Frontend as it will make the refactoring and code sharing much easier. (Sorry for eos in the name, this is a namespace issue that we can make more generic down the line).

There is no problem to have a separate gRPC event for Archive, Retrieve, etc. to replace the current "one protobuf to rule them all and a big dispatcher" model. But I will add the dispatcher as well to give us an easy migration path from EOS. It can be removed in future.

I promoted archive_file_id and storage_class to first class citizens as requested. The "schedule by creation time" issue (CTA#1279) should also be fine, I just need to check we are passing the correct btime (birth time) from EOS and fix that if not.

A note about what you called the "request id" for the Archive and Retrieve requests: this id is the address of the object in the objectstore. It's a hack to get around the fact that queue requests in the objectstore cannot be indexed by archiveFileId. In future we will move to a PostgreSQL SchedulerDB and then this id will be redundant. I renamed it to objectstore_id to make this distinction clearer. (Elsewhere in the code we use the XRootD request ID, which we call reqId, so this is also to avoid confusion between those different IDs).

The MR with all these changes is !249 (merged). This implies some minor changes to dCache code to fill the protocol buffer. Please review and let me know if it is OK for you. Thanks!

created branch 63-cta-frontend-protobuf-changes-to-support-dcache to address this issue

mentioned in merge request !1 (merged)

added Needs Discussion Front end (Admin interface) Front end (EOS interface) dCache integration labels

removed Needs Discussion label

added workflowIn Progress label

closed with merge request !1 (merged)

changed milestone to %Release 4.7.11

CTA Frontend protobuf changes to support dCache

Designs

Child items ...

Activity

Preamble

CTA Frontend API (protobuf definitions)

gRPC callback definitions

dCache file metadata

EOS file metadata

dCache response format

CTA XrdSsi Frontend response format

CTA Frontend protobuf changes to support dCache

Is blocked by

Relates to

Activity

Preamble

CTA Frontend API (protobuf definitions)

gRPC callback definitions

dCache file metadata

EOS file metadata

dCache response format

CTA XrdSsi Frontend response format