Looks like for repack some queue names are longer than the max file name length resulting in file not found errors when trying to lock the file. The error when failing to create the file goes unnoticed:
[ RUN ] OStoreDBPlusMockSchedulerTestVFS/SchedulerTest.expandRepackRequest/0unknown file: FailureC++ exception with description "In BackendVFS::lockHelper(): no such file /tmp/jobStoreVFSYcU1tM/RetrieveQueueToReportToRepackForSuccess-RepackRequest-OStoreDBFactory-runner-zzbcjzs-project-139306-concurrent-0-vq5asmhj-25444-20231003-09:21:58-180-3-OStoreDBFactory-runner-zzbcjzs-project-139306-concurrent-0-vq5asmhj-25444-20231003-09:21:58-180-40/builds/cta/CTA/build_rpm/RPM/BUILD/cta-5-6282826git4eb1b56f/build/common/libctacommon.so.0(cta::exception::Backtrace::Backtrace(bool)+0x6b) [0x7ff677e5031f]/builds/cta/CTA/build_rpm/RPM/BUILD/cta-5-6282826git4eb1b56f/build/common...
It is not the name of the queue itself but when trying to create the lock . + $queueName + .lock it exceeds the 255 limit and fails to create the lock, deletes the queue file, logs it (not reported in the CI build stage) and continues running the test up to the point where it fails to find the queue file.
Full name of queue object is: RetrieveQueueToReportToRepackForSuccess-RepackRequest-OStoreDBFactory-runner-zzbcjzs-project-139306-concurrent-0-r3jj3boe-25469-20231004-15:42:37-180-3-OStoreDBFactory-runner-zzbcjzs-project-139306-concurrent-0-r3jj3boe-25469-20231004-15:42:37-180-3 where the RepackRequest-...-180-3 refers to the VID (?? I don't fully understand why this is stored in the VID field).
The full name for the lock file on RetrieveQueueToReportToRepackForSucces exceeds the 255 character|byte limit for file names. The object name has the following structure: <ObjectType>-<Identifyer (vid in this case)>-<ProcUniqueId>-<SequenceNumber>,
the lock files adds . and .lock at the beginning and end of the file.
Limiting the size of the hostname used for ObjectStore's object names fixes the issue (https://gitlab.cern.ch/cta/CTA/-/blob/6c2c96ec5a04ed3e20f425fdf632c4d8c11e088a/objectstore/AgentReference.cpp#L45), I just tried with size 45 but it could be a little longer, this should be documented somewhere, specially for external sites. This is the most trivial solution to implement and enforce but could mean loosing the creator of the object if two different hostnames are long enough and share a common string in the beginning of the name. Cloud like environment naming conventions can be affected by this (like the one that raised this issue )
As mentioned in the last comment, the VID is not an actual VID, this ''issue'' also appeared when Vlado was dumping object during the Scheduler taking down all drives incident:
The only common thing I see in these cases is that they are in ToReport queues, are the fields on those used in a different way to reuse some protobuf messages and then handled differently in the code depending on the queue? Is this documented anywhere?
Non related but found it funny.
I also discovered that we have several implementations to get the hostname of a machine. Some people make it really complex:
//------------------------------------------------------------------------------// getHostName//------------------------------------------------------------------------------std::string cta::System::getHostName(){ // All this to get the hostname, thanks to C ! int len = 64; char* hostname; hostname = (char*) calloc(len, 1); if (0 == hostname) { OutOfMemory ex; ex.getMessage() << "Could not allocate hostname with length " << len; throw ex; } if (gethostname(hostname, len) < 0) { // Test whether error is due to a name too long // The errno depends on the glibc version if (EINVAL != errno && ENAMETOOLONG != errno) { free(hostname); cta::exception::Errnum e(errno); e.getMessage() << "gethostname error"; throw e; } // So the name was too long while (hostname[len - 1] != 0) { len *= 2; char *hostnameLonger = (char*) realloc(hostname, len); if (0 == hostnameLonger) { free(hostname); cta::exception::Errnum e(ENOMEM); e.getMessage() << "Could not allocate memory for hostname"; throw e; } hostname = hostnameLonger; memset(hostname, 0, len); if (gethostname(hostname, len) < 0) { // Test whether error is due to a name too long // The errno depends on the glibc version if (EINVAL != errno && ENAMETOOLONG != errno) { free(hostname); cta::exception::Errnum e(errno); e.getMessage() << "Could not get hostname" << strerror(errno); throw e; } } } } std::string res(hostname); // copy the string free(hostname); return res;}
Others prefer simpler solutions:
//------------------------------------------------------------------------------// getHostName//------------------------------------------------------------------------------std::string cta::tape::daemon::TapeDaemon::getHostName() const { char nameBuf[HOST_NAME_MAX + 1]; if(gethostname(nameBuf, sizeof(nameBuf))) throw cta::exception::Errnum("Failed to get host name"); return nameBuf;}