Fix repack systemtest failing on timeout in move and add copies step

added Continuous Integration ci: tests priorityhigh typebug labels

assigned to @nbugel

created branch 990-fix-repack-systemtest-failing-on-timeout-in-move-and-add-copies-step to address this issue

mentioned in merge request !727 (merged)

created branch 990-fix-repack-systemtest-failing-on-timeout-in-move-and-add-copies-step-2 to address this issue

This might (or might not be) related to the message too long error. This seems to be because of a really long buffer. Below is an excerpt from a test job:

"message":"In DriveHandlerProxy::addLogParams(): Socket send failed."
"bufferLength":250207

Ðâ\t\u0000ðâ\t\u0000ã\t\u0015â\t\u0007tapeVidâ\t\u0006V00103ã\t\u001fâ\t\tmountTypeâ\t\u000eArchiveForUserã\t\u0011â\t\u0007mountIdâ\t\u000210ã\t\u0012â\t\bvolReqIdâ\t\u000210ã\t\u0018â\t\ttapeDriveâ\t\u0007VDSTK01ã\t\u0014â\t\u0006vendorâ\t\u0006vendorã\t\fâ\t\u0002voâ\t\u0002voã\t\u0015â\t\tmediaTypeâ\t\u0004LTO8ã\t\u001aâ\t\btapePoolâ\t\nctasystestã\t\u001dâ\t\u000elogicalLibraryâ\t\u0007VDSTK01ã\t%â\t\u000fcapacityInBytesâ\t\u000e12000000000000ã\t\u0017â\t\u000emountAttemptedâ\t\u00011ã\t\nâ\t\u0017stillOpenFileForThread2â\tã\troot://ctaeos.k8s-test-r-9892340git3028011b-49359729.svc.cluster.local:1095//eos/ctaeos/preprod/985e2743-aa3c-49ba-afb4-e1b10a1c3c5d/0/0314?cap.msg=DxKV7+mywug8LwuWNGHeEs/GyzQCZ8uRRx8gP9aNH7y0XmqqSmmzSTxik14e0Ulyl39TavTAIPQm9jDg2rSx+fhO63Vcr6a+i3f2sQD8PPNGySk9U/7ALLM3DTlrmcT2nRw/gdgm82egpWNaXxBLd5+PS+aKRpdKcxcIRQFXP+C6ztQHF1ABEUQVjnJymESkq2jE+qhWkzIGc9Q3WHMC1DQaIGarCc0Vd0J5Vc9lZaJyKEtbCQEvSp4ihf25ovv29PasAZhKOhIks45rQEMs/BwLdNWDXVbEJUrIu7MmkQ1iA3R3SVtEjXTv0eJr2M8dNPavoIwbNLhcZ2xP8aEUNgjBjvESfsVgptz4aKv7ENkGrN+9DxccuPDRaRr2lzAKUKsDNp6FYiobRCYj8LV4uWrRfO5SW1k6q7l/sEDwzIUbuF897qSv3X/5OVw6b66sZ7h2NQijXg0gCUx5JJvk2Xw6x3maXSXZoE9/Q3EH2f5Qjw7ANOOch7ETC2eDRFmNAy58hnT2pKKzMDabAI9eTwUVq4KyCHVCPvJyyFLnoRXX7ALljXPw5xS/2xsv+rVBI6xCcw7yVDQnXT0Yr7gQs9Y6z7mp9nyARB5EZFT4gf5F7zIqprN0eUrxHLjcmQFXdXkMnBzyfB4=&cap.sym=JPDIRkF2x57vOWGpvLVSLoImdx8=&eos.clientinfo=zbase64:MDAwMDAwOTN4nBXI2w0CIRBA0VZsADKwbAQSiuExKJEsBAaN3St/597W8bKDHHAlbqXZ8HHlylszPjE52KRvRxdWzjgw7dFHaS5h9qvS7v8uAyO5SB7b5C89GeEkNpjRRh4KHoUOkBqECEyZ4zR3afh8Rx7rmoSD1xZ9tQLM+QOf8DD+&eos.lfn=fxid:15c&mgm.blockchecksum=ignore&mgm.id=0000015c&mgm.logid=29a535f6-d32c-11ef-afcf-bef333e452b1&mgm.mtime=1736937203&mgm.replicahead=0&mgm.replicaindex=0&xrdcl.requuid=863efeec-7f4a-4e7b-8e7e-72b284364380ã\t\nâ\t\u0017stillOpenFileForThread1â\tã\troot://ctaeos.k8s-test-r-9892340git3028011b-49359729.svc.cluster.local:1095//eos/ctaeos/preprod/985e2743-aa3c-49ba-afb4-e1b10a1c3c5d/0/0334?cap.msg=DxKV7+mywug8LwuWNGHeEs/GyzQCZ8uRRx8gP9aNH7y0XmqqSmmzSTxik14e0Ulyl39TavTAIPQm9jDg2rSx+fhO63Vcr6a+i3f2sQD8PPNGySk9U/7ALLM3DTlrmcT2nRw/gdgm82egpWNaXxBLd5+PS+aKRpdKcxcIRQFXP+C6ztQHF1ABEUQVjnJymESkqO2LMcODTYKsad+ciAXBx6TwthybcDU0NMf5GniPgencI2oRfpjrgovZdioKfLn9bATgiNCYvZmAUBj+2gDD0+WWhy0NyhNTeL3LAUTq2xyKGtZs5AHCbAn0J3+b4V4/KavqSvsQbhdVr/TVFVedwhLCBW3N9znQ/xQj2IGwrHIdLLmu9T8x0j5AUNTk3dFOgugdPMHgFy9i4Tc7HDYlJ2cDYX+n3Bhh3v7UUJI5/60QmR9o5eaj3g32fOeKo4gIFm2PfZBH8guGmubJdqmt1n0/aPGSQ9dCx20OSU3nrk0LrJjK16sZ87tG0L2BoXpEnnkwgbJFW3tdjGXfAAaDbPwOkXYCjCD6UZ4uXL6/8rPCbciF+7qSXLF9cn03A0TUgF5iyahQdqph25w32TYmtnXWzL3iVnRe9+PuwTKbiqZFR+PhASnAiEy6/FSaBo7g6mJKTwn8pbY=&cap.sym=JPDIRkF2x57vOWGpvLVSLoImdx8=&eos.clientinfo=zbase64:MDAwMDAwOTN4nBXI2w0CIRBA0VZsAMJjNwIJxcAwKJEsBAaN3St/597W8XKDvOBK30pz8ePLlbcmPDF5sUnfjj6unHFg2qOP0nzCHFal3f9dBgJ5oIBt8peZjHASG8waq/QhHoW0UEZIGdlh9WnvyvL5Bg51TcLBa4NQnRT2/AGf7jD+&eos.lfn=fxid:15b&mgm.blockchecksum=ignore&mgm.id=0000015b&mgm.logid=29a54bd6-d32c-11ef-afcf-bef333e452b1&mgm.mtime=1736937203&mgm.replicahead=0&mgm.replicaindex=0&xrdcl.requuid=7b93174d-91be-463a-ac84-76a29d784c3dã\t\nâ\t\u0017stillOpenFileForThread3â\tã\troot://ctaeos.k8s-test-r-9892340git3028011b-49359729.svc.cluster.local:1095//eos/ctaeos/preprod/985e2743-aa3c-49ba-afb4-e1b10a1c3c5d/0/0313?cap.msg=DxKV7+mywug8LwuWNGHeEs/GyzQCZ8uRRx8gP9aNH7y0XmqqSmmzSTxik14e0Ulyl39TavTAIPQm9jDg2rSx+fhO63Vcr6a+i3f2sQD8PPNGySk9U/7ALLM3DTlrmcT2nRw/gdgm82egpWNaXxBLd5+PS+aKRpdKcxcIRQFXP+C6ztQHF1ABEUQVjnJymESkd9qALpDYF33Vdw2wuC3UwsmOzDGt82HC5VSmfM9Tz6/s5AzMmQDzQzLVTS6SraLGOeOupXdEt02We0Jw1uyIXlxnHAeEXPn9x0sIPLSwNTnaW4GnUixsmV3fLyErxo9TQtmqpVfpouHtZsKHuE161FOzyTTJoRZBtaRHbbDyhyZTo0brIsmJyrCdGkguePUR171/+/6QFbIpKlI4oQIzcxzy9th/AqGLY0w7630cD9dWUsimXwOzX+SPFre3Ctr9fu7KXk1uk1kXinK1UJFCZn32K8HFmKZk5/xIjgzeT8AcOrO/UrPmxcVLR2+L6UiNFu1VgcN7tIRr+oBUgWY9oMQaikchxPsXEXt4GOfUz2QHl3olhQhsnHLOdJnTK2pJoAfi5pzgRMe3szIRnKg6wtFsvltVBPxWYJ2dc8ebiroGE3hR04z3Wu6ST8KC3hH+8R9IHwhy6Io=&cap.sym=JPDIRkF2x57vOWGpvLVSLoImdx8=&eos.clientinfo=zbase64:MDAwMDAwOTN4nBXI2w0CIRBA0VZsAMJjNwIJxcAwKJEsBAaN3St/597W8XKDvOBK30pz8ePLlbcmPDF5sUnfjj6unHFg2qOP0nzCHFal3f9dBgJ5oIBt8peZjHASG8waq/QhHoW0UEZIGdlh9WnvyvL5Bg51TcLBa4NQnRT2/AGf7jD+&eos.lfn=fxid:167&mgm.blockchecksum=ignore&mgm.id=00000167&mgm.logid=29a55888-d32c-11ef-afcf-bef333e452b1&mgm.mtime=1736937203&mgm.replicahead=0&mgm.replicaindex=0&xrdcl.requuid=01d8ee86-da89-42f4-9fc4-3c54abe1803bã\t\nâ\t\u0017stillOpenFileForThread4â\tã\troot://ctaeos.k8s-test-r-9892340git3028011b-49359729.svc.cluster.local:1095//eos/ctaeos/preprod/985e2743-aa3c-49ba-afb4-e1b10a1c3c5d/0/0332?cap.msg=DxKV7+mywug8LwuWNGHeEs/GyzQCZ8uRRx8gP9aNH7y0XmqqSmmzSTxik14e0Ulyl39TavTAIPQm9jDg2rSx+fhO63Vcr6a+i3f2sQD8PPNGySk9U/7ALLM3DTlrmcT2nRw/gdgm82egpWNaXxBLd5+PS+aKRpdKcxcIRQFXP+C6ztQHF1ABEUQVjnJymESk+QcnTV31bOEzNSOuGYSqr2x/tUOuOgbmwIkTjhB4mEaHkNMSEAvTkIRKiiJqrdAFacM4LjKtJOj6LE7PDPd6KFK272uAodK/rbWHpdLnHawUjC6cxpsdIUVQAW68AGtDX+ttiX38URnJ9TSchf3SXtMTcG1msViJleVYB9/0yXnCDmZqgi1g6SK2NpfA3Q+xKVkbfbGXwJE4iE58dkPBIOyJcUzdiyAwgEFvT87Agc2OtlX9lpWI51fEfFaoIKM7xVJK6NefIvEDiuwoTwZ5ChXZ7jJdEGC7sP0EZLyQmK37XUsAErZ3Fy71uhtWxhYyYtyJKsQaXF9jp6Nl7hDNUnZeKYdtwSI9m02iTCQLcdbZHbktJelMm/zK3tGW2BOWo3UI50bUhDC4yrUG7yc5trlXMcbvVf2EXX8kQzTcsTGPoJlSiSK1U+Jw5bgALo7gGXWbXm47zWc=&cap.sym=JPDIRkF2x57vOWGpvLVSLoImdx8=&eos.clientinfo=zbase64:MDAwMDAwOTN4nBXI2w0CIRBA0VZsAMJjNwIJxcAwKJEsBAaN3St/597W8XKDvOBK3Upz8ePLlbcmPDF5sUnfjj6unHFg2qOP0nzCHFal3f9dBgJ5oIBt8peZjHASG8waq/QhHoW0UEZIGdlh9WnvyvL5Bg51TcLBa4NQnRT2/AGfZjD9&eos.lfn=fxid:162&mgm.blockchecksum=ignore&mgm.id=00000162&mgm.logid=29a571a6-d32c-11ef-afcf-bef333e452b1&mgm.mtime=1736937203&mgm.replicahead=0&mgm.replicaindex=0&xrdcl.requuid=4ba223ae-dba5-4fda-80b5-0c2f9630bf05ã\t\nâ\t\u0017stillOpenFileForThread0â\tã\troot://ctaeos.k8s-test-r-9892340git3028011b-49359729.svc.cluster.local:1095//eos/ctaeos/preprod/985e2743-aa3c-49ba-afb4-e1b10a1c3c5d/0/0320?cap.msg=DxKV7+mywug8LwuWNGHeEs/GyzQCZ8uRRx8gP9aNH7y0XmqqSmmzSTxik14e0Ulyl39TavTAIPQm9jDg2rSx+fhO63Vcr6a+i3f2sQD8PPNGySk9U/7ALLM3DTlrmcT2nRw/gdgm82egpWNaXxBLd5+PS+aKRpdKcxcIRQFXP+C6ztQHF1ABEUQVjnJymESkPrhvzY4pbGq7t5QjUcM7Jd9r7yrC0QSLRTa6MDgbNMXA89jCLBNwd8ipq5Lwb27U6HynK4H4OsYzo4MnSjsqFEwXlqBgthOwH2XhlfLy3J3pcS8LsuwXbCzC+KOzzGEyfgwAYPGZm/6YzX0cJGEauuhHHwfm2iFRRAN3hHlPdKAzfZ5EtOYstZMGdf5mWewlwFlHw24+DRT5wRZPgr07haNCehS2rVFeFUM7LsBTUd9gXGpXyrmJgdHLSsSjGqs2eupaRdu+gQfXYHGLTS3jzegvZMmi5g/3Z8H7ElwEsdDQUkr9FoCkP/sodD52v3KeR+mi+h/21bwN6Tq/o/pLSPhK88VGwGiyKfGR7i35in43gucy0R1+EgpLFu0mUE+KFo55XDG216e0wg8KwIaO84kcHVy9x4P9CefjYXcyH1z4WIGpIXy622WMrgYmPjoEFtZaHAstz6I=&cap.sym=JPDIRkF2x57vOWGpvLVSLoImdx8=&eos.clientinfo=zbase64:MDAwMDAwOTN4nBXI2w0CIRBA0VZsAMJjNwIJxcAwKJEsBAaN3St/597W8XKDvOBK3Upz8ePLlbcmPDF5sUnfjj6unHFg2qOP0nzCHFal3f9dBgJ5oIBt8peZjHASG8waq/QhHoW0UEZIGdlh9WnvyvL5Bg51TcLBa4NQnRT2/AGfZjD9&eos.lfn=fxid:163&mgm.blockchecksum=ignore&mgm.id=00000163&mgm.logid=29a56562-d32c-11ef-afcf-bef333e452b1&mgm.mtime=1736937203&mgm.replicahead=0&mgm.replicaindex=0&xrdcl.requuid=d0084a80-78f8-453c-afc7-71dc3197cd9eã\t\nâ\t\u0017stillOpenFileForThread5â\tã\troot://ctaeos.k8s-test-r-9892340git3028011b-49359729.svc.cluster.local:1095//eos/ctaeos/preprod/985e2743-aa3c-49ba-afb4-e1b10a1c3c5d/0/0340?cap.msg=DxKV7+mywug8LwuWNGHeEs/GyzQCZ8uRRx8gP9aNH7y0XmqqSmmzSTxik14e0Ulyl39TavTAIPQm9jDg2rSx+fhO63Vcr6a+i3f2sQD8PPNGySk9U/7ALLM3DTlrmcT2nRw/gdgm82egpWNaXxBLd5+PS+aKRpdKcxcIRQFXP+C6ztQHF1ABEUQVjnJymESkVhUMsw4YW16KhiSoEpUMHO0HLyJm/y1hBu52BV/NYhdm7yttaUpzWuyzBaAy46DaxBx0BqBH5tje/qKwvPndFvvHqyCkR25qIJ7WIesjr0s5GsayuU9NnnO0FovVrnm3yNPWrx9icp7HiLE2K8vSjLL6XEfFgwwGg08QdvUlw6ZX8TaWQ7HCKgHOO4Z3BmWd9p72iFXOJh6Bx8mYeVNmdC3c2lWO7kd970LLDmQi0Lt3P0jFYmOA1vnn8T0GnowFJskVVUXNGwDb+2aGMAsTa+CnziH6LA1NxXSYejvYbPiRbHoNjzwnYPd+qu6U2+N2YwJrDV7IIzTVE61zEDk7dtdFJtArjpukSc3HyrTI7A/xRm/yqyQHmMVH5naGOLmH2duPhG+A5Z6LGAIyXBAzGLlKWjwODnN2HhgBeScFe9WHliiXhMoJqaBnzCLzhGQX8vJBVodssyk=&cap.sym=JPDIRkF2x57vOWGpvLVSLoImdx8=&eos.clientinfo=zbase64:MDAwMDAwOTN4nBXI2w0CIRBA0VZsAMJjNwIJxcAwKJEsBAaN3St/597W8XKDvOBK3Upz8ePLlbcmPDF5sUnfjj6unHFg2qOP0nzCHFal3f9dBgJ5oIBt8peZjHASG8waq/QhHoW0UEZIGdlh9WnvyvL5Bg51TcLBa4NQnRT2/AGfZjD9&eos.lfn=fxid:160&mgm.blockchecksum=ignore&mgm.id=00000160&mgm.logid=29a57d68-d32c-11ef-afcf-bef333e452b1&mgm.mtime=1736937203&mgm.replicahead=0&mgm.replicaindex=0&xrdcl.requuid=794b0a62-65c7-4ea0-b4be-fbe540cca7caã\t\nâ\t\u0017stillOpenFileForThread6â\tã\troot://ctaeos.k8s-test-r-9892340git3028011b-49359729.svc.cluster.local:1095//eos/ctaeos/preprod/985e2743-aa3c-49ba-afb4-e1b10a1c3c5d/0/0336?cap.msg=DxKV7+mywug8LwuWNGHeEs/GyzQCZ8uRRx8gP9aNH7y0XmqqSmmzSTxik14e0Ulyl39TavTAIPQm9jDg2rSx+fhO63Vcr6a+i3f2sQD8PPNGySk9U/7ALLM3DTlrmcT2nRw/gdgm82egpWNaXxBLd5+PS+aKRpdKcxcIRQFXP+C6ztQHF1ABEUQVjnJymESk/Vfm//d2qW9rjZJPDnpt580GC9rRVLbPJW1xYDin7ebeybzNTULJwpo9yV6YH9hrkd0HUY3PEf1hdjrEBZ86RylHDoUXSrAzJumFfFpoaK1XWvLSZtIrCqaPFGhgypuuu2Mz28e+dAeO+ok/7Mylxgtgrwtrbf08+WpLzZYwz0adsF5Gk/3FL4TAQFIJbZCStlhc1RksoS79UTLwjq0gnM8l4Wz+wM6wWMUvO62jeP1/PMnoqpaPXs0021p9RV60G+W6I56pb22TM8YUA+IjNtoktgde9N1INfa5xpNehiVi015ZR4LOVsz45Rlq8wryqzdmGoKjNFl3YhKbCj6iZk9xjmg07KYropRNJ8V6fWOTNTEQ1r28f0s33+VkSYMK36sdh0AZCOe8X2RSxarhSgyJH+vxjclpibJ7Nzbv/YuLOo7lx3Y4xYioOoIgfbZVuFmReOQnLyA=&cap.sym=JPDIRkF2x57vOWGpvLVSLoImdx8=&eos.clientinfo=zbase64:MDAwMDAwOTN4nBXI2w0CIRBA0VZsAMJjNwIJxcAwKJEsBAaN3St/597W8XKDvOBK3kpz8ePLlbcmPDF5sUnfjj6unHFg2qOP0nzCHFal3f9dBgJ5oIBt8peZjHASG8waq/QhHoW0UEZIGdlh9WnvyvL5Bg51TcLBa4NQnRT2/AGe3jD8&eos.lfn=fxid:164&mgm.blockchecksum=ignore&mgm.id=00000164&mgm.logid=29a58934-d32c-11ef-afcf-bef333e452b1&mgm.mtime=1736937203&mgm.replicahead=0&mgm.replicaindex=0&xrdcl.requuid=ef7fcf52-4fe1-4de8-876f-f8ac1a931d86ã\t\nâ\t\
u0017

While it may or may not be related, this seems like a bug regardless. This is a huge log entry that should probably not be being inserted as a single message. It seems that the stillOpenFileForThread entry is repeated many many times (194 times in this single send)

Seems like taped is also coredumping during logging:

Core was generated by `/usr/bin/cta-taped -c /etc/cta/cta-taped-VDSTK01.conf --foreground --log-format'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f7d188aa215 in cta::log::operator<< (oss=..., fp=...) at /usr/src/debug/cta-5-dev.el9.x86_64/common/log/Logger.cpp:202
202       for (char c : fp.m_value) {
[Current thread is 1 (Thread 0x7f7d11fd9640 (LWP 2492))]
(gdb) bt full
#0  0x00007f7d188aa215 in cta::log::operator<< (oss=..., fp=...) at /usr/src/debug/cta-5-dev.el9.x86_64/common/log/Logger.cpp:202
        c = 101 'e'
        __for_range = Python Exception <class 'OverflowError'>: int too big to convert

        __for_begin = 0xe3af900000000017 <error: Cannot access memory at address 0xe3af900000000017>
        __for_end = 0xb75020000000002f <error: Cannot access memory at address 0xb75020000000002f>
        oss_tmp = <incomplete type>
#1  0x00007f7d188aaa66 in cta::log::Logger::createMsgBody (this=0x1eac320, logLevel="INFO", msg="In DriveHandlerProxy::setRefreshLoggerHandler(): Waiting for refresh logger signal.", 
    params=std::__cxx11::list = {...}, pid=2485) at /usr/src/debug/cta-5-dev.el9.x86_64/common/log/Logger.cpp:276
        param = @0x1f153eb: {m_name = Python Exception <class 'OverflowError'>: int too big to convert
, 
          m_value = std::optional<std::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, long, unsigned long, float, double, bool>> containing std::variant<std::string, long, unsigned long, float, double, bool> [no contained value]}
        __for_range = std::__cxx11::list = {[0] = {m_name = "SubprocessName", 
            m_value = std::optional<std::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, long, unsigned long, float, double, bool>> containing std::variant<std::string, long, unsigned long, float, double, bool> [index 0] = {"drive:VDSTK01"}}}
        __for_begin = Python Exception <class 'OverflowError'>: int too big to convert
{m_name = , m_value = std::optional<std::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, long, unsigned long, float, double, bool>> containing std::variant<std::string, long, unsigned long, float, double, bool> [no contained value]}
        __for_end = {m_name = "", m_value = std::optional<std::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, long, unsigned long, float, double, bool>> [no contained value]}
        os = <incomplete type>
        tid = 2492
#2  0x00007f7d188a92bd in cta::log::Logger::operator() (this=0x1eac320, priority=6, msg="In DriveHandlerProxy::setRefreshLoggerHandler(): Waiting for refresh logger signal.", 
    params=std::__cxx11::list = {...}) at /usr/src/debug/cta-5-dev.el9.x86_64/common/log/Logger.cpp:65
        timeStamp = {__d = {__r = 1736942633614124560}}
        pid = 2485
        priorityTextPair = {first = 6, second = "INFO"}
        header = "\"epoch_time\":1736942633.614124560,\"local_time\":\"2025-01-15T13:03:53+0100\",\"hostname\":\"cta-tpsrv01-0\",\"program\":\"cta-taped\","
--Type <RET> for more, q to quit, c to continue without paging--p fp.m_value
c
        body = " \252\374\021}\177\000\000\200GR\367\377\177\000\000\256\377\211\030}\177\000\000 \271\357\001\000\000\000\000 \252\374\021}\177\000\000\r\000\212\030}\177\000\000S\000\000\000\000\000\000\000\220@]\000\000\000\000\000\220@]\000\006\000\000\000\200GR\367\377\177", '\000' <repeats 11 times>, "a \300(\002q#0\253\374\021}\177\000\000x\225L", '\000' <repeats 13 times>, "X\271\357\001", '\000' <repeats 36 times>, "S\000\000\000\000\000\000\000\220@]", '\000' <repeats 21 times>, "`\v\000\f}\177\000\000"...
#3  0x00007f7d188a000d in cta::log::LogContext::log (this=0x7ffff7524780, priority=6, msg="In DriveHandlerProxy::setRefreshLoggerHandler(): Waiting for refresh logger signal.")
    at /usr/src/debug/cta-5-dev.el9.x86_64/common/log/LogContext.cpp:63
No locals.
#4  0x00000000004c9578 in operator() (__closure=0x1efb958) at /usr/src/debug/cta-5-dev.el9.x86_64/tapeserver/daemon/DriveHandlerProxy.cpp:119
        pollList = std::map with 2 elements = {["0"] = 0x1f1d200, ["1"] = 0x20135e0}
        this = 0x1eff3a0

This seems like a race condition:

(gdb) info args
oss = @0x7f8e4adfc770: <incomplete type>
fp = @0x7f8e4adfc700: {
  m_value = <error: Cannot access memory at address 0x7f8e5704224400>
}
(gdb) frame
#0  0x00007f8e516dc215 in cta::log::operator<< (oss=..., fp=...) at /usr/src/debug/cta-5-dev.el9.x86_64/common/log/Logger.cpp:202
202     in /usr/src/debug/cta-5-dev.el9.x86_64/common/log/Logger.cpp
(gdb) up
#1  0x00007f8e516dca66 in cta::log::Logger::createMsgBody (this=0x1efe320, logLevel="INFO", msg="In DriveHandlerProxy::setRefreshLoggerHandler(): Waiting for refresh logger signal.", 
    params=std::__cxx11::list = {...}, pid=1382) at /usr/src/debug/cta-5-dev.el9.x86_64/common/log/Logger.cpp:276
276     in /usr/src/debug/cta-5-dev.el9.x86_64/common/log/Logger.cpp
(gdb) info args
this = 0x1efe320
logLevel = "INFO"
msg = "In DriveHandlerProxy::setRefreshLoggerHandler(): Waiting for refresh logger signal."
params = std::__cxx11::list = {
  [0] = {
    m_name = "SubprocessName",
    m_value = std::optional<std::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, long, unsigned long, float, double, bool>> containing std::variant<std::string, long, unsigned long, float, double, bool> [index 0] = {
      "drive:VDSTK01"
    }
  }
}
pid = 1382

Params are fine in createMsgBody when operator<< is called, but params are not fine anymore in this function. Most likely a lifetime issue somewhere. Inspecting it further...

mentioned in issue #1028 (closed)

It seems that somewhere in the repack tests, we do the following:

kubectl -n ${NAMESPACE} exec ${EOS_MGM_POD} -c eos-mgm -- eos cp ${pathOfFilesToInject[$i]} $bufferDirectory/`printf "%9d\n" $fseqFile | tr ' ' 0`

This copy is performed on the EOS MGM pod itself. As any eos commands there are executed as root, the owner of the resulting files will be root as well. This would explain some of the weird log messages showing up regarding operation not permitted. I'll try to run this cp from the client pod instead and see if that helps

I need to understand a bit better on how this works in the production use-case. Who is supposed to own these files on the EOS side?

So in some of the cases we get a core dump. E.g for job: https://gitlab.cern.ch/cta/CTA/-/jobs/50344784

We get a core dumped while reporting back:

#7  0x00000000004d181f in std::__future_base::_State_baseV2::_S_check<std::__future_base::_State_baseV2> (__p=<error reading variable: Cannot access memory at address 0x8>) at /usr/include/c++/11/future:562
#8  0x000000000051ce25 in std::promise<void>::_M_state (this=0x0) at /usr/include/c++/11/future:1373
#9  0x00007fbde5226ab4 in std::promise<void>::set_exception (this=0x0, __p=...) at /usr/include/c++/11/future:1356
#10 0x00007fbde4a4b84d in cta::objectstore::Sorter::executeArchiveAlgorithm<cta::objectstore::ArchiveQueueToTransferForRepack> (this=0x7ffe39560bd0, tapePool="systest3_repack", queueAddress="", 
    jobs=std::__cxx11::list = {...}, lc=...) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/objectstore/Sorter.cpp:61
#11 0x00007fbde4a43c62 in cta::objectstore::Sorter::dispatchArchiveAlgorithm (this=0x7ffe39560bd0, tapePool="systest3_repack", 
    jobQueueType=@0x1e89340: cta::common::dataStructures::JobQueueType::JobsToTransferForRepack, queueAddress="", jobs=std::__cxx11::list = {...}, lc=...)
    at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/objectstore/Sorter.cpp:78
#12 0x00007fbde4a448b4 in cta::objectstore::Sorter::queueArchiveRequests (this=0x7ffe39560bd0, tapePool="systest3_repack", 
    jobQueueType=@0x1e89340: cta::common::dataStructures::JobQueueType::JobsToTransferForRepack, archiveJobsInfos=std::__cxx11::list = {...}, lc=...)
    at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/objectstore/Sorter.cpp:154
#13 0x00007fbde4a4472e in cta::objectstore::Sorter::flushOneArchive (this=0x7ffe39560bd0, lc=...) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/objectstore/Sorter.cpp:138
#14 0x00007fbde4a46bc1 in cta::objectstore::Sorter::flushAll (this=0x7ffe39560bd0, lc=...) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/objectstore/Sorter.cpp:379
#15 0x00007fbde51f4ee9 in cta::OStoreDB::RepackRetrieveSuccessesReportBatch::report (this=0x229b670, lc=...) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/scheduler/OStoreDB/OStoreDB.cpp:3021
#16 0x00007fbde513f460 in cta::Scheduler::RepackReportBatch::report (this=0x7ffe39560ea8, lc=...) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/scheduler/Scheduler.cpp:893
#17 0x00007fbde512802c in cta::RepackReportThread::run (this=0x7ffe39561120) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/scheduler/RepackReportThread.cpp:33
#18 0x00007fbde512bc2b in cta::RepackRequestManager::runOnePass (this=0x7ffe39561488, lc=..., repackMaxRequestsToExpand=2)
    at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/scheduler/RepackRequestManager.cpp:66
#19 0x00000000004e0279 in cta::tape::daemon::MaintenanceHandler::exceptionThrowingRunChild (this=0x1934580) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/tapeserver/daemon/MaintenanceHandler.cpp:349

This goes accompanied with the following warning messages:

{"epoch_time":1738601109.056713118,"local_time":"2025-02-03T17:45:09+0100","hostname":"cta-tpsrv01-0","program":"cta-taped","log_level":"WARN","pid":317,"tid":317,"message":"In ContainerAlgorithms::referenceAndSwitchOwnership(): Encountered problems while requeuing a batch of elements","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","reportingType":"RetrieveSuccesses","C":"ArchiveQueueToTransferForRepack","tapepool":"systest3_repack","containerAddress":"ArchiveQueueToTransferForRepack-systest3_repack-Maintenance-cta-tpsrv01-0-317-20250203-17:34:51-0-11","queueJobsBefore":0,"queueBytesBefore":0,"queueJobsAfter":1207,"queueBytesAfter":18538810,"queueLockFetchTime":0.023534,"queueProcessAndCommitTime":0.012548,"asyncUpdateLaunchTime":0.142798,"asyncUpdateCompletionTime":2.449932,"requestsUpdatingTime":0.0225,"queueRecommitTime":0.048129,"queueUnlockTime":0.001352,"errorCount":293,"failedElementsAddresses":"RepackSubRequest-Maintenance-cta-tpsrv02-0-307-20250203-17:34:50-0-7529 RepackSubRequest-Maintenance-cta-tpsrv02-0-307-20250203-17:34:50-0-7535 RepackSubRequest-Maintenance-cta-tpsrv02-0-307-20250203-17:34:50-0-7539  ........ "}

This is related to other messages we get: In OStoreDB::RetrieveJob::~RetrieveJob(): will leave the job owned after destruction.. To my understanding there are two issues here:

The job should not be owned when the maintenance process collects it (this probably shouldn't happen)
As this goes wrong, we have a race condition in the reporting. While this is a bug, fixing this most likely won't fix the underlying issue

This in turn is probably related to:

{"epoch_time":1738601071.519504996,"local_time":"2025-02-03T17:44:31+0100","hostname":"cta-tpsrv01-0","program":"cta-taped","log_level":"ERROR","pid":317,"tid":317,"message":"In OStoreDB::RepackArchiveReportBatch::report(): async file not deleted.","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","reportingType":"ArchiveSuccesses","fileId":4294967334,"subrequestAddress":"RepackSubRequest-Maintenance-cta-tpsrv02-0-307-20250203-17:34:50-0-7504","fileBufferURL":"root://ctaeos//eos/ctaeos/repack/V00101/000000007","exceptionMsg":"In XRootdDiskFileRemover::remove(), fail to remove file. [ERROR] Server responded with an error: [3010] Unable to remove file /eos/ctaeos/repack/V00101/000000007 by tident=cta.317:490@[::ffff:10.244.0.192]; Operation not permitted\n code:400 errNo:3010 status:1"}

Which might potentially be related to some leftovers of a previous test that were created with incorrect ownership settings. Doing a test where these files do not clash

So it seems that fixing one of the previous tests to prevent clashes fixes this particular error, but repack is still getting stuck at some point: see e.g. https://gitlab.cern.ch/cta/CTA/-/jobs/50528231

To summarise so far, it seems that we have 3 separate issues:

(fixed) Permission issues with file leftovers from previous test
Drive stuck in TRANSFERRING state with 0.0 MB/s. Possible MHVTL issue? See https://gitlab.cern.ch/cta/CTA/-/jobs/50528231
Small number of files got lost during repack: https://gitlab.cern.ch/cta/CTA/-/jobs/50496028

To debug the files getting lost during repack, I used the cta-objectstore-dump-object tool to inspect the scheduler.

I have attached two files:

cta-objectstore-dump-object root os_dump_root.json
cta-objectstore-dump-object RepackRequest-Frontend-cta-frontend-0-237-20250206-10:54:43-0-2202 os_dump_repack_request_frontend.json

For reference, this was from job https://gitlab.cern.ch/cta/CTA/-/jobs/50531199

Output from cta-admin repack ls:

         c.time repackTime    c.user    vid   tapepool providedFiles totalFiles totalBytes selectedFiles filesToRetrieve filesToArchive failed  status 
2025-02-06 11:04      5m11s ctaadmin2 V00101 ctasystest             0       2153      33.1M          6459               0            562      0 Running

Seems like not all copies were accounted for. E.g.

    {
      "address": "RepackSubRequest-Maintenance-cta-tpsrv01-0-43165-20250206-11:01:50-0-6465",
      "fseq": "2152",
      "retrieveAccounted": true,
      "archiveCopynbAccounted": [
        1,
        2
      ],
      "subrequestDeleted": true
    }

However, we need more information to get to the problem. The above was quickly extracted from a running test. I'll increase the timeout on the test and run it again to give a bit more time to debug the problem

We do have two crashes (and resulting core dumps) of the maintenance process in this particular job.

{"epoch_time":1738836109.903070039,"local_time":"2025-02-06T11:01:49+0100","hostname":"cta-tpsrv01-0","program":"cta-taped","log_level":"CRIT","pid":313,"tid":313,"message":"In BackendPopulator::~BackendPopulator(): error deleting agent (cta::exception::Exception). Backtrace follows.","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv01-0-313-20250206-10:54:46-0) still owns objects. Here's the first few: RepackSubRequest-Maintenance-cta-tpsrv02-0-319-20250206-10:54:46-0-1616 RepackSubRequest-Maintenance-cta-tpsrv02-0-319-20250206-10:54:46-0-1617 RepackSubRequest-Maintenance-cta-tpsrv02-0-319-20250206-10:54:46-0-1618 RepackSubRequest-Maintenance-cta-tpsrv02-0-319-20250206-10:54:46-0-1619 [... trimmed at 3 of 1553]"}

Which in turn might be related to?

{"epoch_time":1738836082.814181864,"local_time":"2025-02-06T11:01:22+0100","hostname":"cta-tpsrv01-0","program":"cta-taped","log_level":"INFO","pid":25166,"tid":26392,"message":"In OStoreDB::RetrieveJob::~RetrieveJob(): will leave the job owned after destruction.","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","agentObject":"DriveProcess-VDSTK01-cta-tpsrv01-0-25166-20250206-11:01:14-0","jobObject":"RepackSubRequest-Maintenance-cta-tpsrv02-0-319-20250206-10:54:46-0-3144"}

mentioned in commit adbab7ff

mentioned in merge request !844 (merged)

mentioned in commit 169ba009

Fix repack systemtest failing on timeout in move and add copies step

Designs

Child items ...

Activity