cta-taped/maintenance coredumping in production
Summary
While I was checking if the maintenance rc1 deployed in tpsev436 I detected that there were some coreumps. Running gdb showed that the coredump comes from the objectstore:
[root@tpsrv436 ~]# coredumpctl info
PID: 3672897 (cta-maintenance)
UID: 0 (root)
GID: 0 (root)
Signal: 11 (SEGV)
Timestamp: Fri 2025-10-17 09:34:43 CEST (3h 39min ago)
Command Line: /usr/bin/cta-maintenance --log-format=json --log-to-file=/var/log/cta/cta-maintenance-4.log --config=/etc/cta/cta-maintenance-4.conf
Executable: /usr/bin/cta-maintenance
Control Group: /system.slice/system-cta\x2dmaintenance.slice/cta-maintenance@4.service
Unit: cta-maintenance@4.service
Slice: system-cta\x2dmaintenance.slice
Boot ID: 41b729d2ee4b4c5f9e9581eb443d721a
Machine ID: 45f02d9ee92a44558a1923c0b960ceeb
Hostname: tpsrv436.cern.ch
Storage: /var/lib/systemd/coredump/core.cta-maintenance.0.41b729d2ee4b4c5f9e9581eb443d721a.3672897.1760686483000000.zst (present)
Size on Disk: 6.0M
Message: Process 3672897 (cta-maintenance) of user 0 dumped core.
Stack trace of thread 3672897:
#0 0x00007feeea28bedc __pthread_kill_implementation (libc.so.6 + 0x8bedc)
#1 0x00007feeea23eb46 raise (libc.so.6 + 0x3eb46)
#2 0x00007feef156ff94 skgesigOSCrash (libclntsh.so.23.1 + 0x376ff94)
#3 0x00007feef1d898c9 kpeDbgSignalHandler (libclntsh.so.23.1 + 0x3f898c9)
#4 0x00007feef1570327 skgesig_sigactionHandler (libclntsh.so.23.1 + 0x3770327)
#5 0x00007feeea23ebf0 __restore_rt (libc.so.6 + 0x3ebf0)
#6 0x00007feeed337298 _ZN3cta5utils8segfaultEv (libctacommon.so.0 + 0x337298)
#7 0x00007feeeb9cb3aa _ZN3cta11objectstore16BackendPopulatorD1Ev (libctaobjectstore.so.0 + 0x5cb3aa)
#8 0x00007feeeb9cb531 _ZN3cta11objectstore16BackendPopulatorD0Ev (libctaobjectstore.so.0 + 0x5cb531)
#9 0x00000000004e4af3 _ZNKSt14default_deleteIN3cta11objectstore16BackendPopulatorEEclEPS2_ (cta-maintenance + 0xe4af3)
#10 0x00000000004e20cb _ZNSt10unique_ptrIN3cta11objectstore16BackendPopulatorESt14default_deleteIS2_EED2Ev (cta-maintenance + 0xe20cb)
#11 0x00000000004e53a9 _ZN3cta12OStoreDBInitD1Ev (cta-maintenance + 0xe53a9)
#12 0x00000000004e53fd _ZNKSt14default_deleteIN3cta12OStoreDBInitEEclEPS1_ (cta-maintenance + 0xe53fd)
#13 0x00000000004e25e9 _ZNSt10unique_ptrIN3cta12OStoreDBInitESt14default_deleteIS1_EED1Ev (cta-maintenance + 0xe25e9)
#14 0x00000000004e08d1 _ZN3cta11maintenance11MaintenanceD1Ev (cta-maintenance + 0xe08d1)
#15 0x00000000004dd840 _ZN3cta11maintenanceL21exceptionThrowingMainENS_6common6ConfigERNS_3log6LoggerE (cta-maintenance + 0xdd840)
#16 0x00000000004ddf33 main (cta-maintenance + 0xddf33)
#17 0x00007feeea2295d0 __libc_start_call_main (libc.so.6 + 0x295d0)
#18 0x00007feeea229680 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x29680)
#19 0x00000000004dcff5 _start (cta-maintenance + 0xdcff5)
Every process in this server crashed once, probably due to picking the same problematic object from the objectstore. As the stacktrace points to a problem in libctacommon I checked if any other server had these crashes. I found that we have coredums across many servers and that the crashes started before the rc1 made it into the production server. Root cause to be determined.
Relevant logs and/or screenshots
Edited by Pablo Oliver Cortes