Skip to content

cta-taped/maintenance coredumping in production

Summary

While I was checking if the maintenance rc1 deployed in tpsev436 I detected that there were some coreumps. Running gdb showed that the coredump comes from the objectstore:

[root@tpsrv436 ~]# coredumpctl info
           PID: 3672897 (cta-maintenance)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Fri 2025-10-17 09:34:43 CEST (3h 39min ago)
  Command Line: /usr/bin/cta-maintenance --log-format=json --log-to-file=/var/log/cta/cta-maintenance-4.log --config=/etc/cta/cta-maintenance-4.conf
    Executable: /usr/bin/cta-maintenance
 Control Group: /system.slice/system-cta\x2dmaintenance.slice/cta-maintenance@4.service
          Unit: cta-maintenance@4.service
         Slice: system-cta\x2dmaintenance.slice
       Boot ID: 41b729d2ee4b4c5f9e9581eb443d721a
    Machine ID: 45f02d9ee92a44558a1923c0b960ceeb
      Hostname: tpsrv436.cern.ch
       Storage: /var/lib/systemd/coredump/core.cta-maintenance.0.41b729d2ee4b4c5f9e9581eb443d721a.3672897.1760686483000000.zst (present)
  Size on Disk: 6.0M
       Message: Process 3672897 (cta-maintenance) of user 0 dumped core.
                
                Stack trace of thread 3672897:
                #0  0x00007feeea28bedc __pthread_kill_implementation (libc.so.6 + 0x8bedc)
                #1  0x00007feeea23eb46 raise (libc.so.6 + 0x3eb46)
                #2  0x00007feef156ff94 skgesigOSCrash (libclntsh.so.23.1 + 0x376ff94)
                #3  0x00007feef1d898c9 kpeDbgSignalHandler (libclntsh.so.23.1 + 0x3f898c9)
                #4  0x00007feef1570327 skgesig_sigactionHandler (libclntsh.so.23.1 + 0x3770327)
                #5  0x00007feeea23ebf0 __restore_rt (libc.so.6 + 0x3ebf0)
                #6  0x00007feeed337298 _ZN3cta5utils8segfaultEv (libctacommon.so.0 + 0x337298)
                #7  0x00007feeeb9cb3aa _ZN3cta11objectstore16BackendPopulatorD1Ev (libctaobjectstore.so.0 + 0x5cb3aa)
                #8  0x00007feeeb9cb531 _ZN3cta11objectstore16BackendPopulatorD0Ev (libctaobjectstore.so.0 + 0x5cb531)
                #9  0x00000000004e4af3 _ZNKSt14default_deleteIN3cta11objectstore16BackendPopulatorEEclEPS2_ (cta-maintenance + 0xe4af3)
                #10 0x00000000004e20cb _ZNSt10unique_ptrIN3cta11objectstore16BackendPopulatorESt14default_deleteIS2_EED2Ev (cta-maintenance + 0xe20cb)
                #11 0x00000000004e53a9 _ZN3cta12OStoreDBInitD1Ev (cta-maintenance + 0xe53a9)
                #12 0x00000000004e53fd _ZNKSt14default_deleteIN3cta12OStoreDBInitEEclEPS1_ (cta-maintenance + 0xe53fd)
                #13 0x00000000004e25e9 _ZNSt10unique_ptrIN3cta12OStoreDBInitESt14default_deleteIS1_EED1Ev (cta-maintenance + 0xe25e9)
                #14 0x00000000004e08d1 _ZN3cta11maintenance11MaintenanceD1Ev (cta-maintenance + 0xe08d1)
                #15 0x00000000004dd840 _ZN3cta11maintenanceL21exceptionThrowingMainENS_6common6ConfigERNS_3log6LoggerE (cta-maintenance + 0xdd840)
                #16 0x00000000004ddf33 main (cta-maintenance + 0xddf33)
                #17 0x00007feeea2295d0 __libc_start_call_main (libc.so.6 + 0x295d0)
                #18 0x00007feeea229680 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x29680)
                #19 0x00000000004dcff5 _start (cta-maintenance + 0xdcff5)

Every process in this server crashed once, probably due to picking the same problematic object from the objectstore. As the stacktrace points to a problem in libctacommon I checked if any other server had these crashes. I found that we have coredums across many servers and that the crashes started before the rc1 made it into the production server. Root cause to be determined.

Relevant logs and/or screenshots

wassh_coredump_list.txt

Edited by Pablo Oliver Cortes