Skip to content

scheduler lock by grpc-frontend never released

Summary

At DESY we run CTA 5.8.7 with scheduler store on NetApp. Time-to-time we observe that cta-taped on other hosts get stuck

# cta-admin dr ls
library         drive     host    ....                     activity          age reason
ctaltolib-rz1-1 rz1-2,3,1 tpm102  ....                                     - 3761 [STALE] -

By inspection of cta-taped process, we have discovered, that it waits for a lock:

[root@tpm102 ~]# cat /proc/40890/stack 
[<ffffffffc106873a>] nfs4_retry_setlk+0x26a/0x2c0 [nfsv4]
[<ffffffffc106b986>] nfs4_proc_lock+0x1f6/0x340 [nfsv4]
[<ffffffffc10205ef>] do_setlk+0xcf/0x120 [nfs]
[<ffffffffc1020871>] nfs_flock+0x81/0xe0 [nfs]
[<ffffffff974be4d9>] SyS_flock+0x139/0x1d0
[<ffffffff979c539a>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
[root@tpm102 ~]# lsof -p 40890 | grep lock
cta-tpd-r 40890  cta    8r      REG               0,42        0    70267 /cta-cache/ctaObjectStore/.root.lock (nfs-vol:/cta_nfs_queue)

The file is locked by cta-frontend-grpc process:

# lsof -p 24291 | grep  .root.lock
cta-front 24291  cta   39uW     REG               0,42        0    70267 /cta-cache/ctaObjectStore/.root.lock (nfs-vol:/cta_nfs_queue)

Possible causes

Our current assumption is that some code path forgets to unlock or the thread is dying before lock is released.