scheduler lock by grpc-frontend never released
Summary
At DESY we run CTA 5.8.7 with scheduler store on NetApp. Time-to-time we observe that cta-taped on other hosts get stuck
# cta-admin dr ls
library drive host .... activity age reason
ctaltolib-rz1-1 rz1-2,3,1 tpm102 .... - 3761 [STALE] -
By inspection of cta-taped
process, we have discovered, that it waits for a lock:
[root@tpm102 ~]# cat /proc/40890/stack
[<ffffffffc106873a>] nfs4_retry_setlk+0x26a/0x2c0 [nfsv4]
[<ffffffffc106b986>] nfs4_proc_lock+0x1f6/0x340 [nfsv4]
[<ffffffffc10205ef>] do_setlk+0xcf/0x120 [nfs]
[<ffffffffc1020871>] nfs_flock+0x81/0xe0 [nfs]
[<ffffffff974be4d9>] SyS_flock+0x139/0x1d0
[<ffffffff979c539a>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
[root@tpm102 ~]# lsof -p 40890 | grep lock
cta-tpd-r 40890 cta 8r REG 0,42 0 70267 /cta-cache/ctaObjectStore/.root.lock (nfs-vol:/cta_nfs_queue)
The file is locked by cta-frontend-grpc
process:
# lsof -p 24291 | grep .root.lock
cta-front 24291 cta 39uW REG 0,42 0 70267 /cta-cache/ctaObjectStore/.root.lock (nfs-vol:/cta_nfs_queue)
Possible causes
Our current assumption is that some code path forgets to unlock or the thread is dying before lock is released.