Crashes / memory issues of scanConsole
I thought it would be a good idea to create an issue about this since now several groups are reporting similar things (tagging @gstark and @simobius, feel free to add your personal experiences here)
Example
-
45hrs synFE source scan of #Paris10, a module which returns many readout errors when reading synFE (freq changes a lot but typically several per 5min with some bursts) started on Friday
-
DAQ machine starts to be extremely slow and irresponsive at 22h36 on Saturday (aft 30hrs of scan or so), FE are power cycled and HV is set to 0V by interlock but I cannot SSH the machine so cannot stop the scan, however monitoring is back (DAQ and DCS are on the same machine) so I guess this reduced the load of the machine ?
-
new loss of monitoring at 1h14 on Sunday (due to the irresponsiveness of the machine I guess?), this time the interlock cuts down the power of LV/HV. I still cannot SSH the machine so things are left like this (noise scan still running).
-
HW reboot of the machine on Sunday at 20h11, apparently due to scanConsole (below is the backtrace of scanConsole, extracted by Saclay IT). I do not enclose the entire
/var/crash
repo here because it is 2 Gb but can share it with CERNbox if useful.
crash> bt
PID: 5019 TASK: ffff8ee78b2aa100 CPU: 2 COMMAND: "scanConsole"
#0 [ffff8eedbc483b40] machine_kexec at ffffffff9c2662c4
#1 [ffff8eedbc483ba0] __crash_kexec at ffffffff9c322a32
#2 [ffff8eedbc483c70] crash_kexec at ffffffff9c322b20
#3 [ffff8eedbc483c88] oops_end at ffffffff9c98d798
#4 [ffff8eedbc483cb0] no_context at ffffffff9c275d14
#5 [ffff8eedbc483d00] __bad_area_nosemaphore at ffffffff9c275fe2
#6 [ffff8eedbc483d50] bad_area_nosemaphore at ffffffff9c276104
#7 [ffff8eedbc483d60] __do_page_fault at ffffffff9c990750
#8 [ffff8eedbc483dd0] do_page_fault at ffffffff9c990975
#9 [ffff8eedbc483e00] page_fault at ffffffff9c98c778
[exception RIP: __list_add+15]
RIP: ffffffff9c5a668f RSP: ffff8eedbc483eb0 RFLAGS: 00010087
RAX: ffff8eed2ba21cd8 RBX: 0000000000000001 RCX: 0000000000000002
RDX: 0000000000000000 RSI: ffff8eed2ba21ce8 RDI: ffffdcff8112a3a0
RBP: ffff8eedbc483ec8 R8: 0000000000000001 R9: 000000000000028e
R10: ffff8eedbc7d9868 R11: 0000000000000001 R12: 0000000000000000
R13: ffff8eed2ba21ce8 R14: ffff8eedbc7d9800 R15: ffffdcff8112a380
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#10 [ffff8eedbc483ed0] free_pcppages_bulk at ffffffff9c3c7571
#11 [ffff8eedbc483f68] drain_pages at ffffffff9c3c78cb
#12 [ffff8eedbc483f98] drain_local_pages at ffffffff9c3c78f5
#13 [ffff8eedbc483fa8] flush_smp_call_function_queue at ffffffff9c316d13
#14 [ffff8eedbc483fd0] generic_smp_call_function_single_interrupt at ffffffff9c317413
#15 [ffff8eedbc483fe0] smp_call_function_interrupt at ffffffff9c2598bd
#16 [ffff8eedbc483ff0] call_function_interrupt at ffffffff9c99854a
--- ---
#17 [ffff8ee71d543a98] call_function_interrupt at ffffffff9c99854a
[exception RIP: get_page_from_freelist+679]
RIP: ffffffff9c3c8567 RSP: ffff8ee71d543b40 RFLAGS: 00000246
RAX: 0000000000000011 RBX: ffff8ee71d543b18 RCX: 000000000001fc25
RDX: 000000000001fcff RSI: 000000000000001f RDI: 0000000000000246
RBP: ffff8ee71d543c48 R8: fffffffffffffff2 R9: ffffffff9d240ae8
R10: 00000000000c2dc8 R11: 0000000000100000 R12: ffffffff9ccb2140
R13: ffffffff9c29b2ff R14: ffff8ee71d543b08 R15: ffff8eedbc7d9800
ORIG_RAX: ffffffffffffff03 CS: 0010 SS: 0000
#18 [ffff8ee71d543c50] __alloc_pages_nodemask at ffffffff9c3c8f04
#19 [ffff8ee71d543d80] alloc_pages_vma at ffffffff9c41cc49
#20 [ffff8ee71d543de8] handle_mm_fault at ffffffff9c3f6837
#21 [ffff8ee71d543eb0] __do_page_fault at ffffffff9c990653
#22 [ffff8ee71d543f20] do_page_fault at ffffffff9c990975
#23 [ffff8ee71d543f50] page_fault at ffffffff9c98c778
RIP: 00007f6f53677e20 RSP: 00007f6f508eac60 RFLAGS: 00010216
RAX: 00007f6f3055b440 RBX: 00007f6f3055b2d0 RCX: 00007f6f3055b440
RDX: 0000000000011978 RSI: 00007f6f305f1440 RDI: 0000000000000000
RBP: 0000000000012c00 R8: 00000000005f2000 R9: 0000000000096000
R10: 000000000000007e R11: 0000000000001000 R12: 00007f6f3055b3a0
R13: 0000000000012c00 R14: 00007f6f305f1440 R15: 00007f6f441fd260
ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
General
These crashes from YARR are now starting to become one of the bottlenecks for the QC of RD53A modules at Saclay. Full list we had below (/var/crash
) since we started QC:
127.0.0.1-2021-09-01-11:01:12
127.0.0.1-2021-09-03-17:56:00
127.0.0.1-2021-09-21-18:26:47
127.0.0.1-2021-10-25-18:20:00
127.0.0.1-2021-10-30-15:50:16
127.0.0.1-2021-11-02-14:47:47
127.0.0.1-2021-11-08-09:39:41
127.0.0.1-2021-11-09-22:17:49
127.0.0.1-2021-12-01-12:22:15
127.0.0.1-2021-12-12-23:23:15
127.0.0.1-2021-12-21-06:53:00
127.0.0.1-2021-12-22-12:48:55
127.0.0.1-2022-01-13-03:58:35
127.0.0.1-2022-01-23-20:11:23
127.0.0.1-2022-01-24-16:21:46
127.0.0.1-2022-01-24-16:52:43
the last 2 crashes (2022-01-24) were due to using the -l
option when running scanConsole
, see #125 (comment 5180707) but the ones before happened running scanConsole
without the -l
option.
the vast majority (if not all) of these crashes happened either running a new scan after a previous scan led to a segfault of scanConsole (with no useful msg to report here, no particular defect of the module being tested) or during a noise scan with readout errors like this:
[ error ][Rd53aDataProcessor]: [1] Received data not valid: 0xc4e0
I can post more backtrace / crash repos / etc if useful
Thank you for your help with this issue