Memory growth in TwinMux SWATCH cell
In July, the TwinMux experts noticed that the memory usage of their SWATCH cell was growing, without any apparent upper limit; on one occasion in July, it used up almost 100% of the machine's memory, and was killed by the operating system's out-of-memory watchdog.
Since then, the twinmux cell has been restarted at least every couple of weeks, in order to prevent it from crashing / being killed at an inconvenient time. I ran cron jobs on all subsystem's machines in order to monitor the memory usage using the following command: pidstat -l -r -C xdaq -p ALL
.
The attached plots show the memory usage of many subsystems vs time, for roughly 1 month, with the execution times of engage
, setup
and configure
transitions overlaid on the twinmux plot. The plots show that the memory growth:
- Only occurs in the twinmux cell
- Only occurs during an
engage
/setup
/configure
transition (the frequency of the monitoring is not high enough to separate the memory growth during each of those 3 transitions) - Does not occur every time that the
engage
-setup
-configure
transition triplet is executed (only ~ 50% of the time)
From 21st to 23rd of August, during LHC downtime Alvaro and I ran several tests with the TwinMux cell at Point 5; full details are posted https://its.cern.ch/jira/browse/CMSLITOPS-150 , but brief summary of conclusions is:
- The memory growth still occurs when running only the 'engage' and 'reset' transitions of the 'run control' TS operation
- Memory growth does not occur when running the cell under valgrind
memcheck
ormassif
; also, valgrindmemcheck
did not report any leaks that were consistent with the observed memory growth pattern and magnitude. - The increase in memory usage during each
engage
transition is roughly equal to the amount of memory used by the SWATCH system and/or gatekeeper objects (including the size of the processor objects & all objects they instantiate)- This remains true as you decrease the size of the twinmux system; however, the probability of memory usage increasing on any given 'engage' transition is lower for systems containing fewer boards.