Information about handling excessive memory usage
I'm opening this issue to be able to point people to. I intend to edit it as the situation changes so it stays up to data.
Why is there a peak memory limit?
A 5GB memory peak limit has been imposed on Analysis Productions.
- Jobs run on the shared infrastructure at grid sites and LHCb can be sanctioned for repeatedly overusing the resources.
- We do not just use the grid for APs but also Sprucing, MC productions etc and therefore this could affect all operations.
- Ideally jobs do not go over 3-4GB so 5GB is generous.
- The same constraints apply to all grid activity. Users whose jobs that exceed the limits may be sactioned.
How can I get my jobs to adhere to the limits?
Both Moore and DaVinci stages can have high peaks in memory for different reasons. By looking at the memory usage plot with time you may be able to determine which step is causing it. If you are just running on data you will only have a DaVinci stage and so the plot will reflect only a DaVinci job. If you are running your own MC production you will have 2 Moore stages and a DaVinci stage and so the plot will look like 3 distinct sections with the memory approaching zero at the start of each (note that MC productions will be run centrally soon meaning you will no longer have to run the HLT steps). All jobs finish with a "merging" step which is typically negligible.
Some examples to illustrate how to interpret the memory usage plots:
The functor problem
At the beginning of the application there the "functor cache" must be compiled. This results in a spike of memory usage early in the job which is sometimes harmless:
There is a known problem with the functor cache that it doesn't scale well. This is being actively worked on however it's going to take a few months to fix. In extreme cases the memory usage can look like this:
Note functors are used by both Moore and DaVinci, however in most cases the release of Moore should contain most of the functors you need so the problem tends to be worse for DaVinci.
If you're affected by this, there are these following known workarounds:
- Using clang platforms. Here is an example for data and here for MC
- If this does not help you can set the variable
THOR_JIT_N_SPLITS
- This variable splits the functor cache into smaller parts. For example if you set
THOR_JIT_N_SPLITS=2
the cache will be compiled in two halves. Similarly 4 and 8 can be used in extreme cases. - Do not blindly set this variable for all jobs as it wastes a lot of CPU the higher you set this!
- This can be set with
os.environ["THOR_JIT_N_SPLITS"] = "N"
(withimport os
at the top of your file). You should slowly increaseN
until the memory usage is within the acceptable limit. - This variable can be set anywhere in your options so you can make it conditional if needed in your main function.
- This variable splits the functor cache into smaller parts. For example if you set
Slowly growing memory usage
If the memory usage is slowly growing over time it is an example of a memory leak. For example in the following job there is a memory leak in HLT1 and HLT2.
There are two know cases of this:
- Some versions of Moore contain a memory leak which has been fixed
- Some versions of DaVinci leak memory when DecayTreeFitter fails to fit. This is being investigated but the cause is not yet understood. If you're affected please contact DPA-WP3 in case you're example helps to identify the underling cause.
It is also likely other cases will be found. In all cases you should work with the appropriate support channels to figure out the underlying cause. For example:
- Moore: read Moore#813 and get support in ~upgrade-hlt2
- DaVinci: ask questions in ~dpa-wp3-offline-analysis-tools