Hlt2 rate test takes too long

I see two ways, either a dedicated monitoring algorithm which writes out the event size per line or the event size is calculated on the output as done in @enoomen studies. @enoomen and I can have a look at how to integrate her scripts into the LHbPR tests.

Event size based on the output would include compression so this may be more valuable.

We can leave the throughput tests as they are as they are unaffected. @sstahl and @enoomen if you are able to provide a new script to extract event size per line, @shunan can you make sure we can keep the nice webpages for both Spruce and HLT2 using this new input?

No problem, and the second option sounds more doable to me. I'm not familiar with Gaudi algorithm so not sure if the new algorithm (for the first solution) can be time efficienct or not.

Thank you @shunan :)

FYI @rjhunter

Measuring as much as we can on the output sounds attractive to me. We factorise as much as possible the algorithms+configuration from the performance measurements.

Event size based on the output would include compression so this may be more valuable.

@nskidmor, this seems like something useful for studies/extrapolations of requirements as far as storage is concerned (BTW, relevant for stream optimisation studies to be done at some point in the future), but the uncompressed numbers are very valuable as they allow for a direct comparison among lines and also different compression algorithms (including compression levels). I don't think it is a good idea at all to trash what we have so far; unless I'm misunderstanding your comment.

Hello @shunan, this

For the Sprucing test it took 1 hour to finish previously, but now takes 5~6 hours.

For the Hlt2 test, it ran over 10 hours on the test machine, and got killed by jenkins, see an example here.

seems gigantic to me. Do you know what is swallowing time like hell? I undestand that the jobs run over a lot of events, and do nuch, but can the split of packing explain the 5x blow in time? Just for my understanding.

I think that's the only explanation because I didn't see other change that could affect the test. Maybe @sesen can explain this.

From what I can tell in !2006 (merged), previously we use one bank writer for all streams, and now we use n writers for n streams to allow each stream saving different objects.

Yes, this is the result of !2006 (merged) which was necessary. Given time it takes, you might want to reconsider how to do these tests.

These tests were always using a clever hack - that was to give each line its own stream and use the raw bank combiner to access the bank sizes. Im sure this can be done in a cleverer way but I dont have time to look into this atm

mentioned in merge request !2046 (merged)

marked this issue as related to #507 (closed)

@sstahl @enoomen to clarify do you envisage a method to report event size per line? ie. a drop in replacement for what we have now? (except this time using compressed data size)

Yes, that's more or less the plan, @enoomen now made a MR in PrConfig: lhcb-datapkg/PRConfig!279 (merged).

Just one clarification, as long as we are using MDF as output the data won't be compressed. For Sprucing tests one could switch to DST output but then one cannor run multi-threaded at the moment as DST writing is not thread-safe.

This is interesting, ultimately DST is what we are interested in for the Sprucing but if we can only run MDF due to having to run multithreaded this is still equivilent to what we have now and so very useful. We can gauge the compression a different way. It might require an MR to let Sprucing write MDFs though.

We can leave the sprucing test as it is for the moment till this is sorted out.

I don't think there is anything in configuration preventing wiring mdf for sprucing for this test. As long as you don't try to read the output...

mentioned in issue #507 (closed)

I've managed to run a full Hlt2 rate&size test on our unversity's local farm, and it takes ~21 days to finish (from Jan. 26 to Feb. 16).

All input files are downloaded to the local farm so there should be no communication between the farm and lxplus. The test was run on 4 Intel(R) Xeon(R) Gold 5218 CPU with 128 threads in total, and it processed 100k events. As a reference, the test machine of LHCbPR test has 1 Intel(R) Xeon(R) Silver 4216 CPU with 16 threads.

To my understanding, if an event fires N lines, it will be re-evaluated for every line, thus the time cost is N times larger. As we now have more than 1500 Hlt2 lines, perhaps it's still reasonable if the time cost is several hundred times larger...

Hi @shunan, thanks for doing this. Can you provide some more details about what algorithms are being added and where, when running this rate test, compared to just running Moore in a production-like mode? Maybe the rate monitoring algorithms are doing something silly, but if this is really irreducible then surely we need to do the rate evaluation on the output of Moore (run in a production-like mode)

@shunan Can you test it with this !2046 (merged)

@mvesteri The main difference between the test and production settings is we enabled this analytic option, which is not needed in production. Although we didin't write output in this test but we do in production, I think it's completely reducible as we will have much less streams.

@shunan it would also be nice if you could test what is added in lhcb-datapkg/PRConfig!279 (merged) . In principle this should provide the same information but not suffer from this problem.

@sstahl I managed to run a test on that and the results look promising. I will try to launch another one with same sample size as the rate test (100k) and see the outcome.

great, thanks a lot!

@sesen I've tried to launch the test with !2046 (merged). It started at 18:47 on Feb. 21 and ended at 00:46 on Feb. 22. So it takes 6 hrs now, this is significantly reduced compared to the previous 21 days. However, my test was launched with 128 threads and I'm afraid it will still exceed 10 hrs limit on the 16-core test machine and be killed.

BTW, the Sprucing test on lhcb-master.1936 now takes 1h10m to finish, previously it needed 6 hrs (see the log for lhcb-master.1935)

Thanks for checking. So writing out an output file when nothing triggering was taking a lot of time in this case. This MR is now merged.

This is quite an old issue with old timings. There's a newer issue (#638 (closed)), so I suggest we move discussion there. Closing - feel free anyone here to re-open if you think I'm mistaken.

closed

Hlt2 rate test takes too long

Designs

Child items ...

Activity

Hlt2 rate test takes too long

Relates to

Activity