Skip to content
Snippets Groups Projects

FPE handling in CA-based jobs

Merged Walter Lampl requested to merge wlampl/athena:CAFPECheck into master

With this MR, a new flag Exec.FPE controls how Floating Point Exceptions are handled in ComponentAccumulator based jobs. A value of -1 is the equivalent of the old-style flag "rec.doFloatingPointException=True", eg the job will be aborted with a core-dump on the first FPE. A value of 0 (default) will set up the FPEAuditor to print a one-line WARNING on each FPE, like the standard behavior with RecExCommon. A value greater then 0 will set the property FPEAuditor.NStacktracesOnFPE to this value, eg stack-traces will be printed for that many FPEs.

While working on this, I realized that Auditors were no properly handled by the ComponentAccumulator. The first commit fixes this issue.

Edited by Walter Lampl

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Tadej Novak
  • In the meanwhile I found another problem that I don't understand yet: trying to run one of the failing ACTS-tests with the option Exec.FPE=10 I expect to see stack-lines from the FPEAuditor. But in fact, the TDAQ ErrorHandler gets invoked. For the simpler CaloRecoConfig.py everything works as expected.

    In the log of the ACTS example I noticed also that the FPEAuditor installs its signal handler twice. I don't understand why. I verified that the ComponentAccumulator has only one instance of FPEAuditor in AuditorSvc.Auditors and also self._auditors has no duplicate.

    I make this a draft for now ...

  • Walter Lampl marked this merge request as draft

    marked this merge request as draft

  • Walter Lampl added 236 commits

    added 236 commits

    • cae6c032...5417bfc4 - 227 commits from branch atlas:master
    • 403b0dea - ComponentAccumulator.py: Merge auditors like any other component
    • 15a6e119 - introduce FPE handling in CA-based jobs
    • a550e658 - introduce unit-test for FPE handling
    • f24a055d - remove manual FPE-Auditor cfg, now done by MainServicesCfg
    • a934def2 - re-add CompFactory to Run3DQTestingDriver.py
    • 408569f9 - FPEAndCoreDumpConfig: Protect against the absence of AthenaAuditors in some projects
    • d0876a3c - some cleanup following MR review
    • f37f49ad - remove stray comma
    • 92219529 - Set env var TDAQ_ERS_NO_SIGNAL_HANDLERS in FPEAuditor::initalize() to avoid...

    Compare with previous version

  • Hi @pagessin ,

    can you estimate how long it will take to fix the FPEs I described in ATLASRECTS-7341? I wounder if we should wait for it or ignore the FPEs in the failing unit-tests.

    • Walter
  • I'm trying right now to reproduce in ACTS standalone (because that's more robust). I'll try reproducing the FPEs in-situ next week, then I can also fix them hopefully).

    If you want to wait until that's done is up to you I guess?

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading