Algorithm clonability should be explicitly set by Alg author
The clonability of an Algorithm is best decided by the Algorithm author, and should not be a Property of the Algorithm that can be changed at run time.
This MR removes the IsClonable
Property of Algorithms, and changes
it into a protected bool that derived Algs can modify. We should consider
whether the default should be true
or false
. It is currently set to true
for the sake of convenience.
If an Alg is not clonable, then its cardinality is automatically set to 1 during sysInitialize.
Merge request reports
Activity
Validation started with lhcb-gaudi-merge#115
I would propose to set this property to false by default.
C++ makes it trivial to write Algs for which cloning does not resolve thread-safety problems. Just use a non-const accessor to a shared resource anywhere. Since cloning is about allowing progressive migration of thread-unsafe code by avoiding a disruptive "everything must now be reentrant" revolution, I think disabling it before each individual Alg has been reviewed for thread-safety would be consistent with its goals.
Edited by Hadrien Benjamin GraslandNote that this has nothing to do with re-entrancy, but rather with creating multiple instances. In general, Algs don't need to be thread safe, and are safe to clone unless they are doing some really stupid things, eg with global statics. Or if they're writing stuff to disk. I think that most Algs are in fact clonable (especially if they have already made their public tools private), and it's the exception that is not clonable. The downside is that cloning an alg that shouldn't be cloned causes lots of painful problems....
I think we are mostly saying the same thing here. Cloning makes it easier to write thread-safe code, but there are still plenty of ways to shoot yourself in the foot: static members, disk and network IO, pointer-to-mutable (T*) and pointer-like objects (iterators, shared_ptr) aiming at shared mutable resources...
I also agree that many Algs should be fine, and it would be nice if we could somehow automatically detect the problematic edge cases so as to only spend valuable human review time on these ones. How much could Scott's tooling help here?
Edited by Hadrien Benjamin GraslandI think that the main reason why to use a property was to be able to test if cloning an algorithm works without having to recompile.
This is the workflow:
- an old algorithm is, by default, non-clonable because you may have states that must be shared between events
- the developer tries to fix clonability and test if it works by changing job options
- once the algorithm is fixed and validated, we can change the default value of the property (for the specific algorithm)
Even if I think it's better to rewrite an algorithm to make it reentrant rather than to make it clanable, I prefer the original approach.
OTOH, I think that the change in AgResourcePool is good, and I would add a warning message if there is an inconsistency between cardinality and clonability.
@clemenci : the problem with that approach is that a user can detrimentally override what the Alg author has decided. Furthermore, there's a lot of overlap between the properties "Cardinality" and "IsClonable". This is why I think it's best to remove the property, but have it set "true". Then, in case an Alg is shown to be unsafe to clone, it needs be re-compiled (and maybe fixed!).
Also, in general we'll probably (eventually) have "Cardinality" set to 1 for most Algorithms, no matter how many concurrent events we're running, as in general we don't need to clone too many Algs. I see "IsClonable" as something intrinsic to the Alg, which should not be modified by anyone other than the author, and needs to be localized with the Algorithm. "Cardinality" is something that gets tweaked in every job / jobOpt file.
I agree the developer should state if the algorithm is clonable, but the developer would still like to be able to override it via options for large scale tests, for example in AlgResourcePool.
At the same level, it's probably better to set cardinality via AlgResourcePool rather than as a property of the algorithm.
I'm not sure how you envision setting the cardinality via the AlgResourcePool. Do you mean something like:
AlgResourcePool.DefaultCardinality = numThreads AlgResourcePool.Cardinality = [ ("AlgName_1", 3), ("AlgName_2",2), ("AlgName_3",7) ]
I see the developer workflow like:
- default
IsClonable
set true - try running with
Cardinality
set high for all Algorithms (via jobOptions) - job crashes in MyAlg.
- re-run with
MyAlg.Cardinality=1
- job runs cleanly
- developer looks at MyAlg. Fixes it if possible.
- If not easy/possible to fix, explicitly set
MyAlg.m_isClonable = false;
in MyAlg source code, commit. - this global change is picked up in next set of nightlies, and no one else needs to worry about it
Once
m_isClonable
has been explicitly set false in the Algorithm, the only person who's going to be changing it is the Alg developer, who will be re-compiling it anyway. If one is trying to find MT issues, and are trying to figure out if an Alg should or should not be clonable, and there are several Algs that have this issue, I think it's a very bad idea to turn on cloning for ALL of them - you really want to test them one at a time, or your job gets way harder to sort out.- default
I'm thinking more about something like:
AlgResourcePool.Cardinality["AlgName_1"] = 3 AlgResourcePool.Cardinality["AlgName_2"] = 2 AlgResourcePool.Cardinality["AlgName_3"] = 7
I do not see much the need of
DefaultCardinality
, since, as you said, it should always be 1.My developer workflow is more like
- default IsClonable == False (most existing algorithms are not clonable)
- try a fix to clonability
- run some test job with and without clones to see if the physics results are consistent
- get the code in a release (optional)
- submit some large scale test with and without clones
- if everything is fine, change the default for the next release
If one is trying to find MT issues, and are trying to figure out if an Alg should or should not be clonable, and there are several Algs that have this issue, I think it's a very bad idea to turn on cloning for ALL of them - you really want to test them one at a time, or your job gets way harder to sort out.
That's why I believe isClonable should be false by default.
The problem with this approach is that you then have to match the name of the Algorithm with a set of strings that's sent to the AlgResourcePool, which will only get checked pretty late in the game (when decodeTopAlg is called). And again, string comparison..... If "AlgName_2" is not known to the TopAlgs, is that an error? Or perhaps you've picked it up from some default global configuration, but that Alg isn't used in this job, due to a configuration flag.
You have to be careful here not to confuse behaviour of classes (clonability) and instances (number of clones). Thus the clonability should actually not be a property, but it is something static to the class. By default 'false'. So my suggestion would be: Now for testing clonability w/o recompiling every single class before testing, I'd add a parameter to the AlgResourcePool to have a relaxed/debug check on cardinality vs. clonability, issuing only warnings in the relaxed setting. In the default production setting it should stop. Once tests show things are indeed clonable, just change the static member and off you go. Who sets the cardinality (AlgResourcePool vs. individual algorithm) I don't have a strong opinion on. I'd lean more towards Charles' approach.
I would also much rather we have a 'buy in' model for cloneability - it should be
false
by default and the developer has to make a specific, positive, statement that they believe their algorithm is safe by settingtrue
. And it's for sure a static property as per @leggett's patch.I think when one is addressing such issues recompilation is not a very big hurdle to overcome. I worry that if we allow configuration to set dangerous stuff (overriding
m_isClonable=false
) that it is inevitably going to leak into production. (ATLAS have a mode of running jobs with a flagignoreErrors=true
and it's used a shocking amount.) But I can live with the proposal to let it past with some big red flashing lights...I prefer @leggett's suggestion for setting cardinality too as it would make error handling rather more obvious.
added 1 commit
- b297364a - Change default clonability of Alg to "false"
544 544 protected: 545 545 /// Hook for for derived classes to provide a custom visitor for data handles. 546 546 std::unique_ptr<IDataHandleVisitor> m_updateDataHandles; 547 548 bool m_isClonable {false}; the problem with declaring it a
static const
is that you can't override/overshadow it in the derived class, or the derived class has to implement theisClonable()
method. I think is's a much simpler syntax if the author of the derived Alg simply has to do:MyAlg::MyAlg(...) { ... m_isClonable = true; }
Edited by Charles LeggettIf it were up to me, I'd just add a virtual 'isClonable' (i.e. not pure) with a default which just does
virtual bool isClonable() const { return false; } // or whatever default is thought to be best
and then if the derived class wants something different, it is just a matter of doing:
bool isClonable() const override { return true; };
i.e. the information should, as Benedict pointed out, be bound to the type, not to an instance.
Alternative solutions would be to eg. use Traits, and specialize them for distinct algorithms, but in the context where the information is required, I suspect that the concrete type of the algorithm is unknown -- just the pointer-to-base -- which makes traits unsuitable, so that points to using the vtbl, i.e. add a virtual function as above...
- Resolved by Charles Leggett