Skip to content

canDefine speedup: used cached "canDefine" if true

Pieter David requested to merge piedavid/bamboo:speedupcandefine into master

Thanks @gsaha for the interesting case that highlighted this problem!

A bit of explanation: expressions keep have a 'canDefine' flag that says whether they can be added as a column to the RDataFrame. The interesting case are sub-expressions that cannot be defined, while their parent expressions can (think: a select - or any other higher-order range function - predicate lambda, which cannot be defined because it has depends on the loop index variable, versus the parent range operation, which can be defined because it has the values for this index variable).

These flags are set at construction of every TupleOp - from the above, most are trivial, except for range operations, which need to check if there are any other loop indices used than their own inside the predicate or transformation.

The performance problem was that this was looping over the whole subexpression, while it could stop looking as soon as it finds a subexpression that can be defined -> nice speedup for deeply nested structures.

It doesn't break Gourab's plotter, and the explanation still makes sense to me after writing down, but regressions are not excluded, so more testing is welcome.

Edited by Pieter David

Merge request reports