Skip to content
Snippets Groups Projects

fit toys not taking 'categories' properly into account

Merged Jonas Eschle requested to merge hotfix/toy_categories into master

Toy fits do not seem to apply a cut on 'category' when the data is loaded (or only later). This should, I think, occur before the sampling.

Or are we missing something very obvious? Unfortunately, due to another bug, it was not possible to test the behavior completely. But from static code inspection, it looks like there is something missing.

Edited by Jonas Eschle

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Jonas Eschle
  • Mmm, without looking at the code, I would say the behavior is as expected. The idea is the following (if the code doesn't do that, then needs fixing :laughing:):

    1. One dataset corresponds to one component in the fit.
    2. You load as many datasets as components are in the fit. Several PDFs fitted simultaneously are simply more components. This mimics what Nature does, which is generate a given decay/phenomenon individually, not as part of a complex "model".
    3. Each dataset, which is assigned to the proper category through the category key, is sampled correctly and then all these data are merged in a single dataset that can be fit.

    Is is more or less clear? Maybe @matzeni is trying to just generate the full model directly, but that would be extremely inefficient, or maybe there's a bug in here... I would not be surprised :wink:

  • Author Maintainer

    :laughing: then let's check:

    1. yes
    2. Agree. Most of all with the 'individually', we do not sample therefore from a 'mixed' sample but from the given decay, right?
    3. The code seems to disagree. This is what I expect as well but the code seems to do different things: it first samples from all categories, then reassigns (yes...) the wanted category to the data. Then, all datasets are merged.

    Schematic of what happens in get_datasets (inside for loop looping through different data/categories)

    dataset(3 different labels(a, b, c), 100 entries), requested category=a
    sample 10 entries -> dataset(3 different labels(a, c, b), 10 entries)
    overwrite (!) category -> dataset(1 label(a), 10 entries)

    Is this really what we want? This way we end up with n_categories times identical datasets (except from statistic fluctuation through sampling). Basically, in this case, categories are not needed at all.

    I would have expected something like:

    dataset(3 different labels(a, b, c), 100 entries), requested category=a  
    select category -> dataset(1 label(a), 42 entries)  # 42 out of 100 are of cat. a
    sample 10 entries -> dataset(1 label(a), 10 entries)

    Do I misinterpret the code somehow?

    Edited by Jonas Eschle
  • Mmm, I'm not sure I follow. Isn't

    for data_name, (data, n_events, do_poisson, category) in data_frames.items():

    actually loop over the different samples?

  • Author Maintainer

    Well, yes and no.

    To avoid confusion, preassumption: the (previously) generated toy dataframe that is loaded here contains around 20 categories. The toy fitting should also work if we want, for example, use only 10 out of the 20 categories and not use the remaining categories for a script. Is that correct? Can we do that? Or do we need a single toy hdf file for each category?

    To the yes and no. The data_frames (below just data) is generated like this .

    data[data_id] = (get_data({'source': source_toy,  # no category argument, full toy frame is loaded
                               'source-type': 'toy',
                               'tree': 'data',
                               'output-format': 'pandas',
                               'selection': data_source.get('selection')}),
                     data_source['nevents'],
                     data_source.get('poisson'),
                     data_source.get('category'))  # category only enters here into a tuple

    so in

    for data_name, (data, n_events, do_poisson, category) in data_frames.items():

    data refers to the full source_toy containing currently all categories. On the other hand, the category (which is only a str argument) refers to the category we want (want->"want to be select" not "want it to be tagged as").

    So yes, as it loops over all tuples(data, nevents, do_poisson, category) and no, as data is always the same datafile (assuming all samples we want here have been generated and saved to the same hdf file before).

    I don't think that data at this stage should contain all categories but either be already selected containing only the requested (in this case by category) data or get selected right inside the loop after the for-line.

    Does it make sense?

    Edited by Jonas Eschle
  • I understand now! As I said in a previous comment, this is not how the dataset should be generated. You generate a single large dataset for each component, so you can scale your toys in a much more flexible way. Then yes, the rest doesn't work.

    It's just not meant to work this way, but I guess it should be fixed anyway...

  • Hi Albert,

    What I did was to generate all the components individually and saving them in a single hdf file. To my understanding this can be done very easily in your framework with the joint advantage of having only one configuration file for the full generation.

    I also think this should be as flexible as having 20 different datasets if we can sample within the categories.

    However if this was not implemented to avoid the problems that arise when handling big files (e.g do not fit in memory), I will gladly split my configuration file.

  • @matzeni, if they are saved in the same hdf file, there's no way to sample from it. The idea is that if everything is separated, you can sample from those files (even different yields), while if everything is merged there's no way to do that (unless you add some code).

  • Jonas Eschle added 1 commit

    added 1 commit

    • 5308261f - catch wrong usage of fit_toys

    Compare with previous version

  • Ok, perfect :) I will change my code accordingly!

  • Jonas Eschle added 1 commit

    added 1 commit

    • 7cebdfc9 - changed syntax to valid and added TODO

    Compare with previous version

  • Author Maintainer

    Ah yes, makes sense! thanks a lot for the help!

    Yes, that strategy sounds reasonable.

    Just to make sure this won't fail (overwrite) silently in the future, I've added a catch. Not a golden solution, but surely something that may prevents errors. What do you think, @apuignav?

    Ready to merge from my side (or to be disregarded and closed).

  • Before merging, what do you say about adding a couple of sentences in the README too?

  • Jonas Eschle added 1 commit

    added 1 commit

    • 20293636 - changed readme, remove category col if not specified

    Compare with previous version

  • removed bug critical labels

  • Jonas Eschle resolved all discussions

    resolved all discussions

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading