fit toys not taking 'categories' properly into account
Toy fits do not seem to apply a cut on 'category' when the data is loaded (or only later). This should, I think, occur before the sampling.
Or are we missing something very obvious? Unfortunately, due to another bug, it was not possible to test the behavior completely. But from static code inspection, it looks like there is something missing.
Merge request reports
Activity
mentioned in issue #39 (closed)
- Resolved by Jonas Eschle
- Resolved by Jonas Eschle
Mmm, without looking at the code, I would say the behavior is as expected. The idea is the following (if the code doesn't do that, then needs fixing
):- One dataset corresponds to one component in the fit.
- You load as many datasets as components are in the fit. Several PDFs fitted simultaneously are simply more components. This mimics what Nature does, which is generate a given decay/phenomenon individually, not as part of a complex "model".
- Each dataset, which is assigned to the proper category through the
category
key, is sampled correctly and then all these data are merged in a single dataset that can be fit.
Is is more or less clear? Maybe @matzeni is trying to just generate the full model directly, but that would be extremely inefficient, or maybe there's a bug in here... I would not be surprised
then let's check:- yes
- Agree. Most of all with the 'individually', we do not sample therefore from a 'mixed' sample but from the given decay, right?
- The code seems to disagree. This is what I expect as well but the code seems to do different things: it first samples from all categories, then reassigns (yes...) the wanted category to the data. Then, all datasets are merged.
Schematic of what happens in get_datasets (inside for loop looping through different data/categories)
dataset(3 different labels(a, b, c), 100 entries), requested category=a sample 10 entries -> dataset(3 different labels(a, c, b), 10 entries) overwrite (!) category -> dataset(1 label(a), 10 entries)
Is this really what we want? This way we end up with n_categories times identical datasets (except from statistic fluctuation through sampling). Basically, in this case, categories are not needed at all.
I would have expected something like:
dataset(3 different labels(a, b, c), 100 entries), requested category=a select category -> dataset(1 label(a), 42 entries) # 42 out of 100 are of cat. a sample 10 entries -> dataset(1 label(a), 10 entries)
Do I misinterpret the code somehow?
Edited by Jonas EschleWell, yes and no.
To avoid confusion, preassumption: the (previously) generated toy dataframe that is loaded here contains around 20 categories. The toy fitting should also work if we want, for example, use only 10 out of the 20 categories and not use the remaining categories for a script. Is that correct? Can we do that? Or do we need a single toy hdf file for each category?
To the yes and no. The
data_frames
(below justdata
) is generated like this .data[data_id] = (get_data({'source': source_toy, # no category argument, full toy frame is loaded 'source-type': 'toy', 'tree': 'data', 'output-format': 'pandas', 'selection': data_source.get('selection')}), data_source['nevents'], data_source.get('poisson'), data_source.get('category')) # category only enters here into a tuple
so in
for data_name, (data, n_events, do_poisson, category) in data_frames.items():
data
refers to the full source_toy containing currently all categories. On the other hand, thecategory
(which is only a str argument) refers to the category we want (want->"want to be select" not "want it to be tagged as").So yes, as it loops over all tuples(data, nevents, do_poisson, category) and no, as data is always the same datafile (assuming all samples we want here have been generated and saved to the same hdf file before).
I don't think that
data
at this stage should contain all categories but either be already selected containing only the requested (in this case by category) data or get selected right inside the loop after thefor
-line.Does it make sense?
Edited by Jonas EschleI understand now! As I said in a previous comment, this is not how the dataset should be generated. You generate a single large dataset for each component, so you can scale your toys in a much more flexible way. Then yes, the rest doesn't work.
It's just not meant to work this way, but I guess it should be fixed anyway...
Hi Albert,
What I did was to generate all the components individually and saving them in a single
hdf
file. To my understanding this can be done very easily in your framework with the joint advantage of having only one configuration file for the full generation.I also think this should be as flexible as having 20 different datasets if we can sample within the categories.
However if this was not implemented to avoid the problems that arise when handling big files (e.g do not fit in memory), I will gladly split my configuration file.
@matzeni, if they are saved in the same
hdf
file, there's no way to sample from it. The idea is that if everything is separated, you can sample from those files (even different yields), while if everything is merged there's no way to do that (unless you add some code).Ah yes, makes sense! thanks a lot for the help!
Yes, that strategy sounds reasonable.
Just to make sure this won't fail (overwrite) silently in the future, I've added a catch. Not a golden solution, but surely something that may prevents errors. What do you think, @apuignav?
Ready to merge from my side (or to be disregarded and closed).
added 1 commit
- 20293636 - changed readme, remove category col if not specified