fit toys not taking 'categories' properly into account

mentioned in issue #39 (closed)

changed title from WIP: toy get_dataset to WIP: fit toys not taking 'categories' properly into account

changed the description

Mmm, without looking at the code, I would say the behavior is as expected. The idea is the following (if the code doesn't do that, then needs fixing ):

One dataset corresponds to one component in the fit.
You load as many datasets as components are in the fit. Several PDFs fitted simultaneously are simply more components. This mimics what Nature does, which is generate a given decay/phenomenon individually, not as part of a complex "model".
Each dataset, which is assigned to the proper category through the category key, is sampled correctly and then all these data are merged in a single dataset that can be fit.

Is is more or less clear? Maybe @matzeni is trying to just generate the full model directly, but that would be extremely inefficient, or maybe there's a bug in here... I would not be surprised

then let's check:

yes
Agree. Most of all with the 'individually', we do not sample therefore from a 'mixed' sample but from the given decay, right?
The code seems to disagree. This is what I expect as well but the code seems to do different things: it first samples from all categories, then reassigns (yes...) the wanted category to the data. Then, all datasets are merged.

Schematic of what happens in get_datasets (inside for loop looping through different data/categories)

dataset(3 different labels(a, b, c), 100 entries), requested category=a
sample 10 entries -> dataset(3 different labels(a, c, b), 10 entries)
overwrite (!) category -> dataset(1 label(a), 10 entries)

Is this really what we want? This way we end up with n_categories times identical datasets (except from statistic fluctuation through sampling). Basically, in this case, categories are not needed at all.

I would have expected something like:

dataset(3 different labels(a, b, c), 100 entries), requested category=a  
select category -> dataset(1 label(a), 42 entries)  # 42 out of 100 are of cat. a
sample 10 entries -> dataset(1 label(a), 10 entries)

Do I misinterpret the code somehow?

Mmm, I'm not sure I follow. Isn't

for data_name, (data, n_events, do_poisson, category) in data_frames.items():

actually loop over the different samples?

Well, yes and no.

To avoid confusion, preassumption: the (previously) generated toy dataframe that is loaded here contains around 20 categories. The toy fitting should also work if we want, for example, use only 10 out of the 20 categories and not use the remaining categories for a script. Is that correct? Can we do that? Or do we need a single toy hdf file for each category?

To the yes and no. The data_frames (below just data) is generated like this .

data[data_id] = (get_data({'source': source_toy,  # no category argument, full toy frame is loaded
                           'source-type': 'toy',
                           'tree': 'data',
                           'output-format': 'pandas',
                           'selection': data_source.get('selection')}),
                 data_source['nevents'],
                 data_source.get('poisson'),
                 data_source.get('category'))  # category only enters here into a tuple

so in

for data_name, (data, n_events, do_poisson, category) in data_frames.items():

data refers to the full source_toy containing currently all categories. On the other hand, the category (which is only a str argument) refers to the category we want (want->"want to be select" not "want it to be tagged as").

So yes, as it loops over all tuples(data, nevents, do_poisson, category) and no, as data is always the same datafile (assuming all samples we want here have been generated and saved to the same hdf file before).

I don't think that data at this stage should contain all categories but either be already selected containing only the requested (in this case by category) data or get selected right inside the loop after the for-line.

Does it make sense?

I understand now! As I said in a previous comment, this is not how the dataset should be generated. You generate a single large dataset for each component, so you can scale your toys in a much more flexible way. Then yes, the rest doesn't work.

It's just not meant to work this way, but I guess it should be fixed anyway...

Hi Albert,

What I did was to generate all the components individually and saving them in a single hdf file. To my understanding this can be done very easily in your framework with the joint advantage of having only one configuration file for the full generation.

I also think this should be as flexible as having 20 different datasets if we can sample within the categories.

However if this was not implemented to avoid the problems that arise when handling big files (e.g do not fit in memory), I will gladly split my configuration file.

@matzeni, if they are saved in the same hdf file, there's no way to sample from it. The idea is that if everything is separated, you can sample from those files (even different yields), while if everything is merged there's no way to do that (unless you add some code).

added 1 commit

5308261f - catch wrong usage of fit_toys

Compare with previous version

Ok, perfect :) I will change my code accordingly!

added 1 commit

7cebdfc9 - changed syntax to valid and added TODO

Compare with previous version

Ah yes, makes sense! thanks a lot for the help!

Yes, that strategy sounds reasonable.

Just to make sure this won't fail (overwrite) silently in the future, I've added a catch. Not a golden solution, but surely something that may prevents errors. What do you think, @apuignav?

Ready to merge from my side (or to be disregarded and closed).

Before merging, what do you say about adding a couple of sentences in the README too?

added 1 commit

20293636 - changed readme, remove category col if not specified

Compare with previous version

removed bug critical labels

resolved all discussions

fit toys not taking 'categories' properly into account

Merge request reports

Activity