Define content of PersistReco

changed the description

PV tracks used in a PV can now be added to persist reco by setting pv_tracks=True in line definition.

@sstahl do you have some thoughts on how many different definitions of persist reco could be practically supported in parallel?

I think there should be only one definition of "PersistReco" which to me is the replacement of saving the raw event (and a subsequent offline processing) and should enable (roughly) the same opportunities offline. That's why I asked more or less specifically about the Sprucing. For Turbo lines I personally would not use the word PersistReco but Selective Persistence and then I guess there are infinite combinations.

My question was not about technicalities, it was more a strategic question.

I agree with your definition of "PersistReco", but I wanted to make sure I understood your point correctly. I agree that enabling roughly the same opportunities offline is important. Especially since we have to learn how to make optimal use of the significantly lower PT physics which the software trigger will give us access to.

@sstahl Thanks for this issue.

I agree with your definitions: PersistReco should contain everything (reasonable) the sprucing needs to do the sprucing.

For (most) Turbo lines, we want to have a maximally flexible SelectivePersistence, where one can essentially define a list of extra objects one wants to persist in addition to the candidates. If I understood @sesen correctly, this is in principle doable, but will need some modifications.

The PV tracks however should not be needed in any case, at least if the latest version of the PV class issued (I assume it is needed at the moment, as it is actually not used and a conversion is performed to the old PV class?)

mentioned in merge request !2119 (closed)

@isanders I had a quick chat with @sstahl, could we ask for your help to perform some studies on putting a TTracks container into the nominal persistreco list? This should actually be easier than what we discussed as a solution. What we need to know is how much this increases the bandwidth. But as a first step could you try adding the TTracks into the persistreco list?

It seems clear that we need to add every track type into the default persistreco

Keep in mind this is not only about saving them but rather fitting them all which might not be so cheap.

Currently there is already a merged v1 track container including unfitted T-tracks in the reco (used for the calo). So I suppose adapting this for the persist reco should be easy.

    tracks4calo = [all_best_tracks["v1"]] + [
        trackSelections[key] for key in out_track_types["Unfitted4Calo"]
    ]
    alltracks4calo = TrackSelectionMerger(
        InputLocations=tracks4calo).OutputLocation

I assume people want fitted tracks though..

then they can do that in the sprucing step I suppose

Don't you need clusters for that?

those (clusters) we need anyway to be persisted, right? all persisted tracks should be refitable I suppose

Another question is do we also need proto particles for all these track types? Do we need any Rich, Calo, Muon information for them?

These kind of discussions would make sense having in our joint RTA-DPA meetings backed-up by 1-2 slides IMO. Cc @benito and @decianm.

It would be ideal to start commissioning and spitting files to offline at the next occasion with a reasonably finalised decision on the list of objects to be persisted. Even more ideal, such a list should be presented to the PPG to avoid surprises and be transparents to analysts on what they should expect to have offline for analysis. This way we mitigate feedback of the kind "Oh, I can't do my analysis X because info Y is missing. I never realised this was going to be the case." My 2 cents :-).

I agree, I would very much like a "signed off" persistreco list from the physics WGs. We can schedule a slot for next Friday

I am not sure it makes sense to present a list and have an open discussion before ironing out the technical details and costs.

I see that point but if analysts need something they need it. And we have to make it work. Im talking more about the inclusive lines here that allow data mining many years in the future

I also appreciate that a lot of this work falls on you @sesen

I did not say this is meant to happen next week. But we need to be more transparent on what we make possible/impossible to analysts. It's a must. Also, as far as analysts and the PPG are concerned, technical details are irrelevant. Technical details and costs is something to discuss within RTA-DPA, and the executive summary is what you provide to the PPG. Hence I stand by my proposal.

I am very sympathetic to everyone's points here. For what my input is worth the nice thing about our framework is that everything is possible, given the relevant human effort (or costs as @sesen puts it). The physics working groups should be made very aware of those costs and warmly invited to consider how they could help absorb some of them by putting more work into this area. LHCb is frankly beyond lucky that you've worked so selflessly on this for as long as you have Sevda! That being said nothing should be seen as impossible, but physics analysts should list their priorities understanding that lower priority items will happen only later in the run (or in run 4) unless significant additional effort is found.

I have always been very appreciative of Sevda's commitment and work, and made that clear on several occasions in MRs, so assume this is beyond doubt.

Now, effort and people available to do certain tasks is a totally different matter wrt what RTA and DPA (need to) provide for analysis. We cannot go and say to analysts that we won't tell them what the content of DSTs will be because we have insufficient people to work on getting the work done. We can, and should, state what we plan to make available offline, to check that's sufficient for LHCb's physics programme across the board, while emphasising the parts of the required work that are understaffed. Hence my "2-step" proposal.

I hope this clarifies my comments.

I completely agree with the steps of @erodrigu.

We need to know what is expected offline (ie. the definition of persistreco) such that the analysts have the full flexibility that was promised to them. As @sstahl said it "should enable (roughly) the same opportunities offline "
We then need to determine the human and bandwidth costs of delivering it
Then we can iterate on the above

@cmarinbe @decianm this is one of those areas where the RTA-DPA boundaries are fuzzy. How would you like to proceed?

We had some discussion in the RTA coordination meeting today, and converged on:

PersistReco should provide a set of reconstructed objects that cover the needs of a majority of analysis. There should only be one set of persisted objects which we call PersistReco.
For exploratory studies, where you need the full information (including raw event), you can use the TurCal stream
If you want to do something special, there is the possibility to save a list of objects that fit your needs (along the lines of SelectivePersistence or SelectivePersistReco). You can always opt to persist the raw event (or a subset of raw banks) for your specific line that then enables you to do what you want to do.

This is more the general strategy. For what concerns what we want to have in the list of objects for PersistReco, I suggest we make a list of things we consider essential, and then gather input (from all interested parties), where we then decide if this is worth adding to PersistReco or not. Of course it should come (as @nskidmor said), with a cost estimate in terms of bandwidth to make a better informed decision. It would also be beneficial to know how much bandwidth-cost adding all raw banks has, as this is essentially the maximum information one can persist (and would be kind of an upper limit).

This will most likely be an iterative process, but that is ok.

May I ask if the default list in PersistReco is converged ? I see a list in Sevda's talk https://indico.cern.ch/event/1253658/contributions/5283546/attachments/2599779/4488788/Talks-12.pdf page 2 but not sure if there is any update recently...

We would like to do a survey in our Physics WG to collect feedbacks about special usecases of dedicated rawbanks, where the default persistency feature would be a very useful input to help WG members understand the situation.

If it is not converged, may I consider the following items as something that will surely be involved in the default setting (based Sevda's talk) ?

Reconstructed track objects for all track categories. But for example, for T-tracks that has been used for Long-track reconstruction, we are not sure if it will be saved by-default.
ECAL reco output: photon, pi0 and ECAL electron
Muon Reco output: including Muon-station-based PID and Muon standard-alone tracking output
Rich Reco output: the Rich PID

If yes, we may ask WG members to tell us what's the special usecase that cannot be archieved by only using these info.

And personally I would suggest considering the following two objects as "widely used by many analyses":

Calo-based PID . My naive feeling is that it also contribute to the widely-used DLL variables.
Rich raw bank. This is for persisting the ability of performing a Rich global fit considering the hypotheses of: Sigma- (upstream track), Xi- (upstream track), deuteron et. al. , for potential future analysis making use of these final states. If it's not involved in the default setting, B&Q would like to study the feasibility of migrating our detached Jpsi2MuMu, psi2S2MuMu full hlt2 line into Turcal or SelectivePersist configs. Playing with fancy reconstruction algorithms and searching for new hadrons in these rarely-used final states will be one of our future directions to boost the LHCb spectroscopy study to a new level.

Hello @mengzhen

The list has not been finalized yet, but rest assured that the Physics Working Groups will be consulted on what should go in the PersistReco list.

Great thanks Michel !

Then for the B&Q survey we would just say "the issue is under discussion, and we liaisons would like to collect feedbacks to get prepared for discussions with commissiong experts".

In our WG, I know a few analyses which might be do-able using 2023 data or even benefit from the low-multiplicity condition, and their analysis strategy could be different based on different strategy of persistency. So I would like to start the survey earlier than later. Apologize for being a bit hurry and pushing at this point :)

I thought there was plan to not to merge tracks to one best track container. At the moment, what goes into Track/Best depends on what reco runs. @dovombru @mveghel Do you know the status of this?

Indeed this is the plan, but I am not aware of any (recent) progress.

added to epic &5

added Collaboration commissioning Persistency design labels

marked this issue as related to #503

Related MR: !2338 (merged)

removed the relation with #503

mentioned in merge request !2358 (merged)

Follow up on the discussion at !2358 (comment 6849180) .

I'm not sure I got the conclusion of the discussion, is the content of PersistReco enough to be able to refit PVs?

My understanding after a discussion with @sesen is the following: If we enable pv_tracks in packing, all tracks that are used to make a PV are persisted. This should be true irrespective if we use TBLV or PatPV3DFuture. Offline one could then always refit PVs by matching the track in the candidate with the track in the PV, and excluding it in a refit, or just fitting the whole PV from scratch with it. I remember that @peilian had a look at this at some point.

To use the benefit of PrimaryVertex, one would need to essentially persist the PrimaryVertexContainer. Online the PrimaryVertex is converted to RecVertex, persisted, and offline the RecVertex is converted back to a PrimaryVertex via RecV1ToPVConverter, but the links between Velo tracks and indices are not recovered in the converter.

Long story short: With pv_tracks in packing, we should be able to do a refit offline. A proper implementation would need to persist PrimaryVertexContainer and add the index of the Velo track to the (extra info of the) tracks used by the selection (which would be adding one line in the Kalman Filter)

What is the solution for this year to enable refitting in the full stream?

This is the list of locations we plan to save for PersistReco for (the beginning of) 2023, taken from LHCb!4127 (merged)

"LongProtos": ("/Event/Rec/ProtoP/Long", "ProtoParticles"),
"DownstreamProtos": ("/Event/Rec/ProtoP/Downstream", "ProtoParticles"), # not available for now
"UpstreamProtos": ("/Event/Rec/ProtoP/Upstream", "ProtoParticles"), # not available for now
"NeutralProtos": ("/Event/Rec/ProtoP/Neutrals", "ProtoParticles"),
"LongTracks": ("/Event/Rec/Track/BestLong", "Tracks"),
"DownstreamTracks": ("/Event/Rec/Track/BestDownstream", "Tracks"), # not available for now
"UpstreamTracks": ("/Event/Rec/Track/BestUpstream", "Tracks"), # not available for now
"Ttracks": ("/Event/Rec/Track/Ttrack", "Tracks"), # unfitted
"VeloTracks": ("/Event/Rec/Track/Velo", "Tracks"), # unfitted
"PVs": ("/Event/Rec/Vertex/Primary", "PVs"),
"CaloElectrons": ("/Event/Rec/Calo/Electrons", "CaloHypos"),
"CaloPhotons": ("/Event/Rec/Calo/Photons", "CaloHypos"),
"CaloMergedPi0s": ("/Event/Rec/Calo/MergedPi0s", "CaloHypos"),
"CaloSplitPhotons": ("/Event/Rec/Calo/SplitPhotons", "CaloHypos"),
"RecSummary": ("/Event/Rec/Summary", "RecSummary")

In addition, we'll save the Raw event for the Full stream, such that any potentially missing object could be restored.

Comments?

Quick question, you say "(beginning of)" because it might change when the UT is included or any other reason?

No reason, I just did not want to predict the full 2023 running now.

Okay, stating the probably obvious, if there is not a really critical reason, data in the same bookkeeping path should have the same data format. Data with UT would end up in a different location.

indeed

mentioned in merge request !2589 (merged)

Define content of PersistReco

Designs

Child items ...

Activity