-
Ming-Yan Lee authoredMing-Yan Lee authored
Run 3 commissioning results of heavy-flavor jet tagging at √s=13.6 TeV with CMS data using a modern framework for data processing
CDS link: DP-2024-024
Abstract
Identifying jets originating from the hadronization of bottom and charm hadrons (heavy-flavor jets) in the CMS experiment holds significant importance for various physics analyses, including investigations of the properties of the Higgs boson, top quarks, and the search for new physics beyond the standard model. This identification primarily relies on detector inputs from reconstructed charged particle tracks and information about secondary vertices contained within hadrons reconstructed as jets. In Run 3, improved machine-learning techniques have been introduced to distinguish heavy-flavor jets from those originating from the hadronization of light-flavor (uds) quarks or gluons (g). Consequently, it is crucial to compare the distributions of data and simulations of input variables, tagging discriminants, and other pertinent kinematic observables. In the first part of this note, proton-proton collision data is at √s = 13.6 TeV (Run 3) compared to expected distributions obtained from Monte Carlo simulation in five different phase space regions— top quark antiquark production (tt̅) in the dileptonic final state (enriched in b-jets), in the semileptonic final state (enriched in b and c-jets), W boson plus charm production (enriched in c-jets), Drell-Yan production, and QCD multijet production (enriched in light jets). These phase spaces are shown with data corresponding to an integrated luminosity of 61.7fb-1 and recorded by the CMS experiment in 2022 and 2023. In the second part, a modern and fast framework that has been developed and automated for the production of the presented comparisons is discussed along its technical details.
Glossaries:
heavy-flavor tagging
-
AK4 jets: Jets that are reconstructed by the anti-k_t algorithm [1] with a distance parameter of R = 0.4 using particle-flow candidates. The pileup mitigation is performed by the pileup-per-particle identification (PUPPI) algorithm [2,3], which assigns a weight to every particle depending on its probability to originate either from a pileup or the leading vertex. Dedicated jet energy corrections (JEC) derived from Run 3 data [4,5] are applied to the jets.
-
Muon-jet: An AK4 jet containing a low-pT muon (pT < 25 GeV), i.e. fulfilling the requirement on the angular separation of ΔR (low-pT muon, jet axis) < 0.4 where ΔR = √(Δ\eta^2 + Δ\phi^2)..
-
Pileup jet: A reconstructed jet that is not mapped to any generator-level jets within ΔR > 0.4 and pT > 8 GeV in the simulated event.
-
Heavy-flavor jets Jets originating from the hadronization of bottom or charm hadrons.
-
Light-flavor jets Jets originating from the hadronization of light-flavor (uds) quarks or gluons.
-
Secondary Vertex (SV) The point from where the b or c hadron decays. The vertex reconstruction is performed using the adaptive vertex fitter and inclusive vertex finding (IVF) algorithm [6]. The resulting list of vertices is then subject to a cleaning procedure, rejecting SV candidates that share 70% or more of their tracks, or if the significance of the flight distance between the two secondary vertices is less than 2, one of the two secondary vertices is dropped from the collection of secondary vertices.
-
3D Track SIP (Signed Impact Parameter) Above Charm The signed 3D impact parameter exceeding the c-jets threshold. The threshold is a four-vector sum updated by individually adding all tracks in a jet in decreasing impact parameter significance order. The procedure stops once an invariant mass of at least 1.5 GeV is reached.
-
BTV B-tagging and vertexing.
-
scale factor (SF) Data-to-simulation factor to correct the number of b-tagged jets in simulation can be applied in a number of ways, but typically involve weighting simulation events based on the b (c) jet discrimninator value evaluate for each jet in the event.
-
DeepJet: A multi-classification deep-neural-network algorithm [6] employing general (low-level) properties of several charged and neutral particle-flow jet constituents, supplemented with properties of secondary vertices associated with a jet. This is the state-of-the-art tagger during during the 2015–2018 data-taking period at √s = 13 TeV (Run2) used for heavy flavor tagging.
-
ParticleNetAK4 A ParticleNet [7] architecture customised for AK4 jet classification, namely ParticleNetAK4 to perform in an inclusive way heavy flavour and hadronic tau identification combined with a flavour aware jet energy correction and jet energy resolution. ParticleNet is a Dynamic Graph Convolutional Neural Network based jet tagging algorithm. Instead of treating the jet as a collection of ordered constituents like DeepJet, a jet is considered as an unordered set of its constituent particles or a “particle cloud”. This representation effectively proves to be more efficient in incorporating additional low-level jet information and also explicitly respects the permutation symmetry.
-
RobustParTAK4 A ParticleTransformer [8] model specific for the classification of AK4 jets. The transformer model introduces pairwise "interaction" features between all input jet constituents and secondary vertices. These additional layer of inputs give better view of the internal relations of the jet constituents, thus improving the performance of the model. For AK4 jet classification, a slightly modified ParticleTransformer model architecture [9] is used. In addition, an Adversarial Training (AT) [10] is used to enhance the robustness of the model against the mismodeling of our Monte-Carlo (MC) simulation. AT performs a distortion of our inputs features with respect to the loss function of the neural network. This allows our model to learn how to classify the jet flavour in a region around the jet input features distributions observed on our MC simulation, later reducing the impact of the mismodeling. A combination of these two approaches is used to preserve the performance and improve the robustness of heavy flavor tagging and the tagger is called RobustParTAK4.
-
BvAll discriminant Discriminates b-jets from other flavor jets (c and udsg). The BvAll is defined as BvAll = P(b) / [1 - P(b)], where P(b) is given by the probability of identifying a b jet from all other types of jets.
-
CvB and CvL discriminants: Discriminates c quark initiated jets from b (CvB) and light (CvL) jets. The CvL and CvB are defined as CvL = P(c) / [P(c) + P(udsg)] and CvB = P(c) / [P(c) + P(b)]. P(c) is defined as the probability of identifying a c jet, P(b) is defined as the sum of the probability corresponding to jets originated from b hadrons signatures, and P(udsg) is given by the sum of the probabilities of identifying uds and g jets.
framwork:
-
NanoAOD [11] An event data format that is highly compact in size, commissioned by the CMS Collaboration. It only includes high level physics object information and is about 20 times more compact than the MiniAOD format. NanoAODs are easily customisable for the developmental activities and supports automated data analysis workflows.
-
Columnar Object Framework For Effective Analysis (coffea) [12] A python-based package for performing columnar manipulation of data, tailored to the analysis requirements typical of high-energy collider physics (HEP) experiments. It makes use of uproot and awkward-array [13] to provide an array-based syntax for manipulating HEP event data in an efficient and numpythonic way. There are additional sub-packages dedicated for the generation of look-up tables essential for conveying scientific insights, performing data transformations, and rectifying discrepancies observed in Monte Carlo simulations and data.
Run 3 conditions
-
ECAL water leakage During the Run 3 data taking in 2022, a leak is identified in the water cooling circuit resulting high ECAL energy towers in the endcap region. This leakage prevented cooling of electronics serving 20 communication and control units, corresponding to a total of 491 ECAL crystals. The affected crystals are masked from Sept 17 until the end of the 2022 data taking period, rendering approximately 7% of the ECAL endcap unusable for data collection. The jets within in the event cleaning region are vetoed[14]. Consequently, dedicated simulations used to compare 2022 mimic the effect are divided into pre-EE and post-EE periods. The leak is repaired during the year-end-technical-stop in 2022, resolving the issue for the 2023 data collection onwards.
-
BPix issue[15] After Technical Stop 1 of 2023 (June 19-24), 27 modules in the Barrel Pixel Layer 3 and 4 (BPix 3 and BPix 4) became inoperable due to an issue in distributing the LHC clock signals to these modules. Since this incident, these modules have remained deactivated. They cover a sector spanning approximately 0.4 radians (~23 degrees) in phi at negative pseudorapidity (Bml Sector 7). Since the regions covered by these modules are fully overlapping in eta and phi across the two detector layers, a full gap in acceptance is produced while attempting to seed tracks with traditional ”high purity” pixel-hit combinations (triplets and quadruplets). A dedicated jet energy scale and resolution correction is introduced to account the effect of the loss in jet energy in the issue regions. Dedicated simulations used in 2023 are sub-divided into pre-BPix and post-BPix periods.
Dileptonic tt̅ phase space
Dileptonically decaying tt̅ events form a final state in which highest purity of b jets is achieved. This event topology is relevant for deriving calibration SFsfor b-tagging [16]. Events are selected with a set of electron-muon (eμ) trigger paths. The electron (muon) is required to fulfill pT > 30 (30) GeV, |η| < 2.5 (2.4) and to pass tight identification and isolation requirements [17,18]. At least 2 jets with pT > 20 GeV, |η| < 2.5, fulfilling tight identification criteria, and separated with at least ΔR > 0.4 from the selected electron and muon are considered.
**Figure 1.** Transverse momentum (pT, left) and pseudorapidity (η, right) of the selected jet with highest pT. Reasonable agreement for the pT and η distributions is observed.
**Figure 2.** DeepJet BvAll (left) and CvsL discriminants (right) of the selected jet with highest pT. A lower tagger score in the mismatched peak position and downward trend are observed before the SF is implemented.
**Figure 3.** ParticleNetAK4 BvAll (left) and CvsL discriminants (right) of the selected jet with highest pT. A lower tagger score in the mismatched peak position and downward trend are observed before the SF is implemented.
.
**Figure 4.** RobustParTAK4 BvAll (left) and CvsL discriminants (right) of the selected jet with highest pT. A lower tagger score in the mismatched peak position and downward trend are observed before the SF is implemented. Slightly better agreement is observed as compared to DeepJet and ParticleNetAK4 taggers is observed.
**Figure 5.** Value (left) and significance (right) of the 3D signed impact parameter above charm of the selected tracks. The shifted peak position and asymmetric distribution are expected due to imperfect tracking calibration.
**Figure 5.** Transverse momentum (pT- left) and χ2 value of the secondary vertex fit (right) of the first selected secondary vertex. Overall good agreement is observed.
Semileptonic tt̅ phase space
Due to the hadronically decaying W boson, semileptonic tt̅ events have a significant amount of c jets, and thus can be used for calculating b-tagging and c-tagging identification SFs[16,19]. Events are selected using a single-muon trigger path. The selected muon is required to fulfill the same selection criteria as the one of the dileptonic tt̅ phase space. At least 4 jets with the same requirements as slide 7 are considered. The event is required to have pTmiss above 50 GeV.
**Figure 7.** Transverse momentum (pT, left) and pseudorapidity (η, right) of the selected jet with highest pT. This phase space is enriched in b and c jets. Compared to the dileptonic tt̅ final state (see Fig. 1), the fraction of udsg and c jets is higher. Reasonable agreement between the data and simulations is observed.
**Figure 8.** DeepJet BvAll (left) and CvsL discriminants (right) of the selected jet with highest pT. A lower tagger score in the mismatched peak position and downward trend are observed before the SF is implemented.
**Figure 9.** ParticleNetAK4 BvAll (left) and CvsL discriminants (right) of the selected jet with highest pT. A lower tagger score in the mismatched peak position and downward trend are observed before the SF is implemented.
**Figure 10.** RobustParTAK4 BvAll (left) and CvsL discriminants (right) of the selected jet with highest pT. A lower tagger score in the mismatched peak position and downward trend are observed before the SF is implemented. Slightly better agreement is observed as compared to DeepJet and ParticleNetAK4 taggers is observed.
**Figure 11.** Value (left) and significance (right) of the 3D signed impact parameter above charm of the selected tracks. The shifted peak position and asymmetric distribution are expected due to imperfect tracking calibration.
**Figure 12.** Transverse momentum (pT- left) and χ2 value of the secondary vertex fit (right) of the first selected secondary vertex. Overall good agreement is observed.
W boson plus charm jet (W+c) selection
This phase space is largely enriched in c jets and is utilized for evaluating the c-tagging performance of the heavy-flavor tagging algorithms [19]. We use a leptonically decaying W boson and c jets. These c jets are identified using the semileptonic decay of the c hadron, which produces a soft muon within the jet in the final state. The same trigger path and the same selection criteria for the isolated muon as for the semileptonic tt̅ phase space are required (see slide 14). At least 1 additional soft muon with a reduced pT threshold and a relative isolation of greater than 0.2 is selected and matched with 1 to 3 selected muon-jets, which pass pT > 20 GeV and |η| < 2.5. Both opposite-sign (OS) and same-sign (SS) isolated muon and soft muon pairs are taken into account. Events that contain more than 1 reconstructed secondary vertex are considered for this study. Additional selection criteria to enrich W boson events and to suppress QCD multijet and Drell–Yan contributions are applied as well. To enrich the selected event sample with W bosons, a transverse mass of the sum of pTmiss and the isolated muon four-vector of is required to be larger than 55 GeV. Events stemming from Drell–Yan processes are suppressed by excluding an invariant di-muon mass within the Z boson mass window (80 GeV < mμμ < 100 GeV). By requiring the sum of the muon and the neutral electromagnetic energy fractions to be smaller than 0.7, Drell–Yan events are further suppressed. Low-mass di-muon events are rejected by selecting events with mμμ > 12 GeV. The QCD multijet rejection is performed by requiring the isolated muon to fulfill a very tight relative isolation of less than 0.05 and tight requirements on the impact parameter of the selected muon.
**Figure 13.** Transverse momentum (pT, left) and pseudorapidity (η, right) of the selected muon-jet with highest pT. Reasonable agreement for the pT and η distributions is observed.
**Figure 14.** DeepJet BvAll (left) and CvL discriminants (right) of the selected jet with highest pT. Reasonable agreement between data and MC is observed.
**Figure 15.** DeepJet CvB discriminant of the selected jet with highest pT. Reasonable agreement between data and MC is observed.
**Figure 16.** ParticleNetAK4 BvAll (left) and CvsL discriminants (right) of the selected jet with highest pT. Reasonable agreement between data and MC is observed.
**Figure 16.** ParticleNetAK4 CvB discriminant of the selected jet with highest pT. Reasonable agreement between data and MC is observed.
**Figure 18.** RobustParTAK4 BvAll (left) and CvsL discriminants (right) of the selected jet with highest pT. Reasonable agreement between data and MC is observed.
**Figure 19.** RobustParTAK4 CvB discriminant of the selected jet with highest pT. Reasonable agreement between data and MC is observed.
**Figure 20.** Value (left) and significance (right) of the 3D signed impact parameter above charm of the selected tracks. The shifted peak position and asymmetric distribution are expected due to imperfect tracking calibration.
**Figure 21.** Transverse momentum (pT- left) and χ2 value of the secondary vertex fit (right) of the first selected secondary vertex. Overall good agreement is observed.
Drell–Yan plus jets (DY+jets) selection
This phase space is enriched in light flavored jets and is used for the calibration of udsg mistagging SFs [16]. A di-muon trigger path is employed to select Z→μμ events. The leading (subleading) muon has to fulfill a criteria of pT > 15 (12) GeV, where both muons are required to satisfy |η| < 2.4, as well as tight identification and isolation requirements [18]. The invariant di-muon mass has to be at least 15 GeV and to be within the Z boson mass window. At least 1 jet with pT > 20 GeV, |η| < 2.5, fulfilling tight identification criteria, and cleaned from the selected muon is required.
**Figure 22.** Transverse momentum (pT, left) and pseudorapidity (η, right) of the selected jet with highest pT. Reasonable agreement for the pT and η distributions is observed.
**Figure 23.** Transverse momentum (pT, left) and mass (right) of the Z boson. The disagreement at low pT is expected due to the simulation configuration that is not fine tuned.
**Figure 24.** DeepJet BvAll (left) and CvsL discriminants (right) of the selected jet with highest pT. A lower tagger score in the mismatched peak position and downward trend are observed before the SF is implemented.
**Figure 25.** ParticleNetAK4 BvAll (left) and CvsL discriminants (right) of the selected jet with highest pT. A lower tagger score in the mismatched peak position and downward trend are observed before the SF is implemented.
.
**Figure 26.** RobustParTAK4 BvAll (left) and CvsL discriminants (right) of the selected jet with highest pT. A lower tagger score in the mismatched peak position and downward trend are observed before the SF is implemented. Slightly better agreement is observed as compared to DeepJet and ParticleNetAK4 taggers is observed.
**Figure 27.** Value (left) and significance (right) of the 3D signed impact parameter above charm of the selected tracks. The shifted peak position and asymmetric distribution are expected due to imperfect tracking calibration.
**Figure 28.** Transverse momentum (pT- left) and χ2 value of the secondary vertex fit (right) of the first selected secondary vertex. Overall good agreement is observed.
Inclusive QCD multijet selection
This phase space is mostly dominated by udsg jets. This region is used for tagger calibrations and acts as a control region for the udsg jets. Events are selected if they satisfy a trigger selection of at least one AK4 jet with pT > 180 GeV, |η| < 2.4. Due to the high event rates, only a fraction of the events that fulfill the trigger requirement are selected (prescaled trigger). The fraction of accepted events depends on the prescale value, which varies during the data-taking period according to the instantaneous luminosity. The data are compared to simulated multijet events at leading order.
**Figure 29.** Transverse momentum (pT, left) and pseudorapidity (η, right) of the selected jet with highest pT. Large disagreement in |η| > 2 region is observed.
**Figure 30.** DeepJet BvAll (left) and CvsL discriminants (right) of the selected jet with highest pT. A lower tagger score in the mismatched peak position and downward trend are observed before the SF is implemented.
**Figure 31.** ParticleNetAK4 BvAll (left) and CvsL discriminants (right) of the selected jet with highest pT. A lower tagger score in the mismatched peak position and downward trend are observed before the SF is implemented.
.
**Figure 32.** RobustParTAK4 BvAll (left) and CvsL discriminants (right) of the selected jet with highest pT. A lower tagger score in the mismatched peak position and downward trend are observed before the SF is implemented. Slightly better agreement is observed as compared to DeepJet and ParticleNetAK4 taggers is observed.
**Figure 33.** Value (left) and significance (right) of the 3D signed impact parameter above charm of the selected tracks. The shifted peak position and asymmetric distribution are expected due to imperfect tracking calibration.
Variables in different data-taking conditions: dileptonic tt̅ phase space
**Figure 34.** Track SIP 3D signed impact parameter (SIP) value (left) and significance (right) in different data-taking periods in the dileptonic tt̅ phase space in the first panel with the distribution normalized to 1 demonstrated as arbitrary unit (A.U.). The seond panel assumes the 2022 pre-EE data-taking period as reference and shows the ratio of other data-taking conditions vs. the reference. The third panel shows the Data and prediction ratios in each individual condition. A shift of the peak position towards to negative side of the 3D SIP value is observed. Wider distribution of the 3D SIP is observed in the 2022 post-EE era.
**Figure 36.** Workflow of the common framework. The workflow starts by adding customized flavor tagging related information to the NanoAOD and creates flat ntuples. Next, it proceeds to the common BTV framework where events are selected for b-/c-/udsg-jet enriched regions, with the corrections and systematics variations applied on the fly. Finally, the information is stored either as histograms (coffea[12], ROOT[21]) or arrays (awkward[13], ROOT) to make plots or as input to other frameworks (i.e. scale factor derivation).
Common BTV framework: BTVNanoCommissioning [20]
- Input : Customized information specific to heavy flavor tagging studies configured as an additional module in CMS software [20] where the sequence runs after the data and simulation are produced. → custom BTV NanoAOD (additional information added to NanoAOD)
- Common BTV framework based on coffea [12]:
- Common selections: used in commissioning & scale factor studies, unifying the object selections between different phase spaces. → utilisation of coffea framework for efficient computing
- Common corrections: used among all phase spaces: triggers, lepton scale factors, jet energy corrections, and jet probability.
- Common systematics: considered in different phase spaces including lepton efficiency correction, trigger scale factors and jet energy uncertainties.
- Output: arrays used for SF derivation and histograms used for final plots, and template fitting.
- Automation (gitlab CI): connect BTV framework into gitlab continuous integrations and automatically produce plots on dedicated websites for checking data/simulation comparisons.
**Figure 34.** Track SIP 3D signed impact parameter (SIP) value (left) and significance (right) in different data-taking periods in the dileptonic tt̅ phase space in the first panel with the distribution normalized to 1 demonstrated as arbitrary unit (A.U.). The seond panel assumes the 2022 pre-EE data-taking period as reference and shows the ratio of other data-taking conditions vs. the reference. The third panel shows the Data and prediction ratios in each individual condition. A shift of the peak position towards to negative side of the 3D SIP value is observed. Wider distribution of the 3D SIP is observed in the 2022 post-EE era.
References
[1] M. Cacciari, G. P. Salam and G. Soyez, “The anti-kt jet clustering algorithm,” JHEP 0804 (2008) 063.
[2] CMS Collaboration, “Pileup mitigation at CMS in 13 TeV data”, JINST 15 (2020) P09018.
[3] CMS Collaboration, “Pileup-per-particle identification: optimisation for Run 2 Legacy and beyond”, CMS Detector Performance Summary CMS-DP-2021-001.
[3] CMS Collaboration, “Jet Energy Scale and Resolution Measurements Using Prompt Run3 Data Collected by CMS in the First Months of 2022 at 13.6 TeV”, CMS Detector Performance Summary CMS-DP-2022-054.
[4] CMS Collaboration, “Jet Energy Scale and Resolution Measurements Using Prompt Run3 Data Collected by CMS in the Last Months of 2022 at 13.6 TeV”, CMS Detector Performance Summary CMS-DP-2023-045.
[5] CMS collaboration, “Identification of heavy-flavour jets with the CMS detector in pp collisions at 13 TeV” 2018 JINST 8 P04013
[6] E. Bols, J. Kieseler, M. Verzetti, M. Stoye and A. Stakia, “Jet flavour classification using DeepJet” JINST 15 (2020) P12012.
[7] H. Qu and L. Gouskos, “Jet tagging via particle clouds”, Phys. Rev. D 101, 056019 (2020).
[8] H. Qu, C. Li, S. Qian, “Particle Transformer for Jet Tagging,” arXiv:2202.03772.
[9] CMS Collaboration, ”Transformer models for heavy flavor jet identification”, CMS Detector Performance Summary CMS-DP-2022-050.