Configuration of HltPackedData{Writer,Decoder} -- a.k.a. 'JSON and TCKs'

In order to save valuable bytes in storage, the encoding of various Hlt RawBanks maps (potentially long) human recognizable / descriptive strings (i.e. TES locations) into small integer values. This implies that the decoding, if it wants to provide those strings, must be aware of the mapping employed by the encoder at the time of writing.

These mappings are fairly stable, i.e. they are guaranteed to be the same for all events within a given run, and runs processed by the same 'configuration' (aka 'workflow' instance) will also share these mappings. However, changes in the configuration may result in different mappings.

For run1 and run2, given that the encoding was configured 'through a TCK', which allowed access to all properties of the relevant components for any given event (specifically, for any 'key', and the key could be obtained from the data), it was possible to reverse-engineer, with knowledge of how the encoding was implemented, how to obtain this table. In practice, this meant that the decoding would assume that the encoding used the HltANNSvc to obtain this mapping, and within this service, the relevant mapping had a given 'major' name (eg. 'Hlt1SelectionID' for the selection reports in Hlt1). It would then, while decoding the corresponding data, ask the 'TCKANNSvc' for the relevant property of the HltANNSvc for this key, and use this subsequently to turn the numbers back into human-recognizable strings.

This system implied that the decoding would always access an external source of information (released through DBASE/TCK/HltTCK), and this information had to be published prior to processing the data. In addition, it encoded in the decoder the implementation detail of how the encoder was configured. As a results, if someone would like to generate 'their own' samples (i.e. not run through production), they would have insure a valid TCK for their data would be present in $HLTTCKROOT. This implies either generating a TCK for the 'entire' configuration to be used, or to 'fake' enough of a configuration to satisfy the (small!) part of the configuration which the decoding would assume was used by the encoding. In addition, it implies that any configuration also needs to be associated with a unique number (i.e. 'the TCK') and that no duplicate TCKs would be generated (as then one could use an inconsistent $HLTTCKROOT and just get wrong results).

Based on the above, there are several solutions one could consider.

By far the simplest, and also most expensive solution is to not encode long strings into small fixed size integers. A slightly less expensive solution would be to store, for each event, the relevant mapping in the data itself, eg. in a dedicated RawBank, or as one of the 'containers' stored in the DstData RawBank. Next would be to recognize that these data are the same for all events in a given run, and store this only once per run in eg. an FSR. These solution all have in common that no external data access would be needed, and no assumptions have to be made on how the encoding is configured in order for the decoding to know what mapping to use: the decoding would either (in the first case) not need this information, or (the second and third case) directly obtain the mapping from the datastream it is currently processing. The third case is clearly the most attractive, as it amortizes the space required over (potentially) very many events.

A second class of solutions is to rely on 'external metadata' to obtain the relevant table. Either one directly configures the decoding (or more precisely, the services used by the decoding) with these tables. This is the case of the 'JSON' files currently being used, i.e. they configure the decoding (indirectly). Or one uses a unique identifier inside the data stream (which could be the TCK, or the run number) and use that as key to obtain this information. This information could live in eg. the conditions database, or it could live in 'the TCK backend'. The former has the advantage that the relevant information (no more, no less) could be made available for easy consumption by the decoder. The latter implies that the decoder (indirectly) must know how the encoding is actually configured (i.e. which components are required, and which properties of said component are the relevant ones) -- and this latter solution is the one used during run 1 and run 2, partly because 'it existed when the problem had to be solved' and partly because the conditions database was indexed not by run number, but by event time, and there where occasional hiccups which led to events whose event time was not within the validity interval of their run.

DstData should not go to an external source (the DecReports) to know which number->string table to use
SelReports should not go to an external source (the DecReports) to know which number->string table to use

Edited Apr 13, 2022 by Rosen Matev

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information