Discussion about data formats

Summary

This issue follows up on cmsgemos#50 (closed) which should have initially focused more on implementation than detailed data format discussion. Please take the information from this post with caution, some of is only deduced from source code since official documentation is hard to find,

Depending on the situation different data formats are currently used: AMC13 spy readout, FedKit and miniDAQ. We must understand, document, try to reduce their differences for simplified maintenance.

The AMC13 spy readout data format is the simplest to describe: it simply concatenate the events readout from the AMC13 without any additional header, marker or trailer. This data format is supported by the LDQM unpacker.
The FedKit data format, called FED Raw Data format (FRD), is very similar to the AMC13 spy readout one, but it adds one header before each of the events. This format is understood by the LDQM unpacker, which skips the header, and the Python unpacker, which extracts the event size from the header.

From what can be extracted from the source code, the header is added by the BU (see the EventInfo class). In addition, a per file header only seem to be added (see the FileInfo class). The same headers are a priori added during readout in miniDAQ. However, those files are only destined to the FU and are readout by the HLT with CMSSW (see this file and that file) with the headers defined here and there.
The miniDAQ data format is produced at the output of the HLT and is called streamer data format. Whether the actual conversion is done by the FU, BU or macro-merger is irrelevant, this is the file format which is stored to disk at P5 and transferred to Tier-0. (Can it already be the EDM data format?)

The miniDAQ data format seems to be partially supported by the Python unpacker.

Matching the AMC13 spy readout data format (under our control) and the FedKit/BU/FRD data format seems to be a good idea: it would completely uniformize the unpacker. Unpacking the miniDAQ/streamer data format seems more complicated, but possible. Alternatives would be (1) to be able to get data in the FRD data format or (2) to convert between data formats with the help of CMSSW.

Despite being out of the scope of this issue, it is important to mention that the trigger alignment procedure will likely have to cope with the OTMB and/or EMTF DAQ data formats. Developing our own unpacker for these will be a burden; using CMSSW at least to convert them to a more appropriate data format for analysis will likely be required.

What is the expected correct behavior?

Data formats are understood, documented and, ideally, unified.