Commit a1927a8c authored by Remi Mommsen's avatar Remi Mommsen Committed by Dainius Simelevicius
Browse files

references #180: add full text

parent 833df00a
## The CMS Event-Builder package (EvB)
This package contains the source code of the CMS event builder
(EvB). Please refer to the paper [Performance of the CMS Event
Builder](https://inspirehep.net/files/6469f6a8850f8736acab83265d0e70d1)
for a general overview.
The EvB consists of three main XDAQ Applications:
- Event Manager (EVM) which orchestrates the event building. There is
only one EVM in any EvB system. The EVM typically receives the data
from the TCDS FED, which acts as reference for the event order. Data
from further FEDs can be send to the EVM, which are handled identical
to the other FEDs on the readout units (RUs)
- Readout Unit (RU) handles one or multiple FED streams. The data for
a given event (trigger) from all FEDs received by a RU are combined
into a super-fragment.
- Builder Unit (BU) combines the superfragments from all RUs into a
complete event and writes multiple events into a file.
In addition to the main applications, further code is available:
- dummyFEROL emulates a FEROL TCP/IP stream. It is used for testing
the other EvB applications.
- an exensive Python-based test and measurement suite
(c.f. test/HowToRunFromScripts.txt)
- the fedKit.py (in the scripts directory) which allows to read out a
single FEROL stream using a simple command-line interface. The fedKit
is used for lab-based test stands.
### Implementation
The EvB applications are based on the EvBApplication template, which
inherits from the standard XDAQ classes and provides convenience
methods. The code has been developed based on the C++98 standard and
mostly relies on the then available technologies. It has
been upgraded to use C++14 features where easily possible.
### Finite State Machine (FSM)
The EvB applications extensively use nested states and rely on
entry and exit functions of the states. boost::statechart is used as
base. The EvBStateMachine template provides the base classes and state
transitions common to all EvB applications. Further specialization of
the state machine is done for the individual applications.
The state transitions are either triggered by SOAP messages from run
control (RCMS), or by internal events, e.g., error conditions. The
SOAP reply to RCMS contains the name of the innermost state. E.g. if
the system is in the "Ready" state and RCMS sends the "Enable"
SOAP message, the state machine changes to the "Active" state which
has an inner state "Enabled". The response to RCMS is in this case
"Enabled". This is a synchronous state transition. The state "Active"
is internal to the EvB and not visible to the outside.
There are asynchronous state changes, too. In this case, the state
name reported to RCMS ends with "ing", e.g., "Configuring" or
"Draining". These states spawn a boost::thread which performs the
tasks which may not complete instantaneously. Once the tasks are done,
an internal event is generated (e.g. "DrainingDone") which triggers
another state transition, in this case back to the "Ready" state. The
entryAction of the "Ready" state sends an asynchronous notification to
RCMS that the new state "Ready" has been reached.
In case of a failure, the special state event "Fail" can be
fired. This event takes a xcept::Exception and takes care of the error
reporting. The event causes the state machine to go to the "Failed"
state, which is a parallel state to the outermost state "AllOk". This
assures that all exit actions of the nested inner states are called,
which clean up any resources. It is possbile to go back from the
"Failed" state to "Halted" by sending the "Halt" event. However, this
functionality is not used by RCMS.
There are two state events to leave the "Enabled" (aka running)
state: "Stop" and "Halt". The "Stop" event causes a transition to the
state "Draining", during which the EvB applications are draining all
events. In case that further events are sent to the EvB, e.g., if the
trigger has not stopped, or if not all events can be built
successfully, e.g., if some FED fragments are missing, the EvB will
stay in the "Draining" state. In this case, an external "Clear" event
has to be sent. This causes an internal re-configuration, which clears
out the remaining data and returns to the "Ready" state. If a "Halt"
event is received, any data is dropped and the EvB application
immeditately returns into the "Halted" state.
There are further states, which are communicated to RCMS, e.g. the
[BU states "Cloud", "Blocked",
etc](https://twiki.cern.ch/twiki/bin/view/CMS/CDAQOnCallHowTOs#BU_blocks_throttles_or_shows_fun).
These states are used to monitor the system, but are not acted upon.
### Threads and inter-thread communication
The EvB applications use multiple threads based on
toolbox::task::WorkLoop. The applications use separate threads for
distinct tasks, and concurrent threads to process multiple events in
parallel. The locking between the threads has been kept to a minimum
by e.g. using local structs for monitoring data which can be locked
independently from the main monitoring task which retrieves the
information every second.
The main interprocess communication uses the custom
evb::OneToOneQueue. It allows to concurrently enqueue and dequeue
elements without locking from one producer and one consumer. In case
that multiple threads need to enqueue or dequeue elements of the same
queue, external locking is required. The OneToOneQueue template
provides methods to display the queue contents on the XDAQ hyperdaq
pages, too.
### Configuration and monitoring
All configuration parameters are defined in a struct in the
corresponding Configuration.h of the application. The parameters are
accessible in the xdaq::Infospace of the application. There's a helper
class InfoSpaceItems which encapsulates often used patterns.
A special case is the handling of the FEROL sources: the FEROL sources
is a xdata::Vector of xdata::Bag containing the connection parameters
for the FEROL (c.f. readoutunit/Configuration.h). These values are
filled by the configurator, or like the IP address resolved at
configure time. The FEROL sources contain all FEROLs which are defined
in the FED builder connected to this RU and only change when a new DAQ
configuration is made. The xdata::Vector fedSourceIds contain the
FED IDs which are participating in a given run. This list is set by
the DAQ FM with the Configure SOAP message. The FED IDs in this list
must be a subset of the FEROL IDs defines in the FEROL sources.
The runNumber is defined in EvBStateMachine.h and is set by the SOAP
message received from the DAQ FM at Enable.
Each application has a monitoring xdaq::Infospace, which is mapped to
a corresponding flash list. Note that a copy of the flash lists are
kept in the EvB repository. However, the actually used ones are part
of the XaaS configurations. A few monitoring parameters are available
in the application infospace, too. They are updated on request. These
values are used for the standalone test/measurements, as it is easier
to access the predictable URL of the application infospace.
None of the worker threads update the monitoring infospace
directly. This avoids any locking between uncorrelated
threads. Instead, each task accumulates its own statistics, typically
encapsulate in a struct with its own mutex. The
evb::PerformanceMonitor class provides commonly used methods for
this. There is one thread in each application which runs every second
and which collects all monitoring quantities of the individual tasks
and updated the monitoring infospace.
Much more monitoring information is available on the hyperdaq
pages. These pages access directly the monitoring information
accumulated in the worker threads. Therefore, the access to hyperdaq
pages can influence the performance of the EvB application. It is
not recommended to reload the pages automatically or at a high rate.
### Message passing
The messages used for the event building are based on I2O technology
and use the XDAQ PeerTransport layer, which encapsulates the network
specific code. Thus, the EvB does not directly use or profit from
RDMA-specific methods to transfer the data.
The structs used for the I2O messages are defined in
I2OMessages.h. There are three types of messages: to request events,
to send the data, and to request the lumisection information.
#### Event requests
The EventRequest is send by the BU to the EVM. The BU fills its TID,
the ID of the builder thread inside the BU (buResourceId), and the
number of events to request, as well as how many previously requested
events have been built. The BU adds a priority of the request, which
is based on its available resources. The lower the number, the more
resources the BU has available. In addition, the BU adds a time-stamp
based on its local clock. This allows to measure the round-trip time
of the request.
The EventRequests from all BUs are handled on the EVM, which decides
which BU gets the next batch of events (triggers) assigned. The EVM
adds information to the EventRequest: the EvBids corresponding to the
number of events to be sent to the given BU. This number might be
lower than the one requested by the BU in case that not enough
triggers are available to fulfill the request. The EVM adds a list of
RU TIDs participating in the event building, too. This list is used
later by the BU to decide if it got the superfragments from all
RUs. This list can in principle be different for each request, which
would to allow to (temporarily) mask certain RUs. However, this
functionality has never been used.
The EVM sends the EventRequests to all RUs, which use this information
to pack the corresponding event fragments into a superfragment and to
deliver it to the requesting BU.
To keep the message rate low, several EventRequests are packed into a
ReadoutMsg. The packing is dynamically, i.e., it assure that
individual EventRequests are sent in case that the trigger rate is
low, but packs more EventRequests into one ReadoutMsg in case the rate
increases.
#### SuperFragment
The SuperFragment contains the event data from all FEDs handled by the
given RU or EVM. The maximum size of the SuperFragment is given by the
configuration parameter 'blockSize', which must be less or equal to
the corresponding message size on the PeerTransport layer. In case
that the data does not fit into a single message, multiple
SuperFragments are used.
The SuperFragment contains a list of FEDids which were dropped by the
RU. This list is empty unless the RU is configured to
'tolerateCorruptedEvents' or 'tolerateOutOfSequenceEvents', which has
not been used in production.
The DataBlockMsg contains all SuperFragments for all EvBids requested
in the EventRequest message. The DataBlockMsg size is at most
'blockSize' Bytes. Several DataBlockMsg are sent if needed. The
DataBlockMsg contains the ID of the builder thread inside the BU
(buResourceId), and the timestamp as found in the EventRequest
message. The first DataBlockMsg for a given EventRequest contains the
list of EvBids and the information on the RU TIDs, too.
#### LumiSectionInfo
The merger of the output data from the HLT needs to know for each
lumisection how many events in total have been seen. This information
is queried by each BU from the EVM when the BU finishes the
lumisection. The request is done by exchanging I2O messages:
- RequestLumiSectionInfoMsg is send from the BU to the EVM. It
contains the lumisection number for which the information is
requested.
- LumiSectionInfoMsg is the reply from the EVM to the requesting
BU. It contains the lumisection number and the total number of
events built for the given lumisection.
### EVM & RU
The EVM and RU application share many functionalities. Therefore, the
main functionality is implemented in the templated
readoutunit::ReadoutUnit class. A template approach has been chosen to
avoid calling virtual functions at a high rate.
The data from the FEROL TCP/IP streams is receive in socket buffers
from pt::blit. The interface to pt::blit is implemented in
the classes readoutunit::FerolConnectionManager and
readoutunit::PipeHandler. The splitting of the socket buffers into
individual event fragments is done in readoutunit::SocketStream which
is a specialization of readoutunit::FerolStream. Each stream is
handled by a separate thread. All Streams are owned by the
readoutunit::Input class. There are 2 special stream classes which
can replace the FerolStream class: LocalStream is used to generate
fake event-fragments and MetaDataStream is used to inject data
retrieved from DIP into the event stream.
Note that an alternative implementation of the socket handling is
available on the branch 'feature_94_evb_directSocketRead'. This code
uses boost::asio to read the sockets directly in
readoutunit::FerolStream and does not use pt::blit. This code has been
tested in daq3val and uses less resources than the pt::blit
approach. It improves the data locality and is less sensitive to NUMA
settings. However, it has not been used in production.
All FED fragments handled on a given RU/EVM and belonging to the same
trigger are combined into a super-fragment. The super-fragment is
built upon reception of event request as ReadoutMsg from the BU in
readoutunit::BUproxy class. The handling of the ReadoutMsg is the main
difference between the EVM and the RUs. The specializations of the
template methods are found in EVM.cc and RU.cc, respectively:
- The EVM arbitrates the requests from the BU according to priorities
in a round-robin scheme. The RU processes the request in
chronological order.
- The EVM has one FED stream, the master stream, which dictates the
order of triggers. This is typically the TCDS FED. Each request from
the BU for a certain number of events is matched against the
available triggers. If not enough triggers are available, the EVM
waits up to 'maxTriggerAgeMSec' (defaults to one second) before
fulfilling the request. The EVM defines the EvBid, which is used as
unique key for the event building. The EVM extracts the lumi-section
number from the TCDS payload, too. Once the list of EvBids
corresponding to a given BU request is defined, the EVM sends this
list as extended ReadoutMsg to all RUs. The RUs use this list to
build their corresponding superfragment to send to the BU.
The consistency and sanity of the event fragments are checked when
receiving the data from the FEROLs, and when building the
super-fragment upon reception of the ReadoutMsg. The event numbers
have to be successive, the fragment structure with FED header and
trailer needs to be intact, and the event numbers of all fragments
must match the number of the master stream defined in the EVM. In case
that a serious problem is found, 'MismatchDetected' or
'EventOutOfSequence' state events are fired, which causes the run to
be stopped. There are several options to tolerate deviations from this
strict scheme, but they have never been used in production.
The building of superfragments is done concurrently in
'numberOfResponders' threads. The handling of the ReadoutMsg is
sequentialized by a mutex. But the memcpy'ing of the fragments from
the socket buffers into contiguous memory of the I2O message sent over
IB is done concurrently. The sending of the I2O message using pt::ibv
is again sequentialized as the peer-transport layer is not reentrant.
### BU
The BU request events to the EVM, receives all superfragments from the
EVM and RUs, and builds complete events. Complete events are checked
for consistency and written into a file residing on a local
RAMdisk. Each BU works independently from all other BUs. Therefore, a
failure of a BU does not stop the data taking. In addition to building
events, the BU acts as interface between run control (RCMS) and the
HLT infrastructure daemon hltd running on the same machine. The
interface between these is file based. The BU provides the
configuration files retrieved from RCMS in a local directory (hltd in
the run directory) and [reduces or stops the processing speed depending
on the status of the
HLT](https://twiki.cern.ch/twiki/bin/view/CMS/CDAQOnCallHowTOs#BU_blocks_throttles_or_shows_fun)
running on the filter units (FUs) attached to the given BU machine.
The BU has 'numberOfBuilders' threads which request and build events
mostly independently. The event building is happening on pointes to
the I2O buffers received from pt::ibv. Complete events are then handed
to 'numberOfDiskWriters' threads, which write the events into separate
files. The memcpy'ing of the data from I2O buffers into the files is
done in these threads using the system call writev.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment