Draft: Change storage to `std::list` from `std::vector` (!291) · Merge requests · HepMC / HepMC3

Christian Holm Christensen requested to merge cholmcc_storage_is_list into master Nov 28, 2023

This is a big one, and may introduce some incompatibilities with 3rd party code. The commit also makes quite a bit of changes in code, and should not be accepted lightly. That said, I do believe it improves the code base.

Changes

Storage of GenVertex and GenParticle has been changed from std::vector to std::list.
The identifiers of GenVertex and GenParticle (i.e., data members id) are stored with the GenVertexData and GenParticleData.

Consequences

id of GenVertex and GenParticle are preserved on I/O (except for HEPEVT).
Since particles and vertices are stored in std::list containers it means that random access (e.g., vertices[id]) isn't directly supported and is less efficient. "Compatibility" member functions GenEvent::particle(int) and GenEvent::vertex(int) are provided.
When particles and vertices were stored in std::vector, the operations of adding or removing such objects to a GenEvent had at least a complexity of O(n log n), where n is the size of the event. With this change, the operation is more like O(1). The reason for this is simply because indices does not have to be updated.

This can speed up pruning (see examples/PruneExample) of an event by a factor 4 almost. See figure below

The figure shows the effect of pruning (top left is the correlation of number of particles before and after pruning, top right, the ratio of after to before), and the timing using the old (std::vector) and new (std::list) containers. Bottom left Is the ratio of mean time of new to old, and bottom right shows the mean time. The function fitted to the data is

A n^p \log(n)

The use of std::list also improves read, as shown in the figure below.

To the left is the ratio of the mean time to read and event using std::list to std::vector. Again, the right panel shows the mean time and the function

A n^p \log(n)

is fitted to the data. Again, we see a 3-fold increased speed, especially for larger events.
The protobuf messages of particles and vertices has an extra id field. This is marked as optional which means that backward compatibility with existing protobuf data is kept.
By moving to a more suitable container type (std::list as opposed to std::vector) we can eliminate a lot of new and delete calls, which, especially with large events, is very inefficient.

Another storage option would be std::unordered_map which would allow for random access, but I believe that use case is rather infrequent, and a faster alternative seems to be std::list. One could perhaps also opt for std::forward_list, since we generally traverse particles or vertices in one direction only.

Yours, Christian

Draft: Change storage to `std::list` from `std::vector`

Merge request reports