Collecting outputs from running in parallel is very slow

It's been noticed by several people that when a large number of events and files are processed using a parallel mode (batch system, local multiprocessing, etc) the individual tasks run quickly, but the final collecting step can take a long time. It would be good to understand why this is and accelerate the steps as much as possible; certainly merging many pandas dataframes shouldn't take too long.