Improve package structuring
In an attempt to better understand (and document) the components of CTA, I started working on an overview of all the different packages that we have in the project: https://eoscta.docs.cern.ch/latest/dev/building_cta/cmake_overview/. Note that this overview is (so far) solely based on the contents of the CMakeLists.txt
files. It is still WIP, so not everything is complete/correct yet.
Quick note: when I refer to a package here it is somewhat loosely defined. I will use it to mean a collection of files that generate some executables/targets/libraries. I'm also fine with the word module
, but I think that package is slightly more descriptive (given e.g. the context of RPMs).
Areas for Improvement
Anyway, while reviewing our code base and the resulting diagram, I noticed a few things:
- It is unclear what constitutes the public interface of a package. This makes it more difficult for developers to know when their changes affect or break something in other packages. Additionally, it does not encourage careful thinking about what classes other packages should be able to use, as it can be tempting to just include another random header file from package x.
- Packages lack a standard directory structure. There are often lots of files in a single directory without any grouping into further subdirectories. This means it is not clear what subcomponents a package consists of, which makes life for developers more difficult.
- It is unclear whether a directory in the root repository is a package or something else. Directories such as
cta-release
orpython
do not suggest them to be packages or source code related. Having these package directories on the same level as non-package directories such ascmake
,continuousintegration
, orassets
is a bit confusing. Additionally, this makes searching in IDEs or usinggrep
more difficult, because search results in these non-source code directories are getting mixed with search results in the packages. If everything is grouped in e.g. apackages/
directory, then the search results are also grouped. - Looking at the diagram, there are quite a few packages that have a relatively high degree of fan-out (dependency-wise). This might be unavoidable, but this could be a good opportunity to review this and potentially clean it up a little bit.
Proposal
I do not want to propose a giant refactor that will take ages to complete, but I do think we should consider restructuring this. Almost all the changes required to improve this have to do with directory structure or are cmake
related. Of course, that still takes a little bit of time (which I would be happy to invest), but it is not something that takes a humongous effort. The primary challenge will be ensuring that people currently working on features in specific packages do not end up with a thousand merge conflicts, but more on this later.
In the long run (especially with the relatively high turnover rate of developers in this team), improving this can save developers time and reduce technical debt. It should increase the maintainability and readability of the project by a lot and make the dev team happy (primarily after this is done probably; I am not vouching for the "during" phase).
My proposal is to clearly separate the packages from the rest and give each (c++) package a standardized directory structure:
/
├── packages/ # Directory containing all packages
│ ├── packageA/
│ │ ├── include/packageA/
│ │ ├── src/
│ │ ├── tests/
│ │ ├── CMakeLists.txt
│ │ └── README.md
│ ├── packageB/
│ │ ├── include/packageB/
│ │ ├── src/
│ │ ├── tests/
│ │ ├── CMakeLists.txt
│ │ └── README.md
Each package has:
-
include/packageX
: contains the header files that constitute the public interface of this package. This way, developers can clearly see what parts of this package can be used from other parts. We useinclude/packageX/
so that we can simply include theinclude/
directory withcmake
and it won't change anything about the current import statements elsewhere in the code. -
src
: contains the source code files and private header files -
tests/
: contains the (unit) tests of this package if present. -
CMakeLists.txt
: self-explanatory. Should ideally be somewhat consistent with theCMakeLists.txt
of other packages (i.e. try to define libraries, targets, executables in the same order where possible) -
README.md
: brief description of the package. We could also make this more elaborate and then sync thisREADME.md
file with our documentation website; then we only need to maintain it in one place. That is for later though.
This standardized structure is common in the industry and similar to how many other large C++ projects do this (e.g. opencv or grpc).
If we do this restructuring progressively for each package, this will allow us to:
- Review the directory structure of each package and improve the grouping of files. This will improve readability and potentially allow us to spot further opportunities for decoupling subcomponents in the future.
- Review out-of-date or irrelevant files, which allows us to clean up and simplify the repository.
- Review what the current public interface of the package is and what we want it to be. Similar to the above, this will reduce coupling and make it easier for developers to work on a particular component of the system without needing to know the details of other components.
- Review all the dependencies of the package. If we spot things that can be simplified or removed, then this will improve decoupling. If we don't find anything, then it's good that we at least know : )
Plan of attack
There are a few important considerations in doing this:
- Blindly refactoring this will mess up existing branches and cause a merge-conflict nightmare
- We need to be careful not to mess up the git history. I.e.
git blame
should still be useful and the Git history should be maintained.
To combat the first issue, I think it's best if we go on a per-package basis; don't move all packages at once. We only move a package when the people working on said package (if any) give the green light. This should prevent messing up existing branches as much as possible. This is also one of the reasons why #813 is important; the fewer active branches we have, the less there is to mess up.
We also don't do everything and once. First, we move stuff around, and then at a later stage, we can look at improving the actual content. This allows for easier backtracking in case things go wrong, and also allows us to utilise .git-blame-ignore-revs
to prevent messing up git blame
. Additionally, to preserve the git history, we execute the move of files first in a single commit (using git mv
) and then later make the fixes. This should allow Git to properly track the renames.
Stage 1: Moving things around
Start by simply moving the packages into this structure. Concretely that means:
- Moving
packageX
topackages/packageX
- Separate out the headers currently used by other packages and put them in
include/
- Update the
CMakeLists.txt
(of both the package and the packages using this package) so that import statements don't need to be updated.
Notes:
- The move should be a single commit, where all the moving is done using
git mv
. - The next commit should be any potential fixes to get the new location to work.
- The first commit should be added to the
.git-blame-ignore-revs
file. As such, initially, each MR should consist of two commits.
Stage 2: Cleaning up each package
Once all packages have been moved, we can start with:
- looking at improving the grouping of files within a package. This should only result in files moving and import statements being updated. As we have separated out the public interface, this does not affect other packages. This should once again be a commit that is added to the
.git-blame-ignore-revs
. - Separating out the tests from the source code (if relevant)
- Removing old files (e.g. leftover pdf files or similar).
- Cleaning up the
CMakeLists.txt
. - Adding a (very) brief description to the README
Again, this is done on a per-package basis and not all at once.
Stage 3: Improve dependencies
At this point, we should have a look at the public interface of each package and see whether it is what we want it to be. Does a package have more dependencies than it needs? Is there a way that we can simplify things? This last stage is rather open, but at least it is now easier to do a proper review of the dependencies and add this to the documentation.
Conclusion
All in all, I think this is an important step to make for the long-term health of the project. I know I have been proposing quite a few refactorings lately, but things such as this can make a significant positive long-term impact for (comparatively) little effort. I can take the lead on this, as this is also a nice way to get more familiar with all the components in the project, making me a more efficient dev in the future : )
Thank you for reading my manifesto.
Further suggestions and criticisms are of course welcome.