From 17fa3df656d14ed3df89f0befa4b3b2e1240269d Mon Sep 17 00:00:00 2001
From: Anthony Correia <anthony.correia@cern.ch>
Date: Mon, 12 Feb 2024 02:45:52 +0100
Subject: [PATCH 01/11] Fix typo

---
 readme/setup/1_installation.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/readme/setup/1_installation.md b/readme/setup/1_installation.md
index 56c05344..69811305 100644
--- a/readme/setup/1_installation.md
+++ b/readme/setup/1_installation.md
@@ -45,7 +45,7 @@ For each new session, follow these steps to prepare your environment:
 1. Source the `setup/setup.sh` file, which accomplishes the following:
     - Defines the environment variable `ETX4VELO_REPO`, containing the absolute path
     to this repository.
-    - Adds `montetracko`, `etx4velo and` `etx4velo/pipeline` to the `PYTHONPATH`.
+    - Adds `montetracko`, `etx4velo` and `etx4velo/pipeline` to the `PYTHONPATH`.
     ```bash
     source setup/setup.sh
     ```
-- 
GitLab


From ce6d591119afe55e731ce2aa72c3b803d0de80c3 Mon Sep 17 00:00:00 2001
From: Anthony Correia <anthony.correia@cern.ch>
Date: Mon, 12 Feb 2024 03:10:37 +0100
Subject: [PATCH 02/11] Update path to collect_test_samples.py

---
 readme/guide/3_training.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/readme/guide/3_training.md b/readme/guide/3_training.md
index b98b5869..579598bf 100644
--- a/readme/guide/3_training.md
+++ b/readme/guide/3_training.md
@@ -66,7 +66,7 @@ The essential steps are outlined below:
     source setup/setup.sh
     cd etx4velo
     # Run the test sample collection script
-    ./evaluation/collect_test_samples.py
+    ./scripts/collect_test_samples.py
     ```
    Once you've completed these steps, the configuration for the test samples will be available
    in the `etx4velo/evaluation/test_samples.yaml` file, ready for use in the next steps.
-- 
GitLab


From 2be3dea2d5d57af8a2717c282b627cd5fe290738 Mon Sep 17 00:00:00 2001
From: Anthony Correia <anthony.correia@cern.ch>
Date: Mon, 12 Feb 2024 03:11:12 +0100
Subject: [PATCH 03/11] Write first section of tutorial about configuration

---
 readme/tutorial/01_configuration.ipynb | 203 +++++++++++++++++++++++++
 1 file changed, 203 insertions(+)
 create mode 100644 readme/tutorial/01_configuration.ipynb

diff --git a/readme/tutorial/01_configuration.ipynb b/readme/tutorial/01_configuration.ipynb
new file mode 100644
index 00000000..5fad62c7
--- /dev/null
+++ b/readme/tutorial/01_configuration.ipynb
@@ -0,0 +1,203 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# ETX4VELO Configuration\n",
+    "\n",
+    "Welcome to this second section regarding the configuration of the ETX4VELO repository.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Repository Organisation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The root directory of the ETX4VELO repository contains several folders:\n",
+    "- `etx4velo`: the main repository that contains the models, pipeline configurations,\n",
+    "notebooks, etc.\n",
+    "- `readme`: the README markdown files used in the documentation website.\n",
+    "- `docs`: the source files to build the documentation with sphinx.\n",
+    "- `setup`: the environment and configuration files.\n",
+    "\n",
+    "The main folder is `etx4velo`, where you can find the following folders\n",
+    "- `pipeline`: is the heart of ETX4VELO, containing all the packages and models.\n",
+    "- `notebooks`: contains Notebooks for interactively run trainings and evaluations.\n",
+    "- `pipeline_configs`: contains all the pipeline configurations for training and inference.\n",
+    "- `scripts`: contains scripts to run some steps of the pipeline from the command line.\n",
+    "- `snakefiles`: Snakemake files to run automated and reproducible evaluation of\n",
+    "the ETX4VELO pipeline.\n",
+    "- `analyses`: random notebooks I use to debug or understand problems\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup file\n",
+    "\n",
+    "First, source the `setup/setup.sh` file.\n",
+    "```bash\n",
+    "source setup/setup.sh\n",
+    "```\n",
+    "This defines the environment variable `ETX4VELO_REPO`, containing the absolute path\n",
+    "to this repository, and add `montetracko`, `etx4velo`, `etx4velo/pipeline` to \n",
+    "the `PYTHONPATH` environment variable."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "ETX4VELO_REPO environment variable: /home/acorreia/Documents/tracking/etx4velo\n",
+      "\n",
+      "PYTHONPATH content:\n",
+      "['/home/acorreia/Documents/tracking/etx4velo/readme/tutorial',\n",
+      " '/home/acorreia/Documents/tracking/etx4velo/readme/tutorial',\n",
+      " '/home/acorreia/Documents/tracking/etx4velo/etx4velo',\n",
+      " '/home/acorreia/Documents/tracking/etx4velo/etx4velo/pipeline',\n",
+      " '/home/acorreia/Documents/tracking/etx4velo/montetracko',\n",
+      " '/scratch/acorreia/mambaforge/envs/etx4velo_env_tutorial/lib/python310.zip',\n",
+      " '/scratch/acorreia/mambaforge/envs/etx4velo_env_tutorial/lib/python3.10',\n",
+      " '/scratch/acorreia/mambaforge/envs/etx4velo_env_tutorial/lib/python3.10/lib-dynload',\n",
+      " '',\n",
+      " '/scratch/acorreia/mambaforge/envs/etx4velo_env_tutorial/lib/python3.10/site-packages']\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "import sys\n",
+    "from pprint import pprint\n",
+    "\n",
+    "print(\"ETX4VELO_REPO environment variable:\", os.environ[\"ETX4VELO_REPO\"])\n",
+    "\n",
+    "print(\"\\nPYTHONPATH content:\")\n",
+    "pprint(sys.path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Configuration Files"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First edit the `setup/common_config.yaml` file to your liking, more particularly\n",
+    "the `directories` section:\n",
+    "```yaml\n",
+    "directories:\n",
+    "  # Directory where the processed files are saved. You may need space to store this folder.\n",
+    "  data_directory: /scratch/acorreia/data\n",
+    "  # Directory where the model parameters are saved during training\n",
+    "  artifact_directory: artifacts\n",
+    "  # The plots and reports of a given experiment are saved under this folder\n",
+    "  performance_directory: output\n",
+    "  # Directory that contains the reference (test) samples\n",
+    "  reference_directory: /scratch/acorreia/reference_samples\n",
+    "  # Directory that contains other figures, used for presentations for instance\n",
+    "  analysis_directory: output/analysis\n",
+    "  # Directory that contains the exported model\n",
+    "  export_directory: model_export\n",
+    "```\n",
+    "\n",
+    "The relative paths are expressed w.r.t. the `etx4velo` folder of the repository.\n",
+    "\n",
+    "The configuration that you are likely to change are:\n",
+    "- `data_directory`\n",
+    "- `reference_directory`: change it to where you extracted the `reference_samples_tutorial.tar.lz4` archive"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For convenience, these directories can be retrieved using the `cdirs` object."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "cdirs.data_directory          : /scratch/acorreia/data\n",
+      "cdirs.artifact_directory      : /home/acorreia/Documents/tracking/etx4velo/etx4velo/artifacts\n",
+      "cdirs.performance_directory   : /home/acorreia/Documents/tracking/etx4velo/etx4velo/output\n",
+      "cdirs.reference_directory     : /scratch/acorreia/reference_samples\n",
+      "cdirs.analysis_directory      : /home/acorreia/Documents/tracking/etx4velo/etx4velo/output/analysis\n",
+      "cdirs.export_directory        : /home/acorreia/Documents/tracking/etx4velo/etx4velo/model_export\n"
+     ]
+    }
+   ],
+   "source": [
+    "from utils.commonutils.config import cdirs\n",
+    "\n",
+    "for dirtype in [\"data\", \"artifact\", \"performance\", \"reference\", \"analysis\", \"export\"]:\n",
+    "    attribute_name = f\"{dirtype}_directory\"\n",
+    "    print(f\"{f'cdirs.{attribute_name}':<30}:\", getattr(cdirs, attribute_name))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Collect Test Samples"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Please move to the `etx4velo` directory and run the following script\n",
+    "\n",
+    "```bash\n",
+    "./scripts/collect_test_samples.py\n",
+    "```\n",
+    "which produces, the  `etx4velo/test_samples.yaml` file, which is the configuration\n",
+    "for the test samples. The test samples are collected by navigating through the folders\n",
+    "in `cdirs.reference_directory`.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
-- 
GitLab


From 69fb2a08679d3835c040dea70f6d3311a4f45348 Mon Sep 17 00:00:00 2001
From: Anthony Correia <anthony.correia@cern.ch>
Date: Tue, 13 Feb 2024 09:22:13 +0100
Subject: [PATCH 04/11] Fix all type hints in ModelBase

---
 .../pipeline/utils/modelutils/basemodel.py    | 45 ++++++++++++++-----
 1 file changed, 34 insertions(+), 11 deletions(-)

diff --git a/etx4velo/pipeline/utils/modelutils/basemodel.py b/etx4velo/pipeline/utils/modelutils/basemodel.py
index 6c70cf0d..85726279 100644
--- a/etx4velo/pipeline/utils/modelutils/basemodel.py
+++ b/etx4velo/pipeline/utils/modelutils/basemodel.py
@@ -24,7 +24,7 @@ class ModelBase(LightningModule):
         super().__init__()
         self._trainset = None
         self._valset = None
-        self.testset: typing.List[Data] | None = None
+        self._testset: typing.List[Data] | None = None
         self.save_hyperparameters(hparams)
         self._idx_trainset_split: int | None = None
         self._trainset_split_indices: typing.List[npt.NDArray] | None = None
@@ -32,7 +32,7 @@ class ModelBase(LightningModule):
     def setup(self, stage):
         self.load_partition("train")
         self.load_partition("val")
-        self.testset = None
+        self._testset = None
 
     @property
     def lazy(self) -> bool:
@@ -62,9 +62,18 @@ class ModelBase(LightningModule):
         if self._valset is None:
             self.load_partition(partition="val")
         assert self._valset is not None
-        assert not isinstance(self._valset, LazyDatasetBase)
         return self._valset
 
+    @property
+    def testset(self) -> typing.List[Data]:
+        if self._testset is None:
+            raise ValueError(
+                "Test set not loaded. Please load it with `fetch_partition` "
+                "or `load_testset_from_directory`."
+            )
+        else:
+            return self._testset
+
     @valset.setter
     def valset(self, batches: typing.List[Data]):
         self._valset = batches
@@ -79,8 +88,15 @@ class ModelBase(LightningModule):
     def train_dataloader(self):
         """Train dataloader, with random splitting of epochs."""
         print("Load train dataloader.")
-        if len(self.trainset) > 0:
+        trainset = self.trainset
+        if len(trainset) > 0:
             if (trainset_split := self.hparams.get("trainset_split")) is not None:
+                if not isinstance(trainset, LazyDatasetBase):
+                    raise TypeError(
+                        "In order to use the `trainset_split` property, "
+                        "the trainset should be loaded in a lazy way. "
+                        "Please consider switching `lazy` to `True`."
+                    )
                 if self._trainset_split_indices is None:
                     print("Define random splitting of epochs")
                     self.load_trainset_split_indices(trainset_split)
@@ -91,8 +107,8 @@ class ModelBase(LightningModule):
                 print("Load subset number", self._idx_trainset_split)
 
                 trainset = Subset(
-                    self.trainset,
-                    self._trainset_split_indices[self._idx_trainset_split],
+                    trainset,
+                    self._trainset_split_indices[self._idx_trainset_split],  # type: ignore
                 )
 
                 # Prepare next already
@@ -104,20 +120,25 @@ class ModelBase(LightningModule):
             else:
                 trainset = self.trainset
                 shuffle = True
-            return DataLoader(trainset, batch_size=1, num_workers=8, shuffle=shuffle)
+            return DataLoader(
+                trainset,  # type: ignore
+                batch_size=1,
+                num_workers=8,
+                shuffle=shuffle,
+            )
         else:
             return None
 
     def val_dataloader(self):
         """Validation dataloader."""
         if len(self.valset) > 0:
-            return DataLoader(self.valset, batch_size=1, num_workers=8)
+            return DataLoader(self.valset, batch_size=1, num_workers=0)
         else:
             return None
 
     def test_dataloader(self):
         """Test dataloader."""
-        if self.testset is not None and len(self.testset) > 0:
+        if self._testset is not None and len(self._testset) > 0:
             return DataLoader(self.testset, batch_size=1, num_workers=8)
         else:
             return None
@@ -180,7 +201,7 @@ class ModelBase(LightningModule):
                 pickles files.
         """
         lazy_dataset = self.get_lazy_dataset(input_dir=input_dir, **kwargs)
-        self.testset = self.fetch_datasets(lazy_dataset=lazy_dataset)
+        self._testset = self.fetch_datasets(lazy_dataset=lazy_dataset)
 
     def get_lazy_dataset_partition(
         self,
@@ -284,9 +305,11 @@ class ModelBase(LightningModule):
         if partition == "train":
             self._trainset = datasets
         elif partition == "val":
+            assert not isinstance(datasets, LazyDatasetBase)  # shouldn't be the case
             self._valset = datasets
         else:
-            self.testset = datasets
+            assert not isinstance(datasets, LazyDatasetBase)  # shouldn't be the case
+            self._testset = datasets
 
     def get_input_data(self, all_features: torch.Tensor) -> torch.Tensor:
         return get_input_features(
-- 
GitLab


From db2babbc18e3c16e4f8b0fd73f233efbfac6cf46 Mon Sep 17 00:00:00 2001
From: Anthony Correia <anthony.correia@cern.ch>
Date: Tue, 13 Feb 2024 09:22:51 +0100
Subject: [PATCH 05/11] Finish part 1 about configuration

---
 readme/tutorial/01_configuration.ipynb | 158 +++++++++++++++++++------
 1 file changed, 119 insertions(+), 39 deletions(-)

diff --git a/readme/tutorial/01_configuration.ipynb b/readme/tutorial/01_configuration.ipynb
index 5fad62c7..dad916d7 100644
--- a/readme/tutorial/01_configuration.ipynb
+++ b/readme/tutorial/01_configuration.ipynb
@@ -52,31 +52,25 @@
     "the `PYTHONPATH` environment variable."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once this file is sourced, you may launch `jupyter-lab`\n",
+    "```bash\n",
+    "cd etx4velo\n",
+    "jupyter-lab --port 8889 --no-browser\n",
+    "```\n",
+    "and open this notebook on your internet browser.\n",
+    "\n",
+    "You can inspect your environment variables and the PYTHONPATH content:"
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 11,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "ETX4VELO_REPO environment variable: /home/acorreia/Documents/tracking/etx4velo\n",
-      "\n",
-      "PYTHONPATH content:\n",
-      "['/home/acorreia/Documents/tracking/etx4velo/readme/tutorial',\n",
-      " '/home/acorreia/Documents/tracking/etx4velo/readme/tutorial',\n",
-      " '/home/acorreia/Documents/tracking/etx4velo/etx4velo',\n",
-      " '/home/acorreia/Documents/tracking/etx4velo/etx4velo/pipeline',\n",
-      " '/home/acorreia/Documents/tracking/etx4velo/montetracko',\n",
-      " '/scratch/acorreia/mambaforge/envs/etx4velo_env_tutorial/lib/python310.zip',\n",
-      " '/scratch/acorreia/mambaforge/envs/etx4velo_env_tutorial/lib/python3.10',\n",
-      " '/scratch/acorreia/mambaforge/envs/etx4velo_env_tutorial/lib/python3.10/lib-dynload',\n",
-      " '',\n",
-      " '/scratch/acorreia/mambaforge/envs/etx4velo_env_tutorial/lib/python3.10/site-packages']\n"
-     ]
-    }
-   ],
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "import os\n",
     "import sys\n",
@@ -99,6 +93,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "To properly use the ETX4VELO repository, there is still a few things you need to do.\n",
+    "\n",
     "First edit the `setup/common_config.yaml` file to your liking, more particularly\n",
     "the `directories` section:\n",
     "```yaml\n",
@@ -133,22 +129,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 31,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "cdirs.data_directory          : /scratch/acorreia/data\n",
-      "cdirs.artifact_directory      : /home/acorreia/Documents/tracking/etx4velo/etx4velo/artifacts\n",
-      "cdirs.performance_directory   : /home/acorreia/Documents/tracking/etx4velo/etx4velo/output\n",
-      "cdirs.reference_directory     : /scratch/acorreia/reference_samples\n",
-      "cdirs.analysis_directory      : /home/acorreia/Documents/tracking/etx4velo/etx4velo/output/analysis\n",
-      "cdirs.export_directory        : /home/acorreia/Documents/tracking/etx4velo/etx4velo/model_export\n"
-     ]
-    }
-   ],
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "from utils.commonutils.config import cdirs\n",
     "\n",
@@ -177,6 +160,103 @@
     "for the test samples. The test samples are collected by navigating through the folders\n",
     "in `cdirs.reference_directory`.\n"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Pipeline Configuration"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The pipeline configurations are stored in in the `pipeline_configs` directory.\n",
+    "Let's focus on the `pipeline_configs.yaml` configuration.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "config_path = os.path.join(cdirs.repository, \"etx4velo\", \"pipeline_configs\", \"example.yaml\")\n",
+    "print(\"config_path:\", config_path)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To load the configuration, you should always use the `load_config` function,\n",
+    "because it alters the configuration for convenience."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from utils.commonutils.config import load_config\n",
+    "\n",
+    "config = load_config(config_path)\n",
+    "assert config == load_config(config) # pass-through if it already a dictionary, for convenience!\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pprint(config)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The configuration is essentially a dictionary of dictionaries.\n",
+    "It is divided into several sections, corresponding the pipeline steps."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Configuration sections:\", list(config.keys()))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First look at the `common` section:\n",
+    "```yaml\n",
+    "common:\n",
+    "  experiment_name: example # Optional: this is automatically set to the name of the config file\n",
+    "  # Name of the test datasets to use (defined in `evaluation/test_samples.yaml`)\n",
+    "  test_dataset_names:\n",
+    "  - minbias-sim10b-xdigi_v2.4_1496\n",
+    "  - minbias-sim10b-xdigi_v2.4_1498\n",
+    "  detector: velo # default to the first entry in `detectors` in `common_config.yaml`\n",
+    "```\n",
+    "which defines:\n",
+    "- the `experiment_name`, set to the name of the configuration file by `load_config`!\n",
+    "- the test dataset names made available to the pipeline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We'll go over the next sections of the configuration in subsequent parts of this tutorial."
+   ]
   }
  ],
  "metadata": {
-- 
GitLab


From f1cd75710337d75b4094383dbe949a9f6abf4f76 Mon Sep 17 00:00:00 2001
From: Anthony Correia <anthony.correia@cern.ch>
Date: Tue, 13 Feb 2024 09:23:14 +0100
Subject: [PATCH 06/11] Write part about processing

---
 readme/tutorial/02_preprocessing.ipynb | 657 +++++++++++++++++++++++++
 1 file changed, 657 insertions(+)
 create mode 100644 readme/tutorial/02_preprocessing.ipynb

diff --git a/readme/tutorial/02_preprocessing.ipynb b/readme/tutorial/02_preprocessing.ipynb
new file mode 100644
index 00000000..f1bfc36b
--- /dev/null
+++ b/readme/tutorial/02_preprocessing.ipynb
@@ -0,0 +1,657 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "f8f81a32-5c0f-4d01-a768-f00c42c4c5e1",
+   "metadata": {},
+   "source": [
+    "# Preprocessing and Processing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "217e7045-9a17-4566-b09d-a0f8d6472d8e",
+   "metadata": {},
+   "source": [
+    "To follow this section, please open the `etx4velo/notebooks/full_pipeline.ipynb` notebook."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3234485f-c1f5-438c-a484-aa238d42fbd3",
+   "metadata": {},
+   "source": [
+    "## Files Produced by XDIGI2CSV"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7ae6199d-b229-485f-bd62-afe31b7bac47",
+   "metadata": {},
+   "source": [
+    "The first two steps of the pipeline consists of preparing the data for training.\n",
+    "First, let's look at the data downloaded from my EOS space."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4b6b9ab8-9e3f-484d-8f44-54fa1a0ac510",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Update this variable with the directory where you folder actually is\n",
+    "original_datadir = \"/scratch/acorreia/minbias-sim10b-xdigi_subset\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "349fb4ea-544e-4829-851a-92a0e49acf50",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!ls -1 {original_datadir}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b693e24c-ba42-4c29-8b10-9c6fc6d8336e",
+   "metadata": {},
+   "source": [
+    "The files were obtained using the [XDIGI2CSV repository](https://gitlab.cern.ch/gdl4hep/xdigi2csv).\n",
+    "Each folder contains about 2000 events. Let's have a look at the first folder."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fe509889-8bdf-4d21-90e7-6ba198c0bbaa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!ls -1 {original_datadir}/0"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "660ce07e-4e34-4a94-b2ce-350902306f1d",
+   "metadata": {},
+   "source": [
+    "The `log.yaml` file contains information about where the events come from\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "17f4b2fc-79f8-4d04-a3a8-7c0169d73039",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cat {original_datadir}/log.yaml"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c4e9ce68-59ca-4c46-85e9-ffdd2d3a3861",
+   "metadata": {},
+   "source": [
+    "- the events correspond to the ones stored in the Logical File Name (LFN) `LFN:/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi`.<br/>\n",
+    "- The other LFN is \"banned\" because it was stored in a server that I deemed unreliable.\n",
+    "- The returncode, equal to 0, indicates that the file was produced properly."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b15f3b06-69c7-487e-a82f-3549717b2e82",
+   "metadata": {},
+   "source": [
+    "The 2 files of interest for this tutorial are `hits_velo.parquet.lz4` and `mc_particles.parquet.lz4`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5514a598-2edd-40dc-b43d-0f40c0431046",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os.path as op\n",
+    "import pandas as pd\n",
+    "\n",
+    "df_hits_particles = pd.read_parquet(\n",
+    "    op.join(original_datadir, \"0\", \"hits_velo.parquet.lz4\")\n",
+    ")\n",
+    "df_particles = pd.read_parquet(\n",
+    "    op.join(original_datadir, \"0\", \"mc_particles.parquet.lz4\")\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "833ff888-e5d8-4044-9087-33e554e6b458",
+   "metadata": {},
+   "source": [
+    "Each row of the dataframe of particles is uniquely identified by \n",
+    "- `run`: the run number\n",
+    "- `event`: the event number within this run\n",
+    "- `mcid`: the particle ID\n",
+    "\n",
+    "Other columns give information about the particle."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bd771a61-682a-4a88-9b28-48f5ea6c0189",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_particles"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eb2a9a85-bc28-40bd-b273-30aa44d259ed",
+   "metadata": {},
+   "source": [
+    "Each row of the dataframe of hits-particles is uniquely identified by \n",
+    "- `run`: the run number\n",
+    "- `event`: the event number within this run\n",
+    "- `lhcbid`: the cluster ID\n",
+    "- `mcid`: the particle ID\n",
+    "\n",
+    "Other columns give information about the cluster position."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "29485b9d-f0d8-4889-ae99-7bc54cf6b85d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_hits_particles"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "368ad422-b3bb-41ff-b5f0-5c857d9a3020",
+   "metadata": {},
+   "source": [
+    "A `mcid` equal to `-1` corresponds to a noise hit (for the velo, from spillover)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cd763363-a44f-40a7-bef7-6e78184b1a77",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "n_hits = df_hits_particles[[\"run\", \"event\", \"lhcbid\"]].drop_duplicates().shape[0]\n",
+    "n_fake_hits = (df_hits_particles[\"mcid\"] == -1).sum()\n",
+    "\n",
+    "print(\"Proportion of fake hits:\", f\"{n_fake_hits / n_hits:%}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0e1935b6-57ab-41ce-add8-91b699043127",
+   "metadata": {},
+   "source": [
+    "A hit may be associated to more than one column."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e0b9c9f5-0891-471d-855b-f003de6fdf53",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_hits_particles_grouped_by_hits = (\n",
+    "    df_hits_particles[df_hits_particles[\"mcid\"] != -1]\n",
+    "    .groupby([\"run\", \"event\", \"lhcbid\"])[\"mcid\"]\n",
+    "    .count()\n",
+    "    .rename(\"n_particles\")\n",
+    ")\n",
+    "print(\n",
+    "    \"Proportion of true hits belonging to more than one particle:\",\n",
+    "    f\"{(df_hits_particles_grouped_by_hits > 2).sum() / df_hits_particles_grouped_by_hits.shape[0]:%}\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c4c41d60-29d6-4a99-9128-787afadaf654",
+   "metadata": {},
+   "source": [
+    "You can add particle information to the dataframe of hits-particles by merging\n",
+    "the dataframe of particles to the dataframe of hits-particles:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "49939160-5fc1-4f76-a3c8-48020b451096",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Add `pid` information\n",
+    "df_hits_particles.merge(\n",
+    "    df_particles[[\"run\", \"event\", \"mcid\", \"pid\"]],\n",
+    "    how=\"left\",\n",
+    "    on=[\"run\", \"event\", \"mcid\"],\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cdc001f9-2b4f-486e-8eff-af5e98ec0661",
+   "metadata": {},
+   "source": [
+    "For an extensive description of the the meaning of each column, please refer to the [XDIGI2CSV documentation](https://xdigi2csv.docs.cern.ch/master/Access/1.csv_description.html)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7ac2e43f-0d8c-4cd3-9da9-ed03f4f8875a",
+   "metadata": {},
+   "source": [
+    "## Preprocessing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a2b59dde-b81e-4cee-995f-b1965de8d43c",
+   "metadata": {},
+   "source": [
+    "Open again the pipeline configuration `etx4velo/pipeline_configs/example.yaml` to analyze\n",
+    "the `preprocessing` section.\n",
+    "\n",
+    "```yaml\n",
+    "preprocessing:\n",
+    "  input_dir: /scratch/acorreia/minbias-sim10b-xdigi_subset\n",
+    "  # Can be\n",
+    "  # - Integer: Last subdirectory that can be used (starting from `0`). `-1` for all.\n",
+    "  # - String or list of strings: sub-directories that can be used\n",
+    "  # - `null`: use `input_dir` directly\n",
+    "  # - Dictionary with keys `start` and `stop`\n",
+    "  subdirs: {\"start\": 0, \"stop\": 10}\n",
+    "  output_subdirectory: \"preprocessed\"\n",
+    "  # Preprocessing will stop once the required number of events has been preprocessed.\n",
+    "  # if `null`, default to `n_train_events + n_test_events`.\n",
+    "  n_events: null\n",
+    "  # Number of jobs dataframes processed in parallel\n",
+    "  # If more than 1 is required, the preprocessing will not stop after producing\n",
+    "  # the `n_events` events and all the input events will be preprocessed.\n",
+    "  n_workers: 1\n",
+    "\n",
+    "  processing: # Processing function(s), defined in `Preprocessing/process_custom.py`\n",
+    "  - remove_curved_particles\n",
+    "  num_true_hits_threshold: 500 # Minimal number of genuine hits\n",
+    "\n",
+    "  # Columns to keep in the dataframes of hits-particles and particles\n",
+    "  # (excluding `event`, `particle_id` and `lhcbid`)\n",
+    "  # `null` means keep everything\n",
+    "  hits_particles_columns: [\"x\", \"y\", \"z\", \"plane\"]\n",
+    "  particles_columns: null\n",
+    "```\n",
+    "\n",
+    "Please update `input_dir` to the location of the value your `original_datadir`.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9e73c0e3-c2d6-4db4-8018-005baa1c6ad2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pprint import pprint\n",
+    "from utils.commonutils.config import cdirs, load_config\n",
+    "\n",
+    "config_path = op.join(cdirs.repository, \"etx4velo\", \"pipeline_configs\", \"example.yaml\")\n",
+    "config = load_config(config_path)\n",
+    "\n",
+    "pprint(config[\"preprocessing\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2a389fca-5708-4fa9-b607-27dd325dab54",
+   "metadata": {},
+   "source": [
+    "As you can see, the `load_config` function has turned `output_subdirectory`\n",
+    "into `output_dir = {cdirs.data_directory}/{experiment_name}/{output_subdirectory}`.\n",
+    "That is why the configuration must be loaded using `load_config`!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7a1d5112-7cfc-41b8-99d4-6865d1b0e0a9",
+   "metadata": {},
+   "source": [
+    "You're now ready to move to `full_pipeline.ipynb` and run the preprocessing.\n",
+    "```python\n",
+    "from Preprocessing.run_preprocessing import run_preprocessing\n",
+    "run_preprocessing(CONFIG, reproduce=False)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e3ca78f6-3787-428b-abcf-ab9dc0d2afde",
+   "metadata": {},
+   "source": [
+    "Here is what it is gonna do:\n",
+    "- loop over the events in `{input_dir}/{number}/hits_velo.parquet.lz4` and `{input_dir}/{number}/mc_particles.parquet.lz4` (as configured in `setup/common_config.yaml`.\n",
+    "- Only load the hits-particles columns in `hits_particles_columns` and the particle columns in `particles_columns`.\n",
+    "- Define the following columns.\n",
+    "    - `particle_id = mcid + 1`\n",
+    "    - `event_id = {9 numbers corresponding to the run}{9 numbers corresponding to the event}`\n",
+    "\n",
+    "- Apply the processing functions specified in `processing`, defined in `pipeline/preprocessing/process_custom.py`\n",
+    "- Only save the events with a number of genuine hits higher than `num_true_hits_threshold`\n",
+    "- For each event, save 2 parquet files `{event_id}-hits_particles.parquet` and `{event_id}-particles.parquet`\n",
+    "- Once the required number of events for training is reached, touch the `done` file, that indicates that this step properly finished.\n",
+    "\n",
+    "The preprocessing step supports parallelism over input files, using the `joblib` library. To enable it, you may increase `n_workers` to a number higher than 1."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "37302ac3-63d9-49e7-bef0-1e25dc89ceff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!(ls {config[\"preprocessing\"][\"output_dir\"]} | head -10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4f33f1a5-4493-4806-a326-74c21f9d41c9",
+   "metadata": {},
+   "source": [
+    "The preprocessing of the test samples can also be run in `full_pipeline.ipynb` through\n",
+    "\n",
+    "```python\n",
+    "from utils.commonutils.ctests import get_required_test_dataset_names\n",
+    "from Preprocessing.run_preprocessing import run_preprocessing_test_dataset\n",
+    "\n",
+    "for required_test_dataset_name in get_required_test_dataset_names(CONFIG):\n",
+    "    run_preprocessing_test_dataset(\n",
+    "        test_dataset_name=required_test_dataset_name,\n",
+    "        reproduce=False,\n",
+    "    )\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "59158488-6774-4040-baf2-b1d9d97d0569",
+   "metadata": {},
+   "source": [
+    "The preprocessed files of the test samples are common to all the pipelines.\n",
+    "For this reason, they are saved in `{datadir}/__test__/{detector}/{test_dataset_name}/`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f6af51a7-a627-4378-a534-eecbffaceb23",
+   "metadata": {},
+   "source": [
+    "## Processing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0949d099-6356-4d94-b6e0-25d3221ffcb7",
+   "metadata": {},
+   "source": [
+    "The processing step consists, for each event, of\n",
+    "1. Defining the (normalised) input features of the networks\n",
+    "2. Building the true edge indices\n",
+    "3. Defining the columns to keep\n",
+    "4. Defining the # train and validation samples\n",
+    "\n",
+    "Here is the current configuration of the processing step\n",
+    "```yaml\n",
+    "processing:\n",
+    "  input_subdirectory: \"preprocessed\"\n",
+    "  output_subdirectory: \"processed\"\n",
+    "  n_workers: 1 # Number of processes in parallel in the processing stage\n",
+    "\n",
+    "  features: [\"r\", \"phi\", \"z\"] # Name of the features to use\n",
+    "  feature_means: [18., 0.0, 281.0] # Means for normalising the features\n",
+    "  feature_scales: [9.75, 1.82, 287.0] # Scales for normalising the features\n",
+    "\n",
+    "  # List of the columns to keep in the PyTorch batches, in the dataframe of hits\n",
+    "  # Here the columns `x`, `y` and `z` are renamed `un_x`, `un_y` and `un_z`.\n",
+    "  kept_hits_columns: [\"plane\", {\"un_x\": \"x\"}, {\"un_y\": \"y\"}, {\"un_z\": \"z\"}]\n",
+    "  # List of columns in the dataframe of particles that are merged to the dataframe\n",
+    "  # of hits and stored in the PyTorch batches\n",
+    "  kept_particles_columns: [\"nhits_velo\"]\n",
+    "\n",
+    "  n_train_events: 5000 # Number of training events\n",
+    "  n_val_events: 500 # Number of validation events\n",
+    "  split_seed: 0 # Seed used for the splitting train-val\n",
+    "\n",
+    "  # How the true edges are computed\n",
+    "  # - sortwise: sort by z\n",
+    "  # - modulewise: sort by distance to production vertex\n",
+    "  # - planewise: hits belonging to same particle and belonging to adjacent planes\n",
+    "  true_edges_column: planewise\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "381a9c9b",
+   "metadata": {},
+   "source": [
+    "To run the processing:\n",
+    "\n",
+    "```python\n",
+    "from Processing.run_processing import run_processing_from_config\n",
+    "run_preprocessing(CONFIG, reproduce=False)\n",
+    "```\n",
+    "and to run the processing of the test samples:\n",
+    "```python\n",
+    "for required_test_dataset_name in get_required_test_dataset_names(CONFIG):\n",
+    "    run_preprocessing_test_dataset(\n",
+    "        test_dataset_name=required_test_dataset_name,\n",
+    "        reproduce=False,\n",
+    "    )\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "081e8579-4ae3-4fc8-a60b-137850db9fda",
+   "metadata": {},
+   "source": [
+    "Let's have a look at the output."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6ca2bb89-75b9-4ce7-9c35-3d3746156895",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "output_dir = config[\"processing\"][\"output_dir\"]\n",
+    "\n",
+    "print(\"Output dir:\", output_dir)\n",
+    "\n",
+    "!ls -1 {output_dir}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b08d635e",
+   "metadata": {},
+   "source": [
+    "The file `splitting.json` contains information about which events belong to the\n",
+    "train sample, and which events belong to the test sample."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "27ecf423",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!head {output_dir}/splitting.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "01310a6b",
+   "metadata": {},
+   "source": [
+    "The `train` and `val` directory contain the processed event files."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "86e2ef60",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!ls -1 {output_dir}/val | head -10"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "84b0dde5",
+   "metadata": {},
+   "source": [
+    "The `test` folder contains the same files for the various test samples."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "85287986",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!ls {output_dir}/test"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "65a246d1",
+   "metadata": {},
+   "source": [
+    "Let's try to open an event file on the validation set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7e3f75a5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "\n",
+    "first_val_path = next(iter(os.scandir(op.join(output_dir, \"val\")))).path\n",
+    "print(\"Opening\", first_val_path)\n",
+    "event = torch.load(first_val_path)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "45285ab8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for key, description in {\n",
+    "    \"x\": \"Hit features\",\n",
+    "    \"plane\": \"Plane index of each hit\",\n",
+    "    \"signal_true_edges\": \"True edge indices of the graph\",\n",
+    "    \"particle_id_hit_idx\": \"Allow to re-build the dataframe of hits-particles.\",\n",
+    "    \"un_x\": \"Unormalised x-coordinates of the hits\",\n",
+    "    \"un_y\": \"Unormalised y-coordinates of the hits\",\n",
+    "    \"un_z\": \"Unormalised z-coordinates of the hits\",\n",
+    "    \"unique_particle_id\": \"Unique particle ids in the event\",\n",
+    "    \"particle_nhits_velo\": \"Number of velo hits for the particles in `unique_particle_id`\",\n",
+    "}.items():\n",
+    "    key_str = f'\"{key}\"'\n",
+    "    print(\n",
+    "        f'{f\"batch[{key_str}]:\":<30}', f\"{str(event[key].shape):<20}\", \"-\", description\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe8010da",
+   "metadata": {},
+   "source": [
+    "The columns `un_x`, `un_y`, `un_z` were specified in `kept_hits_columns`.<br>\n",
+    "The column `particle_nhits_velo` (that comes with `unique_particle`) was specified\n",
+    "in `kept_particle_columns`.\n",
+    "\n",
+    "The goal would be not to rely on the preprocessed samples anymore.\n",
+    "However, sometimes, instead of reproducing the processed samples (and the samples\n",
+    "of the subsequent steps), it might less time consuming of loading the preprocessed\n",
+    "file directly.\n",
+    "For this reason, the element `truncated_path` allow to easily access the preprocessed\n",
+    "file of the corresponding event."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6bd4adcf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "truncated_path = event[\"truncated_path\"]\n",
+    "print(\"truncated_path:\", truncated_path)\n",
+    "print(f\"$ls {truncated_path}*\")\n",
+    "!ls {truncated_path}*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0515ae52",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_hits_particles = pd.read_parquet(truncated_path + \"-hits_particles.parquet\")\n",
+    "df_hits_particles"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
-- 
GitLab


From 87ace3462bec0a079de7da1af071e605f696f38a Mon Sep 17 00:00:00 2001
From: Anthony Correia <anthony.correia@cern.ch>
Date: Tue, 13 Feb 2024 09:23:34 +0100
Subject: [PATCH 07/11] Write section about ModelBase

---
 readme/tutorial/03_training.ipynb | 260 ++++++++++++++++++++++++++++++
 1 file changed, 260 insertions(+)
 create mode 100644 readme/tutorial/03_training.ipynb

diff --git a/readme/tutorial/03_training.ipynb b/readme/tutorial/03_training.ipynb
new file mode 100644
index 00000000..36c1e688
--- /dev/null
+++ b/readme/tutorial/03_training.ipynb
@@ -0,0 +1,260 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "f8f81a32-5c0f-4d01-a768-f00c42c4c5e1",
+   "metadata": {},
+   "source": [
+    "# Training"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9742e077",
+   "metadata": {},
+   "source": [
+    "Let's load the pipeline configuration once again."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "85933e74",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os.path as op\n",
+    "from utils.commonutils.config import cdirs, load_config\n",
+    "config_path = op.join(cdirs.repository, \"etx4velo\", \"pipeline_configs\", \"example.yaml\")\n",
+    "config = load_config(config_path)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "91da5888",
+   "metadata": {},
+   "source": [
+    "## `ModelBase`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f4b8b3ac-8ac1-4a66-a593-b081d57df75a",
+   "metadata": {},
+   "source": [
+    "Every model if this repository inherits from `ModelBase`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "80c91afc-36fd-4562-8efe-6a47ad007f6a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from utils.modelutils.basemodel import ModelBase\n",
+    "model = ModelBase(\n",
+    "    hparams={\n",
+    "        \"input_dir\": config[\"processing\"][\"output_dir\"]\n",
+    "    }\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "05da2cfe",
+   "metadata": {},
+   "source": [
+    "The `trainset` are `valset` are loaded on access:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0a1fef10-07d5-437f-af4b-c36c790ab048",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainset = model.trainset\n",
+    "trainset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b2f820d5",
+   "metadata": {},
+   "source": [
+    "The can also be loaded using the `load_partition` method."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4a0604eb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model._trainset = None # let's unload the trainset\n",
+    "model.load_partition(\"train\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ebab5e9d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# The trainset is already loaded\n",
+    "trainset = model.trainset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6175db90",
+   "metadata": {},
+   "source": [
+    "however, the `trainset` can be particularly large so it is not worth loading\n",
+    "it entirely. In this case, `lazy` can be turned to `True`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a2c8277b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.hparams[\"lazy\"] = True\n",
+    "model._trainset = None # let's unload the trainset\n",
+    "trainset = model.trainset # and load it again\n",
+    "trainset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cac5a6ad",
+   "metadata": {},
+   "source": [
+    "Now, the `trainset` is an instance of `LazyDatasetBase` that inherits from \n",
+    "the `torch.utils.data.Dataset` class.\n",
+    "Events are only when accessed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1b5a968b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from utils.loaderutils.dataiterator import LazyDatasetBase\n",
+    "\n",
+    "\n",
+    "assert isinstance(trainset, LazyDatasetBase)\n",
+    "print(\"Let's access\", trainset.input_paths[0])\n",
+    "event = trainset[0]\n",
+    "print(event)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "81a3fc78",
+   "metadata": {},
+   "source": [
+    "This only regards the `trainset`. The validation sample is still loaded entirely."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "08370472",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "valset = model.valset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4e850235",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f9c64fc7",
+   "metadata": {},
+   "source": [
+    "The `testset` is not loaded automatically, because there might be more than\n",
+    "one testset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1eb43c63",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.testset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "81e7b550",
+   "metadata": {},
+   "source": [
+    "You can use the very same method `load_partition` to load it"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e515a883",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.load_partition(\"minbias-sim10b-xdigi_v2.4_1496\")\n",
+    "model.testset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "752e0a16",
+   "metadata": {},
+   "source": [
+    "## Load Models"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fff67f55",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
-- 
GitLab


From d888417fa721ac171b49b43c260283b8d1ef060e Mon Sep 17 00:00:00 2001
From: Anthony Correia <anthony.correia@cern.ch>
Date: Tue, 13 Feb 2024 09:26:24 +0100
Subject: [PATCH 08/11] Fix full_pipeline.ipynb

---
 etx4velo/notebooks/full_pipeline.ipynb | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/etx4velo/notebooks/full_pipeline.ipynb b/etx4velo/notebooks/full_pipeline.ipynb
index b2baec61..340b4b72 100644
--- a/etx4velo/notebooks/full_pipeline.ipynb
+++ b/etx4velo/notebooks/full_pipeline.ipynb
@@ -46,8 +46,8 @@
     "from Embedding.embedding_plots import plot_best_performances_squared_distance_max\n",
     "\n",
     "from scripts.train_model import train_model\n",
-    "from scripts.embedding_run import run as run_embedding_inference\n",
-    "from scripts.track_building import build as build_track_candidates\n",
+    "from scripts.build_graph_using_embedding import run as run_embedding_inference\n",
+    "from scripts.build_tracks import build as build_track_candidates\n",
     "\n",
     "from utils.plotutils import performance_mpl as perfplot_mpl\n",
     "from utils.commonutils.ctests import get_required_test_dataset_names\n",
@@ -148,7 +148,6 @@
     "for required_test_dataset_name in get_required_test_dataset_names(CONFIG):\n",
     "    run_preprocessing_test_dataset(\n",
     "        test_dataset_name=required_test_dataset_name,\n",
-    "        path_or_config_test=\"../evaluation/test_samples.yaml\",\n",
     "        reproduce=False,\n",
     "    )\n"
    ]
@@ -1096,7 +1095,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.11"
+   "version": "3.10.12"
   },
   "vscode": {
    "interpreter": {
-- 
GitLab


From d8362ec4b4d412df8eb3796be07bf389640b42c1 Mon Sep 17 00:00:00 2001
From: Anthony Correia <anthony.correia@cern.ch>
Date: Tue, 13 Feb 2024 09:29:44 +0100
Subject: [PATCH 09/11] Fix import of `compare_etx4velo_vs_allen`

---
 etx4velo/notebooks/full_pipeline.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/etx4velo/notebooks/full_pipeline.ipynb b/etx4velo/notebooks/full_pipeline.ipynb
index 340b4b72..2d4e3cad 100644
--- a/etx4velo/notebooks/full_pipeline.ipynb
+++ b/etx4velo/notebooks/full_pipeline.ipynb
@@ -1036,7 +1036,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from scripts.evaluate_allen import compare_etx4velo_vs_allen\n"
+    "from scripts.evaluation.compare_allen_vs_etx4velo import compare_etx4velo_vs_allen\n"
    ]
   },
   {
-- 
GitLab


From fcabfd4baa928bb2c4fecabc35892768e83abf9e Mon Sep 17 00:00:00 2001
From: Anthony Correia <anthony.correia@cern.ch>
Date: Tue, 13 Feb 2024 09:43:30 +0100
Subject: [PATCH 10/11] Improve example.yaml

---
 etx4velo/pipeline_configs/example.yaml | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/etx4velo/pipeline_configs/example.yaml b/etx4velo/pipeline_configs/example.yaml
index 29b52c27..308f194d 100644
--- a/etx4velo/pipeline_configs/example.yaml
+++ b/etx4velo/pipeline_configs/example.yaml
@@ -1,9 +1,10 @@
 common:
-  experiment_name: example
+  experiment_name: example # Optional: this is automatically set to the name of the config file
   # Name of the test datasets to use (defined in `evaluation/test_samples.yaml`)
   test_dataset_names:
   - minbias-sim10b-xdigi_v2.4_1496
   - minbias-sim10b-xdigi_v2.4_1498
+  detector: velo # default to the first entry in `detectors` in `common_config.yaml`
 
 preprocessing:
   input_dir: /scratch/acorreia/minbias-sim10b-xdigi_subset
@@ -59,7 +60,7 @@ processing:
   # - planewise: hits belonging to same particle and belonging to adjacent planes
   true_edges_column: planewise
 
-metric_learning:
+embedding:
   # Dataset parameters
   input_subdirectory: "processed"
   output_subdirectory: "embedding_processed"
@@ -78,14 +79,26 @@ metric_learning:
   emb_hidden: 128 # Number of hidden units / layer in the Dense Neural Network
   nb_layer: 3 # Number of layers
   emb_dim: 4 # Embedding dimension
-  activation: Tanh # Action function used in the MLP
-  weight: 3 # Weight for positive examples
+  activation: Tanh # Activation function used in the MLP
+  weight: 6 # Weight for positive examples
+
+  # Requirement to apply to all particles in the training and validation samples
+  particle_requirement: null
+  # Requirement to apply to the particles of the query points
+  # in the training and validation samples
+  query_particle_requirement: "(abs(pid) != 11) and has_velo and (((eta > -5) and (eta < -2)) or ((eta > 2) and (eta < 5)))"
+  # Requirement that defines the target particles in the training and validation samples
+  # Can be used to put more weight on the target particles.
+  # It was finally deemed unecessary to use this parameter.
+  target_requirement: null
+  # non_target_weight: 0.05 # Weight for non-target particles in the loss
+  # target_weight: 0.05 # Weight for target particles in the loss
 
   # Available regimes
   # - rp: random pairs
   # - hnm: hard negative mining
   # - norm: perform L2 normalisation
-  regime: [rp, hnm, norm]
+  regime: [rp, hnm]
   randomisation: 1 # Number of random pairs per hit
   points_per_batch: 100000 # Number of query points to consider
   squared_distance_max: 0.015 # Maximum distance for hard-negative mining during training
@@ -131,6 +144,8 @@ gnn:
   # The GNN that is used. Switch to another GNN (such as the default `interaction`)
   # might not work properly.
   model: triplet_interaction
+  edge_checkpointing: True
+  triplet_checkpointing: True
 
   # Minimal edge score used to filter out fake edges before building the triplets
   edge_score_cut: 0.5
-- 
GitLab


From d4d2f4651b0f598bdb3a4cd50fc110b8726c8787 Mon Sep 17 00:00:00 2001
From: Anthony Correia <anthony.correia@cern.ch>
Date: Tue, 13 Feb 2024 09:43:41 +0100
Subject: [PATCH 11/11] Finish section about training

---
 readme/tutorial/03_training.ipynb | 98 ++++++++++++++++++++++++++++++-
 1 file changed, 96 insertions(+), 2 deletions(-)

diff --git a/readme/tutorial/03_training.ipynb b/readme/tutorial/03_training.ipynb
index 36c1e688..428ed19d 100644
--- a/readme/tutorial/03_training.ipynb
+++ b/readme/tutorial/03_training.ipynb
@@ -224,7 +224,49 @@
    "id": "752e0a16",
    "metadata": {},
    "source": [
-    "## Load Models"
+    "## Load and Train Models"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bf71c693",
+   "metadata": {},
+   "source": [
+    "The ETX4VELO repository provides a central way for instantiating, loading and\n",
+    "training models.\n",
+    "\n",
+    "For instance, let's consider the GNN for a moment. various types of GNNs are defined\n",
+    "in the repository for exploration purposes. To load the class corresponding\n",
+    "to the correct GNN model, you may use the `get_model` function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "72cc159b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pipeline import get_model\n",
+    "GNNModel = get_model(config_path, step=\"gnn\")\n",
+    "GNNModel"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0b0a7fd9",
+   "metadata": {},
+   "source": [
+    "This function calls the embedding and GNN `get_model` function located in\n",
+    "`pipeline/Embedding/models/__init__.py` and `pipeline/GNN/models/__init__.py`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ef4fa3b2",
+   "metadata": {},
+   "source": [
+    "You can instantiate a model in order to train it using `instantiate_model_for_training`"
    ]
   },
   {
@@ -233,7 +275,59 @@
    "id": "fff67f55",
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "from pipeline import instantiate_model_for_training\n",
+    "embedding_model = instantiate_model_for_training(config_path, step=\"embedding\")\n",
+    "embedding_model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6fc1690e",
+   "metadata": {},
+   "source": [
+    "Finally, you can load a trainer model using `load_trained_model`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7af48a33",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pipeline import load_trained_model\n",
+    "embedding_model = load_trained_model(config_path, step=\"embedding\")\n",
+    "# -> ready for testing!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c1658f85",
+   "metadata": {},
+   "source": [
+    "To train a model, you can use the function made available in the\n",
+    "`scripts/train_model.py` script."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a9fcfd4d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from scripts.train_model import train_model\n",
+    "train_model(config_path, step=\"embedding\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c6fdf1e8",
+   "metadata": {},
+   "source": [
+    "You're now ready to run any training you want!"
+   ]
   }
  ],
  "metadata": {
-- 
GitLab