From 17fa3df656d14ed3df89f0befa4b3b2e1240269d Mon Sep 17 00:00:00 2001 From: Anthony Correia <anthony.correia@cern.ch> Date: Mon, 12 Feb 2024 02:45:52 +0100 Subject: [PATCH 01/11] Fix typo --- readme/setup/1_installation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/readme/setup/1_installation.md b/readme/setup/1_installation.md index 56c05344..69811305 100644 --- a/readme/setup/1_installation.md +++ b/readme/setup/1_installation.md @@ -45,7 +45,7 @@ For each new session, follow these steps to prepare your environment: 1. Source the `setup/setup.sh` file, which accomplishes the following: - Defines the environment variable `ETX4VELO_REPO`, containing the absolute path to this repository. - - Adds `montetracko`, `etx4velo and` `etx4velo/pipeline` to the `PYTHONPATH`. + - Adds `montetracko`, `etx4velo` and `etx4velo/pipeline` to the `PYTHONPATH`. ```bash source setup/setup.sh ``` -- GitLab From ce6d591119afe55e731ce2aa72c3b803d0de80c3 Mon Sep 17 00:00:00 2001 From: Anthony Correia <anthony.correia@cern.ch> Date: Mon, 12 Feb 2024 03:10:37 +0100 Subject: [PATCH 02/11] Update path to collect_test_samples.py --- readme/guide/3_training.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/readme/guide/3_training.md b/readme/guide/3_training.md index b98b5869..579598bf 100644 --- a/readme/guide/3_training.md +++ b/readme/guide/3_training.md @@ -66,7 +66,7 @@ The essential steps are outlined below: source setup/setup.sh cd etx4velo # Run the test sample collection script - ./evaluation/collect_test_samples.py + ./scripts/collect_test_samples.py ``` Once you've completed these steps, the configuration for the test samples will be available in the `etx4velo/evaluation/test_samples.yaml` file, ready for use in the next steps. -- GitLab From 2be3dea2d5d57af8a2717c282b627cd5fe290738 Mon Sep 17 00:00:00 2001 From: Anthony Correia <anthony.correia@cern.ch> Date: Mon, 12 Feb 2024 03:11:12 +0100 Subject: [PATCH 03/11] Write first section of tutorial about configuration --- readme/tutorial/01_configuration.ipynb | 203 +++++++++++++++++++++++++ 1 file changed, 203 insertions(+) create mode 100644 readme/tutorial/01_configuration.ipynb diff --git a/readme/tutorial/01_configuration.ipynb b/readme/tutorial/01_configuration.ipynb new file mode 100644 index 00000000..5fad62c7 --- /dev/null +++ b/readme/tutorial/01_configuration.ipynb @@ -0,0 +1,203 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# ETX4VELO Configuration\n", + "\n", + "Welcome to this second section regarding the configuration of the ETX4VELO repository.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Repository Organisation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The root directory of the ETX4VELO repository contains several folders:\n", + "- `etx4velo`: the main repository that contains the models, pipeline configurations,\n", + "notebooks, etc.\n", + "- `readme`: the README markdown files used in the documentation website.\n", + "- `docs`: the source files to build the documentation with sphinx.\n", + "- `setup`: the environment and configuration files.\n", + "\n", + "The main folder is `etx4velo`, where you can find the following folders\n", + "- `pipeline`: is the heart of ETX4VELO, containing all the packages and models.\n", + "- `notebooks`: contains Notebooks for interactively run trainings and evaluations.\n", + "- `pipeline_configs`: contains all the pipeline configurations for training and inference.\n", + "- `scripts`: contains scripts to run some steps of the pipeline from the command line.\n", + "- `snakefiles`: Snakemake files to run automated and reproducible evaluation of\n", + "the ETX4VELO pipeline.\n", + "- `analyses`: random notebooks I use to debug or understand problems\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup file\n", + "\n", + "First, source the `setup/setup.sh` file.\n", + "```bash\n", + "source setup/setup.sh\n", + "```\n", + "This defines the environment variable `ETX4VELO_REPO`, containing the absolute path\n", + "to this repository, and add `montetracko`, `etx4velo`, `etx4velo/pipeline` to \n", + "the `PYTHONPATH` environment variable." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ETX4VELO_REPO environment variable: /home/acorreia/Documents/tracking/etx4velo\n", + "\n", + "PYTHONPATH content:\n", + "['/home/acorreia/Documents/tracking/etx4velo/readme/tutorial',\n", + " '/home/acorreia/Documents/tracking/etx4velo/readme/tutorial',\n", + " '/home/acorreia/Documents/tracking/etx4velo/etx4velo',\n", + " '/home/acorreia/Documents/tracking/etx4velo/etx4velo/pipeline',\n", + " '/home/acorreia/Documents/tracking/etx4velo/montetracko',\n", + " '/scratch/acorreia/mambaforge/envs/etx4velo_env_tutorial/lib/python310.zip',\n", + " '/scratch/acorreia/mambaforge/envs/etx4velo_env_tutorial/lib/python3.10',\n", + " '/scratch/acorreia/mambaforge/envs/etx4velo_env_tutorial/lib/python3.10/lib-dynload',\n", + " '',\n", + " '/scratch/acorreia/mambaforge/envs/etx4velo_env_tutorial/lib/python3.10/site-packages']\n" + ] + } + ], + "source": [ + "import os\n", + "import sys\n", + "from pprint import pprint\n", + "\n", + "print(\"ETX4VELO_REPO environment variable:\", os.environ[\"ETX4VELO_REPO\"])\n", + "\n", + "print(\"\\nPYTHONPATH content:\")\n", + "pprint(sys.path)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Configuration Files" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First edit the `setup/common_config.yaml` file to your liking, more particularly\n", + "the `directories` section:\n", + "```yaml\n", + "directories:\n", + " # Directory where the processed files are saved. You may need space to store this folder.\n", + " data_directory: /scratch/acorreia/data\n", + " # Directory where the model parameters are saved during training\n", + " artifact_directory: artifacts\n", + " # The plots and reports of a given experiment are saved under this folder\n", + " performance_directory: output\n", + " # Directory that contains the reference (test) samples\n", + " reference_directory: /scratch/acorreia/reference_samples\n", + " # Directory that contains other figures, used for presentations for instance\n", + " analysis_directory: output/analysis\n", + " # Directory that contains the exported model\n", + " export_directory: model_export\n", + "```\n", + "\n", + "The relative paths are expressed w.r.t. the `etx4velo` folder of the repository.\n", + "\n", + "The configuration that you are likely to change are:\n", + "- `data_directory`\n", + "- `reference_directory`: change it to where you extracted the `reference_samples_tutorial.tar.lz4` archive" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For convenience, these directories can be retrieved using the `cdirs` object." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "cdirs.data_directory : /scratch/acorreia/data\n", + "cdirs.artifact_directory : /home/acorreia/Documents/tracking/etx4velo/etx4velo/artifacts\n", + "cdirs.performance_directory : /home/acorreia/Documents/tracking/etx4velo/etx4velo/output\n", + "cdirs.reference_directory : /scratch/acorreia/reference_samples\n", + "cdirs.analysis_directory : /home/acorreia/Documents/tracking/etx4velo/etx4velo/output/analysis\n", + "cdirs.export_directory : /home/acorreia/Documents/tracking/etx4velo/etx4velo/model_export\n" + ] + } + ], + "source": [ + "from utils.commonutils.config import cdirs\n", + "\n", + "for dirtype in [\"data\", \"artifact\", \"performance\", \"reference\", \"analysis\", \"export\"]:\n", + " attribute_name = f\"{dirtype}_directory\"\n", + " print(f\"{f'cdirs.{attribute_name}':<30}:\", getattr(cdirs, attribute_name))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Collect Test Samples" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Please move to the `etx4velo` directory and run the following script\n", + "\n", + "```bash\n", + "./scripts/collect_test_samples.py\n", + "```\n", + "which produces, the `etx4velo/test_samples.yaml` file, which is the configuration\n", + "for the test samples. The test samples are collected by navigating through the folders\n", + "in `cdirs.reference_directory`.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} -- GitLab From 69fb2a08679d3835c040dea70f6d3311a4f45348 Mon Sep 17 00:00:00 2001 From: Anthony Correia <anthony.correia@cern.ch> Date: Tue, 13 Feb 2024 09:22:13 +0100 Subject: [PATCH 04/11] Fix all type hints in ModelBase --- .../pipeline/utils/modelutils/basemodel.py | 45 ++++++++++++++----- 1 file changed, 34 insertions(+), 11 deletions(-) diff --git a/etx4velo/pipeline/utils/modelutils/basemodel.py b/etx4velo/pipeline/utils/modelutils/basemodel.py index 6c70cf0d..85726279 100644 --- a/etx4velo/pipeline/utils/modelutils/basemodel.py +++ b/etx4velo/pipeline/utils/modelutils/basemodel.py @@ -24,7 +24,7 @@ class ModelBase(LightningModule): super().__init__() self._trainset = None self._valset = None - self.testset: typing.List[Data] | None = None + self._testset: typing.List[Data] | None = None self.save_hyperparameters(hparams) self._idx_trainset_split: int | None = None self._trainset_split_indices: typing.List[npt.NDArray] | None = None @@ -32,7 +32,7 @@ class ModelBase(LightningModule): def setup(self, stage): self.load_partition("train") self.load_partition("val") - self.testset = None + self._testset = None @property def lazy(self) -> bool: @@ -62,9 +62,18 @@ class ModelBase(LightningModule): if self._valset is None: self.load_partition(partition="val") assert self._valset is not None - assert not isinstance(self._valset, LazyDatasetBase) return self._valset + @property + def testset(self) -> typing.List[Data]: + if self._testset is None: + raise ValueError( + "Test set not loaded. Please load it with `fetch_partition` " + "or `load_testset_from_directory`." + ) + else: + return self._testset + @valset.setter def valset(self, batches: typing.List[Data]): self._valset = batches @@ -79,8 +88,15 @@ class ModelBase(LightningModule): def train_dataloader(self): """Train dataloader, with random splitting of epochs.""" print("Load train dataloader.") - if len(self.trainset) > 0: + trainset = self.trainset + if len(trainset) > 0: if (trainset_split := self.hparams.get("trainset_split")) is not None: + if not isinstance(trainset, LazyDatasetBase): + raise TypeError( + "In order to use the `trainset_split` property, " + "the trainset should be loaded in a lazy way. " + "Please consider switching `lazy` to `True`." + ) if self._trainset_split_indices is None: print("Define random splitting of epochs") self.load_trainset_split_indices(trainset_split) @@ -91,8 +107,8 @@ class ModelBase(LightningModule): print("Load subset number", self._idx_trainset_split) trainset = Subset( - self.trainset, - self._trainset_split_indices[self._idx_trainset_split], + trainset, + self._trainset_split_indices[self._idx_trainset_split], # type: ignore ) # Prepare next already @@ -104,20 +120,25 @@ class ModelBase(LightningModule): else: trainset = self.trainset shuffle = True - return DataLoader(trainset, batch_size=1, num_workers=8, shuffle=shuffle) + return DataLoader( + trainset, # type: ignore + batch_size=1, + num_workers=8, + shuffle=shuffle, + ) else: return None def val_dataloader(self): """Validation dataloader.""" if len(self.valset) > 0: - return DataLoader(self.valset, batch_size=1, num_workers=8) + return DataLoader(self.valset, batch_size=1, num_workers=0) else: return None def test_dataloader(self): """Test dataloader.""" - if self.testset is not None and len(self.testset) > 0: + if self._testset is not None and len(self._testset) > 0: return DataLoader(self.testset, batch_size=1, num_workers=8) else: return None @@ -180,7 +201,7 @@ class ModelBase(LightningModule): pickles files. """ lazy_dataset = self.get_lazy_dataset(input_dir=input_dir, **kwargs) - self.testset = self.fetch_datasets(lazy_dataset=lazy_dataset) + self._testset = self.fetch_datasets(lazy_dataset=lazy_dataset) def get_lazy_dataset_partition( self, @@ -284,9 +305,11 @@ class ModelBase(LightningModule): if partition == "train": self._trainset = datasets elif partition == "val": + assert not isinstance(datasets, LazyDatasetBase) # shouldn't be the case self._valset = datasets else: - self.testset = datasets + assert not isinstance(datasets, LazyDatasetBase) # shouldn't be the case + self._testset = datasets def get_input_data(self, all_features: torch.Tensor) -> torch.Tensor: return get_input_features( -- GitLab From db2babbc18e3c16e4f8b0fd73f233efbfac6cf46 Mon Sep 17 00:00:00 2001 From: Anthony Correia <anthony.correia@cern.ch> Date: Tue, 13 Feb 2024 09:22:51 +0100 Subject: [PATCH 05/11] Finish part 1 about configuration --- readme/tutorial/01_configuration.ipynb | 158 +++++++++++++++++++------ 1 file changed, 119 insertions(+), 39 deletions(-) diff --git a/readme/tutorial/01_configuration.ipynb b/readme/tutorial/01_configuration.ipynb index 5fad62c7..dad916d7 100644 --- a/readme/tutorial/01_configuration.ipynb +++ b/readme/tutorial/01_configuration.ipynb @@ -52,31 +52,25 @@ "the `PYTHONPATH` environment variable." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once this file is sourced, you may launch `jupyter-lab`\n", + "```bash\n", + "cd etx4velo\n", + "jupyter-lab --port 8889 --no-browser\n", + "```\n", + "and open this notebook on your internet browser.\n", + "\n", + "You can inspect your environment variables and the PYTHONPATH content:" + ] + }, { "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "ETX4VELO_REPO environment variable: /home/acorreia/Documents/tracking/etx4velo\n", - "\n", - "PYTHONPATH content:\n", - "['/home/acorreia/Documents/tracking/etx4velo/readme/tutorial',\n", - " '/home/acorreia/Documents/tracking/etx4velo/readme/tutorial',\n", - " '/home/acorreia/Documents/tracking/etx4velo/etx4velo',\n", - " '/home/acorreia/Documents/tracking/etx4velo/etx4velo/pipeline',\n", - " '/home/acorreia/Documents/tracking/etx4velo/montetracko',\n", - " '/scratch/acorreia/mambaforge/envs/etx4velo_env_tutorial/lib/python310.zip',\n", - " '/scratch/acorreia/mambaforge/envs/etx4velo_env_tutorial/lib/python3.10',\n", - " '/scratch/acorreia/mambaforge/envs/etx4velo_env_tutorial/lib/python3.10/lib-dynload',\n", - " '',\n", - " '/scratch/acorreia/mambaforge/envs/etx4velo_env_tutorial/lib/python3.10/site-packages']\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "import os\n", "import sys\n", @@ -99,6 +93,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "To properly use the ETX4VELO repository, there is still a few things you need to do.\n", + "\n", "First edit the `setup/common_config.yaml` file to your liking, more particularly\n", "the `directories` section:\n", "```yaml\n", @@ -133,22 +129,9 @@ }, { "cell_type": "code", - "execution_count": 31, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "cdirs.data_directory : /scratch/acorreia/data\n", - "cdirs.artifact_directory : /home/acorreia/Documents/tracking/etx4velo/etx4velo/artifacts\n", - "cdirs.performance_directory : /home/acorreia/Documents/tracking/etx4velo/etx4velo/output\n", - "cdirs.reference_directory : /scratch/acorreia/reference_samples\n", - "cdirs.analysis_directory : /home/acorreia/Documents/tracking/etx4velo/etx4velo/output/analysis\n", - "cdirs.export_directory : /home/acorreia/Documents/tracking/etx4velo/etx4velo/model_export\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "from utils.commonutils.config import cdirs\n", "\n", @@ -177,6 +160,103 @@ "for the test samples. The test samples are collected by navigating through the folders\n", "in `cdirs.reference_directory`.\n" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pipeline Configuration" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The pipeline configurations are stored in in the `pipeline_configs` directory.\n", + "Let's focus on the `pipeline_configs.yaml` configuration.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "config_path = os.path.join(cdirs.repository, \"etx4velo\", \"pipeline_configs\", \"example.yaml\")\n", + "print(\"config_path:\", config_path)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To load the configuration, you should always use the `load_config` function,\n", + "because it alters the configuration for convenience." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from utils.commonutils.config import load_config\n", + "\n", + "config = load_config(config_path)\n", + "assert config == load_config(config) # pass-through if it already a dictionary, for convenience!\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pprint(config)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The configuration is essentially a dictionary of dictionaries.\n", + "It is divided into several sections, corresponding the pipeline steps." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Configuration sections:\", list(config.keys()))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First look at the `common` section:\n", + "```yaml\n", + "common:\n", + " experiment_name: example # Optional: this is automatically set to the name of the config file\n", + " # Name of the test datasets to use (defined in `evaluation/test_samples.yaml`)\n", + " test_dataset_names:\n", + " - minbias-sim10b-xdigi_v2.4_1496\n", + " - minbias-sim10b-xdigi_v2.4_1498\n", + " detector: velo # default to the first entry in `detectors` in `common_config.yaml`\n", + "```\n", + "which defines:\n", + "- the `experiment_name`, set to the name of the configuration file by `load_config`!\n", + "- the test dataset names made available to the pipeline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll go over the next sections of the configuration in subsequent parts of this tutorial." + ] } ], "metadata": { -- GitLab From f1cd75710337d75b4094383dbe949a9f6abf4f76 Mon Sep 17 00:00:00 2001 From: Anthony Correia <anthony.correia@cern.ch> Date: Tue, 13 Feb 2024 09:23:14 +0100 Subject: [PATCH 06/11] Write part about processing --- readme/tutorial/02_preprocessing.ipynb | 657 +++++++++++++++++++++++++ 1 file changed, 657 insertions(+) create mode 100644 readme/tutorial/02_preprocessing.ipynb diff --git a/readme/tutorial/02_preprocessing.ipynb b/readme/tutorial/02_preprocessing.ipynb new file mode 100644 index 00000000..f1bfc36b --- /dev/null +++ b/readme/tutorial/02_preprocessing.ipynb @@ -0,0 +1,657 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "f8f81a32-5c0f-4d01-a768-f00c42c4c5e1", + "metadata": {}, + "source": [ + "# Preprocessing and Processing" + ] + }, + { + "cell_type": "markdown", + "id": "217e7045-9a17-4566-b09d-a0f8d6472d8e", + "metadata": {}, + "source": [ + "To follow this section, please open the `etx4velo/notebooks/full_pipeline.ipynb` notebook." + ] + }, + { + "cell_type": "markdown", + "id": "3234485f-c1f5-438c-a484-aa238d42fbd3", + "metadata": {}, + "source": [ + "## Files Produced by XDIGI2CSV" + ] + }, + { + "cell_type": "markdown", + "id": "7ae6199d-b229-485f-bd62-afe31b7bac47", + "metadata": {}, + "source": [ + "The first two steps of the pipeline consists of preparing the data for training.\n", + "First, let's look at the data downloaded from my EOS space." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4b6b9ab8-9e3f-484d-8f44-54fa1a0ac510", + "metadata": {}, + "outputs": [], + "source": [ + "# Update this variable with the directory where you folder actually is\n", + "original_datadir = \"/scratch/acorreia/minbias-sim10b-xdigi_subset\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "349fb4ea-544e-4829-851a-92a0e49acf50", + "metadata": {}, + "outputs": [], + "source": [ + "!ls -1 {original_datadir}" + ] + }, + { + "cell_type": "markdown", + "id": "b693e24c-ba42-4c29-8b10-9c6fc6d8336e", + "metadata": {}, + "source": [ + "The files were obtained using the [XDIGI2CSV repository](https://gitlab.cern.ch/gdl4hep/xdigi2csv).\n", + "Each folder contains about 2000 events. Let's have a look at the first folder." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fe509889-8bdf-4d21-90e7-6ba198c0bbaa", + "metadata": {}, + "outputs": [], + "source": [ + "!ls -1 {original_datadir}/0" + ] + }, + { + "cell_type": "markdown", + "id": "660ce07e-4e34-4a94-b2ce-350902306f1d", + "metadata": {}, + "source": [ + "The `log.yaml` file contains information about where the events come from\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "17f4b2fc-79f8-4d04-a3a8-7c0169d73039", + "metadata": {}, + "outputs": [], + "source": [ + "!cat {original_datadir}/log.yaml" + ] + }, + { + "cell_type": "markdown", + "id": "c4e9ce68-59ca-4c46-85e9-ffdd2d3a3861", + "metadata": {}, + "source": [ + "- the events correspond to the ones stored in the Logical File Name (LFN) `LFN:/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi`.<br/>\n", + "- The other LFN is \"banned\" because it was stored in a server that I deemed unreliable.\n", + "- The returncode, equal to 0, indicates that the file was produced properly." + ] + }, + { + "cell_type": "markdown", + "id": "b15f3b06-69c7-487e-a82f-3549717b2e82", + "metadata": {}, + "source": [ + "The 2 files of interest for this tutorial are `hits_velo.parquet.lz4` and `mc_particles.parquet.lz4`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5514a598-2edd-40dc-b43d-0f40c0431046", + "metadata": {}, + "outputs": [], + "source": [ + "import os.path as op\n", + "import pandas as pd\n", + "\n", + "df_hits_particles = pd.read_parquet(\n", + " op.join(original_datadir, \"0\", \"hits_velo.parquet.lz4\")\n", + ")\n", + "df_particles = pd.read_parquet(\n", + " op.join(original_datadir, \"0\", \"mc_particles.parquet.lz4\")\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "833ff888-e5d8-4044-9087-33e554e6b458", + "metadata": {}, + "source": [ + "Each row of the dataframe of particles is uniquely identified by \n", + "- `run`: the run number\n", + "- `event`: the event number within this run\n", + "- `mcid`: the particle ID\n", + "\n", + "Other columns give information about the particle." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bd771a61-682a-4a88-9b28-48f5ea6c0189", + "metadata": {}, + "outputs": [], + "source": [ + "df_particles" + ] + }, + { + "cell_type": "markdown", + "id": "eb2a9a85-bc28-40bd-b273-30aa44d259ed", + "metadata": {}, + "source": [ + "Each row of the dataframe of hits-particles is uniquely identified by \n", + "- `run`: the run number\n", + "- `event`: the event number within this run\n", + "- `lhcbid`: the cluster ID\n", + "- `mcid`: the particle ID\n", + "\n", + "Other columns give information about the cluster position." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "29485b9d-f0d8-4889-ae99-7bc54cf6b85d", + "metadata": {}, + "outputs": [], + "source": [ + "df_hits_particles" + ] + }, + { + "cell_type": "markdown", + "id": "368ad422-b3bb-41ff-b5f0-5c857d9a3020", + "metadata": {}, + "source": [ + "A `mcid` equal to `-1` corresponds to a noise hit (for the velo, from spillover)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cd763363-a44f-40a7-bef7-6e78184b1a77", + "metadata": {}, + "outputs": [], + "source": [ + "n_hits = df_hits_particles[[\"run\", \"event\", \"lhcbid\"]].drop_duplicates().shape[0]\n", + "n_fake_hits = (df_hits_particles[\"mcid\"] == -1).sum()\n", + "\n", + "print(\"Proportion of fake hits:\", f\"{n_fake_hits / n_hits:%}\")" + ] + }, + { + "cell_type": "markdown", + "id": "0e1935b6-57ab-41ce-add8-91b699043127", + "metadata": {}, + "source": [ + "A hit may be associated to more than one column." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e0b9c9f5-0891-471d-855b-f003de6fdf53", + "metadata": {}, + "outputs": [], + "source": [ + "df_hits_particles_grouped_by_hits = (\n", + " df_hits_particles[df_hits_particles[\"mcid\"] != -1]\n", + " .groupby([\"run\", \"event\", \"lhcbid\"])[\"mcid\"]\n", + " .count()\n", + " .rename(\"n_particles\")\n", + ")\n", + "print(\n", + " \"Proportion of true hits belonging to more than one particle:\",\n", + " f\"{(df_hits_particles_grouped_by_hits > 2).sum() / df_hits_particles_grouped_by_hits.shape[0]:%}\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "c4c41d60-29d6-4a99-9128-787afadaf654", + "metadata": {}, + "source": [ + "You can add particle information to the dataframe of hits-particles by merging\n", + "the dataframe of particles to the dataframe of hits-particles:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "49939160-5fc1-4f76-a3c8-48020b451096", + "metadata": {}, + "outputs": [], + "source": [ + "# Add `pid` information\n", + "df_hits_particles.merge(\n", + " df_particles[[\"run\", \"event\", \"mcid\", \"pid\"]],\n", + " how=\"left\",\n", + " on=[\"run\", \"event\", \"mcid\"],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "cdc001f9-2b4f-486e-8eff-af5e98ec0661", + "metadata": {}, + "source": [ + "For an extensive description of the the meaning of each column, please refer to the [XDIGI2CSV documentation](https://xdigi2csv.docs.cern.ch/master/Access/1.csv_description.html)." + ] + }, + { + "cell_type": "markdown", + "id": "7ac2e43f-0d8c-4cd3-9da9-ed03f4f8875a", + "metadata": {}, + "source": [ + "## Preprocessing" + ] + }, + { + "cell_type": "markdown", + "id": "a2b59dde-b81e-4cee-995f-b1965de8d43c", + "metadata": {}, + "source": [ + "Open again the pipeline configuration `etx4velo/pipeline_configs/example.yaml` to analyze\n", + "the `preprocessing` section.\n", + "\n", + "```yaml\n", + "preprocessing:\n", + " input_dir: /scratch/acorreia/minbias-sim10b-xdigi_subset\n", + " # Can be\n", + " # - Integer: Last subdirectory that can be used (starting from `0`). `-1` for all.\n", + " # - String or list of strings: sub-directories that can be used\n", + " # - `null`: use `input_dir` directly\n", + " # - Dictionary with keys `start` and `stop`\n", + " subdirs: {\"start\": 0, \"stop\": 10}\n", + " output_subdirectory: \"preprocessed\"\n", + " # Preprocessing will stop once the required number of events has been preprocessed.\n", + " # if `null`, default to `n_train_events + n_test_events`.\n", + " n_events: null\n", + " # Number of jobs dataframes processed in parallel\n", + " # If more than 1 is required, the preprocessing will not stop after producing\n", + " # the `n_events` events and all the input events will be preprocessed.\n", + " n_workers: 1\n", + "\n", + " processing: # Processing function(s), defined in `Preprocessing/process_custom.py`\n", + " - remove_curved_particles\n", + " num_true_hits_threshold: 500 # Minimal number of genuine hits\n", + "\n", + " # Columns to keep in the dataframes of hits-particles and particles\n", + " # (excluding `event`, `particle_id` and `lhcbid`)\n", + " # `null` means keep everything\n", + " hits_particles_columns: [\"x\", \"y\", \"z\", \"plane\"]\n", + " particles_columns: null\n", + "```\n", + "\n", + "Please update `input_dir` to the location of the value your `original_datadir`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9e73c0e3-c2d6-4db4-8018-005baa1c6ad2", + "metadata": {}, + "outputs": [], + "source": [ + "from pprint import pprint\n", + "from utils.commonutils.config import cdirs, load_config\n", + "\n", + "config_path = op.join(cdirs.repository, \"etx4velo\", \"pipeline_configs\", \"example.yaml\")\n", + "config = load_config(config_path)\n", + "\n", + "pprint(config[\"preprocessing\"])" + ] + }, + { + "cell_type": "markdown", + "id": "2a389fca-5708-4fa9-b607-27dd325dab54", + "metadata": {}, + "source": [ + "As you can see, the `load_config` function has turned `output_subdirectory`\n", + "into `output_dir = {cdirs.data_directory}/{experiment_name}/{output_subdirectory}`.\n", + "That is why the configuration must be loaded using `load_config`!" + ] + }, + { + "cell_type": "markdown", + "id": "7a1d5112-7cfc-41b8-99d4-6865d1b0e0a9", + "metadata": {}, + "source": [ + "You're now ready to move to `full_pipeline.ipynb` and run the preprocessing.\n", + "```python\n", + "from Preprocessing.run_preprocessing import run_preprocessing\n", + "run_preprocessing(CONFIG, reproduce=False)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "e3ca78f6-3787-428b-abcf-ab9dc0d2afde", + "metadata": {}, + "source": [ + "Here is what it is gonna do:\n", + "- loop over the events in `{input_dir}/{number}/hits_velo.parquet.lz4` and `{input_dir}/{number}/mc_particles.parquet.lz4` (as configured in `setup/common_config.yaml`.\n", + "- Only load the hits-particles columns in `hits_particles_columns` and the particle columns in `particles_columns`.\n", + "- Define the following columns.\n", + " - `particle_id = mcid + 1`\n", + " - `event_id = {9 numbers corresponding to the run}{9 numbers corresponding to the event}`\n", + "\n", + "- Apply the processing functions specified in `processing`, defined in `pipeline/preprocessing/process_custom.py`\n", + "- Only save the events with a number of genuine hits higher than `num_true_hits_threshold`\n", + "- For each event, save 2 parquet files `{event_id}-hits_particles.parquet` and `{event_id}-particles.parquet`\n", + "- Once the required number of events for training is reached, touch the `done` file, that indicates that this step properly finished.\n", + "\n", + "The preprocessing step supports parallelism over input files, using the `joblib` library. To enable it, you may increase `n_workers` to a number higher than 1." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "37302ac3-63d9-49e7-bef0-1e25dc89ceff", + "metadata": {}, + "outputs": [], + "source": [ + "!(ls {config[\"preprocessing\"][\"output_dir\"]} | head -10)" + ] + }, + { + "cell_type": "markdown", + "id": "4f33f1a5-4493-4806-a326-74c21f9d41c9", + "metadata": {}, + "source": [ + "The preprocessing of the test samples can also be run in `full_pipeline.ipynb` through\n", + "\n", + "```python\n", + "from utils.commonutils.ctests import get_required_test_dataset_names\n", + "from Preprocessing.run_preprocessing import run_preprocessing_test_dataset\n", + "\n", + "for required_test_dataset_name in get_required_test_dataset_names(CONFIG):\n", + " run_preprocessing_test_dataset(\n", + " test_dataset_name=required_test_dataset_name,\n", + " reproduce=False,\n", + " )\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "59158488-6774-4040-baf2-b1d9d97d0569", + "metadata": {}, + "source": [ + "The preprocessed files of the test samples are common to all the pipelines.\n", + "For this reason, they are saved in `{datadir}/__test__/{detector}/{test_dataset_name}/`" + ] + }, + { + "cell_type": "markdown", + "id": "f6af51a7-a627-4378-a534-eecbffaceb23", + "metadata": {}, + "source": [ + "## Processing" + ] + }, + { + "cell_type": "markdown", + "id": "0949d099-6356-4d94-b6e0-25d3221ffcb7", + "metadata": {}, + "source": [ + "The processing step consists, for each event, of\n", + "1. Defining the (normalised) input features of the networks\n", + "2. Building the true edge indices\n", + "3. Defining the columns to keep\n", + "4. Defining the # train and validation samples\n", + "\n", + "Here is the current configuration of the processing step\n", + "```yaml\n", + "processing:\n", + " input_subdirectory: \"preprocessed\"\n", + " output_subdirectory: \"processed\"\n", + " n_workers: 1 # Number of processes in parallel in the processing stage\n", + "\n", + " features: [\"r\", \"phi\", \"z\"] # Name of the features to use\n", + " feature_means: [18., 0.0, 281.0] # Means for normalising the features\n", + " feature_scales: [9.75, 1.82, 287.0] # Scales for normalising the features\n", + "\n", + " # List of the columns to keep in the PyTorch batches, in the dataframe of hits\n", + " # Here the columns `x`, `y` and `z` are renamed `un_x`, `un_y` and `un_z`.\n", + " kept_hits_columns: [\"plane\", {\"un_x\": \"x\"}, {\"un_y\": \"y\"}, {\"un_z\": \"z\"}]\n", + " # List of columns in the dataframe of particles that are merged to the dataframe\n", + " # of hits and stored in the PyTorch batches\n", + " kept_particles_columns: [\"nhits_velo\"]\n", + "\n", + " n_train_events: 5000 # Number of training events\n", + " n_val_events: 500 # Number of validation events\n", + " split_seed: 0 # Seed used for the splitting train-val\n", + "\n", + " # How the true edges are computed\n", + " # - sortwise: sort by z\n", + " # - modulewise: sort by distance to production vertex\n", + " # - planewise: hits belonging to same particle and belonging to adjacent planes\n", + " true_edges_column: planewise\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "381a9c9b", + "metadata": {}, + "source": [ + "To run the processing:\n", + "\n", + "```python\n", + "from Processing.run_processing import run_processing_from_config\n", + "run_preprocessing(CONFIG, reproduce=False)\n", + "```\n", + "and to run the processing of the test samples:\n", + "```python\n", + "for required_test_dataset_name in get_required_test_dataset_names(CONFIG):\n", + " run_preprocessing_test_dataset(\n", + " test_dataset_name=required_test_dataset_name,\n", + " reproduce=False,\n", + " )\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "081e8579-4ae3-4fc8-a60b-137850db9fda", + "metadata": {}, + "source": [ + "Let's have a look at the output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6ca2bb89-75b9-4ce7-9c35-3d3746156895", + "metadata": {}, + "outputs": [], + "source": [ + "output_dir = config[\"processing\"][\"output_dir\"]\n", + "\n", + "print(\"Output dir:\", output_dir)\n", + "\n", + "!ls -1 {output_dir}" + ] + }, + { + "cell_type": "markdown", + "id": "b08d635e", + "metadata": {}, + "source": [ + "The file `splitting.json` contains information about which events belong to the\n", + "train sample, and which events belong to the test sample." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "27ecf423", + "metadata": {}, + "outputs": [], + "source": [ + "!head {output_dir}/splitting.json" + ] + }, + { + "cell_type": "markdown", + "id": "01310a6b", + "metadata": {}, + "source": [ + "The `train` and `val` directory contain the processed event files." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "86e2ef60", + "metadata": {}, + "outputs": [], + "source": [ + "!ls -1 {output_dir}/val | head -10" + ] + }, + { + "cell_type": "markdown", + "id": "84b0dde5", + "metadata": {}, + "source": [ + "The `test` folder contains the same files for the various test samples." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "85287986", + "metadata": {}, + "outputs": [], + "source": [ + "!ls {output_dir}/test" + ] + }, + { + "cell_type": "markdown", + "id": "65a246d1", + "metadata": {}, + "source": [ + "Let's try to open an event file on the validation set." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e3f75a5", + "metadata": {}, + "outputs": [], + "source": [ + "import torch\n", + "\n", + "first_val_path = next(iter(os.scandir(op.join(output_dir, \"val\")))).path\n", + "print(\"Opening\", first_val_path)\n", + "event = torch.load(first_val_path)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45285ab8", + "metadata": {}, + "outputs": [], + "source": [ + "for key, description in {\n", + " \"x\": \"Hit features\",\n", + " \"plane\": \"Plane index of each hit\",\n", + " \"signal_true_edges\": \"True edge indices of the graph\",\n", + " \"particle_id_hit_idx\": \"Allow to re-build the dataframe of hits-particles.\",\n", + " \"un_x\": \"Unormalised x-coordinates of the hits\",\n", + " \"un_y\": \"Unormalised y-coordinates of the hits\",\n", + " \"un_z\": \"Unormalised z-coordinates of the hits\",\n", + " \"unique_particle_id\": \"Unique particle ids in the event\",\n", + " \"particle_nhits_velo\": \"Number of velo hits for the particles in `unique_particle_id`\",\n", + "}.items():\n", + " key_str = f'\"{key}\"'\n", + " print(\n", + " f'{f\"batch[{key_str}]:\":<30}', f\"{str(event[key].shape):<20}\", \"-\", description\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "fe8010da", + "metadata": {}, + "source": [ + "The columns `un_x`, `un_y`, `un_z` were specified in `kept_hits_columns`.<br>\n", + "The column `particle_nhits_velo` (that comes with `unique_particle`) was specified\n", + "in `kept_particle_columns`.\n", + "\n", + "The goal would be not to rely on the preprocessed samples anymore.\n", + "However, sometimes, instead of reproducing the processed samples (and the samples\n", + "of the subsequent steps), it might less time consuming of loading the preprocessed\n", + "file directly.\n", + "For this reason, the element `truncated_path` allow to easily access the preprocessed\n", + "file of the corresponding event." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6bd4adcf", + "metadata": {}, + "outputs": [], + "source": [ + "truncated_path = event[\"truncated_path\"]\n", + "print(\"truncated_path:\", truncated_path)\n", + "print(f\"$ls {truncated_path}*\")\n", + "!ls {truncated_path}*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0515ae52", + "metadata": {}, + "outputs": [], + "source": [ + "df_hits_particles = pd.read_parquet(truncated_path + \"-hits_particles.parquet\")\n", + "df_hits_particles" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} -- GitLab From 87ace3462bec0a079de7da1af071e605f696f38a Mon Sep 17 00:00:00 2001 From: Anthony Correia <anthony.correia@cern.ch> Date: Tue, 13 Feb 2024 09:23:34 +0100 Subject: [PATCH 07/11] Write section about ModelBase --- readme/tutorial/03_training.ipynb | 260 ++++++++++++++++++++++++++++++ 1 file changed, 260 insertions(+) create mode 100644 readme/tutorial/03_training.ipynb diff --git a/readme/tutorial/03_training.ipynb b/readme/tutorial/03_training.ipynb new file mode 100644 index 00000000..36c1e688 --- /dev/null +++ b/readme/tutorial/03_training.ipynb @@ -0,0 +1,260 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "f8f81a32-5c0f-4d01-a768-f00c42c4c5e1", + "metadata": {}, + "source": [ + "# Training" + ] + }, + { + "cell_type": "markdown", + "id": "9742e077", + "metadata": {}, + "source": [ + "Let's load the pipeline configuration once again." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "85933e74", + "metadata": {}, + "outputs": [], + "source": [ + "import os.path as op\n", + "from utils.commonutils.config import cdirs, load_config\n", + "config_path = op.join(cdirs.repository, \"etx4velo\", \"pipeline_configs\", \"example.yaml\")\n", + "config = load_config(config_path)\n" + ] + }, + { + "cell_type": "markdown", + "id": "91da5888", + "metadata": {}, + "source": [ + "## `ModelBase`" + ] + }, + { + "cell_type": "markdown", + "id": "f4b8b3ac-8ac1-4a66-a593-b081d57df75a", + "metadata": {}, + "source": [ + "Every model if this repository inherits from `ModelBase`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "80c91afc-36fd-4562-8efe-6a47ad007f6a", + "metadata": {}, + "outputs": [], + "source": [ + "from utils.modelutils.basemodel import ModelBase\n", + "model = ModelBase(\n", + " hparams={\n", + " \"input_dir\": config[\"processing\"][\"output_dir\"]\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "05da2cfe", + "metadata": {}, + "source": [ + "The `trainset` are `valset` are loaded on access:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0a1fef10-07d5-437f-af4b-c36c790ab048", + "metadata": {}, + "outputs": [], + "source": [ + "trainset = model.trainset\n", + "trainset" + ] + }, + { + "cell_type": "markdown", + "id": "b2f820d5", + "metadata": {}, + "source": [ + "The can also be loaded using the `load_partition` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4a0604eb", + "metadata": {}, + "outputs": [], + "source": [ + "model._trainset = None # let's unload the trainset\n", + "model.load_partition(\"train\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ebab5e9d", + "metadata": {}, + "outputs": [], + "source": [ + "# The trainset is already loaded\n", + "trainset = model.trainset" + ] + }, + { + "cell_type": "markdown", + "id": "6175db90", + "metadata": {}, + "source": [ + "however, the `trainset` can be particularly large so it is not worth loading\n", + "it entirely. In this case, `lazy` can be turned to `True`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a2c8277b", + "metadata": {}, + "outputs": [], + "source": [ + "model.hparams[\"lazy\"] = True\n", + "model._trainset = None # let's unload the trainset\n", + "trainset = model.trainset # and load it again\n", + "trainset" + ] + }, + { + "cell_type": "markdown", + "id": "cac5a6ad", + "metadata": {}, + "source": [ + "Now, the `trainset` is an instance of `LazyDatasetBase` that inherits from \n", + "the `torch.utils.data.Dataset` class.\n", + "Events are only when accessed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1b5a968b", + "metadata": {}, + "outputs": [], + "source": [ + "from utils.loaderutils.dataiterator import LazyDatasetBase\n", + "\n", + "\n", + "assert isinstance(trainset, LazyDatasetBase)\n", + "print(\"Let's access\", trainset.input_paths[0])\n", + "event = trainset[0]\n", + "print(event)" + ] + }, + { + "cell_type": "markdown", + "id": "81a3fc78", + "metadata": {}, + "source": [ + "This only regards the `trainset`. The validation sample is still loaded entirely." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "08370472", + "metadata": {}, + "outputs": [], + "source": [ + "valset = model.valset" + ] + }, + { + "cell_type": "markdown", + "id": "4e850235", + "metadata": {}, + "source": [] + }, + { + "cell_type": "markdown", + "id": "f9c64fc7", + "metadata": {}, + "source": [ + "The `testset` is not loaded automatically, because there might be more than\n", + "one testset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1eb43c63", + "metadata": {}, + "outputs": [], + "source": [ + "model.testset" + ] + }, + { + "cell_type": "markdown", + "id": "81e7b550", + "metadata": {}, + "source": [ + "You can use the very same method `load_partition` to load it" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e515a883", + "metadata": {}, + "outputs": [], + "source": [ + "model.load_partition(\"minbias-sim10b-xdigi_v2.4_1496\")\n", + "model.testset" + ] + }, + { + "cell_type": "markdown", + "id": "752e0a16", + "metadata": {}, + "source": [ + "## Load Models" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fff67f55", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} -- GitLab From d888417fa721ac171b49b43c260283b8d1ef060e Mon Sep 17 00:00:00 2001 From: Anthony Correia <anthony.correia@cern.ch> Date: Tue, 13 Feb 2024 09:26:24 +0100 Subject: [PATCH 08/11] Fix full_pipeline.ipynb --- etx4velo/notebooks/full_pipeline.ipynb | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/etx4velo/notebooks/full_pipeline.ipynb b/etx4velo/notebooks/full_pipeline.ipynb index b2baec61..340b4b72 100644 --- a/etx4velo/notebooks/full_pipeline.ipynb +++ b/etx4velo/notebooks/full_pipeline.ipynb @@ -46,8 +46,8 @@ "from Embedding.embedding_plots import plot_best_performances_squared_distance_max\n", "\n", "from scripts.train_model import train_model\n", - "from scripts.embedding_run import run as run_embedding_inference\n", - "from scripts.track_building import build as build_track_candidates\n", + "from scripts.build_graph_using_embedding import run as run_embedding_inference\n", + "from scripts.build_tracks import build as build_track_candidates\n", "\n", "from utils.plotutils import performance_mpl as perfplot_mpl\n", "from utils.commonutils.ctests import get_required_test_dataset_names\n", @@ -148,7 +148,6 @@ "for required_test_dataset_name in get_required_test_dataset_names(CONFIG):\n", " run_preprocessing_test_dataset(\n", " test_dataset_name=required_test_dataset_name,\n", - " path_or_config_test=\"../evaluation/test_samples.yaml\",\n", " reproduce=False,\n", " )\n" ] @@ -1096,7 +1095,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.11" + "version": "3.10.12" }, "vscode": { "interpreter": { -- GitLab From d8362ec4b4d412df8eb3796be07bf389640b42c1 Mon Sep 17 00:00:00 2001 From: Anthony Correia <anthony.correia@cern.ch> Date: Tue, 13 Feb 2024 09:29:44 +0100 Subject: [PATCH 09/11] Fix import of `compare_etx4velo_vs_allen` --- etx4velo/notebooks/full_pipeline.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/etx4velo/notebooks/full_pipeline.ipynb b/etx4velo/notebooks/full_pipeline.ipynb index 340b4b72..2d4e3cad 100644 --- a/etx4velo/notebooks/full_pipeline.ipynb +++ b/etx4velo/notebooks/full_pipeline.ipynb @@ -1036,7 +1036,7 @@ "metadata": {}, "outputs": [], "source": [ - "from scripts.evaluate_allen import compare_etx4velo_vs_allen\n" + "from scripts.evaluation.compare_allen_vs_etx4velo import compare_etx4velo_vs_allen\n" ] }, { -- GitLab From fcabfd4baa928bb2c4fecabc35892768e83abf9e Mon Sep 17 00:00:00 2001 From: Anthony Correia <anthony.correia@cern.ch> Date: Tue, 13 Feb 2024 09:43:30 +0100 Subject: [PATCH 10/11] Improve example.yaml --- etx4velo/pipeline_configs/example.yaml | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/etx4velo/pipeline_configs/example.yaml b/etx4velo/pipeline_configs/example.yaml index 29b52c27..308f194d 100644 --- a/etx4velo/pipeline_configs/example.yaml +++ b/etx4velo/pipeline_configs/example.yaml @@ -1,9 +1,10 @@ common: - experiment_name: example + experiment_name: example # Optional: this is automatically set to the name of the config file # Name of the test datasets to use (defined in `evaluation/test_samples.yaml`) test_dataset_names: - minbias-sim10b-xdigi_v2.4_1496 - minbias-sim10b-xdigi_v2.4_1498 + detector: velo # default to the first entry in `detectors` in `common_config.yaml` preprocessing: input_dir: /scratch/acorreia/minbias-sim10b-xdigi_subset @@ -59,7 +60,7 @@ processing: # - planewise: hits belonging to same particle and belonging to adjacent planes true_edges_column: planewise -metric_learning: +embedding: # Dataset parameters input_subdirectory: "processed" output_subdirectory: "embedding_processed" @@ -78,14 +79,26 @@ metric_learning: emb_hidden: 128 # Number of hidden units / layer in the Dense Neural Network nb_layer: 3 # Number of layers emb_dim: 4 # Embedding dimension - activation: Tanh # Action function used in the MLP - weight: 3 # Weight for positive examples + activation: Tanh # Activation function used in the MLP + weight: 6 # Weight for positive examples + + # Requirement to apply to all particles in the training and validation samples + particle_requirement: null + # Requirement to apply to the particles of the query points + # in the training and validation samples + query_particle_requirement: "(abs(pid) != 11) and has_velo and (((eta > -5) and (eta < -2)) or ((eta > 2) and (eta < 5)))" + # Requirement that defines the target particles in the training and validation samples + # Can be used to put more weight on the target particles. + # It was finally deemed unecessary to use this parameter. + target_requirement: null + # non_target_weight: 0.05 # Weight for non-target particles in the loss + # target_weight: 0.05 # Weight for target particles in the loss # Available regimes # - rp: random pairs # - hnm: hard negative mining # - norm: perform L2 normalisation - regime: [rp, hnm, norm] + regime: [rp, hnm] randomisation: 1 # Number of random pairs per hit points_per_batch: 100000 # Number of query points to consider squared_distance_max: 0.015 # Maximum distance for hard-negative mining during training @@ -131,6 +144,8 @@ gnn: # The GNN that is used. Switch to another GNN (such as the default `interaction`) # might not work properly. model: triplet_interaction + edge_checkpointing: True + triplet_checkpointing: True # Minimal edge score used to filter out fake edges before building the triplets edge_score_cut: 0.5 -- GitLab From d4d2f4651b0f598bdb3a4cd50fc110b8726c8787 Mon Sep 17 00:00:00 2001 From: Anthony Correia <anthony.correia@cern.ch> Date: Tue, 13 Feb 2024 09:43:41 +0100 Subject: [PATCH 11/11] Finish section about training --- readme/tutorial/03_training.ipynb | 98 ++++++++++++++++++++++++++++++- 1 file changed, 96 insertions(+), 2 deletions(-) diff --git a/readme/tutorial/03_training.ipynb b/readme/tutorial/03_training.ipynb index 36c1e688..428ed19d 100644 --- a/readme/tutorial/03_training.ipynb +++ b/readme/tutorial/03_training.ipynb @@ -224,7 +224,49 @@ "id": "752e0a16", "metadata": {}, "source": [ - "## Load Models" + "## Load and Train Models" + ] + }, + { + "cell_type": "markdown", + "id": "bf71c693", + "metadata": {}, + "source": [ + "The ETX4VELO repository provides a central way for instantiating, loading and\n", + "training models.\n", + "\n", + "For instance, let's consider the GNN for a moment. various types of GNNs are defined\n", + "in the repository for exploration purposes. To load the class corresponding\n", + "to the correct GNN model, you may use the `get_model` function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "72cc159b", + "metadata": {}, + "outputs": [], + "source": [ + "from pipeline import get_model\n", + "GNNModel = get_model(config_path, step=\"gnn\")\n", + "GNNModel" + ] + }, + { + "cell_type": "markdown", + "id": "0b0a7fd9", + "metadata": {}, + "source": [ + "This function calls the embedding and GNN `get_model` function located in\n", + "`pipeline/Embedding/models/__init__.py` and `pipeline/GNN/models/__init__.py`." + ] + }, + { + "cell_type": "markdown", + "id": "ef4fa3b2", + "metadata": {}, + "source": [ + "You can instantiate a model in order to train it using `instantiate_model_for_training`" ] }, { @@ -233,7 +275,59 @@ "id": "fff67f55", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "from pipeline import instantiate_model_for_training\n", + "embedding_model = instantiate_model_for_training(config_path, step=\"embedding\")\n", + "embedding_model" + ] + }, + { + "cell_type": "markdown", + "id": "6fc1690e", + "metadata": {}, + "source": [ + "Finally, you can load a trainer model using `load_trained_model`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7af48a33", + "metadata": {}, + "outputs": [], + "source": [ + "from pipeline import load_trained_model\n", + "embedding_model = load_trained_model(config_path, step=\"embedding\")\n", + "# -> ready for testing!" + ] + }, + { + "cell_type": "markdown", + "id": "c1658f85", + "metadata": {}, + "source": [ + "To train a model, you can use the function made available in the\n", + "`scripts/train_model.py` script." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a9fcfd4d", + "metadata": {}, + "outputs": [], + "source": [ + "from scripts.train_model import train_model\n", + "train_model(config_path, step=\"embedding\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "c6fdf1e8", + "metadata": {}, + "source": [ + "You're now ready to run any training you want!" + ] } ], "metadata": { -- GitLab