Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
  • CTA CTA
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 139
    • Issues 139
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
  • Merge requests 11
    • Merge requests 11
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • ctacta
  • CTACTA
  • Wiki
  • HTTP metadata discussions

HTTP metadata discussions · Changes

Page history
Create HTTP metadata discussions authored Jul 21, 2022 by Joao Afonso's avatar Joao Afonso
Hide whitespace changes
Inline Side-by-side
HTTP-metadata-discussions.md 0 → 100644
View page @ b5b2a08c
# EXPERIMENT CURRENT USE AND VIEW ON METADATA
## ARCHIVE METADATA
ATLAS started but did not finish a prototype with FTS to send archive metadata to tape end points for the purpose of the ATLAS "Data carousel".
If the destination was tape it was added added to the destination URL of each file using the standard `?key=value&...`. For SRM, webdav, xrootd... any protocol.
This prototype was never deployed in production and was only tested in dev environment.
The prototype used the following triplet of archive metadata:
```
collection = <name of the parent dataset>
nr_files = <of the parent dataset>
nr_bytes = <of the parent dataset>
```
Where *collection* is a purposely generic term meant to cover:
* CMS *blocks*
* ATLAS *datasets*
* LHCb *samples/tracks*
`collection` format was a comma separate list of strings provided for each file: for example multiple parent datasets for ATLAS.
For example: `collection=mc15_13TeV:parent_dataset1,mc15_13TeV:parent_dataset2,user.mlassnig:my_dataset`
### Official datasets workflow
Official dataset names:
* data prep decides the names of the official datasets
* dataset is created by T0 (Armin) in advance
* files are uploaded on T0 DISK
* Armin finds the files upon rucio registration and attach them to precreated datasets
* rucio subscription catch the dataset file attachment event and create/associate a rule to the files
* rucio applies rules to move files according to rules
2 concepts in rucio:
* `Subscription` is automatic creator of rules: rucio subscriptions are defined by ADC: it takes files in a set of `scopes` (ie dataset prefix before `:`) and create rules (typically 1 per dataset for official data): see https://rucio-ui.cern.ch to list data sets
* `rules` express interest for file/dataset/container (set of dataset) in a specific location (user created)
Various metadata is attached to the dataset and could be passed at archival time as well:
* datatype: RAW/AOD/...
* stream: bulk/calibration/minbias/...
Another small twist: regarding `nr_files` and `nr_bytes` rucio cannot populate these before the full dataset is *fully uploaded/done/closed*. As `T0 tape` is on the critical path of data taking archiving is starting before the dataset full size is known. Even the export of RAW datasets from TZDISK to T1_TAPE is ongoing before the fill is finished
ATLAS are happy to change *collection* to *dataset* if this easier to understand/smooth communication.
ATLAS have voiced interest in tape storage systems at the Tier 1s receiving "Data carousel" metadata when archiving files.
The ATLAS data carousel will be the first time Analysis Object Data (AOD) files are automatically retrieved at large scale from tape in order to supply analysis jobs. In the past such retrieves were either done manually or in a semi-automatic way. The ATLAS data carousel is effectively treating tape as cheap disk.
ATLAS want to analyse the "Data carousel" archive metadata that is associated with the AODs they have automatically retrieved for their anaylsis jobs. From this study they hope to determine the potential benefits of "smart writing", which is the writing of related AODs to the same physical tape(s). If the gains are proven to be worthwhile, then ATLAS imagine a future system where files being written to tape are held back in a disk based "archive staging" area. They are eventually written to tape in batches to ensure related AODs are stored on the same physical tape(s).
The rate of creation of AODs that will be held back and written to the same physical tape(s) is considered to be slow enough for a realistic solution to be developed/implemented. The creation rate of such AODs will not be as high as the rate at which raw data are archived to Tier 0 tape storage.
Tier 0 should receive archive metadata/hints for:
* Tier 0 tape raw
* Data carousel
Tier 1 should receive archive metadata/hints for:
* Tier 0 export raw: T0 source is T0 disk
The use of "Tier 0 tape raw" and "Tier 0 export raw" is different from "Data carousel". The "Tier XXXX" metadata/hints are used to specify "activity shares" as already used in FTS. An activity share is a rough channel/bandwidth/highway lane reservation between two end points, including tape.
## RETRIEVE METADATA
In theory ATLAS can start adding retrieve metadata to each file URL of their XRootD stage in requests. Such a change is currently under discussion. No decision has been made.
Current activity allocation:
* `Staging`: currently used by data carousel recalls AND Luc Goossens for higher priority recalls (mess)
Future activity allocation:
* `Staging`: staging activity for data carousel (T0 and T1)
* `T0 export`: TZDISK -> T1_TAPE (T0_TAPE not involved)
* `T0 tape`: T0 tape archival activity and T0 staging (Luc Goossens)
Activity shares seem from rucio conveyor before FTS [here](http://rucio-graphite-prod.cern.ch:3000)
# Some tickets from the past
* [Pass user metadata to storage systems](https://its.cern.ch/jira/browse/FTS-1284)
* [The ATLAS Data Carousel Project](https://indico.cern.ch/event/773049/contributions/3474425/attachments/1937449/3211192/atlas-data-carousel-chep19-xin.pdf)
* [Smart writing presentation](https://gitlab.cern.ch/cta/CTA/-/blob/master/doc/Presentations/20191205_Atlas_week/2019_12_05_CTA_smart_writing.pdf)
\ No newline at end of file
Clone repository
  • CTA Release Roadmap
  • HTTP metadata discussions
  • Manual tests on pre production
  • catalogue render test