Skip to content

New SIP structure and sip.json

Antonio Vivace requested to merge sip-updates into develop

Should work for any invenio1.x job, but the changes are quite disruptive and all the other pipeline will need to be adapted.

Implements everthing we decided on the new SIP and SIP.json specs:

File schema

SOURCE

The following values are related to the upstream source that provided the file

url

The URL where the file can be found.

E.g. https://somewebsite/assets/sachiel.png

source_filename

The complete file name from the source.

E.g.: sachiel.png

source_path

The most complete available path from the file system where the file was originally from.

E.g.: /home/users/Antonio/Documents/archive/Pictures

Since this can expose some private information, using the "excludedata" flag with the local pipeline can be used to omit this value.

JY Comment: excludeData (or excludePath) could take an argument to specify the part to be ignored, because the remaining part (on the right) contains some potential 'metadata' e.g.: --excludePath /home/users/Antonio/

path

[JY: maybe be renamed to relativeSourcePath]

The relative (to the resource folder) in which the file is to be found. If empty ("") it means the file is at the root.

E.g. Pictures/

Be aweare that a "" value is different from a None or null here, because it's still meaningful to affirm that we know the file is in the root folder compared to Null (we don't have that information).

BAG

These values are informative on how to find the file in the SIP payload.

bag_fullpath

The path, relative to the Bag, where the file can be found. Including the file name.

E.g.: data/content/sachiel.png.

It's not required that it matches the original folder structure or filename. E.g. a SIP could be created flattening the directory structure and renaming files: Original folder:

Pictures
    > sachiel.png
    > thumbnails:
        > sachiel.png

SIP:

sip::local::pictures
    > ...
    > data
        > content
            > sachiel.png
            > sachiel_thumb.png
        > meta
            > sip.json        

The original file names and directory structure can always be reconstructed with the other information in the SIP manifest.

Modifications to the SIP.json - meeting on 30.09.2021

  • remove the "remote" key
  • in the tool section, rename url into webSite
  • url as an array, to enable storing multiple pointers to the same file, e.g. http, xrootd ftp...
  • add the key bagId with value source::recid::timestamp
    • For local source, recid = uuencoded(foldername)-aRandomNumber
  • add pointers to the SIP.json schema and to the SIP specification snaphots
  • add for each file its fileHash and the digestAlgorithm keys
  • add a message key which can optionnaly contain a text explaining the rationale to create the bag
  • add a user entity, OCFL-like:
"user": {
        "address": "mailto:somebody-else@example.org",
        "name": "Somebody Else"
        "orcid": "CERN-234loq8f87wrq"
        "CERNId": "23678"
      }
  • rename inside the bag the subdirectory data/content into data/payloads (still under discussion)
Edited by Antonio Vivace

Merge request reports