New SIP structure and sip.json
Should work for any invenio1.x job, but the changes are quite disruptive and all the other pipeline will need to be adapted.
Implements everthing we decided on the new SIP and SIP.json specs:
File schema
SOURCE
The following values are related to the upstream source that provided the file
url
The URL where the file can be found.
E.g. https://somewebsite/assets/sachiel.png
source_filename
The complete file name from the source.
E.g.: sachiel.png
source_path
The most complete available path from the file system where the file was originally from.
E.g.: /home/users/Antonio/Documents/archive/Pictures
Since this can expose some private information, using the "excludedata" flag with the local pipeline can be used to omit this value.
JY Comment: excludeData (or excludePath) could take an argument to specify the part to be ignored, because the remaining part (on the right) contains some potential 'metadata' e.g.:
--excludePath /home/users/Antonio/
path
[JY: maybe be renamed to
relativeSourcePath
]
The relative (to the resource folder) in which the file is to be found. If empty (""
) it means the file is at the root.
E.g. Pictures/
Be aweare that a
""
value is different from a None or null here, because it's still meaningful to affirm that we know the file is in the root folder compared to Null (we don't have that information).
BAG
These values are informative on how to find the file in the SIP payload.
bag_fullpath
The path, relative to the Bag, where the file can be found. Including the file name.
E.g.: data/content/sachiel.png
.
It's not required that it matches the original folder structure or filename. E.g. a SIP could be created flattening the directory structure and renaming files: Original folder:
Pictures
> sachiel.png
> thumbnails:
> sachiel.png
SIP:
sip::local::pictures
> ...
> data
> content
> sachiel.png
> sachiel_thumb.png
> meta
> sip.json
The original file names and directory structure can always be reconstructed with the other information in the SIP manifest.
Modifications to the SIP.json - meeting on 30.09.2021
- remove the "remote" key
- in the tool section, rename
url
intowebSite
-
url
as an array, to enable storing multiple pointers to the same file, e.g. http, xrootd ftp... - add the key
bagId
with valuesource::recid::timestamp
- For
local
source,recid
=uuencoded(foldername)-aRandomNumber
- For
- add pointers to the SIP.json schema and to the SIP specification snaphots
- add for each file its
fileHash
and thedigestAlgorithm
keys - add a
message
key which can optionnaly contain a text explaining the rationale to create the bag - add a
user
entity, OCFL-like:
"user": {
"address": "mailto:somebody-else@example.org",
"name": "Somebody Else"
"orcid": "CERN-234loq8f87wrq"
"CERNId": "23678"
}
- rename inside the bag the subdirectory
data/content
intodata/payloads
(still under discussion)