Indexation of transcripts
We need to start to index any vtt produced by TLP belonging to Weblecture Service.
The output of every document sent to cernsearch should be something like [1]. Taking into account externals can be checked out here: https://gitlab.cern.ch/webcast/cernsearch-asr/-/blob/master/services/cernsearch.py#L126
Before putting this feature in production we need to do a snapshot of the actual state of captions and index the legacy ones. So from the moment the feature is active we dont worry about old ones. In case needed cernsearch-asr can be used to do ad-hoc jobs e.g. delete/add a particular vtt.
[1]
{
"_access": {
"owner": [
"group:weblecture-service"
],
"delete": [
"group:weblecture-service"
],
"update": [
"group:weblecture-service"
],
"read": [
"group:indico-atlas-managers",
"group:secretariat-atlas",
"group:atlas-cb-chair",
"group:atlas-students-group",
"group:atlas-readaccess-active-members",
"group:atlas-mgt-members",
"user:goldfarb",
"user:mdesnyde",
"user:lamontm",
"user:atlasip",
"user:evelina",
"user:awiedema",
"user:eb65a4c8fbd2959b95f7",
"user:joseph.m.muse@ou.edu"
]
},
"$schema": "https://asrservice-search.web.cern.ch/schemas/asrservice/vtt_ttaas_v1.0.0.json",
"language": "en",
"ttpmediaid": "legacy_1149016c23sc3",
"asrtype": "vtt",
"begin_hour": "00",
"begin_min": "04",
"begin_sec": "31",
"_data": {
"mtext": "a it seems not so thanks once again vincent let's move on\n",
"url": "https://weblecture-player.web.cern.ch/?id=1149016c23sc3&year=2022&time=04m31s"
},
"su_account": "legacy"
}