Indexation of transcripts

We need to start to index any vtt produced by TLP belonging to Weblecture Service.

The output of every document sent to cernsearch should be something like [1]. Taking into account externals can be checked out here: https://gitlab.cern.ch/webcast/cernsearch-asr/-/blob/master/services/cernsearch.py#L126

Before putting this feature in production we need to do a snapshot of the actual state of captions and index the legacy ones. So from the moment the feature is active we dont worry about old ones. In case needed cernsearch-asr can be used to do ad-hoc jobs e.g. delete/add a particular vtt.

[1]

{
  "_access": {
    "owner": [
      "group:weblecture-service"
    ],
    "delete": [
      "group:weblecture-service"
    ],
    "update": [
      "group:weblecture-service"
    ],
    "read": [
      "group:indico-atlas-managers",
      "group:secretariat-atlas",
      "group:atlas-cb-chair",
      "group:atlas-students-group",
      "group:atlas-readaccess-active-members",
      "group:atlas-mgt-members",
      "user:goldfarb",
      "user:mdesnyde",
      "user:lamontm",
      "user:atlasip",
      "user:evelina",
      "user:awiedema",
      "user:eb65a4c8fbd2959b95f7",
      "user:joseph.m.muse@ou.edu"
    ]
  },
  "$schema": "https://asrservice-search.web.cern.ch/schemas/asrservice/vtt_ttaas_v1.0.0.json",
  "language": "en",
  "ttpmediaid": "legacy_1149016c23sc3",
  "asrtype": "vtt",
  "begin_hour": "00",
  "begin_min": "04",
  "begin_sec": "31",
  "_data": {
    "mtext": "a it seems not so thanks once again vincent let's move on\n",
    "url": "https://weblecture-player.web.cern.ch/?id=1149016c23sc3&year=2022&time=04m31s"
  },
  "su_account": "legacy"
}