various oracle devdb failures blocked all pipelines for 24h

Description

During the past 24h almost all jobs on our CI runners failed with no clear signature:

init pod timed out on some because cta-catalogue-schema-drop was never exiting (for example: https://gitlab.cern.ch/cta/CTA/-/jobs/38336586)
ORA-12514 errors TNS:listener does not currently know of service requested in connect descriptor while nothing changed in TNS definition provided by ORACLE service for castorint

How bad was this?

This graph show job success (green) and failure (red) on our MHVTL runners only:

All jobs failed between 2024-04-24T15:20 and 2024-04-25T09:50:

184 failed jobs
only 10 went through around 2024-04-24T23:00

Pipelines are starting to work again since 2024-04-25T09:50 on the same CI runners, with the exact same code...

This looks like a castorint backend issue, but this rate of failures just completely block CTA development workflows, and catching up by manually running failed jobs we need for our merge requests is very very expensive on our side.

I will check with DB team the exact reason for this, but we could also adapt a bit our DB CI strategy: local postgres backend would be good enough for 99% of our jobs and we keep ORACLE for some schedules, stress tests (merge pipelines with catalogue changes?)...

Oracle VS Postgres as default for runners

We could list the various reasons for switching our default to postgres here.

Edited Apr 26, 2024 by Julien Leduc