various oracle devdb failures blocked all pipelines for 24h
Description
During the past 24h almost all jobs on our CI runners failed with no clear signature:
-
init
pod timed out on some becausecta-catalogue-schema-drop
was never exiting (for example: https://gitlab.cern.ch/cta/CTA/-/jobs/38336586) - ORA-12514 errors
TNS:listener does not currently know of service requested in connect descriptor
while nothing changed in TNS definition provided by ORACLE service for castorint
How bad was this?
This graph show job success (green) and failure (red) on our MHVTL runners only:
All jobs failed between 2024-04-24T15:20 and 2024-04-25T09:50:
- 184 failed jobs
- only 10 went through around 2024-04-24T23:00
Pipelines are starting to work again since 2024-04-25T09:50 on the same CI runners, with the exact same code...
This looks like a castorint backend issue, but this rate of failures just completely block CTA development workflows, and catching up by manually running failed jobs we need for our merge requests is very very expensive on our side.
I will check with DB team the exact reason for this, but we could also adapt a bit our DB CI strategy: local postgres backend would be good enough for 99% of our jobs and we keep ORACLE for some schedules, stress tests (merge pipelines with catalogue changes?)...
Oracle VS Postgres as default for runners
We could list the various reasons for switching our default to postgres here.