Investigate job-service memory consumption

changed iteration to Kubernetes Sprints Feb 19, 2025 - Mar 4, 2025

changed the description

it may be worth adding the fancy topology or anti affinity stuff to the job service to ensure they are spread across

from what i got from the standup is seems like there was two on the node that went down yday

changed iteration to Kubernetes Sprints Mar 5, 2025 - Mar 18, 2025

we have added topology and increased limits as part of todays release

For the alert at 2025-03-10 06:03pm https://mattermost.web.cern.ch/it-dep/channels/kubernetes-alerts/8knqf5znwtd9byoz36wwz6ufba

The pod in question is: kops-registry-prod-harbor-jobservice-867f4967d5-qqg8s which was OOMKilled a few minutes before 2025-03-10T16:59:32Z We see an increase of SLACK jobs.

I am also attaching the logs just before the pod exited: previous

I don't think it's clear how eats the memory, maybe it's the SLACK or IMAGE_SCAN jobs.

It could be the flood of errors [ERROR] [/jobservice/hook/hook_agent.go:155]: Retry: sending hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID

Issues that I checked already related to jobservice memory consumption:

https://github.com/goharbor/harbor/issues/15833 Similar advice at the end to clean up the bad jobservice keys in redis
https://github.com/goharbor/harbor/issues/16292 Maybe cleaning up the jobservice redis DB helps.
https://github.com/goharbor/harbor/issues/20863 Nothing interesting

kops-registry-prod-harbor-jobservice-867f4967d5-nnxn5-OOM.log

Some notes from this morning's discussion in the standup around next steps:

Increase memory limits for jobservice and trivy. Also we see JobService orphan items in the logs, they don’t exist in the db and get reported (likely related to REDIS being restarted?)
Understand memory consumption in the job service and trivy
Back to REDIS HA

TN instance still has HA REDIS
Investigate what the job service dashboard has and what things mean (vendor ID, etc). How to link jobs/schedules to projects, …

So, for Redis Sentinel, we did not configure the memory limits properly and it was OOM-ing immediately. The default limits are ultra low 192Mi.

mentioned in merge request !218 (closed)

mentioned in merge request !219 (merged)

After !219 (merged) with generous resources trivy and jobservice look better, in 24h will have more data.

To replicate the issue with redis, I will setup a bucket with the data from production in "s3://registry-staging-large" and a clone of the production DB harbordb_prod_clone_20250311150912.

I have disabled webhooks in the following projects (the webhooks are still there but deactivated):

bc
unpacked-dev
cmsweb

After this there are not errors in the jobservice logs and these webhooks were failing all the time.

mentioned in issue #127 (closed)

jobservice memory usage goes UP normally with many jobs, I don't think there is a leak, it's because there are many queued jobs.

Now, problems can happen if we have at the same time: many image pushes (which trigger image_scan), many replication to TN for the ACC project, many webhooks that timeout but consume resources.

Summary:

Jobservice memory consumtion is influenced by the number of tasks jobservice is running. Image pulls can increase the jobs because image scans may be triggered and webhooks may be executed. We had 5 broken webhooks that were firing frequently, they were not leaking memory but it was added load with jobs waiting to timeout (and log pollution.)

We observe hourly spikes because lxcvmfs is pulling ~2.5K images every hour within a 5-10 minute window.

Jobservice needs space for spiked memory needs and redis needs memery to accommodate the jobs stored there.

I'm leaving the webhooks disabled since they are all failing.

Investigate job-service memory consumption

Designs

Child items ...

Activity