The pod in question is: kops-registry-prod-harbor-jobservice-867f4967d5-qqg8s which was OOMKilled a few minutes before 2025-03-10T16:59:32Z
We see an increase of SLACK jobs.
I am also attaching the logs just before the pod exited:
previous
I don't think it's clear how eats the memory, maybe it's the SLACK or IMAGE_SCAN jobs.
It could be the flood of errors [ERROR] [/jobservice/hook/hook_agent.go:155]: Retry: sending hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID
Some notes from this morning's discussion in the standup around next steps:
Increase memory limits for jobservice and trivy. Also we see JobService orphan items in the logs, they don’t exist in the db and get reported (likely related to REDIS being restarted?)
Understand memory consumption in the job service and trivy
Back to REDIS HA
TN instance still has HA REDIS
Investigate what the job service dashboard has and what things mean (vendor ID, etc). How to link jobs/schedules to projects, …
To replicate the issue with redis, I will setup a bucket with the data from production in "s3://registry-staging-large" and a clone of the production DB harbordb_prod_clone_20250311150912.
jobservice memory usage goes UP normally with many jobs, I don't think there is a leak, it's because there are many queued jobs.
Now, problems can happen if we have at the same time: many image pushes (which trigger image_scan), many replication to TN for the ACC project, many webhooks that timeout but consume resources.
Jobservice memory consumtion is influenced by the number of tasks jobservice is running. Image pulls can increase the jobs because image scans may be triggered and webhooks may be executed. We had 5 broken webhooks that were firing frequently, they were not leaking memory but it was added load with jobs waiting to timeout (and log pollution.)
We observe hourly spikes because lxcvmfs is pulling ~2.5K images every hour within a 5-10 minute window.
Jobservice needs space for spiked memory needs and redis needs memery to accommodate the jobs stored there.
I'm leaving the webhooks disabled since they are all failing.