Skip to content

Investigate issues with kubeconfig authentication with big images

We see 401 errors, even though the image gets pulled partially:

Aug 21 09:32:58 k8s-fpga-large-pdc-v-m4-szcr2oa7kzpi-node-4 containerd-stargz-grpc[4185956]: {"error":"failed to resolve layer: failed to resolve layer \"sha256:d68ab6c47a05135b1ddfe77779dd15618b6e62c8fe056bf6de055aad7248572c\" from \"registry.cern.ch/kubernetes-developers-private/vivado:2024.2\": failed to resolve the blob: failed to resolve the source: cannot resolve layer: failed to redirect (host \"registry.cern.ch\", ref:\"registry.cern.ch/kubernetes-developers-private/vivado:2024.2\", digest:\"sha256:d68ab6c47a05135b1ddfe77779dd15618b6e62c8fe056bf6de055aad7248572c\"): failed to access to the registry with code 401: failed to resolve: failed to resolve target","key":"k8s.io/280/extract-869361022-h85L sha256:5bf24d5f5949c7ccdc146d1536b16a14a5e3ccef67041b63a5f83abe5be99b7e","level":"warning","msg":"failed to prepare remote snapshot","parent":"sha256:0819b451287becf4007b4d0e8964bc5dbc91e699cf0d5ce5d30aa09091f1ef8a","remote-snapshot-prepared":"false","time":"2025-08-21T09:32:58.918620247Z"}

or

Aug 21 09:30:29 k8s-fpga-large-pdc-v-m4-szcr2oa7kzpi-node-4 containerd-stargz-grpc[4185956]: {"key":"k8s.io/234/extract-120841962-xDLA sha256:1cd77fb1592db412c9623885ada2f3a8042326d29167ad74d0d2c032ed2a0ab3","level":"info","mountpoint":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/141/fs","msg":"Received status code: 401 Unauthorized. Refreshing creds...","parent":"sha256:f209f8ee2d4dcb628434a5499d01575008b033152559fb2964fd7d751bce46e4","src":"registry.cern.ch/kubernetes-developers-private/vivado:2021.2/sha256:ec5a61381d639c13ef85a5183ad26678af495ed2ef86491537ee96d356987b11","time":"2025-08-21T09:30:29.222167783Z"}

But also issues with specific layers:

Aug 21 09:34:18 k8s-fpga-large-pdc-v-m4-szcr2oa7kzpi-node-4 containerd-stargz-grpc[4185956]: {"error":"failed to read \"file_list_3.txt\" (off:4194304,size:4194304): cacheWithReader.peek: fileReader.ReadAt.peek: failed to fetch region [{3900000 3949999} {3600000 3649999} {4050000 4099999} {3500000 3549999} {4000000 4049999} {4100000 4149999} {3550000 3599999} {3700000 3749999} {3850000 3899999} {3650000 3699999} {3750000 3799999} {3950000 3999999} {3800000 3849999}]","level":"warning","msg":"failed to fetch whole layer=sha256:bf0a3ddfc4a09424b2609e1089cd54ee24d99d235fb473d576e0f67e465eb819","time":"2025-08-21T09:34:18.924409075Z"}

Sometimes this results in errors to start the pod (with missing layer errors), other times stargz falls back to containerd that tries to pull the full image, until it consumes all the disk and kills the node.