Hero image for Wiring Amazon ECR to EKS: Authentication, IAM, and Production-Ready Image Pulls

Wiring Amazon ECR to EKS: Authentication, IAM, and Production-Ready Image Pulls


Your EKS pods are stuck in ImagePullBackOff. The event log says 401 Unauthorized against your ECR registry, your IAM policies look correct, your node group has the right role attached, and you’ve triple-checked the registry URI. Both services are AWS-native—this integration is supposed to be seamless. And yet.

The friction here is almost always the same: engineers treat ECR like Docker Hub and expect static credentials to solve the problem. ECR doesn’t work that way. It issues short-lived authorization tokens that expire after 12 hours, retrieved via GetAuthorizationToken at runtime, not stored secrets passed at deploy time. The entire authentication chain runs through IAM—specifically through the EC2 instance profile attached to your EKS node group—and if any link in that chain is broken, the kubelet silently fails to pull and Kubernetes gives you a cryptic 401 with no further explanation.

Cross-account setups amplify the confusion. Pull-through caches, shared registries, and multi-account architectures each introduce additional trust relationships that have to be explicitly wired together. Miss one resource-based policy on the ECR side or one assume-role permission on the node side, and you’re back to staring at the same error.

Getting this right means understanding exactly how ECR tokens are issued, how the kubelet credential provider fetches and caches them, and where the IAM trust chain can silently break—not just in initial setup, but 13 hours later when your token expires mid-deployment.

Start with how ECR authentication actually works under the hood, because it’s fundamentally different from every other container registry you’ve used.

How ECR Authentication Actually Works (and Why It Differs from Docker Hub)

When engineers first wire ECR into an EKS cluster, they often treat it like any other container registry: create credentials, store them as a Kubernetes secret, reference them in the pod spec. That mental model breaks down fast with ECR, and understanding why is the key to avoiding a class of production failures that only surface at the worst possible time.

Visual: ECR token issuance flow from GetAuthorizationToken through kubelet credential provider to image pull

Tokens, Not Passwords

ECR does not use static credentials. Every authentication interaction starts with a call to the AWS GetAuthorizationToken API, which returns a base64-encoded token valid for exactly 12 hours. That token is a standard HTTP Basic Auth credential—username AWS, password set to the token value—that ECR’s registry endpoint accepts. After 12 hours, it expires. No rotation grace period, no warning. The next image pull against an expired token fails immediately.

This is the root cause of a specific, frustrating failure mode: a deployment runs fine for hours, then pods restart (due to a rollout, node replacement, or OOM kill) and suddenly enter ImagePullBackOff. The container image hasn’t changed. The IAM permissions haven’t changed. The token expired while the cluster was otherwise healthy, and nothing refreshed it.

Docker Hub and most other OCI-compliant registries issue long-lived tokens or support static robot credentials. ECR’s 12-hour window is a deliberate security boundary, not a limitation—but it requires the authentication layer to be active and automatic rather than a one-time configuration step.

How EKS Nodes Handle This Automatically

EKS clusters running recent Amazon Linux 2 or Bottlerocket node images solve the token refresh problem through the kubelet credential provider plugin. The kubelet calls the credential provider binary before each image pull that targets an ECR registry endpoint (matching the pattern *.dkr.ecr.*.amazonaws.com). The plugin calls GetAuthorizationToken using the node’s EC2 instance profile IAM role and returns a fresh token. The kubelet caches this token and uses it for subsequent pulls within the validity window.

The critical dependency here is the EC2 instance profile role. The node needs ecr:GetAuthorizationToken attached to its instance profile—not a Kubernetes secret, not a service account annotation for this specific operation. The IAM call happens at the node level, outside the Kubernetes API, before the container runtime even attempts the pull.

💡 Pro Tip: The credential provider plugin replaced the legacy credentialprovider built into older kubelet versions. If you are running EKS node AMIs older than mid-2023, verify your kubelet configuration explicitly enables the external credential provider rather than relying on the compiled-in ECR support, which lacks the same refresh reliability.

Why This Distinction Matters in Production

Understanding that the IAM trust chain—AWS account → EC2 instance profile → GetAuthorizationToken → registry token—is the actual authentication path explains why troubleshooting starts at the IAM layer, not at Kubernetes secrets. A missing policy on the node role surfaces as an ImagePullBackOff that no amount of imagePullSecrets configuration fixes.

The next section covers exactly which IAM policies belong on the node role, what the minimal permission set looks like, and how to scope access to specific repositories rather than granting blanket registry access.

IAM Configuration: Node Role Policies and the Minimal Permission Set

Every EKS node runs with an IAM role attached to the underlying EC2 instance. When kubelet needs to pull an image from ECR, it calls the ECR credential helper, which in turn calls the AWS credentials endpoint on the instance metadata service. The node role is what determines whether that call succeeds or fails. Getting the permissions right—tight enough to satisfy security teams, broad enough to avoid runtime failures—is the first operational decision you make when wiring ECR to EKS.

The AmazonEC2ContainerRegistryReadOnly Managed Policy

AWS provides a managed policy named AmazonEC2ContainerRegistryReadOnly that grants the permissions kubelet needs to authenticate and pull images. Attaching it to the node role is the fastest path to a working cluster.

The policy grants the following actions across all ECR resources in your account:

AmazonEC2ContainerRegistryReadOnly-effective-permissions.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:GetRepositoryPolicy",
"ecr:DescribeRepositories",
"ecr:ListImages",
"ecr:DescribeImages",
"ecr:BatchGetImage",
"ecr:GetLifecyclePolicyPreview",
"ecr:GetLifecyclePolicy"
],
"Resource": "*"
}
]
}

ecr:GetAuthorizationToken is the critical action here. It operates on * by design—it does not accept resource-level conditions because the token it returns is account-scoped, not repository-scoped. Every other action in the list operates on specific repository ARNs, which gives you a lever for tightening scope.

Scoping Permissions to Specific Repositories

The managed policy grants read access to every ECR repository in the account. For production environments where multiple teams share an account, that blast radius is too wide. You can write a customer-managed policy that locks access to specific repositories while still allowing the account-level GetAuthorizationToken call.

ecr-scoped-readonly-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowTokenVending",
"Effect": "Allow",
"Action": "ecr:GetAuthorizationToken",
"Resource": "*"
},
{
"Sid": "AllowScopedPull",
"Effect": "Allow",
"Action": [
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage"
],
"Resource": [
"arn:aws:ecr:us-east-1:123456789012:repository/payments-service",
"arn:aws:ecr:us-east-1:123456789012:repository/auth-service"
]
}
]
}

Attach this policy to the node role in place of the managed policy. Nodes on that group pull only from the listed repositories, and any other ECR resource returns an authorization error.

💡 Pro Tip: Use a repository naming convention like team-name/service-name and a wildcard ARN (arn:aws:ecr:us-east-1:123456789012:repository/payments/*) to cover an entire team’s repositories without enumerating each one. Wildcards work in resource ARNs for all ECR actions except GetAuthorizationToken.

IRSA: Pod-Level Control Without Node Role Sprawl

Attaching ECR permissions to the node role works, but it means every pod running on that node inherits the same access—regardless of what that pod actually needs. IRSA (IAM Roles for Service Accounts) solves this by binding an IAM role to a Kubernetes service account, then projecting short-lived credentials into each pod individually.

The trust policy for an IRSA role references your cluster’s OIDC provider:

irsa-trust-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE:sub": "system:serviceaccount:payments:payments-puller"
}
}
}
]
}

Attach the scoped ECR policy from above to this role, annotate the payments-puller service account with the role ARN, and only pods in the payments namespace using that service account receive ECR pull credentials. The node role requires no ECR permissions at all.

For any cluster where workloads belong to distinct teams or have different compliance boundaries, IRSA is the correct default—not the exception.

With IAM permissions in place, the next step is creating the ECR repositories themselves and establishing the image push workflow that feeds them.

Setting Up ECR Repositories and Pushing Your First Image

With IAM policies in place, the next step is creating the ECR repositories that EKS will pull from and establishing the push workflow your CI/CD pipeline will repeat on every release. This section covers repository creation, storage management, authentication, and the verification step that closes the loop before a manifest references your image.

Creating the Repository

Create a repository using the AWS CLI, specifying the region explicitly to avoid relying on ambient configuration:

create-ecr-repo.sh
aws ecr create-repository \
--repository-name my-app/backend \
--region us-east-1 \
--image-scanning-configuration scanOnPush=true \
--image-tag-mutability IMMUTABLE

Two flags here are worth making standard across all repositories in your account. scanOnPush=true runs a basic vulnerability scan on every image using Amazon Inspector without additional configuration, surfacing CVEs before an image can be deployed. IMMUTABLE tag mutability prevents any tag from being overwritten after it is pushed—a guard that eliminates an entire category of production incident where a rollback lands on a different image than expected.

Once created, capture the repository URI. You will use it in every subsequent step, so exporting it as an environment variable early keeps scripts readable:

get-repo-uri.sh
REPO_URI=$(aws ecr describe-repositories \
--repository-names my-app/backend \
--region us-east-1 \
--query 'repositories[0].repositoryUri' \
--output text)
echo $REPO_URI
## 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app/backend

Setting a Lifecycle Policy

Without a lifecycle policy, ECR retains every image you push indefinitely. Storage costs accumulate quickly in active pipelines, particularly when builds produce untagged intermediate layers. Apply a policy immediately after creating the repository rather than treating it as a cleanup task for later:

lifecycle-policy.sh
aws ecr put-lifecycle-policy \
--repository-name my-app/backend \
--region us-east-1 \
--lifecycle-policy-text '{
"rules": [
{
"rulePriority": 1,
"description": "Keep last 20 tagged images",
"selection": {
"tagStatus": "tagged",
"tagPrefixList": ["v"],
"countType": "imageCountMoreThan",
"countNumber": 20
},
"action": { "type": "expire" }
},
{
"rulePriority": 2,
"description": "Remove untagged images after 7 days",
"selection": {
"tagStatus": "untagged",
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 7
},
"action": { "type": "expire" }
}
]
}'

This policy keeps the last 20 versioned releases—enough to support rollbacks across multiple environments—and clears untagged build artifacts after one week. Adjust countNumber based on your release cadence and how many prior versions your runbooks require access to.

Authenticating Docker to ECR

ECR uses short-lived tokens rather than static credentials. Request a token and pipe it directly to docker login:

ecr-auth.sh
AWS_ACCOUNT_ID=123456789012
AWS_REGION=us-east-1
aws ecr get-login-password --region $AWS_REGION | \
docker login \
--username AWS \
--password-stdin \
$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

ECR tokens expire after 12 hours. In CI/CD pipelines, always run get-login-password at the start of each job rather than caching the token between runs. If your pipeline spans multiple stages that execute hours apart, add the authentication step to each stage that performs a push or pull.

Tagging and Pushing

Tag your local image using the full ECR URI, then push. Using a deterministic version tag—rather than latest—makes every deployment traceable to a specific build:

push-image.sh
IMAGE_TAG=v1.4.2
docker tag my-app-backend:latest $REPO_URI:$IMAGE_TAG
docker push $REPO_URI:$IMAGE_TAG

For pipelines that build on every commit, a common convention is to combine a semantic version with the short Git SHA—for example, v1.4.2-a3f9c1d—so the tag encodes both the release and the exact source revision without requiring a registry lookup.

Verifying the Push

Before referencing an image in a Kubernetes manifest, confirm ECR received it and inspect the manifest digest:

verify-image.sh
aws ecr describe-images \
--repository-name my-app/backend \
--region us-east-1 \
--image-ids imageTag=v1.4.2 \
--query 'imageDetails[0].{digest:imageDigest,size:imageSizeInBytes,pushed:imagePushedAt}' \
--output table

The digest returned here—a sha256: prefixed string—is the value you should pin in production manifests rather than the mutable tag. Even with IMMUTABLE tag mutability enabled, referencing the digest directly ties a deployment to a byte-for-byte specific image, making it immune to any registry-side change. Record the digest as a build artifact alongside your release notes so it is available during incident response without requiring a registry query.

With images verified and digests in hand, the next step is authoring Kubernetes manifests that reference these images reliably and deploying them to your EKS cluster.

Deploying EKS Workloads That Pull from ECR

With your ECR repository populated and your node role carrying the correct policies, deploying workloads that pull from ECR reduces to writing accurate manifest files and understanding how Kubernetes interacts with the ECR token exchange. The authentication layer you configured in the previous sections is invisible at the manifest level—your pod specs reference ECR image URIs the same way they reference any other registry.

ECR Image URIs in Pod Specs

An ECR image URI follows a fixed structure:

<account-id>.dkr.ecr.<region>.amazonaws.com/<repository-name>:<tag>

A minimal Deployment pulling from ECR looks like this:

api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-api:1.4.2
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8080

No imagePullSecrets block. No registry credentials mounted as Secrets. The kubelet on each node calls the ECR credential provider, which uses the node’s IAM role to fetch a temporary token, and the pull proceeds. This is the clean path—any manifest requiring imagePullSecrets for ECR is working around a misconfigured IAM setup, not solving it.

Bootstrapping Clusters with eksctl

When creating a new cluster, eksctl attaches the ECR read policy to the managed node group role automatically when you specify the correct managed node group configuration:

cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: my-cluster
region: us-east-1
managedNodeGroups:
- name: ng-standard
instanceType: m5.large
desiredCapacity: 3
iam:
attachPolicyARNs:
- arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
- arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
- arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy

Running eksctl create cluster -f cluster.yaml provisions the cluster with nodes pre-authorized to pull from any ECR repository in the same account. No post-creation IAM surgery required.

Validating Pulls with kubectl

After applying your Deployment, use kubectl describe pod to verify the pull succeeded:

Terminal window
kubectl describe pod -l app=api -n production

Look at the Events section at the bottom of the output. A successful pull produces:

Events:
Normal Pulling 12s kubelet Pulling image "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-api:1.4.2"
Normal Pulled 9s kubelet Successfully pulled image "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-api:1.4.2" in 2.847s
Normal Created 9s kubelet Created container api
Normal Started 9s kubelet Started container api

A failed pull surfaces here as an ErrImagePull or ImagePullBackOff event with the specific error from ECR—typically an authorization denial or a nonexistent image tag. The Events section is the first place to look before reaching for CloudTrail.

imagePullPolicy Behavior Against ECR

IfNotPresent is the correct policy for most production workloads. It skips the registry round-trip when the image already exists on the node, reducing pull latency and ECR API call volume. Use Always only when you deploy mutable tags like latest in development environments where you need the kubelet to re-check the registry on every pod start.

💡 Pro Tip: Pinning to immutable digest references (image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-api@sha256:a1b2c3d4...) combined with IfNotPresent gives you both reproducibility and pull efficiency. The digest guarantees you get exactly the layer set you tested, regardless of what happens to the tag after deployment.

With single-account pulls working end-to-end, the next operational challenge is giving workloads in one account access to images living in another—which requires shifting from IAM identity policies to ECR’s resource-based repository policies.

Cross-Account Image Pulls: ECR Repository Policies

Most EKS deployments eventually outgrow a single AWS account. Platform teams build shared service registries, security teams enforce account isolation, and organizations adopt multi-account structures through AWS Organizations. When your EKS node role lives in account 111122223333 but the ECR repository lives in account 444455556666, the single-account IAM model breaks down—and ECR’s resource-based policies take over.

Why Identity Policies Alone Won’t Work

IAM identity policies (attached to roles or users) control what principals in your account can do. They cannot grant cross-account access on their own. For cross-account ECR pulls, AWS requires a matching resource-based policy on the repository itself—similar to how S3 bucket policies work. The ECR repository policy in the registry account explicitly permits the foreign principal to perform pull operations. Without it, even a perfectly crafted node role policy in the workload account returns an authorization failure.

Both sides of the trust chain must be in place simultaneously: the repository policy in the registry account grants the permission, and the node role’s identity policy in the workload account allows it to request ECR authorization tokens. Either side missing causes an access denial that can be difficult to distinguish without careful error analysis.

Writing the Repository Policy

In the registry account (444455556666), apply a repository policy that names the workload account’s node role as a principal. The policy grants the minimum set of ECR read actions required for a successful image pull.

ecr-cross-account-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowCrossAccountPull",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::111122223333:role/eks-node-role-production"
},
"Action": [
"ecr:BatchGetImage",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchCheckLayerAvailability"
]
}
]
}

Apply it with the AWS CLI from the registry account:

apply-repo-policy.sh
aws ecr set-repository-policy \
--registry-id 444455556666 \
--repository-name platform/base-images \
--policy-file ecr-cross-account-policy.json \
--region us-east-1

The node role in the workload account still needs ecr:GetAuthorizationToken in its identity policy—that call always hits the local account’s IAM endpoint, regardless of where the repository lives. Omitting this action is a common misconfiguration that produces a confusing failure during the Docker login step rather than during the pull itself.

Verifying Access Without a Full Deployment

Before rolling this into a workload, validate the trust chain directly using the workload account’s credentials:

verify-cross-account-pull.sh
## Assume the node role in the workload account
aws sts assume-role \
--role-arn arn:aws:iam::111122223333:role/eks-node-role-production \
--role-session-name cross-account-test \
--query "Credentials" \
--output json
## Authenticate to the registry account's ECR
aws ecr get-login-password --region us-east-1 \
| docker login --username AWS --password-stdin \
444455556666.dkr.ecr.us-east-1.amazonaws.com
## Attempt a pull
docker pull 444455556666.dkr.ecr.us-east-1.amazonaws.com/platform/base-images:latest

A successful pull confirms both sides of the trust chain are wired correctly. A not authorized error on the pull (after a successful login) points to a missing or misconfigured repository policy. An error during get-login-password points to the node role’s identity policy.

Note: When granting access to an entire workload account rather than a specific role, use "AWS": "arn:aws:iam::111122223333:root" as the principal. This delegates access control to the workload account’s IAM policies, giving that team flexibility to assign pull permissions to any role they manage—useful for platform teams that don’t control downstream node role names.

With cross-account pulls functioning, the next layer of production readiness is hardening the images and the registry itself—through signing, vulnerability scanning, and tag immutability policies that prevent silent image replacement.

Security Hardening: Image Signing, Scanning, and Immutable Tags

Pulling images securely from ECR is only half the equation. The other half is ensuring those images are trustworthy before they ever reach your cluster. This means scanning for vulnerabilities at push time, preventing tag mutations in production, and verifying cryptographic signatures at admission time.

Visual: ECR security hardening layers from scan findings through immutable tags to admission webhook signature verification

Vulnerability Scanning with ECR Basic and Enhanced

ECR offers two scanning tiers. Basic scanning uses the open-source Clair engine and runs on push or on demand. It catches known CVEs at no additional cost and is the minimum baseline for any production repository.

Enhanced scanning integrates AWS Inspector and adds continuous, automated re-scanning of images already in the registry as new CVEs are published. Unlike basic scanning, you get findings surfaced in the Inspector console, EventBridge events for automation, and software composition analysis that covers OS packages and programming language dependencies simultaneously.

Enable enhanced scanning at the registry level in the ECR console under Scanning configuration, or via the AWS CLI. Once active, every push triggers an Inspector scan automatically, and findings appear within minutes.

💡 Pro Tip: Wire ECR scan findings to EventBridge and route CRITICAL or HIGH severity events to a Lambda that blocks deployment pipelines. This closes the loop between registry and CI/CD without any manual review step.

Immutable Image Tags

In production, tag mutability is a supply chain liability. If latest or v2.3.1 can be overwritten after deployment, your rollback story collapses and audits become unreliable.

Enable tag immutability per repository under Image tag mutabilityImmutable. Once set, any push attempting to overwrite an existing tag returns a 400 error. This forces your CI pipeline to always produce unique tags—commit SHA or build number—which also makes tracing a running image back to a specific commit trivial.

Image Signing with AWS Signer

AWS Signer integrates with ECR through Notation, the CNCF-standard image signing tool. You create a signing profile backed by an AWS-managed key, sign images post-build using the notation sign command, and store the signature as an OCI artifact in the same repository.

Enforcement happens at the admission layer. The Notation admission webhook—deployed into your EKS cluster—verifies signatures against a trust policy before any pod reaches the Pending state. Unsigned images are rejected without ever pulling a byte to a node.

Lifecycle Policies for Registry Hygiene

Unbounded image accumulation increases storage costs and audit surface. ECR lifecycle policies automatically expire images based on age or count. A sensible baseline expires untagged images after one day and retains only the fifty most recent tagged images per repository. Policies run daily, and deletions are permanent—configure them against a staging registry first to validate the rule logic before applying to production.

With your registry secured at the artifact layer, the next section moves to the operational side: integrating ECR authentication into CI/CD pipelines and building a troubleshooting playbook for the image pull failures that will inevitably surface in production.

Operational Patterns: CI/CD Integration and Troubleshooting Playbook

With your EKS cluster pulling images from ECR and cross-account access configured, the next operational layer is automating image delivery through CI/CD and building the reflexes to diagnose authentication failures before they become incidents.

GitHub Actions with OIDC Federation

Avoid long-lived AWS credentials in GitHub secrets. Instead, configure OIDC federation so GitHub Actions assumes an IAM role directly, with tokens scoped to each workflow run.

First, create the OIDC provider in your AWS account (one-time setup):

create-oidc-provider.sh
aws iam create-open-id-connect-provider \
--url https://token.actions.githubusercontent.com \
--client-id-list sts.amazonaws.com \
--thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1

Then attach a trust policy to the IAM role your workflow will assume:

github-actions-trust-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringLike": {
"token.actions.githubusercontent.com:sub": "repo:my-org/my-app:*"
},
"StringEquals": {
"token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
}
}
}
]
}

The workflow itself stays clean — no AWS_ACCESS_KEY_ID secrets required:

.github/workflows/build-push.yml
name: Build and Push to ECR
on:
push:
branches: [main]
permissions:
id-token: write
contents: read
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-ecr-push
aws-region: us-east-1
- name: Login to ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
- name: Build and push
env:
REGISTRY: ${{ steps.login-ecr.outputs.registry }}
IMAGE_TAG: ${{ github.sha }}
run: |
docker build -t $REGISTRY/my-app:$IMAGE_TAG .
docker push $REGISTRY/my-app:$IMAGE_TAG

💡 Pro Tip: Pin the role-to-assume condition to a specific branch (ref:refs/heads/main) rather than using a wildcard. This prevents feature branches from pushing images that could shadow production tags.

Troubleshooting Playbook

Expired authorization tokens. ECR tokens expire after 12 hours. If your node’s kubelet starts reporting ImagePullBackOff after half a day of uptime, the credential helper on the node role has stopped refreshing. Verify the ecr:GetAuthorizationToken permission is present on the node role, then check whether the amazon-ecr-credential-helper is correctly configured in /etc/docker/daemon.json on the node.

Missing or misconfigured IAM policies. Run a quick policy simulation before assuming a role is correct:

verify-iam-permissions.sh
aws iam simulate-principal-policy \
--policy-source-arn arn:aws:iam::123456789012:role/my-eks-node-role \
--action-names ecr:GetAuthorizationToken ecr:BatchGetImage ecr:GetDownloadUrlForLayer \
--resource-arns "*"

Any DENY result here explains a pull failure without requiring a live cluster test.

Wrong region endpoint. ECR endpoints are regional. A pod configured with 123456789012.dkr.ecr.us-west-2.amazonaws.com cannot pull from a repository that exists in us-east-1. Confirm the image URI in your Deployment manifest matches the region where the repository was created.

VPC endpoint misconfiguration. For clusters in private subnets, image pulls route through VPC endpoints for ECR (com.amazonaws.us-east-1.ecr.api and com.amazonaws.us-east-1.ecr.dkr) and S3 (for image layers). If pulls hang rather than fail immediately, verify both endpoints exist in the VPC, their security groups allow HTTPS from node subnets, and the S3 gateway endpoint covers the correct route table.

When you need to verify connectivity without local tooling, AWS CloudShell is the fastest path. Launch a session inside the same VPC and test the ECR endpoint directly:

cloudshell-connectivity-check.sh
## Run from CloudShell or a bastion in the same VPC
curl -I https://123456789012.dkr.ecr.us-east-1.amazonaws.com/v2/

A 401 Unauthorized response confirms network connectivity is healthy — authentication is the remaining issue. A connection timeout points to the VPC endpoint configuration. CloudShell eliminates the variable of local network or credential state, making it the preferred first diagnostic step for private-subnet pull failures.

ECR API throttling at scale. ECR applies per-account API rate limits that become visible when many nodes pull simultaneously during a rollout. Distribute pull pressure by staggering maxSurge in your Deployment strategy and enabling ECR pull-through cache for base images.

Monitor pull latency alongside throttle events to catch degradation before it surfaces as ImagePullBackOff errors. In CloudWatch, track two metrics under the AWS/ECR namespace: ThrottleCount and RepositoryPullCount. Set an alarm on any sustained non-zero ThrottleCount, and plot RepositoryPullCount against p99 latency during rollouts to identify whether a single registry is becoming a bottleneck. At sufficient node counts, distributing base image pulls through a regional pull-through cache dramatically reduces both latency and throttle exposure.

The authentication and IAM patterns covered throughout this guide form a complete foundation for reliable image delivery in production EKS environments — from initial credential flow through cross-account access, security controls, and operational automation.

Key Takeaways

  • Attach AmazonEC2ContainerRegistryReadOnly to your EKS node role as a baseline, then migrate workloads to IRSA for per-service least-privilege ECR access
  • Use immutable ECR tags in production and enable on-push scanning to catch vulnerabilities before pods ever attempt a pull
  • Diagnose ImagePullBackOff by checking node role policies first, then token expiry, then VPC endpoint routing—in that order