One of the most common Kubernetes production issues engineers run into is:
Readiness probe failed
At first glance, the application may look healthy:
* pods are running
* containers started successfully
* CPU and memory look normal
But traffic suddenly stops reaching the pod.
This usually happens because Kubernetes marked the pod as Not Ready after the readiness probe started failing.
When this happens:
* services stop routing traffic to the pod
* deployments may become unstable
* rolling updates can fail
* applications may become partially unavailable
In large Kubernetes environments, readiness probe failures are extremely common during:
* deployments
* startup spikes
* dependency failures
* resource pressure
* networking instability
What Is a Readiness Probe in Kubernetes?
A readiness probe tells Kubernetes whether a container is ready to receive traffic.
If the readiness probe succeeds:
* Kubernetes adds the pod to the service endpoint list
* traffic starts flowing to the pod
If the readiness probe fails:
* Kubernetes temporarily removes the pod from load balancing
* traffic stops going to that pod
Unlike liveness probes, readiness probes do not restart containers.
They only control whether the pod should receive production traffic.
Why “Readiness Probe Failed” Happens
A readiness probe can fail for many reasons.
The most common causes are:
* slow application startup
* incorrect probe configuration
* application dependency failures
* high CPU or memory usage
* networking issues
* database connection problems
* overloaded nodes
* port binding failures
* Kubernetes resource pressure
In many real-world incidents, the actual issue is not Kubernetes itself.
It is usually the application or infrastructure underneath.
Common Causes of Readiness Probe Failures
1. Application Started Slowly
This is one of the most common causes.
The container starts successfully, but the application inside is still:
* loading dependencies
* warming caches
* establishing database connections
* compiling assets
* starting background workers
Kubernetes begins probing too early and marks the pod as Not Ready.
This happens frequently with:
* Java applications
* Spring Boot services
* large Node.js applications
* AI/ML workloads
Common Fix
Increase:
* initialDelaySeconds
* failureThreshold
* timeoutSeconds
to give the application more startup time.
2. Wrong Readiness Probe Path
Sometimes the readiness endpoint itself is incorrect.
Example:
* /health does not exist
* application returns 404
* endpoint requires authentication
* wrong port configured
Kubernetes sees probe failures continuously.
Common Fix
Verify:
* endpoint path
* port
* protocol
* response status
using:
kubectl describe pod <pod-name>
and:
kubectl logs <pod-name>
3. Database or Dependency Failures
Many applications fail readiness checks because dependencies are unavailable.
Examples:
- PostgreSQL unavailable
- Redis connection timeout
- Kafka unreachable
- external API failure
The application stays running but fails readiness validation internally.
This is extremely common in microservice environments.
4. Resource Starvation
Under high load:
- CPU throttling
- memory pressure
- disk IO contention
can delay readiness responses.
The application responds too slowly and Kubernetes marks the probe as failed.
This often happens during:
- traffic spikes
- deployments
- autoscaling events
- noisy neighbor workloads
5. Kubernetes Networking Issues
Sometimes the probe itself cannot reach the container properly because of:
- CNI instability
- DNS failures
- service mesh issues
- ingress misconfiguration
This becomes more common in large multi-cluster environments.
How to Troubleshoot “Readiness Probe Failed”
Step 1: Describe the Pod
Run:
kubectl describe pod <pod-name>
Look for:
- readiness probe errors
- timeout messages
- HTTP status failures
- connection refused errors
This usually reveals the first clue.
Step 2: Check Container Logs
Run:
kubectl logs <pod-name>
Look for:
- startup delays
- dependency failures
- database connection issues
- memory errors
- crashes
Step 3: Test the Endpoint Manually
Exec into the pod:
kubectl exec -it <pod-name> -- sh
Then test:
curl localhost:<port>/health
Verify:
- endpoint responds
- response time is fast enough
- status code is correct
Step 4: Check Resource Usage
Run:
kubectl top pod
High CPU or memory pressure may slow responses enough to fail probes.
Step 5: Review Probe Configuration
A badly configured readiness probe is very common.
Check:
- timeoutSeconds
- periodSeconds
- initialDelaySeconds
- failureThreshold
Small timeout values often cause false failures in production.
Difference Between Readiness Probe and Liveness Probe
This is a common source of confusion.
Readiness Probe
Controls:
- whether traffic reaches the pod
Failure result:
- pod removed from service endpoints
Container keeps running.
Liveness Probe
Controls:
- whether Kubernetes should restart the container
Failure result:
- container restart
Liveness failures are usually more severe.
Best Practices for Readiness Probes
Use Dedicated Health Endpoints
Do not use heavy application endpoints for readiness checks.
Use lightweight endpoints like:
/ready
/healthz
/status
Avoid Expensive Dependency Checks
Your readiness endpoint should not perform:
- heavy DB queries
- external API calls
- expensive computations
Otherwise transient slowdowns can remove healthy pods from traffic.
Tune Probe Timing Carefully
Default probe settings are often too aggressive for production workloads.
Adjust:
- startup delays
- timeouts
- retry thresholds
based on real application behavior.
Monitor Probe Failures
Frequent readiness failures usually indicate:
- unstable deployments
- overloaded infrastructure
- application bottlenecks
This should be monitored proactively.
Why Readiness Probe Failures Matter in Production
Many engineers underestimate readiness failures because containers stay running.
But in production environments, readiness instability can cause:
- partial outages
- traffic imbalance
- deployment failures
- autoscaling problems
- cascading incidents
Especially in Kubernetes, unhealthy readiness behavior often becomes the first sign of deeper infrastructure issues.
How Modern Teams Reduce Readiness Probe Incidents
Enterprise SRE and CloudOps teams increasingly use:
- Kubernetes observability
- AI-assisted root cause analysis
- deployment correlation
- topology mapping
- automated investigation workflows
to identify probe failures faster.
This becomes critical in large-scale environments where manually correlating:
- logs
- deployments
- alerts
- resource pressure
- networking changes
takes too much time during incidents.
What does “Readiness Probe Failed” mean in Kubernetes?
It means Kubernetes determined that the pod is not ready to receive production traffic based on the readiness check configuration.
Does readiness probe failure restart the container?
No. Readiness probe failures only remove the pod from traffic routing. They do not restart containers.
What causes readiness probe failures?
Common causes include:
- slow startup
- wrong endpoint configuration
- database failures
- memory pressure
- networking issues
- dependency timeouts
How do I check readiness probe errors?
Run:
kubectl describe pod <pod-name>
and inspect probe failure events.
Can high CPU usage cause readiness probe failures?
Yes. CPU throttling or overloaded containers can delay probe responses and trigger failures.
What is the difference between readiness and liveness probes?
Readiness probes control traffic routing. Liveness probes determine whether the container should restart.
Can readiness probe failures cause downtime?
Yes. If enough pods become Not Ready, applications may become partially or fully unavailable.
What are best practices for readiness probes?
Best practices include:
- lightweight health endpoints
- proper timeout tuning
- avoiding expensive checks
- monitoring probe failures proactively