Troubleshooting Common Issues

Azure Compute Solutions

Florin Angelescu

Azure Cloud Architect

Why troubleshooting matters

 

Pod Failure

Pod Communication

Node Failure

 

 

  • Pods may fail to start

 

  • Services might not route traffic

 

  • Nodes could run out of resources
Azure Compute Solutions

Why troubleshooting matters

 

Pod Failure

  • Structured approach:
    • Observe
    • Identify
    • Test
    • Resolve
  • Ensures you don't waste time chasing symptoms instead of root causes.
Azure Compute Solutions

Pod failures

 

 

Pod Failure

 

 

  • Pods failing to start.
  • Causes:
    • Incorrect container images
    • Missing secrets
    • Insufficient resources
Azure Compute Solutions

Pod failures

kubectl

  • kubectl describe pod - events
  • kubectl logs - output

Policy

  • Check image pull policies
  • Check registry credentials

Logs

  • Investigate application logs
  • Investigate resource requests

Probe

  • Readiness and liveness probes
  • Detect unhealthy pods and restart them
Azure Compute Solutions

Networking problems

 

 

Pod Communication

 

  • Networking issues:

    • Prevent services from reaching pods
    • Or external clients
  • Verify:

    • Services are correctly defined
    • Selectors match pod labels
Azure Compute Solutions

Networking problems

kubectl

  • kubectl get svc
  • kubectl get endpoints

kubectl

  • Ingress controllers - additional config
  • TLS certificates or path rules

kubectl

  • Testing connectivity
  • kubectl exec, curl

kubectl

  • Packet capture tools
  • Azure Network Watcher
Azure Compute Solutions

Scaling challenges

 

Scaling Failure

 

 

 

  • Auto-scaler settings are mis-configured.
  • Nodes lack capacity.
Azure Compute Solutions

Scaling challenges

 

Scaling Failure

  • Check Horizontal Pod Autoscaler metrics:
    • kubectl get hpa
  • Ensure the Cluster Autoscaler is enabled.
  • Review resource requests and limits:
    • Overly restrictive values can prevent pods from scheduling
  • Inspect node pool quotas and adjust thresholds.
  • Simulate load during testing.
  • Monitor scaling events in Azure Monitor.
Azure Compute Solutions

Resource constraints

 

Node Failure

 

  • Nodes can run out of:
    • CPU
    • Memory
    • Disk space
  • Causing pods to be evicted.
  • Monitor resource usage:
    • Azure Monitor
    • kubectl top
Azure Compute Solutions

Resource constraints

Resources

  • Over-commit resources leads to instability
  • Define realistic requests and limits

Priority

  • Taints and tolerations -> pod placement
  • Ensure critical workloads have priority

Audit

  • Audits of resource allocation and quotas
  • Prevent bottlenecks

Nodes

  • Multiple node pools with different VM sizes
  • Balance workloads efficiently
Azure Compute Solutions

Recap

 

Kubernetes

 

 

  • Troubleshooting in AKS involves diagnosing:
    • Pod failures
    • Networking issues
    • Scaling challenges
    • Resource constraints
Azure Compute Solutions

Recap

 

Kubernetes

 

 

  • Combining Kubernetes tools with Azure integrations:

    • You can resolve problems quickly and maintain reliability
  • Troubleshooting play-book for your team:

    • Consistent responses and faster resolution times
Azure Compute Solutions

Let's practice!

Azure Compute Solutions

Preparing Video For Download...