Azure AKS - This container service is in a failed state

Question

I have recently upgrade my AKS cluster from 1.16.x to 1.18.17 (it was a jump of two versions). I did the upgrade using the Azure Portal, not the CLI.

The upgrade itself has worked, I can see my cluster is now on version 1.18.17 and on first glance everything seems to be working as expected, but at the top of the Overview panel, this message is displayed:

This container service is in a failed state. Click here to go to diagnose and solve problems.

With the cluster in this state I can't scale, or upgrade, as I get an error telling me the operation isn't available whilst the cluster is upgrading, or in a failed state.

The supporting page the error links to doesn't give me any useful information. It doesn't even mention the fact my cluster is in a failed state.

I've seen this error once before when I was approaching the limit of our VM Compute quota. At the moment though, I am only using 10%, and I don't have enough pods and nodes to push it over. The only other quotas which are maxed are network watchers and I don't think that's related.

The scaling operation links to this support document: aka.ms/aks-cluster-failed, and the suggestion there is about quota sizes, which I have already tried.

I'm really scratching my head with this one I can't find any useful support documents, blog posts or other questions, so any help would be greatly appreciated!

Jim · Accepted Answer · 2021-05-17 21:01:35Z

Answering my own question in the hope it can help others, or myself in the future.

I managed to get more information of the error by running an update with the azure cli Upgrade an Azure Kubernetes Service (AKS) cluster.

You can also use the cli to check for available updates. Check for available AKS cluster upgrades.

Using the cli seems to be a bit more informative when troubleshooting.

vinicius-r-rosa · Accepted Answer · 2023-05-11 20:11:22Z

AKS Cluster in Failed Status

When you perform an operation on Kubernetes (AKS), the operation may fail for MANY reasons, such as an upgrade operation, unavailability of quotas, nodes turned off, memory pressure, misconfiguration, etc. So, when any operation fails, the cluster enters a "Failed" status!

The solution may depend on what caused the error, but a simple way to try to solve this issue is to rescale the failed nodes:

Just click on the node with a "Failed" status, then click on "Scale node pool."
When the pop-up opens, click on "Apply" without making any changes. This will trigger an operation that will re-provision the node and possibly solve your problem.

Another simple way is to send an empty upgrade operation with no version defined.

DISCLAMER: On current az cli version (2.45.0), when you don't define a specific version to upgrade AKS, AKS will upgrade to the current version. So before proceeding with this approach check if this is the behavior on your cli version!

To perform this operation, you will need to use the Cloud Shell or the az CLI, ensuring that you are on the correct tenant. The command to perform the operation is:

az aks upgrade --resource-group <resource_group_name> --name <cluster_name> --yes --debug

Note: When using the az CLI to manage Kubernetes, use the --debug flag during the operation! This will be VERY helpful in understanding what is happening during the upgrade process, allowing you to check for errors and take appropriate action. The --yes flag is used to skip the confirmation dialog.

Fiad · Accepted Answer · 2023-08-04 14:50:42Z

Azure Kubernetes Service cluster in a failed state

please go ahead and do reconcile the cluster.

Reconcile command: az resource update --name xxxxxx --namespace Microsoft.ContainerService --resource-group xxxx --resource-type ManagedClusters --subscription xxxxx

iamattiq1991 · Accepted Answer · 2024-11-17 21:53:36Z

For my case issue was with PDB for ingress controller namespace. I deleted the PDB first

kubectl delete pdb ingress-nginx-controller -n ingress-basic

Secondly I reconciled the AKS cluster

az aks update --resource-group {RG} --name {ClusterName}

Lastly did the same for my node pool and issue was fixed.

az aks nodepool update \ --resource-group {RG} \ --cluster-name {ClusterName} \ --name {{NODEPOOL}}

Collectives™ on Stack Overflow

Azure AKS - This container service is in a failed state

4 Answers 4

Comments

AKS Cluster in Failed Status

Comments

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

AKS Cluster in Failed Status

Comments

1 Comment

Comments

Related