1

All of our AKS clusters have the following error reported in Azure Portal:

This container service is in a failed state. Click here to open a new support request.

enter image description here

It seems we also cannot edit the cluster. When trying to scale out the nodes, I am getting the following error:

Failed to save container service 'test-aks'. Error: Operation is not allowed while cluster is being upgrading or failed in upgrade

When looking into the AKS properties, I see there is a provisioning state of "Failed":

enter image description here

We don't know how to troubleshoot this problem.

5
  • I'd contact support and go to #sig-azure on k8s slack Commented Feb 11, 2019 at 13:56
  • 1
    Did you do any changes to your cluster recently like upgrading to another version? Commented Feb 12, 2019 at 18:39
  • 1
    Use the az aks scale command to scale the cluster nodes using Azure CLI as described here and share the results: learn.microsoft.com/en-us/azure/aks/… It is likely that you exceeded the core quota. Let me know. Commented Feb 12, 2019 at 18:45
  • Any more question? Or if it's helpful you can accept it as the answer. Commented Feb 13, 2019 at 9:19
  • It was because I submitted an update request to the cluster, but there were no vCPUs available in my subscription. This set the state of the provisioning update to "Failed", but with no reasoning. I had to increase my quote and rerun the update command. Commented Feb 13, 2019 at 9:27

2 Answers 2

2

Use the az aks scale command to scale the cluster nodes using Azure CLI as described here: https://learn.microsoft.com/en-us/azure/aks/scale-cluster#scale-the-cluster-nodes

az aks show --resource-group myResourceGroup --name myAKSCluster --query agentPoolProfiles

This will show you the descriptive error message in Azure CLI. It is likely that you exceeded the limit for the core quota. More details discussed on this thread: https://github.com/Azure/AKS/issues/542

Sign up to request clarification or add additional context in comments.

Comments

1

For the issue that you shows:

This container service is in a failed state. Click here to open a new support request.

It also happened to me. Usually, there is some limitation to the user for the use of resources. On my side, I just can use 10 vCpu. So I got the error when I scale up for more nodes if the vCpu have none left. I think it's also a possible reason for you. You can take a check.

2 Comments

It was because I submitted an update request to the cluster, but there were no vCPUs available in my subscription. This set the state of the provisioning update to "Failed", but with no reasoning. I had to increase my quote and rerun the update command. Thanks
@davenewza You mean you increase the quote and it works? Maybe it's the limitation of other resources. You can get more details from the log.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.