- Notifications
You must be signed in to change notification settings - Fork 41.8k
Description
What happened?
When the in-tree GCE PD storage driver are used on PVs with regional storage, GCE regional disk have possibility not to be deleted after the PV resource being deleted.
This issue is caused by error handling logic in in-tree GCE PD storage driver. If the API request to delete regional disks are failed because of the error other than Not Found, the deletion request will be regarded as completed even if the regional disk haven't deleted.
What did you expect to happen?
Storage driver should retry again even if the deletion request hits some errors other than NotFound.
How can we reproduce it (as minimally and precisely as possible)?
- Create a regional GKE cluster 1.21.x or less without enabling CSI PD driver
- Create a lot of PersistentVolumes
- Delete a lot of PersistentVolumes at the same time. It can hit QuotaError and it could result in this issue.
There should be the other easier reproducing methods. This problem itself will be appeared when some errors other than NotFound error happened during deleting regional GCE PD by in-tree storage driver. The error not necessary to be a quota error.
Anything else we need to know?
Root cause
The L920 in the following file should be corrected
return nil, mc.Observe(nil) // <- This should be return nil, mc.Observe(err) When the error was raised except Not found error, err object will be nil in the L937.
regionalDisk, err := g.getRegionalDiskByName(diskName) if err == nil { return regionalDisk, err } | if err == nil { |
Then the if condition would be False even if it was just a quota related error on regional disks. Then the codes after L940 will run but zonal disks won't be found. It will be regarded as deletion complete.
Possible mitigations
- Upgrade your cluster to 1.22 or above.
(CSI PD driver will be used instead of the in-tree driver) - Migrate your storage driver to CSI PD driver
https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/gce-pd-csi-driver
Kubernetes version
Can only affects GKE 1.21.x or less.
This won't be a problem after 1.22.x because CSI Volume Migration will be enabled after the version.
Cloud provider
GCP
OS version
N/A
Install tools
N/A
Container runtime (CRI) and version (if applicable)
Any
Related plugins (CNI, CSI, ...) and versions (if applicable)
In-tree GCE persistent disk driver
This won't be a problem when GCE PD CSI driver is used on a cluster.