In-tree GCE PD storage driver can leaks regional GCE PD when related PV is deleted

What happened?

When the in-tree GCE PD storage driver are used on PVs with regional storage, GCE regional disk have possibility not to be deleted after the PV resource being deleted.

This issue is caused by error handling logic in in-tree GCE PD storage driver. If the API request to delete regional disks are failed because of the error other than Not Found, the deletion request will be regarded as completed even if the regional disk haven't deleted.

What did you expect to happen?

Storage driver should retry again even if the deletion request hits some errors other than NotFound.

How can we reproduce it (as minimally and precisely as possible)?

Create a regional GKE cluster 1.21.x or less without enabling CSI PD driver
Create a lot of PersistentVolumes
Delete a lot of PersistentVolumes at the same time. It can hit QuotaError and it could result in this issue.

There should be the other easier reproducing methods. This problem itself will be appeared when some errors other than NotFound error happened during deleting regional GCE PD by in-tree storage driver. The error not necessary to be a quota error.

Anything else we need to know?

Root cause

The L920 in the following file should be corrected

return nil, mc.Observe(nil) // <- This should be return nil, mc.Observe(err)

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/gce/gce_disks.go#L920

When the error was raised except Not found error, err object will be nil in the L937.

regionalDisk, err := g.getRegionalDiskByName(diskName)	if err == nil {	return regionalDisk, err	}

kubernetes/staging/src/k8s.io/legacy-cloud-providers/gce/gce_disks.go

Line 937 in 820247a

if err == nil {

Then the if condition would be False even if it was just a quota related error on regional disks. Then the codes after L940 will run but zonal disks won't be found. It will be regarded as deletion complete.

Possible mitigations

Upgrade your cluster to 1.22 or above.
(CSI PD driver will be used instead of the in-tree driver)
Migrate your storage driver to CSI PD driver
https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/gce-pd-csi-driver

Kubernetes version

Can only affects GKE 1.21.x or less.

This won't be a problem after 1.22.x because CSI Volume Migration will be enabled after the version.

Cloud provider

GCP

OS version

N/A

Install tools

N/A

Container runtime (CRI) and version (if applicable)

Any

Related plugins (CNI, CSI, ...) and versions (if applicable)

In-tree GCE persistent disk driver

This won't be a problem when GCE PD CSI driver is used on a cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

In-tree GCE PD storage driver can leaks regional GCE PD when related PV is deleted #109328

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Root cause

Possible mitigations

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

In-tree GCE PD storage driver can leaks regional GCE PD when related PV is deleted #109328

Description

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Root cause

Possible mitigations

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions