Skip to content

In-tree GCE PD storage driver can leaks regional GCE PD when related PV is deleted #109328

@kyasbal

Description

@kyasbal

What happened?

When the in-tree GCE PD storage driver are used on PVs with regional storage, GCE regional disk have possibility not to be deleted after the PV resource being deleted.

This issue is caused by error handling logic in in-tree GCE PD storage driver. If the API request to delete regional disks are failed because of the error other than Not Found, the deletion request will be regarded as completed even if the regional disk haven't deleted.

What did you expect to happen?

Storage driver should retry again even if the deletion request hits some errors other than NotFound.

How can we reproduce it (as minimally and precisely as possible)?

  1. Create a regional GKE cluster 1.21.x or less without enabling CSI PD driver
  2. Create a lot of PersistentVolumes
  3. Delete a lot of PersistentVolumes at the same time. It can hit QuotaError and it could result in this issue.

There should be the other easier reproducing methods. This problem itself will be appeared when some errors other than NotFound error happened during deleting regional GCE PD by in-tree storage driver. The error not necessary to be a quota error.

Anything else we need to know?

Root cause

The L920 in the following file should be corrected

return nil, mc.Observe(nil) // <- This should be return nil, mc.Observe(err) 

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/gce/gce_disks.go#L920

When the error was raised except Not found error, err object will be nil in the L937.

regionalDisk, err := g.getRegionalDiskByName(diskName)	if err == nil {	return regionalDisk, err	} 

Then the if condition would be False even if it was just a quota related error on regional disks. Then the codes after L940 will run but zonal disks won't be found. It will be regarded as deletion complete.

Possible mitigations

Kubernetes version

Can only affects GKE 1.21.x or less.

This won't be a problem after 1.22.x because CSI Volume Migration will be enabled after the version.

Cloud provider

GCP

OS version

N/A

Install tools

N/A

Container runtime (CRI) and version (if applicable)

Any

Related plugins (CNI, CSI, ...) and versions (if applicable)

In-tree GCE persistent disk driver

This won't be a problem when GCE PD CSI driver is used on a cluster.

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/storageCategorizes an issue or PR as relevant to SIG Storage.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions