Report and repair faulty hosts with TPUs in All Capacity mode

If you notice issues on a All Capacity mode VM that you can't resolve otherwise, for example, consistently high ICI latency metrics or consistently high temperature compared to peers, we recommend that you report its host as faulty. When you report a host as faulty, Compute Engine marks the host as faulty and automatically repairs the VM by running host maintenance. With All Capacity mode, TPU VMs aren't migrated to another host during the repair. Instead, they will be restarted on the same host if there is sufficient capacity. You can only report a faulty host that has running VM(s).

Use the report-host-as-faulty command to report a faulty host using the --fault-behavior parameter to provide additional information about the problem.

gcloud compute instance report-host-as-faulty example-tpu-vm  --fault-behavior \ # required  --description="silent data corruption affecting our ML job…"  [--disruption_policy=FUTURE | IMMEDIATE]

You can pass one of the following values for --fault-behavior:

PERFORMANCE: use this to report performance degradation on an instance
SILENT_DATA_CORRUPTION: use this to report any suspected silent data corruption on an instance
CHIP_ERROR: use this to report any TPU errors or faults where the accelerator becomes unresponsive
BEHAVIOR_UNSPECIFIED: use this to report an issue that does not if in the other three behavior groups.

The report faulty host operation usually takes 10-12 minutes to complete. Once the report operation completes, host repair starts within a minute if the disruption policy is set to immediate. If the disruption policy is set to future, no repair action is taken immediately; instead, Compute Engine schedules a repair if any fault is detected in the future. Once the repair is initiated, the VM will be powered off. The VM may stay in the pending state' until the host is repaired. Repairing the faulty host can take 3-14 days or longer. In TPU All Capacity Mode, TPU VMs aren't migrated to another host during the repair. Instead, they will be restarted on the same host if there is sufficient capacity.