Fix QAT model converting #2190

veralauee · 2023-06-26T23:04:50Z

Convert quantization aware trained model from TF to ONNX has several issues --

QuantizeLinear and DequantizeLinear are fused into conv layer, but the downstream compiler(e.g., TensorRT) needs the Q/DQ layers to determine whether to use int8 or not. See issue QDQ node for weight tensor of Con2D undergoes Constant folding (enabled for node using tf type=FakeQuantWithMinMaxVarsPerChannel) #1972 . We need to keep Q/DQ layer unfused. QuantizeLinear and DequantizeLinear are corresponding to FakeQuantWithMinMaxVars in TensorFlow, so excluding it from can_fold in tf_utils.py can solve it.
Need to allow narrow_range in quantized nodes. TensorRT maps [min, max] to [-127, 127](see Page 12) , which needs 0 in fp32 to be mapped to 0 in int8. Also see narrow_range=True in TensorRT/tools/tensorflow-quantization here.

veralauee and others added 2 commits June 26, 2023 15:43

fix QAT model converting

bd9e9e5

Merge branch 'main' into fix_quantize

2759043

Provide feedback