This folder contains the code to perform Half-Quadratic Quantization (HQQ) presented in our articles:
HQQ is a fast and accurate model quantizer that skips the need for calibration data. It's super simple to implement (just a few lines of code for the optimizer). It can crunch through quantizing the Llama2-70B model in only 4 minutes! ๐
First, make sure you have a Pytorch 2 version that matches your CUDA version: https://pytorch.org/
You can install hqq via pip install hqq.
To get the latest version, you can install the core library directly via pip install git+https://github.com/mobiusml/hqq.git.
Alternatively, clone the repo and run pip install . from this current folder.
To perform quantization with HQQ, you simply need to replace the linear layers ( torch.nn.Linear) as follows:
from hqq.core.quantize import * #Quantization settings quant_config = BaseQuantizeConfig(nbits=4, group_size=64) #Replace your linear layer hqq_layer = HQQLinear(your_linear_layer, #torch.nn.Linear or None quant_config=quant_config, #quantization configuration compute_dtype=torch.float16, #compute dtype device='cuda', #cuda device initialize=True, #Use False to quantize later del_orig=True #if True, delete the original layer )The quantization parameters are set as follows:
nbits(int): supports 8, 4, 3, 2, 1 bits.group_size(int): no restrictions as long asweight.numel()is divisible by thegroup_size.quant_zero(bool): if True, it quantizes the zero-point to 8-bit without grouping.quant_scale(bool): if True, it quantizes the scaling factor to 8-bit with a group_size of 128.offload_meta(bool): if True, meta-data is offloaded to the CPU.view_as_float(bool): if True, the quantized parameter is viewed as float instead of a int type.
Setting offload_meta=True drastically decreases the GPU memory requirements but makes processing slightly slower for smaller group-sizes. With this setting, you can run Llama2-70B and Mixtral with HQQ 2-bit using only 18.8GB and 13GB VRAM respectively!
You can try to change the backend which could speed-up the runtime:
HQQLinear.set_backend(HQQBackend.PYTORCH) #Pytorch backend HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE) #Compiled Pytorch via dynamo HQQLinear.set_backend(HQQBackend.ATEN) #C++ Aten/CUDA backend (set automatically by default if available)The HQQBackend.ATEN backend is automatically installed and used by default when available.
Below you can find the speed-up benchmark with various backends, HQQBackend.PYTORCH being the baseline:
- Llama (Hugging Face + VLLM) ๐ฆ
- Mistral (Hugging Face)
- Mixtral-8x7B (Hugging Face)
- Phi + Phi_opt (Hugging Face)
- ViT-CLIP (timm) ๐ผ๏ธ
- Hugging Face
First, make sure you have your Hugging Face token properly set via:
huggingface-cli login --token <your-token> You can quantize a Hugging Face model as follows:
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer #Model and setttings model_id = 'meta-llama/Llama-2-7b-chat-hf' compute_dtype = torch.float16 device = 'cuda:0' #Load model on the CPU ###################### model = HQQModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype) tokenizer = AutoTokenizer.from_pretrained(model_id) #Quantize the model ###################### from hqq.core.quantize import * quant_config = BaseQuantizeConfig(nbits=4, group_size=64) model.quantize_model(quant_config=quant_config, compute_dtype=compute_dtype, device=device) You can save/load a quantized model as follows:
#Save the quantized model model.save_quantized(model, save_dir=save_dir) #Load from local directory or Hugging Face Hub on a specific device model = HQQModelForCausalLM.from_quantized(save_dir_or_hfhub, device='cuda')For multimodal models, you can quantize the models separately. Here's an example that quantizes the Llama language model in Llava:
#Load the model on CPU import transformers model_id = "llava-hf/llava-1.5-13b-hf" compute_dtype = torch.float16 device = 'cuda:0' processor = transformers.AutoProcessor.from_pretrained(model_id) model = transformers.LlavaForConditionalGeneration.from_pretrained(model_id, torch_dtype=compute_dtype) #Quantize and offload to GPU from hqq.core.quantize import * from hqq.models.hf.llama import LlamaHQQ quant_config = BaseQuantizeConfig(nbits=4, group_size=64) LlamaHQQ.quantize_model(model.language_model, quant_config=quant_config, compute_dtype=compute_dtype, device=device) #Use fp16 CLIP and tower model.vision_tower = model.vision_tower.to(device=device, dtype=compute_dtype) model.multi_modal_projector = model.multi_modal_projector.to(device=device, dtype=compute_dtype) model = model.eval(); #Optimize/compile (Optional) model.vision_tower = torch.compile(model.vision_tower) model.multi_modal_projector = torch.compile(model.multi_modal_projector)If the model architecture is not manally defined in hqq/models/hf, you can try the automatic mode that doesn't require knowing the architecture in advance:
from hqq.models.hf.base import AutoHQQHFModel #Quantize AutoHQQHFModel.quantize_model(model, quant_config=quant_config, compute_dtype=compute_dtype, device=device) #Save AutoHQQHFModel.save_quantized(model, save_dir) #Load model = AutoHQQHFModel.from_quantized(save_dir)By default, VLLM is not installed to avoid CUDA version problems. Make sure you install the right version that matches your CUDA settings (vllm <= 0.2.2): https://docs.vllm.ai/en/latest/getting_started/installation.html
After installation, you can quantize VLLM models as follows:
from hqq.engine.vllm import HQQLLM model_id = 'meta-llama/Llama-2-7b-chat-hf' #Loads the model (on CPU) ###################### model = HQQLLM(model=model_id) #Quantize the model and dispatch on GPU ###################### from hqq.core.quantize import * quant_config = BaseQuantizeConfig(nbits=4, group_size=64) model.quantize_model(quant_config=quant_config)Additionally, you can use the quantized model in Langchain (requires pip install langchain) as follows:
from hqq.engine.vllm import LangchainVLLM llm = LangchainVLLM(max_new_tokens=1000, top_p=0.90, temperature=0.6).set(model) print(llm("Who is Elon Musk?"))You can save/load a quantized model as follows:
#Save the quantized model model.save_quantized(model, save_dir=save_dir) #Load from local directory or Hugging Face Hub model = HQQLLM.from_quantized(save_dir_or_hfhub)Notes:
- Support is broken since post 0.2.2 update.
- The VLLM backend only works with a single GPU for now.
- Only VLLM models created via
save_quantizedcan be loaded withHQQLLM.from_quantized.
Timm backend is also supported. Here's how you use it:
model_id = 'vit_large_patch14_clip_224.laion2b' #Load model on the CPU ###################### from hqq.engine.timm import HQQtimm model = HQQtimm.create_model(model_id, pretrained=True) #Quantize the model ###################### from hqq.core.quantize import * quant_config = BaseQuantizeConfig(nbits=4, group_size=64) model.quantize_model(quant_config=quant_config, compute_dtype=torch.float16)You can save/load the quantized models as follows:
#Save the quantized model model.save_quantized(model, save_dir=save_dir) #Load from local directory or Hugging Face Hub model = HQQtimm.from_quantized(save_dir_or_hfhub)If you want to quantize your own model architecture, you need to write a patching logic that goes through all the linear layers and replaces them with HQQLinear. You can follow the examples provided in hqq/models.
You can specify different quantization configs for different layers by feeding a dictionary in the form linear_tag: BaseQuantizeConfig(), The following example uses 4-bit for self_attn.v_proj and 2-bit for the rest of the layers:
from hqq.core.quantize import * q2_config = BaseQuantizeConfig(nbits=2, group_size=16) #2-bit config q4_config = BaseQuantizeConfig(nbits=4, group_size=64) #4-bit config linear_tags = HQQModelForCausalLM.get_linear_tags(model) #List of tags for the linear layers of the model quant_config = {k: q2_config for k in linear_tags} quant_config['self_attn.v_proj'] = q4_configYou can use HQQ for LoRA training as follows:
#First, quantize/load a quantized HQQ model the from hqq.core.peft import PeftUtils base_lora_params = {'lora_type':'default', 'r':32, 'lora_alpha':64, 'dropout':0.05, 'train_dtype':torch.float32} lora_params = {'self_attn.q_proj': base_lora_params, 'self_attn.k_proj': base_lora_params, 'self_attn.v_proj': base_lora_params, 'self_attn.o_proj': base_lora_params, 'mlp.gate_proj' : None, 'mlp.up_proj' : None, 'mlp.down_proj' : None} #Add LoRA to linear/HQQ modules PeftUtils.add_lora(model, lora_params) #Optional: faster but might not work on older GPUs HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP) #Train .... #Convert LoRA weights to the same model dtype for faster inference model.eval() PeftUtils.cast_lora_weights(model, dtype=torch.float16) #Save LoRA weights PeftUtils.save_lora_weights(model, filename) #Load LoRA weights: automatically calls add_lora PeftUtils.load_lora_weights(model, filename)We provide a complete example to train a model with HQQ/LoRA that you can find in examples/lora/train_hqq_lora_example.py.
If you want to use muti-gpu training via FSDP, check out this awesome repo by Answer.AI: https://github.com/AnswerDotAI/fsdp_qlora
We provide a variety of examples demonstrating model quantization across different backends within the examples directory.
In the examples/llama2_benchmarkdirectory, you'll find code to replicate our Llama2 benchmark. By default, this benchmark quantizes the Llama2-7B model with 4-bit precision and provides perplexity metrics on wikitext-2.
To execute the benchmark, ensure you have the datasets package installed by running pip install datasets. Additionally, for the GPTQ and AWQ demos, you'll need to install the following packages: pip install auto-gptq[triton]==0.4.2 autoawq==0.1.4 triton==2.0.0
After installation, configure your Hugging Face ๐ค token either through the command line or within the demo files, and you're all set!
@misc{badri2023hqq, title = {Half-Quadratic Quantization of Large Machine Learning Models}, url = {https://mobiusml.github.io/hqq_blog/}, author = {Hicham Badri and Appu Shaji}, month = {November}, year = {2023} 
