A PyTorch-based implementation of GPTQ (Generative Pre-trained Transformer Quantization) with distributed computing support. This package enables efficient quantization of large language models across multiple GPUs and even multiple nodes.
- Single GPU Support: Quantize models on a single GPU efficiently
- Multi-GPU Support: Data parallel quantization across multiple GPUs on a single node
- Distributed Support: Scale across multiple nodes (4-8+ GPUs)
- Flexible Quantization: Support for 2, 3, 4, and 8-bit quantization
- Model Agnostic: Works with various transformer architectures
- Memory Efficient: Optimized memory usage with gradient checkpointing
- Easy Integration: Simple API and CLI interface
- Checkpointing: Save and resume quantization progress
pip install distributed-gptqgit clone https://github.com/yourusername/distributed-gptq cd distributed-gptq pip install -e .pip install distributed-gptq[cuda]from distributed_gptq import quantize_model_simple from transformers import AutoModelForCausalLM # Load your model model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m") # Prepare calibration data (list of input tensors) calibration_data = [...] # Your calibration dataset # Quantize to 4 bits quantized_model = quantize_model_simple( model, calibration_data, bits=4, save_path="quantized_model.pt" )from distributed_gptq import DistributedGPTQuantizer, QuantizationConfig # Configure quantization config = QuantizationConfig( bits=4, group_size=128, calibration_samples=128 ) # Create quantizer quantizer = DistributedGPTQuantizer(config) # Quantize model quantized_model = quantizer.quantize_model( model, calibration_data, save_path="quantized_model.safetensors" )Launch on multiple nodes using torchrun:
# Node 0 (master) torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 \ --master_addr=<MASTER_IP> --master_port=29500 \ your_script.py --distributed # Node 1 torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 \ --master_addr=<MASTER_IP> --master_port=29500 \ your_script.py --distributeddistributed-gptq quantize facebook/opt-1.3b \ -o quantized_model.safetensors \ -b 4 \ --calibration-samples 128distributed-gptq benchmark \ original_model.pt \ quantized_model.safetensors \ --test-samples 100distributed-gptq convert \ model.pt \ model.safetensors \ --input-format pytorch \ --output-format safetensorsfrom distributed_gptq import ( DistributedGPTQuantizer, QuantizationConfig, DistributedConfig, QuantizationMode ) # Advanced quantization config quant_config = QuantizationConfig( bits=4, # Quantization bits group_size=128, # Group size for quantization actorder=False, # Use activation order percdamp=0.01, # Percentage dampening blocksize=128, # Block size for quantization calibration_samples=256, # Number of calibration samples use_triton=True, # Use Triton kernels (if available) ) # Distributed config dist_config = DistributedConfig( mode=QuantizationMode.HYBRID_PARALLEL, # Hybrid parallelism world_size=8, # Total GPUs backend="nccl", # Communication backend ) # Initialize quantizer quantizer = DistributedGPTQuantizer( quantization_config=quant_config, distributed_config=dist_config )The package is organized into modular components:
- Core: GPTQ algorithm implementation
- Distributed: Multi-GPU/node coordination
- Models: Model-specific adaptations
- Utils: Helper utilities
- CLI: Command-line interface
- Data Parallel: Split calibration data across GPUs
- Model Parallel: Split model layers across GPUs
- Hybrid Parallel: Combination of both strategies
| Model Size | GPUs | Quantization Time | Memory Usage |
|---|---|---|---|
| 125M | 1 | ~2 min | 4 GB |
| 1.3B | 4 | ~5 min | 8 GB/GPU |
| 6.7B | 8 | ~15 min | 16 GB/GPU |
| 13B | 8 | ~30 min | 24 GB/GPU |
We welcome contributions! Please see our Contributing Guide for details.
# Clone the repo git clone https://github.com/yourusername/distributed-gptq cd distributed-gptq # Install in development mode pip install -e ".[dev]" # Run tests pytest tests/ # Format code black distributed_gptq/ isort distributed_gptq/If you use this package in your research, please cite:
@software{distributed-gptq, title = {Distributed GPTQ: Efficient Quantization for Large Language Models}, author = {Your Name}, year = {2024}, url = {https://github.com/yourusername/distributed-gptq} }This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
This implementation is based on the original GPTQ paper:
- CUDA Out of Memory: Reduce batch size, group size, or use more GPUs. 72B models typically require 8+ GPUs with 40GB+ VRAM each.
- Distributed Timeout: Ensure all nodes have network connectivity and matching CUDA/NCCL versions.
- Model Loading: For very large models, ensure sufficient CPU RAM and disk space.
- CLI Version: Check your install with
distributed-gptq --versionor get help withdistributed-gptq --help.
For more, see the examples/README.md and PROJECT_SUMMARY.md.