Cupcake: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Overview

Cupcake is a compression scheduler to optimize the scalability of communication-efficient distributed training. It provides a general framework and implementations for several popular compression algorithms. Cupcake automatically schedules the compression operations to optimize the performance of compression algorithms without the knowledge of model architectures or system parameters. It can significantly reduce the compression overhead and meanwhile overlap the communication with the computation to reduce the communication overhead. We release the PyTorch version of the Cupcake.

Citation

If you find our project useful in your research, please consider citing:

@article{wang2023cupcake, title={Cupcake: A Compression Scheduler for Scalable Communication-Efficient Distributed Training}, author={Wang, Zhuang and Wu, Xinyu and Xu, Zhaozhuo and Ng, TS}, journal={Proceedings of Machine Learning and Systems}, volume={5}, year={2023} }

Prerequisites

The code is built with the following libraries (see requirements.txt):

Python >= 3.7
PyTorch
Horovod
numpy
tensorboardX
tqdm
openmpi
NCCL
CuPy - Depending on your CUDA Toolkit version

Install

# Step 1: install pytorch: https://pytorch.org/get-started/locally/ # Step 2: install horovod HOROVOD_GPU_OPERATIONS=NCCL pip install --no-cache-dir horovod horovodrun --check-build # Step 3: download Cupcake git clone https://github.com/zhuangwang93/Cupcake.git

Main Components

The four main components of Cupcake framework are the Communicator, Compressor, Memory, and Scheduler abstract classes.

Communicator implementations define the communication primitives used by Cupcake.
Compressor implementations provide different compress and decompress operations for various gradient compression algorithms.
Memory implementations provide the partition, update, compensate and reduce methods for memory.
Scheduler implementations provide the optimization scheduler for the performance of training speed (obsolete).

Cupcake can support the following configurations and we are extending it for more options.

Configurations	Options
Communicator	Allreduce, Allgather
Compressors	FP32, FP16, DGC, Top-k, Rand-k, QSGD, EFSignSGD, SignSGD, TernGrad, OneBit, SigNUM
Memory	Residual, DGC, None

Training Examples

Quick start

The primary benchmark is provided in compress_benchmark.py.

For example, we can use the following command to run the benchmark on 4 GPUs, with compression algorithm as efsignsgd, communication primitive as allgather, memory as residual.

horovodrun -np 4 python compress_benchmark.py --compress --compressor efsignsgd --comm allgather --memory residual --fusion-num 1 --model=resnet50

fusion-num specifies the number of partition groups. --fusion-num 0 is the layer-wise compression.

Standard benchmarks

Real examples are provided under the examples folder. We provide two kinds of benchmarks for the evaluation. The models include ResNet50 and ResNet101 on CIFAR10 and ImageNet.

For CIFAR10 benchmark, the model and dataset could be downloaded automatically. There is no additional operation for CIFAR10.

For ImageNet benchmark, you need to download the ImageNet dataset for the evaluation.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
examples		examples
mergeComp_dl		mergeComp_dl
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
compress_benchmark.py		compress_benchmark.py
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cupcake: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Overview

Citation

Prerequisites

Install

Main Components

Training Examples

Quick start

Standard benchmarks

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cupcake: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Overview

Citation

Prerequisites

Install

Main Components

Training Examples

Quick start

Standard benchmarks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages