Instructions

Notice

Make sure git lfs has installed and has initialized. cifar10 dataset now is included in distributed-training repo using git lfs.

ImageNet

Make sure you have imagenet data set at ~/data. Here is the possible tree structure of ~/data:

/home/ubuntu/data ├── cifar-10-batches-py │ ├── batches.meta │ ├── data_batch_1 │ ├── data_batch_2 │ ├── data_batch_3 │ ├── data_batch_4 │ ├── data_batch_5 │ ├── readme.html │ └── test_batch ├── cifar-10-python.tar.gz └── imagenet ├── bounding_boxes ├── idxar_map.p ├── idxar_map_192.p ├── idxar_map_64.p ├── imagenet_2012_bounding_boxes.csv ├── sorted_idxar.p ├── train ├── trn_file2size.p ├── val_file2size.p └── validation

Modify configurations at `training-configs` folder

Mainly adding server IPs, following file is at training-configs/cifar10-resnet50-2p3dn/2-p3dn-resnet50-cifar10-40G.json. You need to change the "nodes" field in the config file (using EC2's private IP here). E.g: you have two instances: 172.31.31.15 and 172.31.29.187, assume 172.31.29.187 is the localhost where we placed our script. Then change nodes to be ["localhost", "172.31.31.15"]

{ "comments": "unlimited bandwidth", "host_user": "ubuntu", "host_user_dir": "/home/ubuntu", "host_ssh_key": "~/.ssh/id_rsa", "docker_user_dir": "/home/cluster", "docker_user": "cluster", "docker_ssh_port": 2022, "docker_ssh_key": "./DockerEnv/ssh-keys/id_rsa", "script_path": "~/distributed-training/test_scripts/pytorch_resnet50_cifar10.py", "script_args": "--epochs 20", "nodes": ["localhost", ""], "nGPU": 8, "eth": "ens5", "bw_limit": "40Gbit", "default_bw": "100Gbit", "log_folder": "p3dn-ResNet50-CIFAR10" }

Run script

single node

python3 batch_run_st.py

multi-nodes

python3 docker_dt.py <config-file> # e.g. python3 docker_dt.py training-configs/cifar10-resnet50-2p3dn/2-p3dn-resnet50-cifar10-40G.json

mimic distributed training scripts

python3 docker_mt.py <config-file> <debug-flag> # e.g. python3 docker_mt.py training-configs/mimic_config_template.json

Note: logs will be saved into chaokun_logs/<sub-dir>, thus we need the log folder.

Sample outputs of `docker_dt.py`

located at example-script-output

Other logs

Program logs will be saved into log_archives

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
DockerEnv		DockerEnv
memcpy_profile		memcpy_profile
training-configs		training-configs
.gitignore		.gitignore
README.md		README.md
batch_run_st.py		batch_run_st.py
ctl_containers.py		ctl_containers.py
docker_dt.py		docker_dt.py
docker_mt.py		docker_mt.py
docker_st.py		docker_st.py
dt_exp.py		dt_exp.py
init_env.py		init_env.py
log.example		log.example
mimic_env_setup.sh		mimic_env_setup.sh
monitor_cpu.py		monitor_cpu.py
monitor_net.py		monitor_net.py
update_training_configs.py		update_training_configs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instructions

Notice

ImageNet

Modify configurations at `training-configs` folder

Run script

single node

multi-nodes

mimic distributed training scripts

Sample outputs of `docker_dt.py`

Other logs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Instructions

Notice

ImageNet

Modify configurations at training-configs folder

Run script

single node

multi-nodes

mimic distributed training scripts

Sample outputs of docker_dt.py

Other logs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Modify configurations at `training-configs` folder

Sample outputs of `docker_dt.py`

Packages