Name	Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore	.gitignore
README.md	README.md
model.py	model.py
server.py	server.py
test.py	test.py
testworker.py	testworker.py
worker.py	worker.py

Name

Last commit message

Last commit date

.gitignore

ComputeShare: Distributed Mac Training

ComputeShare is a lightweight federated machine learning system built for Apple Silicon. It enables users to distribute PyTorch training tasks across multiple Macs over the internet, utilizing native MPS (Metal Performance Shaders) acceleration to split the workload and drastically reduce training time.

Architecture

The system consists of two primary components:

Parameter Server (server.py): The central node that holds the global PyTorch model. It asynchronously waits for workers to submit their computed gradients, decompresses the payload, mathematically averages them via SGD, and updates the global weights.
Workers (worker.py): Distributed clients that pull the latest model from the server, process a unique mathematical shard of the training dataset using their local GPU, and submit the calculated vectors back to the server.

To drastically decrease network overhead (e.g., preventing Ngrok blocks), gradients are not serialized into JSON. The workers serialize their PyTorch tensors io.BytesIO(), compress them locally using gzip, and transmit the binary payload (application/octet-stream) via HTTP to the parameter server.

All external HTTP communication is secured with a mandatory 4-digit PIN header.

Usage Guide

1. Initialize the Parameter Server

Designate one machine to act as the parameter server.

python server.py --pin 1234

You will be prompted to define:

WORLD_SIZE: Total number of distributed workers.
TOTAL_GLOBAL_BATCHES: The number of gradient averaging rounds before the model is saved.

Note: For external connections, expose port 8000 using LocalTunnel: npx -y localtunnel --port 8000

2. Connect the Workers

On any machine participating in the training, execute the worker script:

python worker.py --pin 1234

You will be prompted for:

WORLD_SIZE: Must match the server configuration.
RANK: The worker's unique ID (0 to WORLD_SIZE - 1).
BATCH_SIZE: Number of images to process per forward pass.
TOTAL_GLOBAL_BATCHES: Must match the server configuration.

Note: Edit SERVER_URL on line 30 of worker.py if connecting via LocalTunnel.

3. Evaluate the Model

Once the server reaches the target global batches, it will save the final weights to trained_model.pth. Execute the testing script to evaluate its accuracy on 10,000 novel images:

python test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ComputeShare: Distributed Mac Training

Architecture

Usage Guide

1. Initialize the Parameter Server

2. Connect the Workers

3. Evaluate the Model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ComputeShare: Distributed Mac Training

Architecture

Usage Guide

1. Initialize the Parameter Server

2. Connect the Workers

3. Evaluate the Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages