ComputeShare is a lightweight federated machine learning training system with pytorch. It enables users to distribute PyTorch training tasks across multiple Macs over the internet, utilizing native MPS (Metal Performance Shaders) acceleration to split the workload and drastically reduce training time.
The system consists of two primary components operating in a Bulk Synchronous Parallel formation:
- Parameter Server (
server.py): The central node that holds the global PyTorch model. It asynchronously waits for workers to submit their computed gradients, decompresses the payload, mathematically averages them via SGD, and updates the global weights. The server implements the Linear Scaling Rule, dynamically multiplying the learning rate by the total WorkerWORLD_SIZEto prevent mathematical decay in multi-node clusters. - Workers (
worker.py): Distributed clients that pull the latest model from the server, process a unique mathematical shard of the training dataset using their local GPU, and submit the calculated vectors back to the server.
- Stale Gradients Rejection: To prevent a slow worker from polluting the global weights, workers attach an
X-Worker-VersionHTTP header with their gradients. If the Server has already advanced to a new global version, it immediately mathematically rejects the slow worker's payload (HTTP 409 Conflict) and forces the worker to re-pull the new weights and recompute. - Connection Continuity: The workers leverage extended
requeststimeouts (15s/30s) to survive aggressive LocalTunnel connection spikes without experiencing process-ending timeout drops. - Compression: To drastically decrease network overhead (e.g., preventing Ngrok blocks), gradients are not serialized into JSON. The workers serialize their PyTorch tensors using
io.BytesIO(), compress them locally viagzip, and transmit the binary payload (application/octet-stream). - Authentication: All external HTTP communication is secured with a mandatory 4-digit PIN header.
Before running the system, initialize the Python environment and install the required dependencies:
python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txtDesignate one machine to act as the parameter server. The server can be initialized interactively or via inline arguments to bypass prompts.
Interactive Initialization:
python server.py --pin 1234Inline Initialization:
python server.py --pinSizEpo <PIN> <WORLD_SIZE> <TOTAL_GLOBAL_BATCHES> # Example: python server.py --pinSizEpo 1234 2 50You will be prompted to define:
WORLD_SIZE: Total number of distributed workers.TOTAL_GLOBAL_BATCHES: The number of gradient averaging rounds before the model is saved.
Note: For external connections, expose port 8000 using LocalTunnel: npx -y localtunnel --port 8000
On any machine participating in the training, execute the worker script.
Interactive Initialization:
python worker.py --pin 1234Inline Initialization:
python worker.py --pinSizRanBatEpo <PIN> <WORLD_SIZE> <RANK> <BATCH_SIZE> <TOTAL_GLOBAL_BATCHES> # Example: python worker.py --pinSizRanBatEpo 1234 2 0 32 50Regardless of initialization method, workers require:
WORLD_SIZE: Must match the server configuration.RANK: The worker's unique ID (0 toWORLD_SIZE - 1).BATCH_SIZE: Number of images to process per forward pass.TOTAL_GLOBAL_BATCHES: Must match the server configuration.
Note: Edit SERVER_URL on line 30 of worker.py if connecting via LocalTunnel.
Once the server reaches the target global batches, it will save the final weights to trained_model.pth. Execute the testing script to evaluate its accuracy on 10,000 novel images:
python test.py