Skip to content

AlexPetrusca/softgrad

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SoftGrad

A lightweight, educational deep learning framework built on MLX for Apple Silicon. SoftGrad provides a clean, intuitive API for building and training neural networks while maintaining full transparency into the forward and backward pass computations.

Philosophy

SoftGrad is designed to help you understand deep learning by implementing it from scratch:

  • Explicit gradients: See exactly how backpropagation flows through each layer
  • Clean abstractions: Simple, readable code that mirrors mathematical definitions
  • Native MLX: Leverages Apple Silicon's Neural Engine for performance
  • Educational focus: Learn by building real models that actually work

Features

  • Core Layers: Linear, Conv2d, MaxPool2d, Embedding, CausalSelfAttention
  • Structural Layers: Sequential, Parallel, Residual, ProjectionResidual
  • Normalization Layers: LayerNorm, BatchNorm
  • Activations: ReLU, LeakyReLU, Softmax, and custom function support
  • Loss Functions: Cross Entropy, Binary Cross Entropy, MSELoss
  • Optimizers: SGD, AdamW, Lion
  • Checkpointing: Save and load model weights
  • MLX Interop: Use MLX models directly or load PyTorch weights

Quick Start

Installation

# Clone the repository git clone https://github.com/AlexPetrusca/softgrad.git cd softgrad # Install dependencies pip install -r requirements.txt

Hello World: Training a Simple Network

import mlx.core as mx from softgrad import Network from softgrad.layer.core import Linear, Activation from softgrad.function.activation import relu from softgrad.optim import SGD from softgrad.function.loss import cross_entropy_loss # Build network network = Network(input_shape=784) network.add_layer(Linear(256)) network.add_layer(Activation(relu)) network.add_layer(Linear(128)) network.add_layer(Activation(relu)) network.add_layer(Linear(10)) # Setup optimizer optimizer = SGD(eta=0.01, momentum=0.9) optimizer.bind_loss_fn(cross_entropy_loss) optimizer.bind_network(network) # Training loop for epoch in range(10): for x_batch, y_batch in dataloader: optimizer.step(x_batch, y_batch)

Examples

Some examples of what Softgrad is capable of.

1. Image Classification with CNN

from softgrad import Network from softgrad.layer.conv import Conv2d, MaxPool2d from softgrad.layer.core import Linear, Activation from softgrad.layer.transform import Flatten from softgrad.function.activation import relu # Build a simple CNN network = Network(input_shape=(32, 32, 3)) # Convolutional layers network.add_layer(Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1)) network.add_layer(Activation(relu)) network.add_layer(MaxPool2d(kernel_size=2, stride=2)) network.add_layer(Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1)) network.add_layer(Activation(relu)) network.add_layer(MaxPool2d(kernel_size=2, stride=2)) # Classification head network.add_layer(Flatten()) network.add_layer(Linear(256)) network.add_layer(Activation(relu)) network.add_layer(Linear(10))

2. Transformer for Language Modeling (GPT)

from softgrad import Network from softgrad.function.activation import Relu from softgrad.function.core import Concatenate, Add from softgrad.layer.attn import CausalSelfAttention from softgrad.layer.core import Linear, Activation, Embedding from softgrad.layer.core import Sequential, Parallel, Residual from softgrad.layer.norm import LayerNorm from softgrad.layer.transform.PositionIndices import PositionIndices class FeedForward(Sequential): """Position-wise MLP with expansion and non-linearity""" def __init__(self, n_embd): super().__init__([ Linear(4 * n_embd), Activation(Relu()), Linear(n_embd) ]) class MultiHeadAttention(Sequential): """Multiple heads of causal self-attention in parallel""" def __init__(self, num_heads, head_size, block_size): super().__init__([ Parallel([ CausalSelfAttention(n_embd, head_size, block_size) # heads for _ in range(num_heads) ], Concatenate()), Linear(n_embd) # projection ]) class TransformerBlock(Sequential): """Transformer block: communication followed by computation""" def __init__(self, n_embd, n_head): super().__init__([ Residual(Sequential([ LayerNorm(), MultiHeadAttention(n_head, n_embd // n_head, block_size) ])), Residual(Sequential([ LayerNorm(), FeedForward(n_embd) ])) ]) network = Network(input_shape=(block_size,)) # Token and positional embeddings network.add_layer(Parallel([ Embedding(vocab_size, n_embd), Sequential([ PositionIndices(), Embedding(block_size, n_embd) ]) ], Add())) # Transformer blocks network.add_layer(Sequential([ TransformerBlock(n_embd, n_head) for _ in range(n_layer) ])) # LLM head network.add_layer(LayerNorm()) network.add_layer(Linear(vocab_size))

See examples/transformer/minimal_transformer.py for a complete GPT-style transformer trained on Shakespeare.

3. DeepDream with VGG16

from examples.deepdream import deep_dream_octaves from load_vgg16 import load_vgg16_pretrained # Load pretrained VGG16 vgg16 = load_vgg16_pretrained() # Generate DeepDream deep_dream_octaves( img_path="input.png", output_path="output.png", layer_names=['conv4_3', 'conv5_2'], octaves=4, n_iterations=10 )

See examples/deepdream/ for complete DeepDream implementation.

Architecture

Forward and Backward Flow

Every layer implements three core methods:

class Layer: def _link(self): """Initialize parameters based on input shape""" def _forward(self, x_in: mx.array) -> mx.array: """Compute forward pass""" def _backward(self, dx_out: mx.array) -> mx.array: """Compute backward pass (gradient w.r.t. input)"""

Parameter Management

Parameters are stored with explicit gradient tracking:

# Setting parameters layer.params["W"] = mx.array(weights) layer.params["b"] = mx.array(bias) # Accessing gradients (automatic "d" prefix) weight_grad = layer.params["dW"] bias_grad = layer.params["db"]

Context Saving

Layers automatically save forward pass values for backward computation:

def _forward(self, x_in: mx.array) -> mx.array: x_out = x_in @ self.params["W"] + self.params["b"] # Context automatically stores x_in and x_out return x_out def _backward(self, dx_out: mx.array) -> mx.array: # Access saved values x_in = self.ctx.x_in self.params["dW"] += x_in.T @ dx_out return dx_out @ self.params["W"].T

Gradient Accumulation

Gradients accumulate across mini-batches:

# In backward pass self.params["dW"] += gradient # Accumulate # After optimizer step layer.params.zero_grad() # Reset for next batch

Advanced Features

Loading PyTorch Weights

from softgrad.util.pytorch_loader import load_pytorch_weights_into_network # Automatic layer mapping network.load_from_pytorch(pytorch_model.features)

Using MLX Models Directly (MLX Iterop)

from mlx import nn from softgrad.layer.shim import MLX # Wrap any MLX model mlx_model = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10) ) network = Network(input_shape=784) network.add_layer(MLX(mlx_model))

Contributing

Contributions welcome! Areas of interest:

  • Data augmentation utilities
  • More layer types (GroupNorm, Dropout, etc.)
  • Visualization tools
  • Performance benchmarks
  • More examples

Acknowledgments

  • Built on MLX by Apple
  • DeepDream implementation based on Google's original work
  • GPT implementations based on Andrej Karpathy's minGPT and nanoGPT

⭐ If you find this project helpful, please consider starring it!


Why SoftGrad?

Because understanding comes from building. This framework is intentionally simple, readable, and educational. Every abstraction serves a pedagogical purpose. If you want to truly understand how neural networks work under the hood, build them yourself with SoftGrad.

Happy learning! 🚀

About

Lightweight, educational deep learning framework for Apple Silicon

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages