Isn't the decoder/PPO doing all the learning?
No, this is precisely why there are ablations. The footage you see in the video was taken using a 0-bias full linear readout decoder, meaning that the action selected is a linear function of the output spikes from the CL1; the CL1 is doing the learning. There is a noticeable difference when using the ablation (both random and 0 spikes result in zero learning) versus actual CL1 spikes.
Isn't the encoder/PPO doing all the learning?
This question largely assumes that the cells are static, which is incorrect; it is not a memory-less feed X in get Y machine. Both the policy and the cells are dynamical systems; biological neurons have an internal state (membrane potential, synaptic weights, adaptation currents). The same stimulation delivered at different points in training will produce different spike patterns, because the neurons have been conditioned by prior feedback. During testing, we froze encoder weights and still observed improvements in the reward.
How is DOOM converted to electrical signals?
We train an encoder in our PPO policy that dictates the stimulation pattern (frequency, amplitude, pulses, and even which channels to stimulate). Because the CL1 spikes are non-differentiable, the encoder is trained through PPO policy gradients using the log-likelihood trick (REINFORCE-style), i.e., by including the encoder’s sampled stimulation log-probs in the PPO objective rather than backpropagating through spikes.
- PPO:
learning_rate=3e-4,gamma=0.99,gae_lambda=0.95(many RL implementations actually use far lower gamma 0.95 and lambda 0.90 for GAE but this can severely affect training on lower levels due to the long range dependencies since you take less damage and therefore live longer),clip_epsilon=0.2,entropy_coef=0.02,steps_per_update=2048(use higher for stability but it can be really slow without parallelization of some sort, that would probably require GRPO instead of PPO),batch_size=256,num_epochs=4. - Observation & Action: Screen buffer enabled at
RES_320X240; hybrid action spaces are used (and greatly preferred) unlessuse_discrete_action_set=True. Realistically, you only flip this if all else fails to reduce entropy as it greatly reduces the movement fidelity of the agent and just doesn't look as cool. - Scenario Config:
doom_configdefaults toprogressive_deathmatch.cfg(progressive_deathmatch.wad) — similar to survival, but kills don't reset ammo count (encouraging proper ammo management) with movement tweaks to make movement easier to train. Also available:survival.cfg(survival.wad) and the deadly corridor curriculum (deadly_corridor_1.cfgthroughdeadly_corridor_5.cfg, all usingdeadly_corridor.wad). Filesdeadly_corridor_1.cfgtodeadly_corridor_4.cfgramp difficulty gradually, butdeadly_corridor_5.cfgis a significant jump (and the actual benchmark). Progress through 1-4 builds basic policies yet may result in movement habits that underperform on 5 (straight running toward armor). Adjust curriculum pacing accordingly. - Feedback Defaults: Episode feedback now follows overall reward unless
episode_positive_feedback_event/episode_negative_feedback_eventare set. Reward channels usefeedback_positive_amplitude=2.0andfeedback_negative_amplitude=2.0, dynamically scaled but clipped via_limit_scaled_amplitude.
The specific values below are tuned for the deadly corridor scenario (
deadly_corridor_1.cfg–deadly_corridor_5.cfg). Treat them as a starting point only — other scenarios (progressive deathmatch, survival) will likely require different values for feedback scaling, reward shaping, ray-cast geometry, and curriculum pacing.
use_reward_feedback: Uses rewards to drive postive/negative feedback rather than action specific feedback, if you do decide to use action feedback, tweakevent_feedback_settingsaccordingly, these were values were arbitarily set.decoder_enforce_nonnegative=False,decoder_freeze_weights=False: decoder stays free to mirror encoder intent; set to True if you need tight control over decoded spike weights.decoder_zero_bias=True: keeps bias at zero so decoded actions depend solely on encoder output; helped prevent a lot of decoder-sided learning in testing but may be different on actual hardware since the sdk spikes were random; this should definitely be tested with ablations!decoder_use_mlp=False,decoder_mlp_hidden=32: default linear decoder keeps hardware mapping transparent; enable the MLP when you require richer non-linear policies (expect higher sample complexity, decoder also tends to start becoming a policy head but might be due to random spike noise from the SDK).decoder_weight_l2_coef=0.0,decoder_bias_l2_coef=0.0: L2 regularization hooks, raise these only if decoder weights diverge on long runs.wall_ray_count=12,wall_ray_max_range=64,wall_depth_max_distance=18.0: ray-cast features tuned for corridor geometry; adjust if you alter field-of-view or scenario scale.encoder_trainable=True,encoder_entropy_coef=-0.10: encoder keeps learning with a negative entropy coefficient that encourages confident (low-variance) stimulation because of the Beta distribution head.decoder_ablation_mode='none': swap torandomorzeroto test policy robustness when decoder contributions are removed.encoder_use_cnn=True,encoder_cnn_channels=16,encoder_cnn_downsample=4: lightweight CNN for spatial features; bump channels/downsample when raising resolution, can disable if needed to rely soley on raycasting data.encoder_same_frame_encoding=False,encoder_same_frame_repeats=3,encoder_same_frame_methods=('raw', 'edge', 'contrast'): optional same-frame encoding (SFE) repeats one Doom observation through a configurable number of encoder rounds and concatenates the resulting spike rounds before decoding. If you provide fewer methods than repeats, the remaining rounds default toraw. This is experimental in this repo, has not yet been ablation-tested, and can make decoder overfitting easier because the decoder sees a larger stacked spike vector.spike_artifact_wait_s=0.050: the training side waits 50 ms after each UDP stimulation before reading the CL1 spike reply, and the CL1 interface now exposes a matching--artifact-wait-ms 50 --collect-window-ms 50device-side window so residual stimulation artifacts are less likely to leak into the decoder input.episode_reset_delay_s=1.0: wait one second between Doom episodes so membrane potentials can settle before the next reset.episode_positive_feedback_event=None,episode_negative_feedback_event=None: default to global reward feedback; set to event keys if you need action-specific filters.- Surprise scaling knobs (
feedback_surprise_gain,_max_scale,_freq_gain,_amp_gainvariants): modulate how unexpected TD errors boost frequency/amplitude; higher values emphasize novel negative events. enemy_distance_normalization=1312.0: normalization constant for distance-based shaping; leave untouched unless you change WAD geometry or measurement units.- We use SiLU instead of GELU for the encoder/decoder, as GELU can be a tad less efficient, especially if the tanh approximation isn’t vectorized, and we don't particularly need the encoder to be perfect, just adaptive enough for the CL1 neurons.
- Install Requirements Torch 2.10 was used with CUDA 13.0 in testing but the version does not matter too much here, use whatever is compatible with the hardware available.
python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt - Pick a Scenario
- Default is
PPOConfig.doom_config = "progressive_deathmatch.cfg". Also available:"survival.cfg". - For the deadly corridor curriculum, start with
"deadly_corridor_1.cfg"and advance sequentially; consider fine-tuning ondeadly_corridor_5.cfgwith a lower learning rate to adapt movement behavior.
- Default is
- Run Training
python3 ppo_doom.py
- Checkpoints land in
PPOConfig.checkpoint_dir, logs inPPOConfig.log_dir. - Use TensorBoard to monitor:
tensorboard --logdir checkpoints/l5_2048_rand/logs.
- Checkpoints land in
- Tweak Defaults
- Override fields when instantiating
PPOConfig, e.g.PPOConfig(doom_config="deadly_corridor_1.cfg"). - For event-specific episode feedback, set
episode_positive_feedback_event/episode_negative_feedback_eventto keys defined inevent_feedback_settings.
- Override fields when instantiating
- Parameter Monitoring
tensorboard --logdir checkpoints/l5_2048_rand/logs --port 6006
# Run with cpu and a tick rate of 10 Hz python3 ppo_doom.py \ --device "cpu" \ --tick_frequency_hz 10 # Run locally and showing the window python3 ppo_doom.py \ --device "cpu" \ --tick_frequency_hz 10 \ --recording_path ./recordings \ --show_windowSee USAGE.md
Same-frame encoding in this repo follows the broad diffusion-neuron idea of repeating a single input frame across multiple stimulation rounds, but the view transforms are tuned for the current Doom RL setup rather than copied exactly. The round count is configurable. The default round template is raw, edge, and contrast, and any missing configured rounds fall back to raw.
Available SFE types:
raw: passes the encoder the original grayscale screen view with no extra preprocessing. Use this as the baseline round.edge: applies a Sobel edge-magnitude transform so the round emphasizes walls, enemy outlines, and other strong geometric transitions.contrast: recenters the grayscale frame around its own mean and pushes differences through atanhcontrast curve, which highlights bright/dark deviations without explicitly collapsing the view to edges.
This should be treated as experimental:
- It has not yet been ablation-tested against the single-round baseline in this repo.
- The decoder now receives stacked spike rounds, which increases decoder capacity and can let it overfit without actually relying on useful CL1 signal.
- If you enable it, compare against
--decoder-ablation randomorzerobefore trusting any reward gain.