FAQ

Isn't the decoder/PPO doing all the learning?

No, this is precisely why there are ablations. The footage you see in the video was taken using a 0-bias full linear readout decoder, meaning that the action selected is a linear function of the output spikes from the CL1; the CL1 is doing the learning. There is a noticeable difference when using the ablation (both random and 0 spikes result in zero learning) versus actual CL1 spikes.

Isn't the encoder/PPO doing all the learning?

This question largely assumes that the cells are static, which is incorrect; it is not a memory-less feed X in get Y machine. Both the policy and the cells are dynamical systems; biological neurons have an internal state (membrane potential, synaptic weights, adaptation currents). The same stimulation delivered at different points in training will produce different spike patterns, because the neurons have been conditioned by prior feedback. During testing, we froze encoder weights and still observed improvements in the reward.

How is DOOM converted to electrical signals?

We train an encoder in our PPO policy that dictates the stimulation pattern (frequency, amplitude, pulses, and even which channels to stimulate). Because the CL1 spikes are non-differentiable, the encoder is trained through PPO policy gradients using the log-likelihood trick (REINFORCE-style), i.e., by including the encoder’s sampled stimulation log-probs in the PPO objective rather than backpropagating through spikes.

ppo_doom.py Quickstart

Default Parameters

PPO: learning_rate=3e-4, gamma=0.99, gae_lambda=0.95 (many RL implementations actually use far lower gamma 0.95 and lambda 0.90 for GAE but this can severely affect training on lower levels due to the long range dependencies since you take less damage and therefore live longer), clip_epsilon=0.2, entropy_coef=0.02, steps_per_update=2048 (use higher for stability but it can be really slow without parallelization of some sort, that would probably require GRPO instead of PPO), batch_size=256, num_epochs=4.
Observation & Action: Screen buffer enabled at RES_320X240; hybrid action spaces are used (and greatly preferred) unless use_discrete_action_set=True. Realistically, you only flip this if all else fails to reduce entropy as it greatly reduces the movement fidelity of the agent and just doesn't look as cool.
Scenario Config: doom_config defaults to progressive_deathmatch.cfg (progressive_deathmatch.wad) — similar to survival, but kills don't reset ammo count (encouraging proper ammo management) with movement tweaks to make movement easier to train. Also available: survival.cfg (survival.wad) and the deadly corridor curriculum (deadly_corridor_1.cfg through deadly_corridor_5.cfg, all using deadly_corridor.wad). Files deadly_corridor_1.cfg to deadly_corridor_4.cfg ramp difficulty gradually, but deadly_corridor_5.cfg is a significant jump (and the actual benchmark). Progress through 1-4 builds basic policies yet may result in movement habits that underperform on 5 (straight running toward armor). Adjust curriculum pacing accordingly.
Feedback Defaults: Episode feedback now follows overall reward unless episode_positive_feedback_event/episode_negative_feedback_event are set. Reward channels use feedback_positive_amplitude=2.0 and feedback_negative_amplitude=2.0, dynamically scaled but clipped via _limit_scaled_amplitude.

Architecture & Feedback Tuning (Deadly Corridor)

The specific values below are tuned for the deadly corridor scenario (deadly_corridor_1.cfg – deadly_corridor_5.cfg). Treat them as a starting point only — other scenarios (progressive deathmatch, survival) will likely require different values for feedback scaling, reward shaping, ray-cast geometry, and curriculum pacing.

use_reward_feedback: Uses rewards to drive postive/negative feedback rather than action specific feedback, if you do decide to use action feedback, tweak event_feedback_settings accordingly, these were values were arbitarily set.
decoder_enforce_nonnegative=False, decoder_freeze_weights=False: decoder stays free to mirror encoder intent; set to True if you need tight control over decoded spike weights.
decoder_zero_bias=True: keeps bias at zero so decoded actions depend solely on encoder output; helped prevent a lot of decoder-sided learning in testing but may be different on actual hardware since the sdk spikes were random; this should definitely be tested with ablations!
decoder_use_mlp=False, decoder_mlp_hidden=32: default linear decoder keeps hardware mapping transparent; enable the MLP when you require richer non-linear policies (expect higher sample complexity, decoder also tends to start becoming a policy head but might be due to random spike noise from the SDK).
decoder_weight_l2_coef=0.0, decoder_bias_l2_coef=0.0: L2 regularization hooks, raise these only if decoder weights diverge on long runs.
wall_ray_count=12, wall_ray_max_range=64, wall_depth_max_distance=18.0: ray-cast features tuned for corridor geometry; adjust if you alter field-of-view or scenario scale.
encoder_trainable=True, encoder_entropy_coef=-0.10: encoder keeps learning with a negative entropy coefficient that encourages confident (low-variance) stimulation because of the Beta distribution head.
decoder_ablation_mode='none': swap to random or zero to test policy robustness when decoder contributions are removed.
encoder_use_cnn=True, encoder_cnn_channels=16, encoder_cnn_downsample=4: lightweight CNN for spatial features; bump channels/downsample when raising resolution, can disable if needed to rely soley on raycasting data.
encoder_same_frame_encoding=False, encoder_same_frame_repeats=3, encoder_same_frame_methods=('raw', 'edge', 'contrast'): optional same-frame encoding (SFE) repeats one Doom observation through a configurable number of encoder rounds and concatenates the resulting spike rounds before decoding. If you provide fewer methods than repeats, the remaining rounds default to raw. This is experimental in this repo, has not yet been ablation-tested, and can make decoder overfitting easier because the decoder sees a larger stacked spike vector.
spike_artifact_wait_s=0.050: the training side waits 50 ms after each UDP stimulation before reading the CL1 spike reply, and the CL1 interface now exposes a matching --artifact-wait-ms 50 --collect-window-ms 50 device-side window so residual stimulation artifacts are less likely to leak into the decoder input.
episode_reset_delay_s=1.0: wait one second between Doom episodes so membrane potentials can settle before the next reset.
episode_positive_feedback_event=None, episode_negative_feedback_event=None: default to global reward feedback; set to event keys if you need action-specific filters.
Surprise scaling knobs (feedback_surprise_gain, _max_scale, _freq_gain, _amp_gain variants): modulate how unexpected TD errors boost frequency/amplitude; higher values emphasize novel negative events.
enemy_distance_normalization=1312.0: normalization constant for distance-based shaping; leave untouched unless you change WAD geometry or measurement units.
We use SiLU instead of GELU for the encoder/decoder, as GELU can be a tad less efficient, especially if the tanh approximation isn’t vectorized, and we don't particularly need the encoder to be perfect, just adaptive enough for the CL1 neurons.

Getting Started

Install Requirements
```
python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt
```
Torch 2.10 was used with CUDA 13.0 in testing but the version does not matter too much here, use whatever is compatible with the hardware available.
Pick a Scenario
- Default is PPOConfig.doom_config = "progressive_deathmatch.cfg". Also available: "survival.cfg".
- For the deadly corridor curriculum, start with "deadly_corridor_1.cfg" and advance sequentially; consider fine-tuning on deadly_corridor_5.cfg with a lower learning rate to adapt movement behavior.
Run Training
```
python3 ppo_doom.py
```
- Checkpoints land in PPOConfig.checkpoint_dir, logs in PPOConfig.log_dir.
- Use TensorBoard to monitor: tensorboard --logdir checkpoints/l5_2048_rand/logs.
Tweak Defaults
- Override fields when instantiating PPOConfig, e.g. PPOConfig(doom_config="deadly_corridor_1.cfg").
- For event-specific episode feedback, set episode_positive_feedback_event/episode_negative_feedback_event to keys defined in event_feedback_settings.

Parameter Monitoring

tensorboard --logdir checkpoints/l5_2048_rand/logs --port 6006

Running with increased configuration (2025-11-19)

# Run with cpu and a tick rate of 10 Hz python3 ppo_doom.py \ --device "cpu" \ --tick_frequency_hz 10 # Run locally and showing the window python3 ppo_doom.py \ --device "cpu" \ --tick_frequency_hz 10 \ --recording_path ./recordings \ --show_window

Running Local Server + Connecting to CL1

See USAGE.md

Same-Frame Encoding Notes

Same-frame encoding in this repo follows the broad diffusion-neuron idea of repeating a single input frame across multiple stimulation rounds, but the view transforms are tuned for the current Doom RL setup rather than copied exactly. The round count is configurable. The default round template is raw, edge, and contrast, and any missing configured rounds fall back to raw.

Available SFE types:

raw: passes the encoder the original grayscale screen view with no extra preprocessing. Use this as the baseline round.
edge: applies a Sobel edge-magnitude transform so the round emphasizes walls, enemy outlines, and other strong geometric transitions.
contrast: recenters the grayscale frame around its own mean and pushes differences through a tanh contrast curve, which highlights bright/dark deviations without explicitly collapsing the view to edges.

This should be treated as experimental:

It has not yet been ablation-tested against the single-round baseline in this repo.
The decoder now receives stacked spike rounds, which increases decoder capacity and can let it overfit without actually relying on useful CL1 signal.
If you enable it, compare against --decoder-ablation random or zero before trusting any reward gain.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
docker		docker
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
USAGE.md		USAGE.md
_vizdoom.ini		_vizdoom.ini
cl1_neural_interface.py		cl1_neural_interface.py
deadly_corridor.wad		deadly_corridor.wad
deadly_corridor_1.cfg		deadly_corridor_1.cfg
deadly_corridor_2.cfg		deadly_corridor_2.cfg
deadly_corridor_3.cfg		deadly_corridor_3.cfg
deadly_corridor_4.cfg		deadly_corridor_4.cfg
deadly_corridor_5.cfg		deadly_corridor_5.cfg
mjpeg_server.py		mjpeg_server.py
nashmove.pk3		nashmove.pk3
ppo_doom.py		ppo_doom.py
progressive_deathmatch.cfg		progressive_deathmatch.cfg
progressive_deathmatch.dbs		progressive_deathmatch.dbs
progressive_deathmatch.wad		progressive_deathmatch.wad
requirements.txt		requirements.txt
survival.cfg		survival.cfg
survival.wad		survival.wad
training_server.py		training_server.py
udp_protocol.py		udp_protocol.py
visualisation.html		visualisation.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FAQ

ppo_doom.py Quickstart

Default Parameters

Architecture & Feedback Tuning (Deadly Corridor)

Getting Started

Running with increased configuration (2025-11-19)

Running Local Server + Connecting to CL1

Same-Frame Encoding Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

FAQ

ppo_doom.py Quickstart

Default Parameters

Architecture & Feedback Tuning (Deadly Corridor)

Getting Started

Running with increased configuration (2025-11-19)

Running Local Server + Connecting to CL1

Same-Frame Encoding Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages