You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Each script runs one experiment end-to-end. All scripts resolve paths relative to their own location, so they can be called from any directory.
All Python scripts run inside the sae conda environment via conda run -n sae.
Data collection (requires GPU / VM)
Script
Description
collect_per_token_sae.sh
Collect per-token SAE feature activations for all 8 layers. Edit LAYERS=(...) to select which layers. Outputs artifacts/activations/per_token_sae_l{L}.jsonl. Must run on a machine with the Qwen2.5-7B model and SAE weights.
Probing — does the model encode vulnerability?
Script
Description
Key output
mean_pool_probe.sh
Mean-token pooling probe across all 8 layers. Compares AUROC of mean-token vs last-token pooling. Main evidence that vulnerability signal is diffuse across positions.
Appendix L table + figure
within_language_probe.sh
Within-language (C / PHP / JS) probe. Controls for programming-language confound by running the binary probe within each language stratum separately.
Appendix N figure
within_language_mean_pool_probe.sh
Same as above but with mean-token pooling. Checks if the AUROC gain survives within a single language.
Appendix Q table
nonlinear_probe.sh
Compares linear (LogReg) vs nonlinear (MLP, Random Forest) probes at all layers. Rules out the possibility that near-chance AUROC is an artefact of linear probing.
Appendix M figure
length_controlled_probe.sh
Length-residualised and length-stratified probes. Controls for token-count differences between secure and vulnerable code.
Appendix figure
advanced_pooling_probe.sh
Compares four pooling strategies: last-token, mean-token, attention-weighted, and diff-restricted (tokens on changed lines).
Computes cosine similarity between the vulnerability direction d^L across all layer pairs. Shows whether the direction rotates or stays stable with depth.
Two analyses at L11: (1) within-pair total activation comparison; (2) activation magnitude scatter — 95.1% of secure-enriched features have higher mean activation on secure code.
fig_paired_suppression.pdf, Appendix figure
magnitude_asymmetry_crosslayer.sh
Repeats the magnitude asymmetry analysis across all 8 standard SAE layers (from TO_UPload/ JONSLs).
Appendix S table + 2×4 scatter grid
feature_asymmetry_crosslayer.sh
Replicates the Δf feature-count asymmetry (e.g. 3.65× at L11) at L0 and L27 using standard SAE activations.
fig_feature_asymmetry_crosslayer.pdf
Positional analyses — where in the sequence is the signal?
Script
Description
Key output
position_stratified_probe.sh
Mean SAE feature activation as a function of normalised token position (0→1). Shows the signal is distributed, not last-token only.
Appendix O figure
positional_probe_b.sh
Drops the first position bin and checks if discriminative signal persists. Uses positional_profiles_raw.jsonl (no GPU needed).
Appendix figure
token_feature_viz.sh
Per-token coloured heatmaps for selected SAE features. Edit --features to choose which features to visualise.
token_viz/figures/ PDFs
token_trajectory_3d.sh
Per-token residual-stream trajectory in vulnerability-direction PCA space (x = d^L, y/z = top orthogonal PCs).
3-D trajectory PDF
token_pca_3d.sh
3-D PCA trajectory of per-token SAE activations coloured by position. Requires JSONL from collect_per_token_sae.sh. Edit --layers to match collected files.