CUDA AES-128 Bitslice (32-way lanes) — `TMP` + `constexpr` secret keys

This repo provides:

A 32-way bitsliced AES-128 implementation in CUDA
Compile-time round keys (no runtime key storage) using templates/constexpr
Implicit ShiftRows (via column remapping) in the main rounds
Tools for input generation, unpacking, and verification

Build

cmake -S . -B build -DGPU_ARCH=90 # adjust SM (e.g. 86, 89) cmake --build build -j

Build prints ptxas info; expect 0 bytes lmem for kernels.

Generate inputs (control grid & block)

python3 tools/make_inputs.py --grid 2x1x1 --block 256x1x1 --seed 1 # writes inputs/run_YYYYmmdd_HHMMSS/{plaintexts.bin, plaintexts.hex, slices_u32_le.bin, meta.json}

groups = grid.x*grid.y*grid.z * block.x*block.y*block.z
Each thread processes 32 plaintexts (one 128-bit state in bitslice).
Bitsliced input layout: groups * 128 little-endian uint32_t slices.

Run (full AES)

IN=inputs/run_*/slices_u32_le.bin OUT=outputs/run_full_slices_u32_le.bin ./build/cuda-aes-full "$IN" "$OUT" 2x1x1 256x1x1

Stores bitsliced ciphertext to OUT.

Verify and unpack to standard bytes

python3 tools/verify_outputs.py \ --meta inputs/run_*/meta.json \ --slices_out "$OUT" \ --keyhex 2b7e151628aed2a6abf7158809cf4f3c

Also writes:
- outputs/ciphertexts_from_cuda.bin (unpacked CUDA output)
- outputs/ciphertexts_from_python.bin (Python AES-128 reference)

Other test targets

S-box only:

./build/cuda-aes-sbox-only inputs/.../slices_u32_le.bin outputs/sbox_only.bin 256

MixColumns only:

./build/cuda-aes-mix-only inputs/.../slices_u32_le.bin outputs/mix_only.bin 256

Compile-time keys (no runtime storage)

Round keys are computed in templates (include/aes_keys.hpp).
AddRoundKey is emitted as compile-time ~reg for key-bit=1 (xor with all-ones), using no registers or memory for keys.

Edit the key in src/aes_full_kernel.cu inside run_aes_bs_full():

using MyKey = StaticKey< 0x2B,0x7E,0x15,0x16, 0x28,0xAE,0xD2,0xA6, 0xAB,0xF7,0x15,0x88, 0x09,0xCF,0x4F,0x3C >;

Notes

ShiftRows in the main rounds is handled implicitly by calling MixColumns with bytes {0,5,10,15}, {4,9,14,3}, {8,13,2,7}, {12,1,6,11}.
Final round does SubBytes + ShiftRows only, then applies the last round key. A register-only permutation is used once.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
include		include
src		src
tools		tools
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA AES-128 Bitslice (32-way lanes) — `TMP` + `constexpr` secret keys

Build

Generate inputs (control grid & block)

Run (full AES)

Verify and unpack to standard bytes

Other test targets

Compile-time keys (no runtime storage)

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CUDA AES-128 Bitslice (32-way lanes) — TMP + constexpr secret keys

Build

Generate inputs (control grid & block)

Run (full AES)

Verify and unpack to standard bytes

Other test targets

Compile-time keys (no runtime storage)

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

CUDA AES-128 Bitslice (32-way lanes) — `TMP` + `constexpr` secret keys

Packages