Branch: AdaLovelace — Optimized for NVIDIA Ada Lovelace architecture (RTX 4060 Ti, sm_89)
A high-performance CUDA implementation of the Scale Invariant Feature Transform (SIFT) algorithm. This implementation runs the complete SIFT pipeline on the GPU, achieving sub-millisecond feature extraction on modern NVIDIA hardware.
Based on the original work by Mårten Björkman (Celebrandil), with Ada Lovelace architecture optimizations.
| Spec | Value |
|---|---|
| GPU | NVIDIA GeForce RTX 4060 Ti |
| Architecture | Ada Lovelace (sm_89) |
| CUDA Cores | 4352 |
| VRAM | 8 GB GDDR6 |
| Memory Bandwidth | 288 GB/s |
| FP32 Performance | ~22.1 TFLOPS |
| L2 Cache | 32 MB |
| Driver | 595.71 |
| Resolution | Size | Features | Extract (ms) | Match (ms) | Total (ms) | FPS |
|---|---|---|---|---|---|---|
| VGA | 640x480 | 653 | 0.77 | 0.12 | 1.04 | 965 |
| 720p | 1280x720 | 1155 | 0.91 | 0.20 | 1.47 | 681 |
| SXGA | 1280x960 | 1326 | 0.99 | 0.21 | 1.66 | 601 |
| 1080p | 1920x1080 | 1911 | 1.38 | 0.34 | 2.49 | 402 |
| 1440p | 2560x1440 | 2244 | 1.85 | 0.38 | 3.56 | 281 |
| 4K UHD | 3840x2160 | 2829 | 3.53 | 0.51 | 6.95 | 144 |
Benchmarked on RTX 4060 Ti (Driver 595.71, CUDA 13.1). Compute Capability 8.9, 34 SMs, 8187 MB VRAM, 128-bit bus, 32768 KB L2 cache.
| Features | Match Time (ms) |
|---|---|
| 1911 (self-match) | 0.33 |
| Octaves | Features | Extract (ms) |
|---|---|---|
| 3 | 1741 | 1.24 |
| 4 | 1877 | 1.35 |
| 5 | 1911 | 1.60 |
| 6 | 1920 | 1.80 |
| Threshold | Features | Extract (ms) |
|---|---|---|
| 1.0 | 7081 | 2.07 |
| 2.0 | 3700 | 1.79 |
| 3.0 | 1911 | 1.59 |
| 5.0 | 542 | 1.32 |
| 10.0 | 6 | 1.35 |
| Arch | GPU | Extract 1280x960 | Extract 1920x1080 | Match (ms) | GFLOPS | BW (GB/s) |
|---|---|---|---|---|---|---|
| Pascal | GTX 1080 Ti | 1.20* | 1.70* | 2.20* | 11340 | 484 |
| Turing | RTX 2080 Ti | 0.42* | 0.56* | 0.30* | 11750 | 616 |
| Ada | RTX 4060 Ti | 0.99 | 1.38 | 0.33 | 22060 | 288 |
* Values from original CudaSift benchmarks. Ada values measured with Driver 595.71, CUDA 13.1.
Input Image (Host -> Device) | v +--------------------------------------------------+ | Gaussian Scale Space | | Octave 0 (full) -> Octave 1 (1/2) -> ... -> N | | | | | v | | LaplaceMulti: DoG computation | | (5 scales + 3 border per octave) | +--------------------------------------------------+ | v +--------------------------------------------------+ | Keypoint Detection | | FindPointsMulti: | | - 3D extrema detection (26 neighbors) | | - Edge response rejection | | - Sub-pixel localization (Taylor expansion) | +--------------------------------------------------+ | v +--------------------------------------------------+ | Orientation Assignment | | ComputeOrientations: | | - 32-bin gradient histogram | | - Gaussian-weighted 11x11 window | | - Secondary peak -> duplicate feature | +--------------------------------------------------+ | v +--------------------------------------------------+ | Descriptor Computation | | ExtractSiftDescriptors: | | - 4x4 spatial bins x 8 orientations | | - 128-D vector per feature | | - Two-pass normalization (clip + renorm) | +--------------------------------------------------+ | v +--------------------------------------------------+ | Feature Matching | | FindMaxCorr10 (brute-force): | | - 32x32 feature block tiling | | - float4 vectorized loads | | - Warp shuffle reductions | | - Best + second-best tracking (ambiguity) | | | | FindHomography (RANSAC): | | - 4-point DLT on GPU | | - Parallel hypothesis testing | | - Iterative refinement (CPU, Cholesky) | +--------------------------------------------------+ | Kernel | Block Size | Shared Mem | Description |
|---|---|---|---|
| ScaleDown | 68x1 | 2 KB | 2x downsampling with 5-tap Gaussian |
| LaplaceMulti | 136x1 | 4 KB | Multi-scale DoG computation |
| FindPointsMulti | 32x1 | 1 KB | 3D extrema detection + sub-pixel |
| ComputeOrientations | 121x1 | 0.5 KB | Gradient histogram, peak detection |
| ExtractSiftDescriptors | 16x8 | 0.7 KB | 128-D descriptor with trilinear interp |
| FindMaxCorr10 | 32x8 | 32 KB | Tiled brute-force matching |
- CUDA Toolkit 11.0+ (recommended 12.x for Ada Lovelace)
- OpenCV 4.x
- CMake 3.18+
- C++17 compatible compiler
mkdir build && cd build cmake .. -DCMAKE_BUILD_TYPE=Release cmake --build . --config Releasemkdir build && cd build cmake .. -DCMAKE_BUILD_TYPE=Release make -j$(nproc)bash scripts/build.sh Release| Option | Default | Description |
|---|---|---|
BUILD_TESTS | ON | Build test and benchmark programs |
BUILD_EXAMPLES | ON | Build example programs |
USE_MANAGED_MEM | OFF | Use CUDA managed memory |
VERBOSE_OUTPUT | ON | Enable verbose timing output |
# Default (img1.png & img2.png) ./cudasift # Specify GPU device and image set ./cudasift 0 1 # device 0, PGM image set./demo_extract [image_path] [gpu_id] [threshold] [num_octaves] ./demo_extract data/img1.png 0 3.0 5Output: data/keypoints.png with detected keypoints drawn.
./demo_match [img1] [img2] [gpu_id] ./demo_match data/img1.png data/img2.png 0Output: data/matches.png with match lines between images.
./demo_video [source] [gpu_id] [threshold] ./demo_video 0 # Webcam ./demo_video video.mp4 # Video fileKeys: q quit, +/- adjust threshold.
./benchmark [gpu_id] [num_runs] [threshold] ./benchmark 0 200 3.0Outputs performance tables at multiple resolutions with extraction, matching, and upload times.
# Individual tests ./test_extract # Feature extraction correctness ./test_match # Matching and quality tests ./test_homography # Geometric verification tests # All tests + benchmark bash scripts/run_benchmark.sh| Test Suite | Passed | Total | Rate |
|---|---|---|---|
| test_extract | 10 | 10 | 100% |
| test_match | 11 | 11 | 100% |
| test_homography | 8 | 8 | 100% |
| Total | 29 | 29 | 100% |
| Test | What It Verifies |
|---|---|
| BasicExtraction | Features detected, valid positions/scales |
| DifferentThresholds | Higher threshold = fewer features |
| DifferentOctaves | More octaves = more features |
| Reproducibility | Identical results across runs |
| ScaleUp | 2x upsampling detects more features |
| SelfMatch | Self-matching gives perfect scores |
| CrossMatch | Cross-image matching produces valid results |
| Homography | RANSAC + refinement finds inliers |
| Translation | Recovers known translation |
| Rotation | Handles 10 degree rotation |
| Scale | Handles 80% scale change |
| PGMImages | Stereo pair matching |
// Initialize CUDA device void InitCuda(int devNum = 0); // Allocate/free temporary GPU memory for extraction float *AllocSiftTempMemory(int width, int height, int numOctaves, bool scaleUp = false); void FreeSiftTempMemory(float *memoryTmp); // Extract SIFT features from a GPU image void ExtractSift(SiftData &siftData, CudaImage &img, int numOctaves, double initBlur, float thresh, float lowestScale = 0.0f, bool scaleUp = false, float *tempMemory = 0); // Initialize/free SIFT data container void InitSiftData(SiftData &data, int num = 1024, bool host = false, bool dev = true); void FreeSiftData(SiftData &data); // Match two sets of SIFT features on GPU double MatchSiftData(SiftData &data1, SiftData &data2); // Find homography using RANSAC double FindHomography(SiftData &data, float *homography, int *numMatches, int numLoops = 1000, float minScore = 0.85f, float maxAmbiguity = 0.95f, float thresh = 5.0f);struct SiftPoint { float xpos, ypos; // Sub-pixel position float scale; // Feature scale (sigma) float sharpness; // DoG response value float edgeness; // Edge response ratio float orientation; // Dominant orientation (degrees) float score; // Match correlation score float ambiguity; // Second-best / best ratio int match; // Index of best match float match_xpos, match_ypos; // Matched point position float match_error; // Reprojection error float subsampling; // Octave subsampling factor float data[128]; // 128-D descriptor vector }; struct SiftData { int numPts; // Number of detected features int maxPts; // Allocated capacity SiftPoint *h_data; // Host pointer SiftPoint *d_data; // Device pointer };CudaSift/ |-- CMakeLists.txt # Modern CMake build (sm_89) |-- README.md # This file |-- LICENSE # MIT License | |-- cudaSift.h # Public API header |-- cudaSiftH.cu # Host-side SIFT pipeline |-- cudaSiftH.h # Host function declarations |-- cudaSiftD.cu # Device kernels (DoG, keypoints, descriptors) |-- cudaSiftD.h # Kernel constants and block sizes |-- cudaImage.cu # GPU image container |-- cudaImage.h # Image class declaration |-- cudautils.h # CUDA utilities (error checking, timers, shuffle) |-- matching.cu # Matching kernels + RANSAC homography |-- geomFuncs.cpp # CPU homography refinement |-- mainSift.cpp # Main demo program | |-- examples/ | |-- demo_extract.cpp # Single-image extraction demo | |-- demo_match.cpp # Two-image matching demo | +-- demo_video.cpp # Real-time video demo | |-- tests/ | |-- benchmark.cpp # Multi-resolution performance benchmark | |-- test_extract.cpp # Extraction correctness tests | |-- test_match.cpp # Matching quality tests | +-- test_homography.cpp # Geometric verification tests | |-- scripts/ | |-- build.sh # Build script | +-- run_benchmark.sh # Run all tests + benchmark | |-- data/ | |-- img1.png # Test image 1 (1280x960) | |-- img2.png # Test image 2 (1280x960) | |-- left.pgm # Stereo left image | +-- righ.pgm # Stereo right image | +-- match.pdf # Matching kernel optimization notes This branch includes the following optimizations for the Ada Lovelace architecture:
- sm_89 Compute Target -- Native code generation for RTX 40-series GPUs
- Fast Math --
--use_fast_mathfor all CUDA kernels (intrinsic sin/cos/exp/sqrt) - Large L2 Cache -- RTX 4060 Ti has 32 MB L2 cache, benefiting texture lookups and DoG pyramid reads
- Warp Synchronization -- All warp-level operations use
__shfl_syncwith full mask - Optimized Block Sizes -- Tuned for 128 SMs and Ada Lovelace occupancy characteristics
- C++17 / CUDA 17 -- Modern language standard support
- Static Library -- Core SIFT compiled as static library for faster linking
| Parameter | Default | Description |
|---|---|---|
numOctaves | 5 | Number of octaves in scale space |
initBlur | 1.0 | Initial Gaussian blur sigma |
thresh | 3.0 | DoG threshold for keypoint detection |
lowestScale | 0.0 | Minimum scale for features |
scaleUp | false | 2x upsample input for fine features |
maxPts | 32768 | Maximum number of features |
minScore | 0.85 | Minimum match score for RANSAC |
maxAmbiguity | 0.95 | Maximum ambiguity ratio for RANSAC |
- David G. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints," IJCV, 2004.
- Original CudaSift by Marten Bjorkman: https://github.com/Celebrandil/CudaSift
MIT License -- see LICENSE for details.