Skip to content

ep186282/TailLatencyProbe

Repository files navigation

Tail Latency Probe (TLP)

A high-performance IPC latency measurement tool that measures round-trip latency between Java and C++ processes using shared memory.

Diagram

The Tail Latency Probe measures the round-trip time (RTT) of messages sent from a Java process to a C++ process and back. The Java process sends timestamps through shared memory, the C++ process echoes them back, and the Java process calculates and visualizes the latency distribution in real-time.

Here are updated measurements with percentile data:

Latency Percentiles

Latency Visualization

Latency Visualization

Architecture

  1. LatencyVisualizer.java - Java GUI application that sends timestamps, receives echoes, and visualizes latency in real-time
  2. SharedMemory.java - Java wrapper for lock-free shared memory operations using VarHandle
  3. TailLatencyProbe.cpp - C++ backend that runs in a high-frequency loop, reading from Java and echoing back
  4. TLPMonitor.cpp - Passive monitoring tool to observes buffer health and throughput statistics

Key Techniques

Shared Memory IPC

The system uses memory-mapped files (tlp_shm.bin) to enable zero-copy communication between processes:

  • File Size: 128KB (131,072 bytes)
  • Memory Layout:
    • First 64 bytes: Atomic counters (head/tail pointers for both directions)
    • Remaining space: Two lock-free ring buffers (Java to C++ and C++ to Java)

Lock-Free Ring Buffers

Both communication channels use lock-free ring buffers with atomic head/tail pointers:

  • Buffer Capacity: 4,096 entries (each entry is 8 bytes)
  • No Mutexes: All synchronization is done through atomic operations
  • Wait-Free Reads: Readers never block writers and vice versa

Memory Ordering Semantics

The implementation uses acquire/release memory ordering to ensure correct visibility without full memory barriers:

  • Acquire Semantics: When reading remote head/tail pointers, ensures all previous writes are visible
  • Release Semantics: When updating head/tail pointers, ensures all previous writes are visible to other threads/processes
  • Relaxed Semantics: Used for local counters where ordering isn't critical

Java Implementation:

  • Uses VarHandle with getAcquire() and setRelease() for fine-grained memory ordering control
  • Faster than synchronized blocks or volatile variables

C++ Implementation:

  • Uses std::atomic with memory_order_acquire and memory_order_release
  • Compiler generates appropriate CPU instructions automatically

Thread Affinity & Priority

The C++ process uses OS-level optimizations to minimize latency:

  • Thread Affinity: Pinned to a specific CPU core (default Core 0) to avoid context switching overhead
  • Real-Time Priority: Runs at REALTIME_PRIORITY_CLASS on Windows to reduce scheduling delays

Spin-Waiting

Both processes use spin-waiting instead of blocking:

  • Java: Uses Thread.onSpinWait() (JVM hint for CPU pause instruction)
  • C++: Uses YieldProcessor() (x86 PAUSE instruction)
  • No Sleep: Eliminates sleep/wake overhead for maximum throughput

The PAUSE instruction tells the CPU we're in a spin loop, reduces power consumption, and prevents pipeline flushes on hyper-threaded cores.

Zero-Copy Design

Data is written directly to shared memory buffers with no intermediate copies or serialization. Timestamps are passed as raw 64-bit integers.

Building and Running

Prerequisites

  • Java: JDK 11+
  • C++ Compiler: MSVC (Windows)

Compilation

C++ Backend:

cl /O2 /EHsc TailLatencyProbe.cpp

Java Application:

javac *.java

Execution

  1. Start Java process first (it creates and initializes the shared memory file):

    java LatencyVisualizer
  2. Start C++ backend (it waits for the shared memory file to exist):

    TailLatencyProbe.exe

The Java GUI displays real-time latency measurements: rolling window statistics for the last 500 samples (min, max, average) and lifetime tail latency percentiles (P50, P99, P99.9, P99.99).

Performance

  • Throughput: Can handle hundreds of thousands of round-trips per second
  • Latency: Typical RTT in the hundreds of nanoseconds range (depending on CPU)
  • Tail Latency: The visualization helps identify latency spikes and distribution patterns

Technical Details

Race Condition Prevention

The C++ process uses OPEN_EXISTING when opening the shared memory file, ensuring Java creates and initializes it first. This prevents the race condition where C++ might zero out the file while Java is writing.

Buffer Full Handling

When buffers are full, both processes spin-wait using CPU pause instructions rather than blocking, maintaining low latency even under high load.

About

Cross-language latency profiler using shared memory ring buffers and lock-free synchronization to achieve efficient communication between Java and C++ services

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors