A high-performance IPC latency measurement tool that measures round-trip latency between Java and C++ processes using shared memory.
The Tail Latency Probe measures the round-trip time (RTT) of messages sent from a Java process to a C++ process and back. The Java process sends timestamps through shared memory, the C++ process echoes them back, and the Java process calculates and visualizes the latency distribution in real-time.
Here are updated measurements with percentile data:
LatencyVisualizer.java- Java GUI application that sends timestamps, receives echoes, and visualizes latency in real-timeSharedMemory.java- Java wrapper for lock-free shared memory operations usingVarHandleTailLatencyProbe.cpp- C++ backend that runs in a high-frequency loop, reading from Java and echoing backTLPMonitor.cpp- Passive monitoring tool to observes buffer health and throughput statistics
The system uses memory-mapped files (tlp_shm.bin) to enable zero-copy communication between processes:
- File Size: 128KB (131,072 bytes)
- Memory Layout:
- First 64 bytes: Atomic counters (head/tail pointers for both directions)
- Remaining space: Two lock-free ring buffers (Java to C++ and C++ to Java)
Both communication channels use lock-free ring buffers with atomic head/tail pointers:
- Buffer Capacity: 4,096 entries (each entry is 8 bytes)
- No Mutexes: All synchronization is done through atomic operations
- Wait-Free Reads: Readers never block writers and vice versa
The implementation uses acquire/release memory ordering to ensure correct visibility without full memory barriers:
- Acquire Semantics: When reading remote head/tail pointers, ensures all previous writes are visible
- Release Semantics: When updating head/tail pointers, ensures all previous writes are visible to other threads/processes
- Relaxed Semantics: Used for local counters where ordering isn't critical
Java Implementation:
- Uses
VarHandlewithgetAcquire()andsetRelease()for fine-grained memory ordering control - Faster than
synchronizedblocks orvolatilevariables
C++ Implementation:
- Uses
std::atomicwithmemory_order_acquireandmemory_order_release - Compiler generates appropriate CPU instructions automatically
The C++ process uses OS-level optimizations to minimize latency:
- Thread Affinity: Pinned to a specific CPU core (default Core 0) to avoid context switching overhead
- Real-Time Priority: Runs at
REALTIME_PRIORITY_CLASSon Windows to reduce scheduling delays
Both processes use spin-waiting instead of blocking:
- Java: Uses
Thread.onSpinWait()(JVM hint for CPU pause instruction) - C++: Uses
YieldProcessor()(x86 PAUSE instruction) - No Sleep: Eliminates sleep/wake overhead for maximum throughput
The PAUSE instruction tells the CPU we're in a spin loop, reduces power consumption, and prevents pipeline flushes on hyper-threaded cores.
Data is written directly to shared memory buffers with no intermediate copies or serialization. Timestamps are passed as raw 64-bit integers.
- Java: JDK 11+
- C++ Compiler: MSVC (Windows)
C++ Backend:
cl /O2 /EHsc TailLatencyProbe.cppJava Application:
javac *.java-
Start Java process first (it creates and initializes the shared memory file):
java LatencyVisualizer
-
Start C++ backend (it waits for the shared memory file to exist):
TailLatencyProbe.exe
The Java GUI displays real-time latency measurements: rolling window statistics for the last 500 samples (min, max, average) and lifetime tail latency percentiles (P50, P99, P99.9, P99.99).
- Throughput: Can handle hundreds of thousands of round-trips per second
- Latency: Typical RTT in the hundreds of nanoseconds range (depending on CPU)
- Tail Latency: The visualization helps identify latency spikes and distribution patterns
The C++ process uses OPEN_EXISTING when opening the shared memory file, ensuring Java creates and initializes it first. This prevents the race condition where C++ might zero out the file while Java is writing.
When buffers are full, both processes spin-wait using CPU pause instructions rather than blocking, maintaining low latency even under high load.



