Tail Latency Probe (TLP)

A high-performance IPC latency measurement tool that measures round-trip latency between Java and C++ processes using shared memory.

The Tail Latency Probe measures the round-trip time (RTT) of messages sent from a Java process to a C++ process and back. The Java process sends timestamps through shared memory, the C++ process echoes them back, and the Java process calculates and visualizes the latency distribution in real-time.

Here are updated measurements with percentile data:

Architecture

LatencyVisualizer.java - Java GUI application that sends timestamps, receives echoes, and visualizes latency in real-time
SharedMemory.java - Java wrapper for lock-free shared memory operations using VarHandle
TailLatencyProbe.cpp - C++ backend that runs in a high-frequency loop, reading from Java and echoing back
TLPMonitor.cpp - Passive monitoring tool to observes buffer health and throughput statistics

Key Techniques

Shared Memory IPC

The system uses memory-mapped files (tlp_shm.bin) to enable zero-copy communication between processes:

File Size: 128KB (131,072 bytes)
Memory Layout:
- First 64 bytes: Atomic counters (head/tail pointers for both directions)
- Remaining space: Two lock-free ring buffers (Java to C++ and C++ to Java)

Lock-Free Ring Buffers

Both communication channels use lock-free ring buffers with atomic head/tail pointers:

Buffer Capacity: 4,096 entries (each entry is 8 bytes)
No Mutexes: All synchronization is done through atomic operations
Wait-Free Reads: Readers never block writers and vice versa

Memory Ordering Semantics

The implementation uses acquire/release memory ordering to ensure correct visibility without full memory barriers:

Acquire Semantics: When reading remote head/tail pointers, ensures all previous writes are visible
Release Semantics: When updating head/tail pointers, ensures all previous writes are visible to other threads/processes
Relaxed Semantics: Used for local counters where ordering isn't critical

Java Implementation:

Uses VarHandle with getAcquire() and setRelease() for fine-grained memory ordering control
Faster than synchronized blocks or volatile variables

C++ Implementation:

Uses std::atomic with memory_order_acquire and memory_order_release
Compiler generates appropriate CPU instructions automatically

Thread Affinity & Priority

The C++ process uses OS-level optimizations to minimize latency:

Thread Affinity: Pinned to a specific CPU core (default Core 0) to avoid context switching overhead
Real-Time Priority: Runs at REALTIME_PRIORITY_CLASS on Windows to reduce scheduling delays

Spin-Waiting

Both processes use spin-waiting instead of blocking:

Java: Uses Thread.onSpinWait() (JVM hint for CPU pause instruction)
C++: Uses YieldProcessor() (x86 PAUSE instruction)
No Sleep: Eliminates sleep/wake overhead for maximum throughput

The PAUSE instruction tells the CPU we're in a spin loop, reduces power consumption, and prevents pipeline flushes on hyper-threaded cores.

Zero-Copy Design

Data is written directly to shared memory buffers with no intermediate copies or serialization. Timestamps are passed as raw 64-bit integers.

Building and Running

Prerequisites

Java: JDK 11+
C++ Compiler: MSVC (Windows)

Compilation

C++ Backend:

cl /O2 /EHsc TailLatencyProbe.cpp

Java Application:

javac *.java

Execution

Start Java process first (it creates and initializes the shared memory file):
```
java LatencyVisualizer
```
Start C++ backend (it waits for the shared memory file to exist):
```
TailLatencyProbe.exe
```

The Java GUI displays real-time latency measurements: rolling window statistics for the last 500 samples (min, max, average) and lifetime tail latency percentiles (P50, P99, P99.9, P99.99).

Performance

Throughput: Can handle hundreds of thousands of round-trips per second
Latency: Typical RTT in the hundreds of nanoseconds range (depending on CPU)
Tail Latency: The visualization helps identify latency spikes and distribution patterns

Technical Details

Race Condition Prevention

The C++ process uses OPEN_EXISTING when opening the shared memory file, ensuring Java creates and initializes it first. This prevents the race condition where C++ might zero out the file while Java is writing.

Buffer Full Handling

When buffers are full, both processes spin-wait using CPU pause instructions rather than blocking, maintaining low latency even under high load.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
.gitignore		.gitignore
LatencyVisualizer.java		LatencyVisualizer.java
README.md		README.md
SharedMemory.java		SharedMemory.java
TLPMonitor.cpp		TLPMonitor.cpp
TailLatencyProbe.cpp		TailLatencyProbe.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tail Latency Probe (TLP)

Architecture

Key Techniques

Shared Memory IPC

Lock-Free Ring Buffers

Memory Ordering Semantics

Thread Affinity & Priority

Spin-Waiting

Zero-Copy Design

Building and Running

Prerequisites

Compilation

Execution

Performance

Technical Details

Race Condition Prevention

Buffer Full Handling

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tail Latency Probe (TLP)

Architecture

Key Techniques

Shared Memory IPC

Lock-Free Ring Buffers

Memory Ordering Semantics

Thread Affinity & Priority

Spin-Waiting

Zero-Copy Design

Building and Running

Prerequisites

Compilation

Execution

Performance

Technical Details

Race Condition Prevention

Buffer Full Handling

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages