Asynchronous IO - versus memory mapped (file) IO

Question

I am enthusiastic, in principle, about io-uring asynchronous IO - which seems an extremely good fit for network communication. I see benefit, with an asynchronous idiom, where eliminating blocking calls avoids the need to support more application threads than (hyperthreaded) cores on the CPU - in order to maximise throughput. The complication, of course, is that I must avoid all blocking calls - not just blocking calls relating to network IO - if I'm to derive all the benefits.

If I had used a traditional design, with one application thread (perhaps using a thread-pool) per conceptual interaction... I might have implemented read access to durable storage using memory mapped IO. While the simplicity of memory mapped IO seems desirable, the prospect for my worker thread to block, awaiting my mapped file/device to respond, is extremely undesirable.

I like the idea that, with memory mapped IO, the OS should cache as much of my file as fits conveniently into memory - taking into account requirements from all active processes on the host. I dislike the fact that, when my process accesses a page, that is not cached in RAM, that my process will block until that page has been read into memory.

I wish that I could asynchronously suspend my current thread if a page were not in memory... and resume (on a completion queue event) when that page had been loaded. I realise, if I were to implement an application-level Least-Recently-Used page cache... and populate it using the io-uring API to read individual pages... then I could get the asynchronous non-blocking behaviour I want. The downside (besides the effort required to implement the application-layer LRU cache) is that my process would not cooperate with other processes to ensure RAM (for caching) would be shared sensibly between all processes on the host.

Am I overlooking a mechanism by which I can get the best of both worlds... with only asynchronous (non-blocking) calls to minimise the number of threads I need for maximum throughput... and an LRU cache for pages of my (huge) data file that exploits as much RAM as is available (without adversely affecting other processes) on my host?

Are programs that adopt the asynchronous idiom forced to choose between using mmap (to gain benefit from the OS' shared LRU cache when reading pages from large files) and implementing a static page cache and relying upon io-uring io_uring_prep_read() to populate it?

Discoveries since asking this question:

It seems the idea of an 'asynchronous page fault' has existed for ~15 years in the context of KVM/QEMU... It's not clear, to me, how such kernel internals might relate to io-uring features.
In 2021, this "Support Asynchronous Page Fault" message suggests that modern Linux kernels may support (something like) the sort of feature I want... but it's not clear, to me, how existing features could be used in the context of io-uring oriented asynchronous programming.
I've discovered this man page for userfaultfd(2) - and this one for ioctl_userfaultfd(2). The example code uses poll() rather than io-uring - but, perhaps, io-uring asynchronous reads might afford io-uring asynchronous notification about page-fault events. I've read this kernel documentation - though it remains unclear, to me, how I'd use such features, in the context of io-uring asynchronous idioms, to avoid the faulting thread blocking when a page on disk happens not to be cached.

I think there is a way for user-space to be informed about page faults - but I don't see a way to avoid thread-blocking when the virtual address for an uncached page is accessed. I don't see a way for my-user space code to react to a page fault scenario by calling co_await (or any comparable asynchronous idiom) rather than blocking until the page fault is resolved. I can't see a strategy to indicate that a page will be required - but where the thread, that establishes this requirement, should do something other than sleep awaiting the page to be available.

Section 3.2 of this 2022 paper seems to offer circumstantial evidence for the idea that io-uring and mmap are somewhat incompatible.

"I wish that I could asynchronously suspend my current thread if a page were not in memory." Doesn't the OS do that for you already? If you try to access a page not loaded, the OS should suspend your thread and run another thread. When the page is loaded, your thread will be resumed. — Some programmer dude
– Some programmer dude, Commented Jul 8 at 20:25
This sounds like vastly overthinking your problem. While there is no need to use deliberately non-performant programming techniques, performance should really only become a major consideration when it has been shown that the performance you have does not meet your requirements. More performant code generally has the downside of being less straightforward, both to implement and reason about. Async IO is not itself more performant (except for maybe being able to save a few unnecessary copies), it simply allows you to do something else while part of the code would otherwise be blocked. — SoronelHaetir
– SoronelHaetir, Commented Jul 8 at 20:33
There is also preadv2(…, RWF_NOWAIT) to read from the page cache without blocking, which then gives you the opportunity to defer work to a work queue, uring or whatever. I guess you could also check for presence in the page cache with mincore() but between checking and accessing, there is always time to lose a page. If uring could do mlock() that would help but I don't think it supports that at the moment. — Homer512
– Homer512, Commented Jul 8 at 20:38
You should also ask yourself if it's even necessary for your application to "play nice" with other apps. If this is a high-performance workstation or server application, you can dedicate the entire system to your application or rely on the sys-admin to set limits for your in-application cache. And if it isn't and your application shares a system with 80 browser tabs, two idle clickers and a video stream, then you don't need to bother getting every last ounce of performance in the first place. — Homer512
– Homer512, Commented Jul 8 at 21:00
i don't think there's any way to force a page fault without blocking the thread. a path that could work for you is to disable the filesystem cache and hook into pressure stall triggers so you can respond to system needing more memory. — Cory Nelson
– Cory Nelson, Commented Jul 8 at 21:25

Collectives™ on Stack Overflow

Asynchronous IO - versus memory mapped (file) IO

0

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Linked