I am enthusiastic, in principle, about io-uring asynchronous IO - which seems an extremely good fit for network communication. I see benefit, with an asynchronous idiom, where eliminating blocking calls avoids the need to support more application threads than (hyperthreaded) cores on the CPU - in order to maximise throughput. The complication, of course, is that I must avoid all blocking calls - not just blocking calls relating to network IO - if I'm to derive all the benefits.
If I had used a traditional design, with one application thread (perhaps using a thread-pool) per conceptual interaction... I might have implemented read access to durable storage using memory mapped IO. While the simplicity of memory mapped IO seems desirable, the prospect for my worker thread to block, awaiting my mapped file/device to respond, is extremely undesirable.
I like the idea that, with memory mapped IO, the OS should cache as much of my file as fits conveniently into memory - taking into account requirements from all active processes on the host. I dislike the fact that, when my process accesses a page, that is not cached in RAM, that my process will block until that page has been read into memory.
I wish that I could asynchronously suspend my current thread if a page were not in memory... and resume (on a completion queue event) when that page had been loaded. I realise, if I were to implement an application-level Least-Recently-Used page cache... and populate it using the io-uring API to read individual pages... then I could get the asynchronous non-blocking behaviour I want. The downside (besides the effort required to implement the application-layer LRU cache) is that my process would not cooperate with other processes to ensure RAM (for caching) would be shared sensibly between all processes on the host.
Am I overlooking a mechanism by which I can get the best of both worlds... with only asynchronous (non-blocking) calls to minimise the number of threads I need for maximum throughput... and an LRU cache for pages of my (huge) data file that exploits as much RAM as is available (without adversely affecting other processes) on my host?
Are programs that adopt the asynchronous idiom forced to choose between using mmap (to gain benefit from the OS' shared LRU cache when reading pages from large files) and implementing a static page cache and relying upon io-uring io_uring_prep_read() to populate it?
Discoveries since asking this question:
- It seems the idea of an 'asynchronous page fault' has existed for ~15 years in the context of KVM/QEMU... It's not clear, to me, how such kernel internals might relate to io-uring features.
- In 2021, this "Support Asynchronous Page Fault" message suggests that modern Linux kernels may support (something like) the sort of feature I want... but it's not clear, to me, how existing features could be used in the context of io-uring oriented asynchronous programming.
- I've discovered this man page for userfaultfd(2) - and this one for ioctl_userfaultfd(2). The example code uses poll() rather than io-uring - but, perhaps, io-uring asynchronous reads might afford io-uring asynchronous notification about page-fault events. I've read this kernel documentation - though it remains unclear, to me, how I'd use such features, in the context of io-uring asynchronous idioms, to avoid the faulting thread blocking when a page on disk happens not to be cached.
I think there is a way for user-space to be informed about page faults - but I don't see a way to avoid thread-blocking when the virtual address for an uncached page is accessed. I don't see a way for my-user space code to react to a page fault scenario by calling co_await (or any comparable asynchronous idiom) rather than blocking until the page fault is resolved. I can't see a strategy to indicate that a page will be required - but where the thread, that establishes this requirement, should do something other than sleep awaiting the page to be available.
Section 3.2 of this 2022 paper seems to offer circumstantial evidence for the idea that io-uring and mmap are somewhat incompatible.
preadv2(…, RWF_NOWAIT)to read from the page cache without blocking, which then gives you the opportunity to defer work to a work queue, uring or whatever. I guess you could also check for presence in the page cache withmincore()but between checking and accessing, there is always time to lose a page. If uring could domlock()that would help but I don't think it supports that at the moment.