Context: I need to read a huge file (~200 GB) from an offset, which is not a multiple of 512 (or block size), in chunks into a buffer as fast and as efficient as possible. A networking application utilizes this data in the buffer to create packets and send them out from the NIC. My sending rate is > 30 Gbps and thus I need to have data in the buffer always so that my networking application sender never runs out of data.
SSD details: KIOXIA XG8 Series (M.2) with 7000 GB/s (56 Gbps) max sequential read speed.
Initially I was using the libaio in the O_DIRECT mode and I was able to read fast enough (~50 Gbps) to have my application running. Since O_DIRECT mode enforces alignment requirements, I am looking for alternatives. In O_RDONLY mode, I am able to read the file at the required offset but that is very slow (reason could be that aio in non-O_DIRECT modes is not async IO anymore io_uring white paper.
As an alternative (better), I am using liburing to read my file. I am using O_RDONLY mode as it enables to read at any offset. However, the read speed it too low.
Here is my code:
#include <stdio.h> #include <fcntl.h> #include <string.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <unistd.h> #include "liburing.h" #include <chrono> #include <iostream> #include <filesystem> int main() { uint64_t block_size = 1 << 30; io_uring ring; io_uring_sqe *sqe; io_uring_cqe *cqe; void* buffer; uint64_t offset = 0; std::string file_name = "200GB_file"; int file_fd = open(file_name.c_str(), O_RDONLY); if (file_fd < 0) throw "Cannot open file"; uint64_t file_size = std::filesystem::file_size(file_name.c_str()); uint64_t bytes_remaining = file_size; int num_entries = 1; int ret = io_uring_queue_init(num_entries, &ring, 0); if (ret < 0) throw "Error while initiating uring queue"; if (posix_memalign(&buffer, block_size, block_size)) throw "Error while mem-aligning"; ret = io_uring_register_files(&ring, &file_fd, 1); if (ret) throw "failed to register file"; while (bytes_remaining) { if (bytes_remaining < block_size) break; sqe = io_uring_get_sqe(&ring); if (!sqe) throw "Error while getting sqe"; auto start = std::chrono::high_resolution_clock::now(); io_uring_prep_read(sqe, file_fd, buffer, block_size, offset); ret = io_uring_submit(&ring); if (ret < 0) throw "io uring submit error"; ret = io_uring_wait_cqe(&ring, &cqe); if (ret < 0) throw "io uring wait error"; std::cout << "cqe->res: " << cqe->res << " "; auto end = std::chrono::high_resolution_clock::now(); auto elapsed_time = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count(); double reading_speed = double(block_size * 8) / double(elapsed_time); std::cout << " Reading speed: " << reading_speed << " Gbps" << std::endl; io_uring_cqe_seen(&ring, cqe); offset += block_size; bytes_remaining -= block_size; } close(file_fd); io_uring_queue_exit(&ring); } Here is the output (with offset = 0):
cqe->res: 1073741824 Reading speed: 31.5145 Gbps cqe->res: 1073741824 Reading speed: 73.4665 Gbps cqe->res: 1073741824 Reading speed: 13.1884 Gbps cqe->res: 1073741824 Reading speed: 6.37993 Gbps cqe->res: 1073741824 Reading speed: 6.97253 Gbps cqe->res: 1073741824 Reading speed: 6.41963 Gbps cqe->res: 1073741824 Reading speed: 7.44832 Gbps cqe->res: 1073741824 Reading speed: 7.40503 Gbps cqe->res: 1073741824 Reading speed: 7.36463 Gbps cqe->res: 1073741824 Reading speed: 6.78501 Gbps cqe->res: 1073741824 Reading speed: 7.10513 Gbps cqe->res: 1073741824 Reading speed: 7.05038 Gbps cqe->res: 1073741824 Reading speed: 6.10796 Gbps cqe->res: 1073741824 Reading speed: 5.63116 Gbps cqe->res: 1073741824 Reading speed: 5.54613 Gbps . . . Here is the output (with offset = 10)
cqe->res: 1073741824 Reading speed: 30.1049 Gbps cqe->res: 1073741824 Reading speed: 76.969 Gbps cqe->res: 1073741824 Reading speed: 74.8756 Gbps cqe->res: 1073741824 Reading speed: 76.9209 Gbps cqe->res: 1073741824 Reading speed: 77.1421 Gbps cqe->res: 1073741824 Reading speed: 76.0599 Gbps cqe->res: 1073741824 Reading speed: 77.7633 Gbps cqe->res: 1073741824 Reading speed: 77.9113 Gbps cqe->res: 1073741824 Reading speed: 75.8365 Gbps cqe->res: 1073741824 Reading speed: 77.1777 Gbps cqe->res: 1073741824 Reading speed: 76.6224 Gbps cqe->res: 1073741824 Reading speed: 77.0741 Gbps cqe->res: 1073741824 Reading speed: 77.0687 Gbps cqe->res: 1073741824 Reading speed: 77.3948 Gbps cqe->res: 1073741824 Reading speed: 76.1512 Gbps cqe->res: 1073741824 Reading speed: 77.502 Gbps cqe->res: 1073741824 Reading speed: 77.3421 Gbps cqe->res: 1073741824 Reading speed: 77.4055 Gbps cqe->res: 1073741824 Reading speed: 75.9798 Gbps cqe->res: 1073741824 Reading speed: 55.6069 Gbps cqe->res: 1073741824 Reading speed: 5.94714 Gbps cqe->res: 1073741824 Reading speed: 32.7055 Gbps cqe->res: 1073741824 Reading speed: 70.8435 Gbps cqe->res: 1073741824 Reading speed: 23.822 Gbps cqe->res: 1073741824 Reading speed: 5.69492 Gbps cqe->res: 1073741824 Reading speed: 6.65288 Gbps cqe->res: 1073741824 Reading speed: 6.76708 Gbps cqe->res: 1073741824 Reading speed: 7.91575 Gbps cqe->res: 1073741824 Reading speed: 6.29257 Gbps cqe->res: 1073741824 Reading speed: 6.2675 Gbps cqe->res: 1073741824 Reading speed: 7.31762 Gbps cqe->res: 1073741824 Reading speed: 6.35138 Gbps cqe->res: 1073741824 Reading speed: 7.52077 Gbps cqe->res: 1073741824 Reading speed: 7.20884 Gbps cqe->res: 1073741824 Reading speed: 6.05534 Gbps cqe->res: 1073741824 Reading speed: 5.80964 Gbps cqe->res: 1073741824 Reading speed: 6.58119 Gbps cqe->res: 1073741824 Reading speed: 8.65097 Gbps cqe->res: 1073741824 Reading speed: 27.1533 Gbps cqe->res: 1073741824 Reading speed: 6.76254 Gbps cqe->res: 1073741824 Reading speed: 5.9082 Gbps cqe->res: 1073741824 Reading speed: 7.45788 Gbps cqe->res: 1073741824 Reading speed: 6.89272 Gbps cqe->res: 1073741824 Reading speed: 5.76414 Gbps cqe->res: 1073741824 Reading speed: 5.65644 Gbps cqe->res: 1073741824 Reading speed: 7.21536 Gbps cqe->res: 1073741824 Reading speed: 6.09685 Gbps cqe->res: 1073741824 Reading speed: 6.44203 Gbps cqe->res: 1073741824 Reading speed: 6.97692 Gbps . . . In case of offset = 10, why do I see read speed values more than max read speed of the SSD (like 77 Gbps)?
Can the code be improved to give "better" read speeds? To quantify "better", let's say as close as possible to the maximum read speed supported by the SSD?
O_DIRECTwill always provide the best performance. Maybe use normal IO for the first, partial block that is misaligned. You see read speeds above the disk speed because part of the data is probably cached in the page-cache. Either that or a disk-internal cache. libaio uses normalpreadwith a thread-pool. It is possible that the thread pool size is better than what the kernel uses with liburing. The important part is to keep the SSD busy with many overlapping operations. NVMe has deep queueshigh_resolution_clockto measure. Usesteady_clocksince you don't want clock shifts to affect your measurements.high_resolution_clockis probably an alias forsystem_clockin your case, so not very trustworthy for making measurements.pread