I have a latency sensitive application running on an embedded system, and I'm seeing some discrepancy between writing to a ext4 partition and an ext2 partition on the same physical device. Specifically, I see intermittent delays when performing many small updates on a memory map, but only on ext4. I've tried what seem to be some of the usual tricks for improving performance (especially variations in latency) by mounting ext4 with different options and have settled on these mount options:
mount -t ext4 -o remount,rw,noatime,nodiratime,user_xattr,barrier=1,data=ordered,nodelalloc /dev/mmcblk0p6 /media/mmc/data barrier=0 didn't seem to provide any improvement.
For the ext2 partition, the following flags are used:
/dev/mmcblk0p3 on /media/mmc/data2 type ext2 (rw,relatime,errors=continue) Here's the test program I'm using:
#include <stdio.h> #include <cstring> #include <cstdio> #include <string.h> #include <stdint.h> #include <sys/mman.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> #include <fcntl.h> #include <stdint.h> #include <cstdlib> #include <time.h> #include <stdio.h> #include <signal.h> #include <pthread.h> #include <unistd.h> #include <errno.h> #include <stdlib.h> uint32_t getMonotonicMillis() { struct timespec time; clock_gettime(CLOCK_MONOTONIC, &time); uint32_t millis = (time.tv_nsec/1000000)+(time.tv_sec*1000); return millis; } void tune(const char* name, const char* value) { FILE* tuneFd = fopen(name, "wb+"); fwrite(value, strlen(value), 1, tuneFd); fclose(tuneFd); } void tuneForFasterWriteback() { tune("/proc/sys/vm/dirty_writeback_centisecs", "25"); tune("/proc/sys/vm/dirty_expire_centisecs", "200"); tune("/proc/sys/vm/dirty_background_ratio", "5"); tune("/proc/sys/vm/dirty_ratio", "40"); tune("/proc/sys/vm/swappiness", "0"); } class MMapper { public: const char* _backingPath; int _blockSize; int _blockCount; bool _isSparse; int _size; uint8_t *_data; int _backingFile; uint8_t *_buffer; MMapper(const char *backingPath, int blockSize, int blockCount, bool isSparse) : _backingPath(backingPath), _blockSize(blockSize), _blockCount(blockCount), _isSparse(isSparse), _size(blockSize*blockCount) { printf("Creating MMapper for %s with block size %i, block count %i and it is%s sparse\n", _backingPath, _blockSize, _blockCount, _isSparse ? "" : " not"); _backingFile = open(_backingPath, O_CREAT | O_RDWR | O_TRUNC, 0600); if(_isSparse) { ftruncate(_backingFile, _size); } else { posix_fallocate(_backingFile, 0, _size); fsync(_backingFile); } _data = (uint8_t*)mmap(NULL, _size, PROT_READ | PROT_WRITE, MAP_SHARED, _backingFile, 0); _buffer = new uint8_t[blockSize]; printf("MMapper %s created!\n", _backingPath); } ~MMapper() { printf("Destroying MMapper %s\n", _backingPath); if(_data) { msync(_data, _size, MS_SYNC); munmap(_data, _size); close(_backingFile); _data = NULL; delete [] _buffer; _buffer = NULL; } printf("Destroyed!\n"); } void writeBlock(int whichBlock) { memcpy(&_data[whichBlock*_blockSize], _buffer, _blockSize); } }; int main(int argc, char** argv) { tuneForFasterWriteback(); int timeBetweenBlocks = 40*1000; //2^12 x 2^16 = 2^28 = 2^10*2^10*2^8 = 256MB int blockSize = 4*1024; int blockCount = 64*1024; int bigBlockCount = 2*64*1024; int iterations = 25*40*60; //25 counts simulates 1 layer for one second, 5 minutes here uint32_t startMillis = getMonotonicMillis(); int measureIterationCount = 50; MMapper mapper("sparse", blockSize, bigBlockCount, true); for(int i=0; i<iterations; i++) { int block = rand()%blockCount; mapper.writeBlock(block); usleep(timeBetweenBlocks); if(i%measureIterationCount==measureIterationCount-1) { uint32_t elapsedTime = getMonotonicMillis()-startMillis; printf("%i took %u\n", i, elapsedTime); startMillis = getMonotonicMillis(); } } return 0; } Fairly simplistic test case. I don't expect terribly accurate timing, I'm more interested in general trends. Before running the tests, I ensured that the system is in a fairly steady state with very little disk write activity occuring by doing something like:
watch grep -e Writeback: -e Dirty: /proc/meminfo There is very little to no disk activity. This is also verified by seeing 0 or 1 in the wait column from the output of vmstat 1. I also perform a sync immediately before running the test. Note the aggressive writeback parameters being provided to the vm subsystem as well.
When I run the test on the ext2 partition, the first one hundred batches of fifty writes yield a nice solid 2012 ms with a standard deviation of 8 ms. When I run the same test on the ext4 partition, I see an average of 2151 ms, but an abysmal standard deviation of 409 ms. My primary concern is variation in latency, so this is frustrating. The actual times for the ext4 partition test looks like this:
{2372, 3291, 2025, 2020, 2019, 2019, 2019, 2019, 2019, 2020, 2019, 2019, 2019, 2019, 2020, 2021, 2037, 2019, 2021, 2021, 2020, 2152, 2020, 2021, 2019, 2019, 2020, 2153, 2020, 2020, 2021, 2020, 2020, 2020, 2043, 2021, 2019, 2019, 2019, 2053, 2019, 2020, 2023, 2020, 2020, 2021, 2019, 2022, 2019, 2020, 2020, 2020, 2019, 2020, 2019, 2019, 2021, 2023, 2019, 2023, 2025, 3574, 2019, 3013, 2019, 2021, 2019, 3755, 2021, 2020, 2020, 2019, 2020, 2020, 2019, 2799, 2020, 2019, 2019, 2020, 2020, 2143, 2088, 2026, 2017, 2310, 2020, 2485, 4214, 2023, 2020, 2023, 3405, 2020, 2019, 2020, 2020, 2019, 2020, 3591} Unfortunately, I don't know if ext2 is an option for the end solution, so I'm trying to understand the difference in behavior between the file systems. I would most likely have control over at least the flags being used to mount the ext4 system and tweak those.
noatime/nodiratime don't seem to make much of a dent
barrier=0/1 doesn't seem to matter
nodelalloc helps a bit, but doesn't do nearly enough to smooth out the latency variation.
The ext4 partition is only about 10% full.
Thanks for any thoughts on this issue!