I'm writing some code to access an inverted index. I have two interchangeable class which perform the reads on the index. One reads the index from the disk, buffering part of it. The other load the index completely in memory, as a byte[][] (the index size is around 7Gb) and read from this multidimensional array. One would expect to have better performances while having the whole data in memory. But my measures state that working with the index on disk it's as fast as having it in memory. (The time spent to load the index in memory isn't counted in the performances)
Why is this happening? Any ideas?
Further information: I've run the code enabling HPROF. Both working "on disk" or "in memory", the most used code it's NOT the one directly related to the reads. Also, for my (limited) understanding, the gc profiler doesn't show any gc related issue.
UPDATE #1: I've instrumented my code to monitor I/O times. It seems that most of the seeks on memory take 0-2000ns, while most of the seeks on disk take 1000-3000ns. The second metric seems a bit too low for me. Is it due disk caching by Linux? Is there a way to exclude disk caching for benchmarking purposes?
UPDATE #2: I've graphed the response time for every request to the index. The line for the memory and for the disk match almost exactly. I've done some other tests using the O_DIRECT flag to open the file (thanks to JNA!) and in that case the disk version of the code is (obviously) slower than memory. So, I'm concluding that the "problem" was because the aggressive Linux disk caching, which is pretty amazing.
UPDATE #3: http://www.nicecode.eu/java-streams-for-direct-io/
Both working "on disk" or "in memory", the most used code it's NOT the one directly related to the reads.So... you have your answer, no ?