4

I'm writing some code to access an inverted index. I have two interchangeable class which perform the reads on the index. One reads the index from the disk, buffering part of it. The other load the index completely in memory, as a byte[][] (the index size is around 7Gb) and read from this multidimensional array. One would expect to have better performances while having the whole data in memory. But my measures state that working with the index on disk it's as fast as having it in memory. (The time spent to load the index in memory isn't counted in the performances)

Why is this happening? Any ideas?

Further information: I've run the code enabling HPROF. Both working "on disk" or "in memory", the most used code it's NOT the one directly related to the reads. Also, for my (limited) understanding, the gc profiler doesn't show any gc related issue.

UPDATE #1: I've instrumented my code to monitor I/O times. It seems that most of the seeks on memory take 0-2000ns, while most of the seeks on disk take 1000-3000ns. The second metric seems a bit too low for me. Is it due disk caching by Linux? Is there a way to exclude disk caching for benchmarking purposes?

UPDATE #2: I've graphed the response time for every request to the index. The line for the memory and for the disk match almost exactly. I've done some other tests using the O_DIRECT flag to open the file (thanks to JNA!) and in that case the disk version of the code is (obviously) slower than memory. So, I'm concluding that the "problem" was because the aggressive Linux disk caching, which is pretty amazing.

UPDATE #3: http://www.nicecode.eu/java-streams-for-direct-io/

9
  • 1
    The memory version might be slowed down by garbage collections if you are close to your maximum heap size - have you monitored GCs? Commented Mar 19, 2013 at 18:01
  • 5
    Two possibilities: 1) OS caches disk reads 2) the code performance is not actually constrained by the speed of data access. Commented Mar 19, 2013 at 18:01
  • Even slowed down by GC RAM is still faster than disc (although depends on what kind of disc we're talking about...). Commented Mar 19, 2013 at 18:02
  • You also might be swapping to disk as a result of allocating more heap than physical memory. Hard to tell without profiling. Commented Mar 19, 2013 at 18:02
  • Both working "on disk" or "in memory", the most used code it's NOT the one directly related to the reads. So... you have your answer, no ? Commented Mar 19, 2013 at 18:05

2 Answers 2

5

Three possibilities off the top of my head:

  • The operating system is already keeping all of the index file in memory via its file system cache. (I'd still expect an overhead, mind you.)
  • The index isn't the bottleneck of the code you're testing.
  • Your benchmarking methodology isn't quite right. (It can be very hard to do benchmarking well.)

The middle option seems the most likely to me.

Sign up to request clarification or add additional context in comments.

2 Comments

if memory it's faster than disk, and if the code performs the same number of reads for memory and for disk, shouldn't the memory version be faster?
@MatteoCatena: Yes. But if you don't perform many reads, but you spend a lot of time doing other things, then the difference may get lost in the noise.
2

No, disk can never be as fast as RAM (RAM is actually in the order of 100,000 times faster for magnetic discs). Most likely the OS is mapping your file in memory for you.

6 Comments

Can you be more detailed in your answer, please? It seems strange to me that the OS caches in RAM a 7GB file.
Of course I don't mean the whole file, but the OS might be preloading in buffers when your process is not executing, anticipating your readings.
Are there way to confirm this?
Checking the OS source code (if available). Maybe some profiling tools can give you more insight as well. The question is are you sure your issue lies here?. Check Jon Skeet's answer, specially nº2.
Do you know anyway to exclude disk caching in a benchmarking on Linux?
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.