13

I have an image compression application that now has two different versions of memory allocation systems. In the original one, malloc is used everywhere, and in the second one, I implemented a simple pool-allocator, that just allocates chunk of memory and returns parts of that memory to myalloc() calls.

We've been noticing a huge memory overhead when malloc is used: At the height of its memory usage, the malloc() code requires about 170 megabytes of memory for a 1920x1080x16bpp image, while the pool allocator allocates just 48 megabytes, of which 47 are used by the program.

In terms of memory allocation patterns, the program allocates a lot of 8byte(most), 32-byte(many) and 1080byte-blocks(some) with the test image. Apart from these, there are no dynamic memory allocations in the code.

The OS of the testing system is Windows 7 (64 Bit).

How did we test memory usage?

With the custom allocator, we could see how much memory is used because all malloc calls are defered to the allocator. With malloc(), in Debug mode we just stepped through the code and watched the memory usage in the task manager. In release mode we did the same, but less fine grained because the compiler optimizes a lot of stuff away so we couldn't step through the code piece by piece (the memory difference between release and debug was about 20MB, which I would attribute to optimization and lack of debug information in release mode).

Could malloc alone be the cause of such a huge overhead? If so, what exactly causes this overhead inside malloc?

8
  • Normally you only write a custom version of something generic when what you write will add a lot of benefit in performance due to knowing the specifics of how you plan to use it. However I wouldn't expect a memory overhead for using malloc. Are you sure you are measuring memory use correctly? Are you sure you are freeing the memory correctly when you use malloc? Commented Oct 25, 2012 at 8:51
  • Everything in the malloc code is freed (I tested this with a memory profiler), but only at the very end of the application before it has finished executing, so the measurements occur before any free() functions are called (in both versions). The custom allocator speeds the entire thing up, and saves us about 15ms per image (since it's just one large allocation instead of lots of small ones). Commented Oct 25, 2012 at 8:57
  • A 200% overhead for malloc does seem excessive, unless there are a lot more of the 8 byte allocations than you think. Commented Oct 25, 2012 at 8:57
  • malloc may well "retain" some memory after you've freed it for immediate use. If your implementation is using std::vector you can "reserve" ahead, although when you are going to allocate such large amounts of memory, it is better not to go for a model that requires a contiguous buffer. Commented Oct 25, 2012 at 9:00
  • 1
    Can you describe how did you guys measure the memory usage? Commented Oct 25, 2012 at 9:04

3 Answers 3

10

On Windows 7 you will always get the low-fragmentation heap allocator, without explicitly calling HeapSetInformation() to ask for it. That allocator sacrifices virtual memory space to reduce fragmentation. Your program is not actually using 170 megabytes, you are just seeing a bunch of free blocks lying around, waiting for an allocation of a similar size.

This algorithm is very easy to beat with a custom allocator that doesn't do anything to reduce fragmentation. Which may well work out for you, albeit that you don't see the side effects of it until you keep the program running longer than a single debug session. You do need to make sure it is stable for days or weeks if that is the expected usage pattern.

Best thing to do is just not fret about it, 170 MB is rather small potatoes. And do keep in mind that this is virtual memory, it doesn't cost anything.

Sign up to request clarification or add additional context in comments.

6 Comments

"On Windows 7 you will always get the low-fragmentation heap allocator" You will, unless you are running under a debugger. Based on the methodology desribed (stepping through the code) it seems the debugger was used when measuring the memory consumption, therefore LFH was not used. See my answer.
That article quotes ancient history, the memory manager was significantly revised in Vista. But yes, the block padding used by the heap debugging feature can affect the amount of VM he sees getting used.
Virtual memory costs something, page misses are painful in performance.
No, you can't get a page miss from accessing a free block. That would be a bug.
@Hans Maybe I got it wrong then, can you please explain this further?
|
6

First at all malloc aligns the pointers to 16 byte boundaries. Furthermore they store at least one pointer (or allocated length) in the addresses preceding the returned value. Then they probably add a magic value or release counter to indicate that the linked list is not broken or that the memory block has not been released twice (free ASSERTS for double frees).

#include <stdlib.h> #include <stdio.h> int main(int ac, char**av) { int *foo = malloc(4); int *bar = malloc(4); printf("%d\n", (int)bar - (int)foo); } 

Return: 32

8 Comments

You mean 32 bits takes 16 bytes?
@Aki he says many 32-byte allocations. why do you come up with 32-bits allocations?
This is tested in 64-bit ubuntu. In 32-bit systems it takes most likely 16 bytes.
@UmNyobe: apparently my mistake. Anyway 8 byte malloc takes 32 bytes and 32-byte malloc takes then at least 48 bytes.
A malloc implementation certainly doesn't have to add headers. Modern allocator implementations like jemalloc have very low metadata overhead even for small allocations (~2%). By using aligned chunks, small (< chunk) allocations be distinguished from large (>= chunk) ones via an alignment check (4M chunks in jemalloc). The metadata for small allocations can be found via an offset with a global data structure for large ones. Alternatively, a global data structure can be used which is what TCMalloc does.
|
4

Caution: When you run your program in the Visual Studio or with any debugger attached, by default the malloc behaviour is changed a lot, Low Fragmentation Heap is not used and a memory overhead may be not representative of real usage (see also https://stackoverflow.com/a/3768820/16673). You need to use environment variable _NO_DEBUG_HEAP=1 to avoid being hit by this, or to measure the memory usage when not running under a debugger.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.