How to work with big arrays preventing massive use of RAM?

Question

A file provides me a 400x400x200x1 array and a shape. Depending on the data transmitted by the array, the shape changes. My task is to adapt the 400x400x200x1 array to the data its contains.
For example:

shape = np.array([20,180,1,1]) b= [] l = np.load("testfile.npy") d = (np.reshape(l[:shape[0],:shape[1],:shape[2],:shape[3]],(shape[0],shape[1]))).transpose() append(d)

The idea is to create a new array, with the size adapted to its data. Now comes the problem: I have to do this process several times, but each time I do it, my RAM-load factor increases.:

shape = np.array([20,180,1,1]) b= [] for j in range(9): l = np.load("testfile.npy") d = (np.reshape(l[:shape[0],:shape[1],:shape[2],:shape[3]],(shape[0],shape[1]))).transpose() time.sleep(2) b.append(d)

load factor

Is that just because the appendet arrays are so big? the output array which I'm append has the size of 180x20... but the RAM load-factor increases 0,12GB each time.Is there a more efficient way to store arrays, without tempfiles?

thanks and sorry for my english.

Juh_ · Accepted Answer · 2014-04-02 11:44:18Z

In your example, your error is that you reload the file at each iteration of the for loop. Try:

l = np.load("testfile.npy") shape = np.array([20,180,1,1]) b = [] for j in range(9): d = (np.reshape(l[:shape[0],:shape[1],:shape[2],:shape[3]],(shape[0],shape[1]))).transpose() time.sleep(2) b.append(d)

That solves the problem.

Now, what is the reason: at each iteration you load l from file, implicitly make a view on part of it (d), and add this view to b. A view actually contain a reference to the whole array. So each time, you load the whole array and store an object with a reference to it, thus disabling the garbage collector to free the memory.

If for any reason you have to reload the file each time, the other solution is to explicitly make a copy in order to lose the reference to the whole array. In your example, replace the last line by:

b.append(d.copy())

Note 1: actually, because the size of the views d are negligible compare to l, you should always make a copy

Note 2: to assert that d keeps a reference to l you can do

d.base.base.base is l # True

.base is the reference to the viewed array. In your case there are 3 depth: l[...] slices, reshape, and transpose.

In my case the big array comes from a function, and as the data changes, I have to reload the function with the new data. Your second suggestion solved my problem. thanks a lot!

user2357112 · Accepted Answer · 2014-04-02 10:47:32Z

Most of your RAM usage comes from reading the whole 400x400x200x1 array into memory. Tell numpy to memory-map the input array instead, and it should be able to avoid reading most of it:

l = np.load("testfile.npy", mmap_mode='r')

Since both reshape and transpose return views of the original array whenever possible, it may also be worthwhile to explicitly make a copy of the result. We can also simplify the indexing:

d = l[:20, :180, 0, 0].transpose().copy()

I don't think the particular indexes you've chosen allow NumPy to return a view, but when it does return a view, the view will cause the entire original array to be retained. For a memory-mapped array, I believe a view will also be memory-mapped, which probably isn't what you want.

how can I use numpy.memmap when the big array comes as return from a function? e.g. bigarray = functionthatgivesbigarray()? thanks!
@Hubschr: You most likely can't. The function generally needs to provide a memory-map option.

Collectives™ on Stack Overflow

How to work with big arrays preventing massive use of RAM?

2 Answers 2

1 Comment

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Related