How to fix memory leak on uploading file to Google Cloud Storage?

Question

Memory leakage is detected via memory_profiler. Since such big file will be uploaded from 128MB GCF or f1-micro GCE, how could I prevent this memory leakage?

✗ python -m memory_profiler tests/test_gcp_storage.py

67108864 Filename: tests/test_gcp_storage.py Line # Mem usage Increment Line Contents ================================================ 48 35.586 MiB 35.586 MiB @profile 49 def test_upload_big_file(): 50 35.586 MiB 0.000 MiB from google.cloud import storage 51 35.609 MiB 0.023 MiB client = storage.Client() 52 53 35.609 MiB 0.000 MiB m_bytes = 64 54 35.609 MiB 0.000 MiB filename = int(datetime.utcnow().timestamp()) 55 35.609 MiB 0.000 MiB blob_name = f'test/{filename}' 56 35.609 MiB 0.000 MiB bucket_name = 'my_bucket' 57 38.613 MiB 3.004 MiB bucket = client.get_bucket(bucket_name) 58 59 38.613 MiB 0.000 MiB with open(f'/tmp/{filename}', 'wb+') as file_obj: 60 38.613 MiB 0.000 MiB file_obj.seek(m_bytes * 1024 * 1024 - 1) 61 38.613 MiB 0.000 MiB file_obj.write(b'\0') 62 38.613 MiB 0.000 MiB file_obj.seek(0) 63 64 38.613 MiB 0.000 MiB blob = bucket.blob(blob_name) 65 102.707 MiB 64.094 MiB blob.upload_from_file(file_obj) 66 67 102.715 MiB 0.008 MiB blob = bucket.get_blob(blob_name) 68 102.719 MiB 0.004 MiB print(blob.size)

Moreover, if the file is not open with binary mode, the memory leakage will be twice as the file size.

 67108864 Filename: tests/test_gcp_storage.py Line # Mem usage Increment Line Contents ================================================ 48 35.410 MiB 35.410 MiB @profile 49 def test_upload_big_file(): 50 35.410 MiB 0.000 MiB from google.cloud import storage 51 35.441 MiB 0.031 MiB client = storage.Client() 52 53 35.441 MiB 0.000 MiB m_bytes = 64 54 35.441 MiB 0.000 MiB filename = int(datetime.utcnow().timestamp()) 55 35.441 MiB 0.000 MiB blob_name = f'test/{filename}' 56 35.441 MiB 0.000 MiB bucket_name = 'my_bucket' 57 38.512 MiB 3.070 MiB bucket = client.get_bucket(bucket_name) 58 59 38.512 MiB 0.000 MiB with open(f'/tmp/{filename}', 'w+') as file_obj: 60 38.512 MiB 0.000 MiB file_obj.seek(m_bytes * 1024 * 1024 - 1) 61 38.512 MiB 0.000 MiB file_obj.write('\0') 62 38.512 MiB 0.000 MiB file_obj.seek(0) 63 64 38.512 MiB 0.000 MiB blob = bucket.blob(blob_name) 65 152.250 MiB 113.738 MiB blob.upload_from_file(file_obj) 66 67 152.699 MiB 0.449 MiB blob = bucket.get_blob(blob_name) 68 152.703 MiB 0.004 MiB print(blob.size)

GIST: https://gist.github.com/northtree/8b560a6b552a975640ec406c9f701731

I have tried with your code (the binary and non binary way) both gave me the same file size. Using memory_profiler I didn't got any memory increment when uploading the blob on either version. Try deleting the blob's memory after uploading it (del blob), or try the "upload_from_filename" method to see if you face the same issue -> googleapis.github.io/google-cloud-python/latest/storage/… . Let me know. — Mayeru
– Mayeru, Commented Jun 27, 2019 at 15:35
@Maximilian I suppose the blob should be auto released outside with. — northtree
– northtree, Commented Jun 28, 2019 at 1:06
@Mayeru I had run multiple times with python3.7 and google-cloud-storage==1.16.1 in OS X. Are you running in different env? Thanks. — northtree
– northtree, Commented Jun 28, 2019 at 1:09
Some advice on how to write code for the cloud: 1) You do not have a memory leak unless you have code that is not displayed. 2) You do not want to allocate large blocks of memory to read a file into. 128 MB is big - too big. 3) Internet connections fail, timeout, packets get dropped, have errors, so you want to upload in smaller blocks like 64 KB or 1 MB per I/O with retry logic. 4) Performance is increased by multi-part uploads. Typically, two to four threads will double the performance. I realize that your question is "memory leaks" but write good code and then quality check the good code. — John Hanley
– John Hanley, Commented Jul 3, 2019 at 1:52

Pieter Ennes · Accepted Answer · 2019-09-18 11:16:15Z

To limit the amount of memory used during an upload, you need to explicitly configure a chunk size on the blob before you call upload_from_file():

blob = bucket.blob(blob_name, chunk_size=10*1024*1024) blob.upload_from_file(file_obj)

I agree this is bad default behaviour of the Google client SDK, and the workaround is badly documented as well.

Collectives™ on Stack Overflow

How to fix memory leak on uploading file to Google Cloud Storage?

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related