3

Memory leakage is detected via memory_profiler. Since such big file will be uploaded from 128MB GCF or f1-micro GCE, how could I prevent this memory leakage?

✗ python -m memory_profiler tests/test_gcp_storage.py 
67108864 Filename: tests/test_gcp_storage.py Line # Mem usage Increment Line Contents ================================================ 48 35.586 MiB 35.586 MiB @profile 49 def test_upload_big_file(): 50 35.586 MiB 0.000 MiB from google.cloud import storage 51 35.609 MiB 0.023 MiB client = storage.Client() 52 53 35.609 MiB 0.000 MiB m_bytes = 64 54 35.609 MiB 0.000 MiB filename = int(datetime.utcnow().timestamp()) 55 35.609 MiB 0.000 MiB blob_name = f'test/{filename}' 56 35.609 MiB 0.000 MiB bucket_name = 'my_bucket' 57 38.613 MiB 3.004 MiB bucket = client.get_bucket(bucket_name) 58 59 38.613 MiB 0.000 MiB with open(f'/tmp/{filename}', 'wb+') as file_obj: 60 38.613 MiB 0.000 MiB file_obj.seek(m_bytes * 1024 * 1024 - 1) 61 38.613 MiB 0.000 MiB file_obj.write(b'\0') 62 38.613 MiB 0.000 MiB file_obj.seek(0) 63 64 38.613 MiB 0.000 MiB blob = bucket.blob(blob_name) 65 102.707 MiB 64.094 MiB blob.upload_from_file(file_obj) 66 67 102.715 MiB 0.008 MiB blob = bucket.get_blob(blob_name) 68 102.719 MiB 0.004 MiB print(blob.size) 

Moreover, if the file is not open with binary mode, the memory leakage will be twice as the file size.

 67108864 Filename: tests/test_gcp_storage.py Line # Mem usage Increment Line Contents ================================================ 48 35.410 MiB 35.410 MiB @profile 49 def test_upload_big_file(): 50 35.410 MiB 0.000 MiB from google.cloud import storage 51 35.441 MiB 0.031 MiB client = storage.Client() 52 53 35.441 MiB 0.000 MiB m_bytes = 64 54 35.441 MiB 0.000 MiB filename = int(datetime.utcnow().timestamp()) 55 35.441 MiB 0.000 MiB blob_name = f'test/{filename}' 56 35.441 MiB 0.000 MiB bucket_name = 'my_bucket' 57 38.512 MiB 3.070 MiB bucket = client.get_bucket(bucket_name) 58 59 38.512 MiB 0.000 MiB with open(f'/tmp/{filename}', 'w+') as file_obj: 60 38.512 MiB 0.000 MiB file_obj.seek(m_bytes * 1024 * 1024 - 1) 61 38.512 MiB 0.000 MiB file_obj.write('\0') 62 38.512 MiB 0.000 MiB file_obj.seek(0) 63 64 38.512 MiB 0.000 MiB blob = bucket.blob(blob_name) 65 152.250 MiB 113.738 MiB blob.upload_from_file(file_obj) 66 67 152.699 MiB 0.449 MiB blob = bucket.get_blob(blob_name) 68 152.703 MiB 0.004 MiB print(blob.size) 

GIST: https://gist.github.com/northtree/8b560a6b552a975640ec406c9f701731

10
  • 1
    Once blob goes out of scope, is the memory still in use? Commented Jun 27, 2019 at 14:14
  • I have tried with your code (the binary and non binary way) both gave me the same file size. Using memory_profiler I didn't got any memory increment when uploading the blob on either version. Try deleting the blob's memory after uploading it (del blob), or try the "upload_from_filename" method to see if you face the same issue -> googleapis.github.io/google-cloud-python/latest/storage/… . Let me know. Commented Jun 27, 2019 at 15:35
  • @Maximilian I suppose the blob should be auto released outside with. Commented Jun 28, 2019 at 1:06
  • @Mayeru I had run multiple times with python3.7 and google-cloud-storage==1.16.1 in OS X. Are you running in different env? Thanks. Commented Jun 28, 2019 at 1:09
  • 2
    Some advice on how to write code for the cloud: 1) You do not have a memory leak unless you have code that is not displayed. 2) You do not want to allocate large blocks of memory to read a file into. 128 MB is big - too big. 3) Internet connections fail, timeout, packets get dropped, have errors, so you want to upload in smaller blocks like 64 KB or 1 MB per I/O with retry logic. 4) Performance is increased by multi-part uploads. Typically, two to four threads will double the performance. I realize that your question is "memory leaks" but write good code and then quality check the good code. Commented Jul 3, 2019 at 1:52

1 Answer 1

0

To limit the amount of memory used during an upload, you need to explicitly configure a chunk size on the blob before you call upload_from_file():

blob = bucket.blob(blob_name, chunk_size=10*1024*1024) blob.upload_from_file(file_obj) 

I agree this is bad default behaviour of the Google client SDK, and the workaround is badly documented as well.

Sign up to request clarification or add additional context in comments.

1 Comment

This helps, but is not a solution to this kind of problem.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.