7

I need to save once and load multiples times some big arrays in a flask application with Python 3. I originally stored these arrays on disk with the json library. In order to speed up this, I used Redis on the same machine to store the array by serializing the array in a JSON string. I wonder why I get no improvement (actually it takes more time on the server I use) whereas Redis keeps data in RAM. I guess the JSON serialization isn't optimize but I have no clue how I could speed up this:

import json import redis import os import time current_folder = os.path.dirname(os.path.abspath(__file__)) file_path = os.path.join(current_folder, "my_file") my_array = [1]*10000000 with open(file_path, 'w') as outfile: json.dump(my_array, outfile) start_time = time.time() with open(file_path, 'r') as infile: my_array = json.load(infile) print("JSON from disk : ", time.time() - start_time) r = redis.Redis() my_array_as_string = json.dumps(my_array) r.set("my_array_as_string", my_array_as_string) start_time = time.time() my_array_as_string = r.get("my_array_as_string") print("Fetch from Redis:", time.time() - start_time) start_time = time.time() my_array = json.loads(my_array_as_string) print("Parse JSON :", time.time() - start_time) 

Result:

JSON from disk : 1.075700044631958 Fetch from Redis: 0.078125 Parse JSON : 1.0247752666473389 

EDIT: it seems that fetching from redis is actually fast, but the JSON parsing is quite slow. Is there a way to fetch directly an array from Redis without the JSON serialization part ? This is what we do with pyMySQL and it is fast.

9
  • Off the top of my head I'd say that the disk version is artificially fast due to disk caching. See here, for example. Writing good benchmarks is hard. Commented Sep 13, 2018 at 7:10
  • I load almost 10 gigabytes of data on a 196 Gb RAM linux, you think the OS caches most of this data ? Commented Sep 13, 2018 at 7:32
  • "Usually, all physical memory not directly allocated to applications is used by the operating system for the page cache." Commented Sep 13, 2018 at 7:47
  • Thx, I updated my question to be more specific, Redis is actually much faster for accessing the data, but because I store the data as strings of JSON, the parsing part is really slow. I'm looking for a way to directly fetch the data in a python object, as we do with pyMySQL. Commented Sep 13, 2018 at 7:52
  • There's always a translation step between a stream of bytes and an in-memory Python object. That said, JSON is known to be slow so you could always try msgpack or even pickle. Commented Sep 13, 2018 at 8:10

3 Answers 3

21
+50

Update: Nov 08, 2019 - Run the same test on Python3.6

Results:

Dump Time: JSON > msgpack > pickle > marshal
Load Time: JSON > pickle > msgpack > marshal
Space: marshal > JSON > pickle > msgpack

+---------+-----------+-----------+-------+ | package | dump time | load time | size | +---------+-----------+-----------+-------+ | json | 0.00134 | 0.00079 | 30049 | | pickle | 0.00023 | 0.00019 | 20059 | | msgpack | 0.00031 | 0.00012 | 10036 | | marshal | 0.00022 | 0.00010 | 50038 | +---------+-----------+-----------+-------+ 

I tried pickle vs json vs msgpack vs marshal.

Pickle is much much faster than JSON. And msgpack is atleast 4x faster that JSON. MsgPack looks like the best option you have.

Edit: Tried marshal also. Marshal is faster than JSON, but slower than msgpack.

Time taken: Pickle > JSON > Marshal > MsgPack
Space taken: Marshal > Pickle > Json > MsgPack

import time import json import pickle import msgpack import marshal import sys array = [1]*10000 start_time = time.time() json_array = json.dumps(array) print "JSON dumps: ", time.time() - start_time print "JSON size: ", sys.getsizeof(json_array) start_time = time.time() _ = json.loads(json_array) print "JSON loads: ", time.time() - start_time # -------------- start_time = time.time() pickled_object = pickle.dumps(array) print "Pickle dumps: ", time.time() - start_time print "Pickle size: ", sys.getsizeof(pickled_object) start_time = time.time() _ = pickle.loads(pickled_object) print "Pickle loads: ", time.time() - start_time # -------------- start_time = time.time() package = msgpack.dumps(array) print "Msg Pack dumps: ", time.time() - start_time print "MsgPack size: ", sys.getsizeof(package) start_time = time.time() _ = msgpack.loads(package) print "Msg Pack loads: ", time.time() - start_time # -------------- start_time = time.time() m_package = marshal.dumps(array) print "Marshal dumps: ", time.time() - start_time print "Marshal size: ", sys.getsizeof(m_package) start_time = time.time() _ = marshal.loads(m_package) print "Marshal loads: ", time.time() - start_time 

Result:

 JSON dumps: 0.000760078430176 JSON size: 30037 JSON loads: 0.000488042831421 Pickle dumps: 0.0108790397644 Pickle size: 40043 Pickle loads: 0.0100247859955 Msg Pack dumps: 0.000202894210815 MsgPack size: 10040 Msg Pack loads: 7.58171081543e-05 Marshal dumps: 0.000118017196655 Marshal size: 50042 Marshal loads: 0.000118970870972 
Sign up to request clarification or add additional context in comments.

2 Comments

Indeed, msgpack is about 4x faster. I wait a bit since I was looking for a more generic answer, but your answer is of great help. Fetch from Redis: 0.023797988891601562 Parse msgpack : 0.17844223976135254
Judging from your print comments, you used Python 2, where pickle is slow and you are advised to use the C version with 'import cPickle as pickle'. On Python 3.7, I get the following save and load times: - Using json: 0.739 + 0.584 ms, 30049 bytes. - Using ujson: 0.265 + 0.136 ms, 20050 bytes. - Using pickle: 0.188 + 0.132 ms, 20059 bytes. - Using msgpack: 0.317 + 0.059 ms, 10036 bytes. - Using marshal: 0.154 + 0.081 ms, 50038 bytes. Of course, if you are storing large homogeneous arrays, use numpy and pickle: - Numpy array using pickle: 0.016 + 0.000 ms, 40192 bytes.
2

Some explanation:

  1. Load data from disk doesn't always means disk access, often the data returned from in-memory OS cache, and when this happens this is even faster than get data from Redis (remove network communication from total time)

  2. The main performance killer is JSON parsing (cpt. Obvious)

  3. JSON parsing from disk most likely is done in parallel with data loading (from filestream)

  4. There is no option to parse from stream with Redis (at least I do not know such API)


You may speedup app with minimal changes just by storing your cache files on tmpfs. It is quite close to Redis setup on the same server.

Agree with @RoopakANelliat msgpack is about 4x faster than JSON. Format change will boost parsing performance (if it is possible).

Comments

1

I made brain-plasma specifically for this reason - fast loading and reloading of big objects in a Flask app. It's a shared-memory object namespace for Apache Arrow-serializable objects, including pickle'd bytestrings generated by pickle.dumps(...).

$ pip install brain-plasma $ plasma_store -m 10000000 -s /tmp/plasma # 10MB memory from brain_plasma import Brain brain = Brain() brain['a'] = [1]*10000 brain['a'] # >>> [1,1,1,1,...] 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.