How to sort a LARGE dictionary

Question

I have a python script that is working with a large (~14gb) textfile. I end up with a dictionary of keys and values, but I am getting a memory error when I try to sort the dictionary by value.

I know the dictionary is too big to load into memory and then sort, but how could I go about accomplishing this?

I'm not sure how useful this'll prove, but check this link: stackoverflow.com/questions/14262433/… .. Can you use pandas for your purpose? — akash12300
– akash12300, Commented Mar 26, 2016 at 8:45
How about using a database like built-in sqlite3? Even the simplest database will easily cope with 14G of data, while you will have to re-invent wheels over wheels... — flaschbier
– flaschbier, Commented Mar 26, 2016 at 8:52
you can use a nosql database such as mongodb , this would save you the trouble of defining the database schema. — GIRISH RAMNANI
– GIRISH RAMNANI, Commented Mar 26, 2016 at 8:54

amirouche · Accepted Answer · 2016-03-27 11:02:49Z

You can use an ordered key/value store like wiredtiger, leveldb, bsddb. All of them support ordered keys using custom sort function. leveldb is the easiest to use but if you use python 2.7, bsddb is included in the stdlib. If you only require lexicographic sorting you can use the raw hashopen function to open a persistent sorted dictionary:

from bsddb import hashopen db = hashopen('dict.db') db['020'] = 'twenty' db['002'] = 'two' db['value'] = 'value' db['key'] = 'key' print(db.keys())

This outputs

>>> ['002', '020', 'key', 'value']

Don't forget to close the db after your work:

db.close()

Mind the fact that hashopen configuration might not suit your need. In this case I recommend you use leveldb which has a simple API or wiredtiger for speed.

To order by value in bsddb, you have to use the composite key pattern or key composition. Which boils down to create a dictionary key which keeps the ordering you look for. In this example we pack the original dict value first (so that small values appears first) with the original dict key (so that the bsddb key is unique):

import struct from bsddb import hashopen my_dict = {'a': 500, 'abc': 100, 'foobar': 1} # insert db = hashopen('dict.db') for key, value in my_dict.iteritems(): composite_key = struct.pack('>Q', value) + key db[composite_key] = '' # value is not useful in this case but required db.close() # read db = hashopen('dict.db') for key, _ in db.iteritems(): # iterate over database size = struct.calcsize('>Q') # unpack value, key = key[:size], key[size:] value = struct.unpack('>Q', value)[0] print key, value db.close()

This outputs the following:

foobar 1 abc 100 a 500

Cool Thanks for all the info. I decided to go the route of sqlite since I could write regular sql in python which was easier for me to instantly recognize how to go about it. I am sure these solutions are great to. Thanks for pointing me in the right direction

Collectives™ on Stack Overflow

How to sort a LARGE dictionary

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related