16

I am trying to deal with a super-massive NetworkX Graph object with hundreds of millions of nodes. I'd like to be able to write it to file as to not consume all my computer memory. However, I need to constantly be searching across existing nodes, updating edges, etc.

Is there a good solution for this? I'm not sure how it would work with any of the file formats provided on http://networkx.lanl.gov/reference/readwrite.html

The only solution i can think of is to store each node as a separate file with references to other nodes in the filesystem - that way, opening one node for examination doesn't overload the memory. Is there an existing filesystem for large amounts of data (e.g. PyTables) to do this without writing my own boilerplate code?

3 Answers 3

26

First try pickle; it's designed to serialize arbitrary objects.

An example of creating a DiGraph and serializing to a file:

import pickle import networkx as nx dg = nx.DiGraph() dg.add_edge('a','b') dg.add_edge('a','c') pickle.dump(dg, open('/tmp/graph.txt', 'w')) 

An example of loading a DiGraph from a file:

import pickle import networkx as nx dg = pickle.load(open('/tmp/graph.txt')) print dg.edges() 

Output:

[('a', 'c'), ('a', 'b')] 

If this isn't efficient enough, I would write your own routine to serialize:

  1. edges and
  2. nodes (in case a node is incident to no edges).

Note that using list comprehensions when possible may be much more efficient (instead of standard for loops).

If this is not efficient enough, I'd call a C++ routine from within Python: http://docs.python.org/extending/extending.html

Sign up to request clarification or add additional context in comments.

10 Comments

+1 pickle is a great thing, never heard about that before, thanks!
Pickle generates MASSIVE files for objects, and if this is already a large network, pickle is almost certainly not going to work. It is a great and underused package for many other reasons thouhg!
First of all use cPickle it's much faster, second use HIGHEST_PROTOCOL. This will save it in more efficient binary format.
@ericmjl good question: It will. One of the lines in the serialized file specifies the object type (for the example provided, the line says (cnetworkx.classes.digraph).
If you get error TypeError: write() argument must be str, not bytes, try this instead: pickle.dump(G, open('filename.pickle', 'wb')) to save and G = pickle.load(open('criminal-network.pickle', 'rb')) to load. Note the 'wb' and 'rb' options to open() used here.
|
4

If you've built this as a NetworkX graph, then it will already be in memory. For this large of a graph, my guess is you'll have to do something similar to what you suggested with separate files. But, instead of using separate files, I'd use a database to store each node with many-to-many connections between nodes. In other words you'd have a table of nodes, and a table of edges, then to query for the neighbors of a particular node you could just query for any edges that have that particular node on either end. This should be fast, though I'm not sure if you'll be able to take advantage of NetworkX's analysis functions without first building the whole network in memory.

3 Comments

Thanks Luis. Essentially I'm storing in a database. However, querying nodes to fetch neighbors is extremely expensive. I can only imagine what Google's servers are like...
If the graph is already in RAM, then why would serializing it be a problem? (disk space is cheaper than RAM) Or does NetworkX have some sort of internal method that compresses the representation, and would balloon during serialization? I am curious.
I think the question isn't focused on serializing as much as saving it in a structure that will allow efficient querying. That is where my suggestion for a database came from.
1

I forgot what problem I came to StackOverflow to solve originally, but I stumbled on this question and (nearly a decade too late!) can recommend Grand, a networkx-like library we wrote to solve exactly this problem:

Before

import networkx as nx g = nx.DiGraph() g.add_edge("A", "B") print(len(g.edges())) 

After

import grand from grand.backends import SQLBackend # or choose another! g = grand.Graph(backend=SQLBackend()) g.nx.add_edge("A", "B") print(len(g.nx.edges())) 

The API is the same as NetworkX, but the data live in SQL, DynamoDB, etc.

2 Comments

Does this code "write to file"? and "read from file"?
It does, if you use a file-based backend. It can also read and write to a database or in-memory storage!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.