Graph / Network Analytics
April 28th, 2009
Graph / network data structures are very useful. A number of tools and APIs have been written to work with such structures. Many of these include functions that can be run on the graphs. A function to calculate node degree is an example. Unfortunately most of the APIs require that the entire graph be loaded into memory before any such functions can be called. This presents a significant limitation on the size of the graph that can be analyzed. Examples include:
- igraph: This looks to be a great C API with all sorts of useful functions. Unfortunately as the documentation notes: “The rule of thumb is that if your graph fits into the physical memory then igraph can handle it.”
- Boost Graph Library: As a Boost library BGL uses the C++ STL to represent graphs and includes a number of functions for analytics. These include the vector, list and set which are by default if not by definition memory-only containers.
- Prefuse: A Java API, Prefuse uses a TupleSet interface to vertices and edges. These are by default implemented with tables in memory.
- JUNG: Another Java API with a terrific set of algorithms for graph analytics. Perhaps it would be useful to compare Prefuse and JUNG to see if the back-end storage mechanisms could be unified.
Relational databases are mature technologies that manage the streaming of data on disk through result sets in memory in a very fast and efficient manner. One option for extending graph analytics to larger graphs would be to take advantage of relational databases for the storage rather than relying on containers in memory. BGL appears to offer some potential for this. The key would be to find a C++ container that both exposes the necessary interface for a BGL adjacency list and also uses a relational database for storage and retrieval. The other APIs might be extended in this manner as well by swapping out the vertex (node) and edge lists with some dynamic container that uses a DBMS and efficiently cached result set well. For Prefuse the trick would be to find a JDBC container that could be wrapped with a TupleSet interface.