Well, that was a bit of a gap in blogging terms.
What happened? Well, I took up a job at the Politecnico di Torino, I got rather busy, and … oh, details can wait. In the meantime, I thought I’d share some recent fun stuff that I’ve been doing with the D programming language — a small and currently very basic library for complex graphs.
Complex networks are something that I’ve been working close to professionally for a long time and yet, oddly enough, I’ve never particularly gone into the dark details of coding them. This is most likely because, absent some pressing need, many of the basic things one wants from a graph library are easier to knock up on the fly. It takes much less time and effort to write something quick’n’easy that does the handful of things you want, than to bend your otherwise completely original code to the needs of some specific library solution.
Sometimes, though, you want the heavy-duty toolbox, and when this comes along you have a number of options. There’s the Boost Graph Library, an STL-style C++ generic programming solution; there’s NetworkX, an attractive Python library; and then there’s the serious big hitter, igraph, which is written in C but also has interfaces in Python and R.
igraph in particular is known for being lightning-fast and capable of scaling to very large graphs indeed. It also has a very wide ranging set of different tools for network analysis, and this makes it an obvious choice for any use case involving research on large scale networks. On the other hand, it suffers from the inevitable problem that all complex programs written in C succumb to: its code is extremely verbose and difficult to follow, and its developers have had to implement and maintain custom versions of a great many ‘standard’ algorithms and tools that other languages give you out of the box.
Contrast this with the D programming language, which I’ve been using intensively and extensively over the past year. Without wanting to go into too many details, D delivers all the power, flexibility and elegance that we’re used to from higher level languages, while offering program speeds that are in the ballpark with C/C++. It’s a joy to use with scientific programming, because that simulation that might have taken a ton of code to write in C, or involved a finnicky debugging process in C++, can suddenly be written briefly and clearly and yet still delivers the performance you need; far more of your time can be spent actually generating results from your code, rather than writing and debugging it.
So, faced with the conclusion that I really did need some seriously thought out graph code to move forward on certain problems, I was faced with a quandary. I wanted to use D, but that would mean having to write hooks for one of the existing graph libraries. igraph was the obvious choice — it’s fairly straightforward in principle to use a C library from D — but on the other hand, given how extensive igraph’s codebase is, I was worried that writing wrappers would in practice be almost as complex as writing a custom solution from scratch.
That was an irritation — and then it became an inspiration. Why not have a play with the data structures and algorithms of igraph and see if they could find a friendly home rewritten in D? This would be a nice baptism-of-fire test for the language’s performance capabilities, and it would also be interesting to see whether the elegant surface of idiomatic D code could be preserved in the face of the need for speed and memory efficiency. And so, last week, my little Dgraph library was born.
* * *
Now, I should emphasize up front that Dgraph is not so much in its infancy as at the point of first post-conception cell division. The reason I’m writing now is that I think it’s instructive how readily it was possible to get quick and relatively elegant code up and running and with good performance. Let’s start with the basics: the core data structure of the graph library.
igraph’s core data structure is a model of simplicity:
1 2 3 4 5 6 7 8 9 10 11 |
typedef struct igraph_s { igraph_integer_t n; igraph_bool_t directed; igraph_vector_t from; igraph_vector_t to; igraph_vector_t oi; igraph_vector_t ii; igraph_vector_t os; igraph_vector_t is; void *attr; } igraph_t; |
What’s going on here? n
is the number of nodes, and directed
unsurprisingly indicates if the graph is directed or not. from
and to
contain respectively the ‘head’ and ‘tail’ vertices of graph edges — that is, the i
’th edge goes from vertex from[i]
to vertex to[i]
. Vertex IDs are assumed to be numbers in the range 0, 1, …, |V| – 1, where |V| is the total number of vertices.
Now it gets more interesting: oi
is the set of edge index values 0, 1, …, |E| – 1 (where |E| is the total number of edges), sorted according to the value of from
. That is, if we iterate over from[oi[0]]
, from[oi[1]]
, …, we’ll get an increasing sequence of vertex IDs; and the same with respect to ii
, except now we’re sorting according to to
. In other words, oi
and ii
give us the sequence of edges sorted either to the vertices from which they are outgoing, or the vertices to which they are incoming. Obviously oi
and ii
both have the same length as from
and to
, that is, the total number of edges in the graph.
os
and is
by contrast represent sums of the numbers of outgoing or incoming edges. More precisely, os[v]
is the total number of edges whose from
value is less than v
, while is[v]
is the total number of edges whose to
value is less than v
. So, if we want the total number of vertices outgoing from v
, it’s given by os[v + 1] - os[v]
, while the total number of incoming vertices is given by is[v + 1] - is[v]
. For this to work, os
and is
need to have a total length equal to |V| + 1.
Taken together, these elements take up minimal space in memory yet allow us to elegantly reconstruct other graph properties, such as a vertex’s degree or its outgoing or incoming neighbours. To see why, it may be best to switch to D — where we can most simply represent the transformations involved. Let’s start with a simple rewrite of the basic data structure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
final class Graph(bool dir) { private: size_t[] _head; size_t[] _tail; size_t[] _indexHead; size_t[] _indexTail; size_t[] _sumHead = [0]; size_t[] _sumTail = [0]; public: enum bool directed = dir; // ... we'll get to the rest of it later ... } |
We’ve done away with the number of vertices, because we can calculate that from array lengths. from
and to
have been renamed as _head
and _tail
, partly to avoid potential naming clashes, but mainly because I like visualizing my graph edges as snakes or worms or other wriggly things (and now you have this lovely image in your head too: you’re welcome:-). _indexHead
and _indexTail
are, unsurprisingly, edge indices sorted according to head or tail values; and _sumHead
and _sumTail
are the sums of edges with head or tail values less than the corresponding value.
We also have a boolean variable that indicates if the graph is directed or not. If a graph is not directed (that is, directed == false
), then we assume a restriction that _head[e] <= _tail[e]
for all edges e
.
Let’s start with the simple stuff. The total number of vertices can be calculated from the lengths of _sumHead
or _sumTail
; we’ll pick the first, and throw in a nice D-style assert()
statement to make sure that the two actually are identical:
1 2 3 4 5 |
size_t vertexCount() @property const pure nothrow { assert(_sumHead.length == _sumTail.length); return _sumHead.length - 1; } |
What about vertex degrees, i.e. the number of edges to which a vertex is connected? In the case of a directed graph, it should be clear that the total number of edges incoming to node v
will be equal to the total number of edges incoming to nodes with id less than v + 1
less the total number of edges incoming to nodes with id less than v
, and correspondingly for outgoing edges:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
static if (directed) { size_t degreeIn(immutable size_t v) const pure nothrow { assert(v + 1 < _sumTail.length); return _sumTail[v + 1] - _sumTail[v]; } size_t degreeOut(immutable size_t v) const pure nothrow { assert(v + 1 < _sumHead.length); return _sumHead[v + 1] - _sumHead[v]; } } |
… while for an undirected graph, where incoming edges are also outgoing edges and vice versa, we have to combine these numbers:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
else { size_t degree(immutable size_t v) const pure nothrow { assert(v + 1 < _sumHead.length); assert(_sumHead.length == _sumTail.length); return (_sumHead[v + 1] - _sumHead[v]) + (_sumTail[v + 1] - _sumTail[v]); } alias degreeIn = degree; alias degreeOut = degree; } |
We still define in- and out-degree functions but this time just alias them to the main degree()
function. If you’re familiar with D’s syntax, you will have noticed that this static if () { ... } else { ... }
formulation is one simple application of D’s generic programming interface: the static if
will be evaluated at compile time so that a directed or undirected graph will have only the functions it’s meant to have.
What about vertices’ neighbours in the graph? Let’s start once more with the directed case. Suppose we want to find the incoming edges to a vertex v
, i.e., all the edges for which v
is the tail. Well, we know that there are _sumTail[v]
edges whose tail is less than v
, and _sumTail[v + 1]
edges whose tail is less than v + 1
(that is, less than or equal to v
). So, if we label the vertices in order of increasing tail value, we know that edges 0, 1, 2, ..., _sumTail[v] - 1
will have tail less than v, while edges _sumTail[v], ..., _sumTail[v + 1] - 1
will have tail equal to v
(assuming that _sumTail[v] < _sumTail[v + 1]
— if not, there are no neighbours!). And it just so happens that we have a corresponding sorted index of the edges in the form of _indexTail
, so taking that, and using a similar process for outgoing neighbours, we have:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
static if (directed) { auto neighboursIn(immutable size_t v) const { return iota(_sumTail[v], _sumTail[v + 1]).map!(a => _head[_indexTail[a]]); } auto neighboursOut(immutable size_t v) const { return iota(_sumHead[v], _sumHead[v + 1]).map!(a => _tail[_indexHead[a]]); } } else { auto neighbours(immutable size_t v) const { return chain(iota(_sumTail[v], _sumTail[v + 1]).map!(a => _head[_indexTail[a]]), iota(_sumHead[v], _sumHead[v + 1]).map!(a => _tail[_indexHead[a]])); } alias neighbors = neighbours; alias neighboursIn = neighbours; alias neighboursOut = neighbours; } |
Some explanation may be in order. iota(m, n)
gives us the range of values m, m + 1, m + 2, ..., n - 1
. For incoming neighbours, we’re mapping that range via the transformation a => _indexTail[a] => _head[_indexTail[a]]
. That is, we’re mapping the sort order of the vertex to the edge index value, and from that to the tail value. In other words, we’re mapping to the vertex IDs that correspond to the heads of the edges for which v
is the tail. Similarly, for outgoing neighbours, we map to the vertex IDs that correspond to the tails of edges for which v
is the head — and for undirected networks, we chain the two lists together. (Besides the wriggly snake/worm imagery, that’s another reason I chose to refer to ‘head’ and ‘tail’: in an undirected network, links are bidirectional so it doesn’t make sense to speak of ‘from’ and ‘to’.)
Notice how easy this is to write down, and how cheap to calculate. We’re not even storing the values of the neighbours in memory — they can be calculated lazily as they are needed! Ultimately this lazy calculation may come back to bite us in some extreme performance situations, but it’s brief, elegant and effective.
Put all of that together, and what have we got?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
final class Graph(bool dir) { private: size_t[] _head; size_t[] _tail; size_t[] _indexHead; size_t[] _indexTail; size_t[] _sumHead = [0]; size_t[] _sumTail = [0]; public: enum bool directed = dir; auto edge() @property const pure nothrow { return zip(_head, _tail); } size_t edgeCount() @property const pure nothrow { assert(_head.length == _tail.length); return _head.length; } size_t vertexCount() @property const pure nothrow { assert(_sumHead.length == _sumTail.length); return _sumHead.length - 1; } static if (directed) { size_t degreeIn(immutable size_t v) const pure nothrow { assert(v + 1 < _sumTail.length); return _sumTail[v + 1] - _sumTail[v]; } size_t degreeOut(immutable size_t v) const pure nothrow { assert(v + 1 < _sumHead.length); return _sumHead[v + 1] - _sumHead[v]; } } else { size_t degree(immutable size_t v) const pure nothrow { assert(v + 1 < _sumHead.length); assert(_sumHead.length == _sumTail.length); return (_sumHead[v + 1] - _sumHead[v]) + (_sumTail[v + 1] - _sumTail[v]); } alias degreeIn = degree; alias degreeOut = degree; } static if (directed) { auto neighboursIn(immutable size_t v) const { return iota(_sumTail[v], _sumTail[v + 1]).map!(a => _head[_indexTail[a]]); } auto neighboursOut(immutable size_t v) const { return iota(_sumHead[v], _sumHead[v + 1]).map!(a => _tail[_indexHead[a]]); } } else { auto neighbours(immutable size_t v) const { return chain(iota(_sumTail[v], _sumTail[v + 1]).map!(a => _head[_indexTail[a]]), iota(_sumHead[v], _sumHead[v + 1]).map!(a => _tail[_indexHead[a]])); } alias neighbors = neighbours; alias neighboursIn = neighbours; alias neighboursOut = neighbours; } alias neighborsIn = neighboursIn; alias neighborsOut = neighboursOut; } |
You’ll notice I sneakily threw in a couple more functions — edgeCount
does exactly what it says on the tin, while edge
returns a list of edges arranged as (head, tail) pairs. Oh, and (this might be a bad idea:-) I threw in a few spelling alternatives for those in the world who have heinously departed from the mother tongue … ;-)
What I’ve presented here is missing a bunch of other aspects of the implementation — for example how to add or remove vertices or edges from the graph, how to find the edge ID of a given (head, tail) pair, or to find out if there is even such an edge in the first place — and we haven’t even begun to address performance yet! Don’t worry, we’ll get to them in future blog posts. But what’s striking for me is that here we’ve implemented the core data structure and query operations of a super-efficient graph representation in barely 80 lines of code. In C, it takes more than that just to implement the function to determine a vertex’s neighbours.
[… interested in finding the source? Dgraph is on GitHub!]
EDIT 17-07-2013: I’ve tweaked the examples using map
in the code so they now use UFCS chaining, i.e. they have the form input.map!()
rather than map!()(input)
. This brings this article in line with what’s now in the codebase.