Well, that was a bit of a gap in blogging terms.

What happened? Well, I took up a job at the Politecnico di Torino, I got rather busy, and … oh, details can wait. In the meantime, I thought I’d share some recent fun stuff that I’ve been doing with the D programming language — a small and currently very basic library for complex graphs.

Complex networks are something that I’ve been working close to professionally for a long time and yet, oddly enough, I’ve never particularly gone into the dark details of coding them. This is most likely because, absent some pressing need, many of the basic things one wants from a graph library are easier to knock up on the fly. It takes much less time and effort to write something quick’n’easy that does the handful of things you want, than to bend your otherwise completely original code to the needs of some specific library solution.

Sometimes, though, you want the heavy-duty toolbox, and when this comes along you have a number of options. There’s the Boost Graph Library, an STL-style C++ generic programming solution; there’s NetworkX, an attractive Python library; and then there’s the serious big hitter, igraph, which is written in C but also has interfaces in Python and R.

igraph in particular is known for being lightning-fast and capable of scaling to very large graphs indeed. It also has a very wide ranging set of different tools for network analysis, and this makes it an obvious choice for any use case involving research on large scale networks. On the other hand, it suffers from the inevitable problem that all complex programs written in C succumb to: its code is extremely verbose and difficult to follow, and its developers have had to implement and maintain custom versions of a great many ‘standard’ algorithms and tools that other languages give you out of the box.

Contrast this with the D programming language, which I’ve been using intensively and extensively over the past year. Without wanting to go into too many details, D delivers all the power, flexibility and elegance that we’re used to from higher level languages, while offering program speeds that are in the ballpark with C/C++. It’s a joy to use with scientific programming, because that simulation that might have taken a ton of code to write in C, or involved a finnicky debugging process in C++, can suddenly be written briefly and clearly and yet still delivers the performance you need; far more of your time can be spent actually generating results from your code, rather than writing and debugging it.

So, faced with the conclusion that I really did need some seriously thought out graph code to move forward on certain problems, I was faced with a quandary. I wanted to use D, but that would mean having to write hooks for one of the existing graph libraries. igraph was the obvious choice — it’s fairly straightforward in principle to use a C library from D — but on the other hand, given how extensive igraph’s codebase is, I was worried that writing wrappers would in practice be almost as complex as writing a custom solution from scratch.

That was an irritation — and then it became an inspiration. Why not have a play with the data structures and algorithms of igraph and see if they could find a friendly home rewritten in D? This would be a nice baptism-of-fire test for the language’s performance capabilities, and it would also be interesting to see whether the elegant surface of idiomatic D code could be preserved in the face of the need for speed and memory efficiency. And so, last week, my little Dgraph library was born.

* * *

Now, I should emphasize up front that Dgraph is not so much in its infancy as at the point of first post-conception cell division. The reason I’m writing now is that I think it’s instructive how readily it was possible to get quick and relatively elegant code up and running and with good performance. Let’s start with the basics: the core data structure of the graph library.

igraph’s core data structure is a model of simplicity:

1 2 3 4 5 6 7 8 9 10 11 |
typedef struct igraph_s { igraph_integer_t n; igraph_bool_t directed; igraph_vector_t from; igraph_vector_t to; igraph_vector_t oi; igraph_vector_t ii; igraph_vector_t os; igraph_vector_t is; void *attr; } igraph_t; |

What’s going on here? `n`

is the number of nodes, and `directed`

unsurprisingly indicates if the graph is directed or not. `from`

and `to`

contain respectively the ‘head’ and ‘tail’ vertices of graph edges — that is, the `i`

’th edge goes from vertex `from[i]`

to vertex `to[i]`

. Vertex IDs are assumed to be numbers in the range 0, 1, …, |*V*| – 1, where |*V*| is the total number of vertices.

Now it gets more interesting: `oi`

is the set of edge index values 0, 1, …, |*E*| – 1 (where |*E*| is the total number of edges), sorted according to the value of `from`

. That is, if we iterate over `from[oi[0]]`

, `from[oi[1]]`

, …, we’ll get an increasing sequence of vertex IDs; and the same with respect to `ii`

, except now we’re sorting according to `to`

. In other words, `oi`

and `ii`

give us the sequence of edges sorted either to the vertices from which they are *outgoing*, or the vertices to which they are *incoming*. Obviously `oi`

and `ii`

both have the same length as `from`

and `to`

, that is, the total number of edges in the graph.

`os`

and `is`

by contrast represent *sums* of the numbers of outgoing or incoming edges. More precisely, `os[v]`

is the total number of edges whose `from`

value is less than `v`

, while `is[v]`

is the total number of edges whose `to`

value is less than `v`

. So, if we want the total number of vertices outgoing from `v`

, it’s given by `os[v + 1] - os[v]`

, while the total number of incoming vertices is given by `is[v + 1] - is[v]`

. For this to work, `os`

and `is`

need to have a total length equal to |*V*| + 1.

Taken together, these elements take up minimal space in memory yet allow us to elegantly reconstruct other graph properties, such as a vertex’s degree or its outgoing or incoming neighbours. To see why, it may be best to switch to D — where we can most simply represent the transformations involved. Let’s start with a simple rewrite of the basic data structure:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
final class Graph(bool dir) { private: size_t[] _head; size_t[] _tail; size_t[] _indexHead; size_t[] _indexTail; size_t[] _sumHead = [0]; size_t[] _sumTail = [0]; public: enum bool directed = dir; // ... we'll get to the rest of it later ... } |

We’ve done away with the number of vertices, because we can calculate that from array lengths. `from`

and `to`

have been renamed as `_head`

and `_tail`

, partly to avoid potential naming clashes, but mainly because I like visualizing my graph edges as snakes or worms or other wriggly things (and now you have this lovely image in your head too: you’re welcome:-). `_indexHead`

and `_indexTail`

are, unsurprisingly, edge indices sorted according to head or tail values; and `_sumHead`

and `_sumTail`

are the sums of edges with head or tail values less than the corresponding value.

We also have a boolean variable that indicates if the graph is directed or not. If a graph is *not* directed (that is, `directed == false`

), then we assume a restriction that `_head[e] <= _tail[e]`

for all edges `e`

.

Let’s start with the simple stuff. The total number of vertices can be calculated from the lengths of `_sumHead`

or `_sumTail`

; we’ll pick the first, and throw in a nice D-style `assert()`

statement to make sure that the two actually are identical:

1 2 3 4 5 |
size_t vertexCount() @property const pure nothrow { assert(_sumHead.length == _sumTail.length); return _sumHead.length - 1; } |

What about vertex degrees, i.e. the number of edges to which a vertex is connected? In the case of a directed graph, it should be clear that the total number of edges incoming to node `v`

will be equal to the total number of edges incoming to nodes with id less than `v + 1`

less the total number of edges incoming to nodes with id less than `v`

, and correspondingly for outgoing edges:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
static if (directed) { size_t degreeIn(immutable size_t v) const pure nothrow { assert(v + 1 < _sumTail.length); return _sumTail[v + 1] - _sumTail[v]; } size_t degreeOut(immutable size_t v) const pure nothrow { assert(v + 1 < _sumHead.length); return _sumHead[v + 1] - _sumHead[v]; } } |

… while for an undirected graph, where incoming edges are also outgoing edges and vice versa, we have to combine these numbers:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
else { size_t degree(immutable size_t v) const pure nothrow { assert(v + 1 < _sumHead.length); assert(_sumHead.length == _sumTail.length); return (_sumHead[v + 1] - _sumHead[v]) + (_sumTail[v + 1] - _sumTail[v]); } alias degreeIn = degree; alias degreeOut = degree; } |

We still define in- and out-degree functions but this time just alias them to the main `degree()`

function. If you’re familiar with D’s syntax, you will have noticed that this `static if () { ... } else { ... }`

formulation is one simple application of D’s generic programming interface: the `static if`

will be evaluated at compile time so that a directed or undirected graph will have only the functions it’s meant to have.

What about vertices’ neighbours in the graph? Let’s start once more with the directed case. Suppose we want to find the incoming edges to a vertex `v`

, i.e., all the edges for which `v`

is the tail. Well, we know that there are `_sumTail[v]`

edges whose tail is less than `v`

, and `_sumTail[v + 1]`

edges whose tail is less than `v + 1`

(that is, less than or equal to `v`

). So, if we label the vertices in order of increasing tail value, we know that edges `0, 1, 2, ..., _sumTail[v] - 1`

will have tail less than v, while edges `_sumTail[v], ..., _sumTail[v + 1] - 1`

will have tail *equal* to `v`

(assuming that `_sumTail[v] < _sumTail[v + 1]`

— if not, there are no neighbours!). And it just so happens that we have a corresponding sorted index of the edges in the form of `_indexTail`

, so taking that, and using a similar process for outgoing neighbours, we have:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
static if (directed) { auto neighboursIn(immutable size_t v) const { return iota(_sumTail[v], _sumTail[v + 1]).map!(a => _head[_indexTail[a]]); } auto neighboursOut(immutable size_t v) const { return iota(_sumHead[v], _sumHead[v + 1]).map!(a => _tail[_indexHead[a]]); } } else { auto neighbours(immutable size_t v) const { return chain(iota(_sumTail[v], _sumTail[v + 1]).map!(a => _head[_indexTail[a]]), iota(_sumHead[v], _sumHead[v + 1]).map!(a => _tail[_indexHead[a]])); } alias neighbors = neighbours; alias neighboursIn = neighbours; alias neighboursOut = neighbours; } |

Some explanation may be in order. `iota(m, n)`

gives us the range of values `m, m + 1, m + 2, ..., n - 1`

. For incoming neighbours, we’re mapping that range via the transformation `a => _indexTail[a] => _head[_indexTail[a]]`

. That is, we’re mapping the sort order of the vertex to the edge index value, and from that to the tail value. In other words, we’re mapping to the vertex IDs that correspond to the heads of the edges for which `v`

is the tail. Similarly, for outgoing neighbours, we map to the vertex IDs that correspond to the tails of edges for which `v`

is the head — and for undirected networks, we chain the two lists together. (Besides the wriggly snake/worm imagery, that’s another reason I chose to refer to ‘head’ and ‘tail': in an undirected network, links are bidirectional so it doesn’t make sense to speak of ‘from’ and ‘to’.)

Notice how easy this is to write down, and how cheap to calculate. We’re not even storing the values of the neighbours in memory — they can be calculated lazily as they are needed! Ultimately this lazy calculation may come back to bite us in some extreme performance situations, but it’s brief, elegant and effective.

Put all of that together, and what have we got?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
final class Graph(bool dir) { private: size_t[] _head; size_t[] _tail; size_t[] _indexHead; size_t[] _indexTail; size_t[] _sumHead = [0]; size_t[] _sumTail = [0]; public: enum bool directed = dir; auto edge() @property const pure nothrow { return zip(_head, _tail); } size_t edgeCount() @property const pure nothrow { assert(_head.length == _tail.length); return _head.length; } size_t vertexCount() @property const pure nothrow { assert(_sumHead.length == _sumTail.length); return _sumHead.length - 1; } static if (directed) { size_t degreeIn(immutable size_t v) const pure nothrow { assert(v + 1 < _sumTail.length); return _sumTail[v + 1] - _sumTail[v]; } size_t degreeOut(immutable size_t v) const pure nothrow { assert(v + 1 < _sumHead.length); return _sumHead[v + 1] - _sumHead[v]; } } else { size_t degree(immutable size_t v) const pure nothrow { assert(v + 1 < _sumHead.length); assert(_sumHead.length == _sumTail.length); return (_sumHead[v + 1] - _sumHead[v]) + (_sumTail[v + 1] - _sumTail[v]); } alias degreeIn = degree; alias degreeOut = degree; } static if (directed) { auto neighboursIn(immutable size_t v) const { return iota(_sumTail[v], _sumTail[v + 1]).map!(a => _head[_indexTail[a]]); } auto neighboursOut(immutable size_t v) const { return iota(_sumHead[v], _sumHead[v + 1]).map!(a => _tail[_indexHead[a]]); } } else { auto neighbours(immutable size_t v) const { return chain(iota(_sumTail[v], _sumTail[v + 1]).map!(a => _head[_indexTail[a]]), iota(_sumHead[v], _sumHead[v + 1]).map!(a => _tail[_indexHead[a]])); } alias neighbors = neighbours; alias neighboursIn = neighbours; alias neighboursOut = neighbours; } alias neighborsIn = neighboursIn; alias neighborsOut = neighboursOut; } |

You’ll notice I sneakily threw in a couple more functions — `edgeCount`

does exactly what it says on the tin, while `edge`

returns a list of edges arranged as (head, tail) pairs. Oh, and (this might be a bad idea:-) I threw in a few spelling alternatives for those in the world who have heinously departed from the mother tongue … ;-)

What I’ve presented here is missing a bunch of other aspects of the implementation — for example how to add or remove vertices or edges from the graph, how to find the edge ID of a given (head, tail) pair, or to find out if there is even such an edge in the first place — and we haven’t even begun to address performance yet! Don’t worry, we’ll get to them in future blog posts. But what’s striking for me is that here we’ve implemented the core data structure and query operations of a super-efficient graph representation in barely 80 lines of code. In C, it takes more than that just to implement the function to determine a vertex’s neighbours.

[... interested in finding the source? Dgraph is on GitHub!]

**EDIT 17-07-2013:** I’ve tweaked the examples using `map`

in the code so they now use UFCS chaining, i.e. they have the form `input.map!()`

rather than `map!()(input)`

. This brings this article in line with what’s now in the codebase.

A big than you to the Crayon Syntax Highlighter for the code examples. :-)

Kudos for reverse engineering the igraph data type! Some comments, based on my experience during the past 7 years, after starting igraph.

1 Don’t think that the igraph data type is the best possible, in fact it was never thoroughly benchmarked, simply because it just proved to be fast enough for us back then. But I am sure that there is room for improvement, and in fact I am experimenting with different data types right now.

2 Whichever data type you choose, the most important thing is to keep the API to the data type thin. For us everything that accesses igraph_t is in type_indexededgelist.c, a small file. So it is easy to try alternative data types. It is sometimes a pain to use this thin API when you are implementing algorithms, but it really pays off later.

3 My guess would be that it is important to make the graph data type flat. The current data type with the vectors is optimal in this respect, it is just a couple of vectors, which are continuous chunks in the memory. It is very fast to iterate over these vectors, and very fast to copy them, which is important because I need to copy them in the R wrapper, since R data types are immutable. I am not sure how a fragmented implementation, e.g. a textbook adjacency list with pointers, would do against this, I am testing it nowadays. My guess is that it’ll be much slower, but I am of course not sure.

Best,

Gabor

Thanks Gabor — I had much fun doing this and exploring your great work on igraph :-)

On data types, when I first started looking into this I’d assumed that there must have been some kind of consensus formed by now as to how to optimally solve the problem of representing a graph computationally. It was a surprise to find that that wasn’t really the case, that the different existing graph libraries had quite different approaches, which I guess reflect simply different priorities. What struck me about

`igraph_t`

was that even if it might carry a cost in terms of speed of access to certain graph elements (e.g. having to calculate the list of neighbours), that would be repaid in terms of scalability.The main concern I personally had about the current igraph approach was its way of handling the deletion of vertices, which if I understand right involves rewriting other vertex IDs. That’s not so much a problem in itself as when one wants to wrap the core graph type with maps from some other vertex ID type (e.g. strings) to the

`size_t`

vertex IDs used internally. It would also be nice if the core data type could readily handle arbitrary vertex ID choices rather than constraining vertex IDs to 0, 1, …, |V| – 1, but I imagine that’s a constraint that’s difficult to reconcile with scalability and performance.Regarding the API and trying out alternative data types, one thing that’s easy to do with D is to put in place template constraints that simply check that the chosen data type can provide a certain minimal range of functionality. So, with well written algorithms, and a small selection of core graph queries, it should be possible to drop in entirely new graph data types and indeed offer several different choices to the user. I’ll probably experiment with that at some point, and if you’d be interested I could try and make sure that a core set of benchmarking tests are put in place so that the performance of different data types could readily be compared. It could be a useful testbed to try an approach out quickly before putting in the investment of writing it in highly optimized C.

I guess you can design a data structure like that. Life is a lot simpler with continuous ids, though. E.g. just to check that the input is a valid vertex or edge id, you would need a lookup in the data structure, that is hard to implement to be O(1) and keep the data structure flat. I guess you could do some kind of hashing and still store everything in flat vectors.

I really want a data structure that is better with dynamically changing graphs, and working on this now (well, a little bit, not very actively).

To be honest, I don’t know much about D, so I don’t know how much your benchmarks are relevant to C code. But please send me an email if you have something, or post it here, I put your blog in my feed reader.

I think some of my previous comment got lost, fixed now …

Agree about the continuous IDs, but ID flexibility was an issue that people have already raised with me. I’m thinking of offering a solution with wrappers that can handle the mapping of arbitrary IDs to

`size_t`

’s, while leaving internal processing untouched. At the end of the day someone somewhere has to transform the input data into a form that’s useful for analysis, and it might as well be in the library.I agree, this seems one of the interesting challenges.

I’ll be following up here with some benchmarking results. Generally speaking there’s a small hit with respect to C, but that can often be narrowed with careful attention to the algorithms and implementation.

About C and D relations — obviously I think D is a powerful production language in its own right, but for developers working in C/C++ it can also be useful as a prototyping language. It’s simply a function of the language being more elegant and safe — you can get an idea up and running quickly, check the general scaling, and then if the results are good, re-implement in the language of choice. It’s just an option — I’d be pleased if Dgraph could be useful to other library developers in this way.

Pingback: Complex networks in D, part 2 — building the network | mad science, dreams and diversions