Complex networks in D

Well, that was a bit of a gap in blogging terms.

What happened? Well, I took up a job at the Politecnico di Torino, I got rather busy, and … oh, details can wait. In the meantime, I thought I’d share some recent fun stuff that I’ve been doing with the D programming language — a small and currently very basic library for complex graphs.

Complex networks are something that I’ve been working close to professionally for a long time and yet, oddly enough, I’ve never particularly gone into the dark details of coding them. This is most likely because, absent some pressing need, many of the basic things one wants from a graph library are easier to knock up on the fly. It takes much less time and effort to write something quick’n’easy that does the handful of things you want, than to bend your otherwise completely original code to the needs of some specific library solution.

Sometimes, though, you want the heavy-duty toolbox, and when this comes along you have a number of options. There’s the Boost Graph Library, an STL-style C++ generic programming solution; there’s NetworkX, an attractive Python library; and then there’s the serious big hitter, igraph, which is written in C but also has interfaces in Python and R.

igraph in particular is known for being lightning-fast and capable of scaling to very large graphs indeed. It also has a very wide ranging set of different tools for network analysis, and this makes it an obvious choice for any use case involving research on large scale networks. On the other hand, it suffers from the inevitable problem that all complex programs written in C succumb to: its code is extremely verbose and difficult to follow, and its developers have had to implement and maintain custom versions of a great many ‘standard’ algorithms and tools that other languages give you out of the box.

Contrast this with the D programming language, which I’ve been using intensively and extensively over the past year. Without wanting to go into too many details, D delivers all the power, flexibility and elegance that we’re used to from higher level languages, while offering program speeds that are in the ballpark with C/C++. It’s a joy to use with scientific programming, because that simulation that might have taken a ton of code to write in C, or involved a finnicky debugging process in C++, can suddenly be written briefly and clearly and yet still delivers the performance you need; far more of your time can be spent actually generating results from your code, rather than writing and debugging it.

So, faced with the conclusion that I really did need some seriously thought out graph code to move forward on certain problems, I was faced with a quandary. I wanted to use D, but that would mean having to write hooks for one of the existing graph libraries. igraph was the obvious choice — it’s fairly straightforward in principle to use a C library from D — but on the other hand, given how extensive igraph’s codebase is, I was worried that writing wrappers would in practice be almost as complex as writing a custom solution from scratch.

That was an irritation — and then it became an inspiration. Why not have a play with the data structures and algorithms of igraph and see if they could find a friendly home rewritten in D? This would be a nice baptism-of-fire test for the language’s performance capabilities, and it would also be interesting to see whether the elegant surface of idiomatic D code could be preserved in the face of the need for speed and memory efficiency. And so, last week, my little Dgraph library was born.

* * *

Now, I should emphasize up front that Dgraph is not so much in its infancy as at the point of first post-conception cell division. The reason I’m writing now is that I think it’s instructive how readily it was possible to get quick and relatively elegant code up and running and with good performance. Let’s start with the basics: the core data structure of the graph library.

igraph’s core data structure is a model of simplicity:

What’s going on here? n is the number of nodes, and directed unsurprisingly indicates if the graph is directed or not. from and to contain respectively the ‘head’ and ‘tail’ vertices of graph edges — that is, the i’th edge goes from vertex from[i] to vertex to[i]. Vertex IDs are assumed to be numbers in the range 0, 1, …, |V| – 1, where |V| is the total number of vertices.

Now it gets more interesting: oi is the set of edge index values 0, 1, …, |E| – 1 (where |E| is the total number of edges), sorted according to the value of from. That is, if we iterate over from[oi[0]], from[oi[1]], …, we’ll get an increasing sequence of vertex IDs; and the same with respect to ii, except now we’re sorting according to to. In other words, oi and ii give us the sequence of edges sorted either to the vertices from which they are outgoing, or the vertices to which they are incoming. Obviously oi and ii both have the same length as from and to, that is, the total number of edges in the graph.

os and is by contrast represent sums of the numbers of outgoing or incoming edges. More precisely, os[v] is the total number of edges whose from value is less than v, while is[v] is the total number of edges whose to value is less than v. So, if we want the total number of vertices outgoing from v, it’s given by os[v + 1] - os[v], while the total number of incoming vertices is given by is[v + 1] - is[v]. For this to work, os and is need to have a total length equal to |V| + 1.

Taken together, these elements take up minimal space in memory yet allow us to elegantly reconstruct other graph properties, such as a vertex’s degree or its outgoing or incoming neighbours. To see why, it may be best to switch to D — where we can most simply represent the transformations involved. Let’s start with a simple rewrite of the basic data structure:

We’ve done away with the number of vertices, because we can calculate that from array lengths. from and to have been renamed as _head and _tail, partly to avoid potential naming clashes, but mainly because I like visualizing my graph edges as snakes or worms or other wriggly things (and now you have this lovely image in your head too: you’re welcome:-). _indexHead and _indexTail are, unsurprisingly, edge indices sorted according to head or tail values; and _sumHead and _sumTail are the sums of edges with head or tail values less than the corresponding value.

We also have a boolean variable that indicates if the graph is directed or not. If a graph is not directed (that is, directed == false), then we assume a restriction that _head[e] <= _tail[e] for all edges e.

Let’s start with the simple stuff. The total number of vertices can be calculated from the lengths of _sumHead or _sumTail; we’ll pick the first, and throw in a nice D-style assert() statement to make sure that the two actually are identical:

What about vertex degrees, i.e. the number of edges to which a vertex is connected? In the case of a directed graph, it should be clear that the total number of edges incoming to node v will be equal to the total number of edges incoming to nodes with id less than v + 1 less the total number of edges incoming to nodes with id less than v, and correspondingly for outgoing edges:

… while for an undirected graph, where incoming edges are also outgoing edges and vice versa, we have to combine these numbers:

We still define in- and out-degree functions but this time just alias them to the main degree() function. If you’re familiar with D’s syntax, you will have noticed that this static if () { ... } else { ... } formulation is one simple application of D’s generic programming interface: the static if will be evaluated at compile time so that a directed or undirected graph will have only the functions it’s meant to have.

What about vertices’ neighbours in the graph? Let’s start once more with the directed case. Suppose we want to find the incoming edges to a vertex v, i.e., all the edges for which v is the tail. Well, we know that there are _sumTail[v] edges whose tail is less than v, and _sumTail[v + 1] edges whose tail is less than v + 1 (that is, less than or equal to v). So, if we label the vertices in order of increasing tail value, we know that edges 0, 1, 2, ..., _sumTail[v] - 1 will have tail less than v, while edges _sumTail[v], ..., _sumTail[v + 1] - 1 will have tail equal to v (assuming that _sumTail[v] < _sumTail[v + 1] — if not, there are no neighbours!). And it just so happens that we have a corresponding sorted index of the edges in the form of _indexTail, so taking that, and using a similar process for outgoing neighbours, we have:

Some explanation may be in order. iota(m, n) gives us the range of values m, m + 1, m + 2, ..., n - 1. For incoming neighbours, we’re mapping that range via the transformation a => _indexTail[a] => _head[_indexTail[a]]. That is, we’re mapping the sort order of the vertex to the edge index value, and from that to the tail value. In other words, we’re mapping to the vertex IDs that correspond to the heads of the edges for which v is the tail. Similarly, for outgoing neighbours, we map to the vertex IDs that correspond to the tails of edges for which v is the head — and for undirected networks, we chain the two lists together. (Besides the wriggly snake/worm imagery, that’s another reason I chose to refer to ‘head’ and ‘tail': in an undirected network, links are bidirectional so it doesn’t make sense to speak of ‘from’ and ‘to’.)

Notice how easy this is to write down, and how cheap to calculate. We’re not even storing the values of the neighbours in memory — they can be calculated lazily as they are needed! Ultimately this lazy calculation may come back to bite us in some extreme performance situations, but it’s brief, elegant and effective.

Put all of that together, and what have we got?

You’ll notice I sneakily threw in a couple more functions — edgeCount does exactly what it says on the tin, while edge returns a list of edges arranged as (head, tail) pairs. Oh, and (this might be a bad idea:-) I threw in a few spelling alternatives for those in the world who have heinously departed from the mother tongue … ;-)

What I’ve presented here is missing a bunch of other aspects of the implementation — for example how to add or remove vertices or edges from the graph, how to find the edge ID of a given (head, tail) pair, or to find out if there is even such an edge in the first place — and we haven’t even begun to address performance yet! Don’t worry, we’ll get to them in future blog posts. But what’s striking for me is that here we’ve implemented the core data structure and query operations of a super-efficient graph representation in barely 80 lines of code. In C, it takes more than that just to implement the function to determine a vertex’s neighbours.

[... interested in finding the source? Dgraph is on GitHub!]

EDIT 17-07-2013: I’ve tweaked the examples using map in the code so they now use UFCS chaining, i.e. they have the form input.map!() rather than map!()(input). This brings this article in line with what’s now in the codebase.

6 thoughts on “Complex networks in D

  1. Kudos for reverse engineering the igraph data type! Some comments, based on my experience during the past 7 years, after starting igraph.

    1 Don’t think that the igraph data type is the best possible, in fact it was never thoroughly benchmarked, simply because it just proved to be fast enough for us back then. But I am sure that there is room for improvement, and in fact I am experimenting with different data types right now.

    2 Whichever data type you choose, the most important thing is to keep the API to the data type thin. For us everything that accesses igraph_t is in type_indexededgelist.c, a small file. So it is easy to try alternative data types. It is sometimes a pain to use this thin API when you are implementing algorithms, but it really pays off later.

    3 My guess would be that it is important to make the graph data type flat. The current data type with the vectors is optimal in this respect, it is just a couple of vectors, which are continuous chunks in the memory. It is very fast to iterate over these vectors, and very fast to copy them, which is important because I need to copy them in the R wrapper, since R data types are immutable. I am not sure how a fragmented implementation, e.g. a textbook adjacency list with pointers, would do against this, I am testing it nowadays. My guess is that it’ll be much slower, but I am of course not sure.

    Best,
    Gabor

    • Thanks Gabor — I had much fun doing this and exploring your great work on igraph :-)

      On data types, when I first started looking into this I’d assumed that there must have been some kind of consensus formed by now as to how to optimally solve the problem of representing a graph computationally. It was a surprise to find that that wasn’t really the case, that the different existing graph libraries had quite different approaches, which I guess reflect simply different priorities. What struck me about igraph_t was that even if it might carry a cost in terms of speed of access to certain graph elements (e.g. having to calculate the list of neighbours), that would be repaid in terms of scalability.

      The main concern I personally had about the current igraph approach was its way of handling the deletion of vertices, which if I understand right involves rewriting other vertex IDs. That’s not so much a problem in itself as when one wants to wrap the core graph type with maps from some other vertex ID type (e.g. strings) to the size_t vertex IDs used internally. It would also be nice if the core data type could readily handle arbitrary vertex ID choices rather than constraining vertex IDs to 0, 1, …, |V| – 1, but I imagine that’s a constraint that’s difficult to reconcile with scalability and performance.

      Regarding the API and trying out alternative data types, one thing that’s easy to do with D is to put in place template constraints that simply check that the chosen data type can provide a certain minimal range of functionality. So, with well written algorithms, and a small selection of core graph queries, it should be possible to drop in entirely new graph data types and indeed offer several different choices to the user. I’ll probably experiment with that at some point, and if you’d be interested I could try and make sure that a core set of benchmarking tests are put in place so that the performance of different data types could readily be compared. It could be a useful testbed to try an approach out quickly before putting in the investment of writing it in highly optimized C.

      • but I imagine that’s a constraint that’s difficult to reconcile with scalability and performance.

        I guess you can design a data structure like that. Life is a lot simpler with continuous ids, though. E.g. just to check that the input is a valid vertex or edge id, you would need a lookup in the data structure, that is hard to implement to be O(1) and keep the data structure flat. I guess you could do some kind of hashing and still store everything in flat vectors.

        I really want a data structure that is better with dynamically changing graphs, and working on this now (well, a little bit, not very actively).

        To be honest, I don’t know much about D, so I don’t know how much your benchmarks are relevant to C code. But please send me an email if you have something, or post it here, I put your blog in my feed reader.

        • I think some of my previous comment got lost, fixed now …

          Agree about the continuous IDs, but ID flexibility was an issue that people have already raised with me. I’m thinking of offering a solution with wrappers that can handle the mapping of arbitrary IDs to size_t’s, while leaving internal processing untouched. At the end of the day someone somewhere has to transform the input data into a form that’s useful for analysis, and it might as well be in the library.

          I really want a data structure that is better with dynamically changing graphs, and working on this now (well, a little bit, not very actively).

          I agree, this seems one of the interesting challenges.

          I’ll be following up here with some benchmarking results. Generally speaking there’s a small hit with respect to C, but that can often be narrowed with careful attention to the algorithms and implementation.

          About C and D relations — obviously I think D is a powerful production language in its own right, but for developers working in C/C++ it can also be useful as a prototyping language. It’s simply a function of the language being more elegant and safe — you can get an idea up and running quickly, check the general scaling, and then if the results are good, re-implement in the language of choice. It’s just an option — I’d be pleased if Dgraph could be useful to other library developers in this way.

  2. Pingback: Complex networks in D, part 2 — building the network | mad science, dreams and diversions

Comments are closed.