Complex networks in D

Well, that was a bit of a gap in blogging terms.

What happened? Well, I took up a job at the Politecnico di Torino, I got rather busy, and … oh, details can wait. In the meantime, I thought I’d share some recent fun stuff that I’ve been doing with the D programming language — a small and currently very basic library for complex graphs.

Complex networks are something that I’ve been working close to professionally for a long time and yet, oddly enough, I’ve never particularly gone into the dark details of coding them. This is most likely because, absent some pressing need, many of the basic things one wants from a graph library are easier to knock up on the fly. It takes much less time and effort to write something quick’n’easy that does the handful of things you want, than to bend your otherwise completely original code to the needs of some specific library solution.

Sometimes, though, you want the heavy-duty toolbox, and when this comes along you have a number of options. There’s the Boost Graph Library, an STL-style C++ generic programming solution; there’s NetworkX, an attractive Python library; and then there’s the serious big hitter, igraph, which is written in C but also has interfaces in Python and R.

igraph in particular is known for being lightning-fast and capable of scaling to very large graphs indeed. It also has a very wide ranging set of different tools for network analysis, and this makes it an obvious choice for any use case involving research on large scale networks. On the other hand, it suffers from the inevitable problem that all complex programs written in C succumb to: its code is extremely verbose and difficult to follow, and its developers have had to implement and maintain custom versions of a great many ‘standard’ algorithms and tools that other languages give you out of the box.

Contrast this with the D programming language, which I’ve been using intensively and extensively over the past year. Without wanting to go into too many details, D delivers all the power, flexibility and elegance that we’re used to from higher level languages, while offering program speeds that are in the ballpark with C/C++. It’s a joy to use with scientific programming, because that simulation that might have taken a ton of code to write in C, or involved a finnicky debugging process in C++, can suddenly be written briefly and clearly and yet still delivers the performance you need; far more of your time can be spent actually generating results from your code, rather than writing and debugging it.

So, faced with the conclusion that I really did need some seriously thought out graph code to move forward on certain problems, I was faced with a quandary. I wanted to use D, but that would mean having to write hooks for one of the existing graph libraries. igraph was the obvious choice — it’s fairly straightforward in principle to use a C library from D — but on the other hand, given how extensive igraph’s codebase is, I was worried that writing wrappers would in practice be almost as complex as writing a custom solution from scratch.

That was an irritation — and then it became an inspiration. Why not have a play with the data structures and algorithms of igraph and see if they could find a friendly home rewritten in D? This would be a nice baptism-of-fire test for the language’s performance capabilities, and it would also be interesting to see whether the elegant surface of idiomatic D code could be preserved in the face of the need for speed and memory efficiency. And so, last week, my little Dgraph library was born.

* * *

Now, I should emphasize up front that Dgraph is not so much in its infancy as at the point of first post-conception cell division. The reason I’m writing now is that I think it’s instructive how readily it was possible to get quick and relatively elegant code up and running and with good performance. Let’s start with the basics: the core data structure of the graph library.

igraph’s core data structure is a model of simplicity:

What’s going on here? n is the number of nodes, and directed unsurprisingly indicates if the graph is directed or not. from and to contain respectively the ‘head’ and ‘tail’ vertices of graph edges — that is, the i’th edge goes from vertex from[i] to vertex to[i]. Vertex IDs are assumed to be numbers in the range 0, 1, …, |V| – 1, where |V| is the total number of vertices.

Now it gets more interesting: oi is the set of edge index values 0, 1, …, |E| – 1 (where |E| is the total number of edges), sorted according to the value of from. That is, if we iterate over from[oi[0]], from[oi[1]], …, we’ll get an increasing sequence of vertex IDs; and the same with respect to ii, except now we’re sorting according to to. In other words, oi and ii give us the sequence of edges sorted either to the vertices from which they are outgoing, or the vertices to which they are incoming. Obviously oi and ii both have the same length as from and to, that is, the total number of edges in the graph.

os and is by contrast represent sums of the numbers of outgoing or incoming edges. More precisely, os[v] is the total number of edges whose from value is less than v, while is[v] is the total number of edges whose to value is less than v. So, if we want the total number of vertices outgoing from v, it’s given by os[v + 1] - os[v], while the total number of incoming vertices is given by is[v + 1] - is[v]. For this to work, os and is need to have a total length equal to |V| + 1.

Taken together, these elements take up minimal space in memory yet allow us to elegantly reconstruct other graph properties, such as a vertex’s degree or its outgoing or incoming neighbours. To see why, it may be best to switch to D — where we can most simply represent the transformations involved. Let’s start with a simple rewrite of the basic data structure:

We’ve done away with the number of vertices, because we can calculate that from array lengths. from and to have been renamed as _head and _tail, partly to avoid potential naming clashes, but mainly because I like visualizing my graph edges as snakes or worms or other wriggly things (and now you have this lovely image in your head too: you’re welcome:-). _indexHead and _indexTail are, unsurprisingly, edge indices sorted according to head or tail values; and _sumHead and _sumTail are the sums of edges with head or tail values less than the corresponding value.

We also have a boolean variable that indicates if the graph is directed or not. If a graph is not directed (that is, directed == false), then we assume a restriction that _head[e] <= _tail[e] for all edges e.

Let’s start with the simple stuff. The total number of vertices can be calculated from the lengths of _sumHead or _sumTail; we’ll pick the first, and throw in a nice D-style assert() statement to make sure that the two actually are identical:

What about vertex degrees, i.e. the number of edges to which a vertex is connected? In the case of a directed graph, it should be clear that the total number of edges incoming to node v will be equal to the total number of edges incoming to nodes with id less than v + 1 less the total number of edges incoming to nodes with id less than v, and correspondingly for outgoing edges:

… while for an undirected graph, where incoming edges are also outgoing edges and vice versa, we have to combine these numbers:

We still define in- and out-degree functions but this time just alias them to the main degree() function. If you’re familiar with D’s syntax, you will have noticed that this static if () { ... } else { ... } formulation is one simple application of D’s generic programming interface: the static if will be evaluated at compile time so that a directed or undirected graph will have only the functions it’s meant to have.

What about vertices’ neighbours in the graph? Let’s start once more with the directed case. Suppose we want to find the incoming edges to a vertex v, i.e., all the edges for which v is the tail. Well, we know that there are _sumTail[v] edges whose tail is less than v, and _sumTail[v + 1] edges whose tail is less than v + 1 (that is, less than or equal to v). So, if we label the vertices in order of increasing tail value, we know that edges 0, 1, 2, ..., _sumTail[v] - 1 will have tail less than v, while edges _sumTail[v], ..., _sumTail[v + 1] - 1 will have tail equal to v (assuming that _sumTail[v] < _sumTail[v + 1] — if not, there are no neighbours!). And it just so happens that we have a corresponding sorted index of the edges in the form of _indexTail, so taking that, and using a similar process for outgoing neighbours, we have:

Some explanation may be in order. iota(m, n) gives us the range of values m, m + 1, m + 2, ..., n - 1. For incoming neighbours, we’re mapping that range via the transformation a => _indexTail[a] => _head[_indexTail[a]]. That is, we’re mapping the sort order of the vertex to the edge index value, and from that to the tail value. In other words, we’re mapping to the vertex IDs that correspond to the heads of the edges for which v is the tail. Similarly, for outgoing neighbours, we map to the vertex IDs that correspond to the tails of edges for which v is the head — and for undirected networks, we chain the two lists together. (Besides the wriggly snake/worm imagery, that’s another reason I chose to refer to ‘head’ and ‘tail’: in an undirected network, links are bidirectional so it doesn’t make sense to speak of ‘from’ and ‘to’.)

Notice how easy this is to write down, and how cheap to calculate. We’re not even storing the values of the neighbours in memory — they can be calculated lazily as they are needed! Ultimately this lazy calculation may come back to bite us in some extreme performance situations, but it’s brief, elegant and effective.

Put all of that together, and what have we got?

You’ll notice I sneakily threw in a couple more functions — edgeCount does exactly what it says on the tin, while edge returns a list of edges arranged as (head, tail) pairs. Oh, and (this might be a bad idea:-) I threw in a few spelling alternatives for those in the world who have heinously departed from the mother tongue … ;-)

What I’ve presented here is missing a bunch of other aspects of the implementation — for example how to add or remove vertices or edges from the graph, how to find the edge ID of a given (head, tail) pair, or to find out if there is even such an edge in the first place — and we haven’t even begun to address performance yet! Don’t worry, we’ll get to them in future blog posts. But what’s striking for me is that here we’ve implemented the core data structure and query operations of a super-efficient graph representation in barely 80 lines of code. In C, it takes more than that just to implement the function to determine a vertex’s neighbours.

[… interested in finding the source? Dgraph is on GitHub!]

EDIT 17-07-2013: I’ve tweaked the examples using map in the code so they now use UFCS chaining, i.e. they have the form input.map!() rather than map!()(input). This brings this article in line with what’s now in the codebase.

Sampling D

In other happy news (though not quite on the scale of the Higgs boson), my first contribution has been accepted to D’s standard library, Phobos. My patches implement a much faster random sampling algorithm, and also fix a couple of bugs that were present in the pre-existing random sampling functions.

Random sampling is a fun problem, which is addressed by Donald Knuth in chapter 3 (vol. 2) of The Art of Computer Programming. Here’s the essence of it: suppose you have a collection of N items, which without loss of generality we might label 1, 2, …, N or alternatively 0, 1, …, N – 1. You want to take a random subset of these items of size n < N, with the probability of picking any given record being equal to the chance of picking any other given record.

Probably the simplest way to do this is to randomly select items from the original set with probability n/N, keeping them if they have not already been selected, reselecting otherwise, and continuing until you have selected n in total. This has two problems, however: first, we have to store the already-selected items so that we can compare them with new selections, and second, with each new sample point, we have to make an increasing number of comparisons and hence run an increased risk of having to reselect the chosen item.

This doesn’t matter too much if we’re selecting only 3 points from 10, say, but suppose we want to select 1000 points out of 100,000, or 100,000 out of 1 million? It’s also problematic if you’re trying to sample records from a slow-access storage device like a tape, where it’s costly to jump back and forth between different items. We need a method which scales more elegantly in terms of both storage space and algorithmic complexity.

What if, instead, we sequentially consider each item and randomly decide whether or not it is in the sample? We can’t now use n/N as the probability to select — we’d get a sample that was only on average of size n — so we need to find an alternative rule which allows us to do this sequential walk and pick exactly n points without biasing the selection process.

Algorithm S

This is a problem which was solved in 1962 in two independently-written papers by Fan et al. and Jones, and their solution is referred to by Knuth as Algorithm S. It rests on a simple principle: the probability of picking a given item is the number of sample points left to select divided by the total number of items still remaining to consider. So, if you have so far selected m sample points and you’re on the (t + 1)st record, the chance of selecting it will be (nm)/(Nt). Once you’ve selected exactly n points, you exit the sampling process. Here’s a simple example, where we try and select 2 points out of 5:

This algorithm has a lot of nice features. It’s entirely sequential, so you could run it on a tape of records without trouble (very important back in 1962). It’s very simple to implement. On the other hand, although the worst-case scenario of considering every item is usually avoided, it’s still got an algorithmic complexity of O(N) and on average requires O(N) random variates to be generated. Again, not a big deal if you’re sampling 10 from 100, but what if you’re trying to sample 1000 from 10 million?

The problem is we have to consider each item one by one, 1, 2, 3, …, and generate a new random variate (computationally expensive!) for each item we consider. What would be much nicer is to instead jump straight to the selected record, skipping over the others. In other words, what we’d like to do is generate a ‘skip size’ S of records to ignore (which might be 0), select the record we land on, and then skip again from the following record, and so on until the sample is complete. Like this:

For this to be really worthwhile, we’d like to generate only one random variate per skip (meaning the total number scales with the sample size rather than the total number of items) and for S to be generated from that random variate in O(1) complexity. At the same time, we need to ensure that the way we generate S preserves the probabilistic character that we are looking for, i.e. that each record has an equal chance of being in the chosen sample.

Algorithms A and D

A solution to this problem was provided in 1984 by Jeffrey Scott Vitter in a paper entitled Faster methods for random sampling, which introduced several new algorithms with superior scaling properties. Improved versions of the two most effective algorithms were introduced in a later article, An efficient algorithm for sequential random sampling (free access to both these papers is available via Prof. Vitter’s online library). An in-depth presentation of all the details of these two algorithms probably needs a whole post in itself, but we can give a rough overview as follows.

The first of these algorithms, Algorithm A, derives from some simple observations on properties of the distribution function F(s) of skip sizes (for now I won’t give the mathematical details here as I’m currently missing a good solution for mathematical notation, but Vitter’s 1984 paper covers this in depth). Vitter used these properties to derive a simple inequality between N, n, s, and the value of a randomly-generated number V in the open interval (0, 1). The skip size S is then equal to the minimum value of s that satisfies this inequality for a given randomly-generated value of V. In other words, we need generate only 1 random variate per sample point, or n total, compared with O(N) for Algorithm S.

The downside to the simplicity of this approach is that the solution to the inequality is calculated inefficiently: the algorithm still has an overall complexity of O(N). This is still an improvement on Algorithm S, and one that can be optimal for cases where O(n) and O(N) are pretty much equivalent (e.g. selecting 20 out of 100, say).

Algorithm D takes advantage of this by using Algorithm A’s approach where the ratio (nm)/(Nt) is larger than a certain value, but implements an extended approach that solves the computational complexity issue for the case where n is smaller. In its most basic form it relies on generating two random variates per sample point, the first, U, uniformly distributed in (0, 1), and the second, X, from a distribution closely approximating the skip size distribution F(s). Again, Vitter derives an inequality between these values that the skip size must satisfy; but in addition he observes that there is a simpler-to-calculate inequality, which is satisfied with high probability, that guarantees compliance. In the case of either of these inequalities being satisfied, the skip size S is set equal to the ‘floor’ of X, i.e. the largest integer less than X.

So, the skip size S is calculated roughly as follows:

  1. Check if the ratio (nm)/(Nt) is sufficiently large; if so, use Algorithm A to generate this skip size.
  2. Generate the random variables U and X.
  3. Check the simple inequality; if satisfied (likely!), set S equal to floor(X), and return this value.
  4. Check the more complicated inequality; if satisfied, set set S equal to floor(X), and return this value. If not satisfied (very rare!) return to Step 2, and repeat until a satisfactory value is generated.

This basic structure can be improved on in several ways. In practice, once the ratio (nm)/(Nt) has exceeded the threshold once, we can switch over to Algorithm A for the entire remainder of the sampling; the supposed extra cost of Algorithm A is generally offset by a reduction in the number of times we have to calculate and compare the value of the ratio. This also helps to guard against worst-case scenario behaviour of Algorithm D when (nm) ≈ (Nt).

Second, Vitter shows that it’s possible in many cases to deterministically calculate the next value of X from the previous one, without violating the independence of the sequence of random values. This allows a sizeable reduction in the number of random variates required, so that in practice the total number is closer to n than 2n.

Finally, a trivial improvement is to consider separately the case where only 1 sample point remains to be selected — where of course all one has to do is generate a uniformly-distributed integer to calculate the skip across the remaining items.

Getting it into D

The D framework for random sampling, written by Andrei Alexandrescu, is a rather elegant creation (worth its own blog post) that uses D’s concept of ranges to create a lazily-evaluated structure that can sample from a broad range of different inputs — an array, a sequence of numbers generated automatically without being stored, an input file, essentially any input that can be interpreted through D’s concept of an InputRange.

I’d originally implemented Algorithms A and D as a little extension library based on the GNU Scientific Library, and it was fairly trivial to rework the code for D (licence lawyers note that the pieces reworked were entirely my own work and not derivative of GSL). To my knowledge, this makes D the only language — certainly the only systems-programming language — to have Vitter’s algorithm implemented in its standard library. I’m not even aware of any mainstream scientific library that implements it, though I’d be delighted to learn otherwise.

For now, though, it seems a bit of a unique feature — and I’m definitely proud to have brought Algorithm D to D.

Programming fun

Where programming and development is concerned it’s always good to get some time to just play, to mess around with new tools, languages or libraries just for the fun of it to see what you can do. I don’t really do this often enough, but the last few days were an exception. Here’s what I got up to …

C, C++, D …

I can’t remember when exactly I discovered the D programming language, but if my posts on the project mailing lists are anything to go by it must have been in early 2010. This overlapped very nicely with the period in which I became properly enthusiastic and confident with C++, and I was delighted by D’s much more elegant and intuitive syntax compared to the older language.

I’ve played with it on-off ever since then, but never with the commitment that the language really deserves. There were various reasons for that; one of the major ones was that the only up-to-date compiler was the partially-proprietary reference implementation, with the GCC and LLVM-based ones lagging behind.

However … the last release of Ubuntu (11.10) saw the GCC 4.6 based GDC compiler included, and the upcoming LTS release (12.04) has promoted this to being the default GDC install. In fact this is a straight import from Debian, meaning that all upcoming Debian-derived distros are available. So, it seemed an opportune moment to dive back in.

The results are quite nice. Back when I was first trying out D there was a big speed gap with C++ which the latest GDC seems to have closed. The GDC dev community is full of beans and doing excellent work. With any luck the LLVM-based LDC compiler will catch up soon and also be available in distros.

My explorations in the next weeks will probably be to do with things like template mixins and other ways in which D differs from C++ in its approach to class templates, inheritance, polymorphism and so on. I’ll also be looking into how easy it is to use D with C and C++ libraries, surely essential at this point when there is not a widespread range of D software available in Linux distros.

Clang!!!!

Looking into D made me think of LDC, which in turn reawakened my casual curiosity in the LLVM compiler project. I’ve never really worked with any of these tools beyond LDC, so thought I’d have a play with Clang, the LLVM-based C++ compiler.

A couple of tweaks to a Makefile later and … oh dear oh dear! Compiler error messages abounded. But once I’d got past the ‘But it works fine with g++!’ panic, they turned out to be useful, informative, and easy to resolve. The code in question was a policy class-based program with a ton of template metaprogramming, and it turns out that Clang/LLVM is much stricter than GCC when it comes to certain potential ambiguities in C++ templates.

In fact, the error messages gave precise instructions for resolving the problem — in this case, to insert ‘this->’ keywords in front of various function calls. Compilation itself was noticeably faster than g++ and produced code that, while slower, was not so much so as to be unpleasant.

On this basis, I’m quite tempted to switch to Clang as my development compiler, using these nice features to ensure code correctness and to avoid long compile times, and pull out g++ only when building a finished program.