Betweenness centrality in Dgraph

In this post I thought I’d take you through the start-to-finish process of implementing and optimizing one of the new functions in Dgraph — from a fairly straight copy of the pseudo-code in a research paper, to its form in the repository today. The function in question is dgraph.metric.betweenness, which implements the betweenness centrality metric for networks. Following the implementation of this algorithm offers some interesting opportunities to examine simple optimization strategies when coding in D, things that experienced D’ers know well but which it’s easy to miss as a newcomer.

First, some background. The concept of centrality in network analysis is used to describe the relative importance of individual vertices or edges in the network’s structure. Multiple different centrality measures exist, including degree centrality (how many other vertices are you connected to?), closeness centrality (how far do you have to go to reach other vertices?) and eigenvector centrality (which derives from the eigenvector with largest eigenvalue of the network’s adjacency matrix).

Betweenness centrality is an alternative measure that essentially reflects a vertex’s importance as a channel of communication between other vertices in the network. Specifically, the betweenness centrality of a vertex v corresponds to

\[C_{B}(v) = \sum_{s \neq t \neq v \in V}\frac{\sigma_{st}(v)}{\sigma_{st}}\]

where V is the set of all vertices in the graph, σst(v) is the number of shortest paths between vertices s and t that pass through v, and σst is the total number of shortest paths between s and t.

In non-mathematical terms, what we are doing is to calculate, for every pair of other nodes in the graph, the fraction of shortest paths between them that include the vertex v — that is, the importance of v as a point of connection in going from one node to the other. We then sum these fractions to get the total betweenness centrality of v. Its practical importance is that if we knock out a vertex with high betweenness centrality, we are going to make it much more difficult to travel between other vertices, because for many vertex pairs the network distance between them will have increased.

It’s a tricky quantity to calculate, because to do so for even a single vertex winds up scaling with the size of the whole graph. A popular algorithm for (relatively!) efficient computation of betweenness centrality was proposed by Ulrik Brandes in 2001 in a paper in the Journal of Mathematical Sociology (also available as a University of Konstanz e-print), which scales with O(|V||E|), i.e. the product of the total numbers of vertices and edges.

My original version, which was written before the API change introduced by recent updates, was a fairly straightforward copy of the pseudo-code Brandes offered in his paper:

A few small things that may be of interest: first, note the template constraint if (isFloatingPoint!T) — the user of this function can optionally specify the return type, but we insist it’s floating point.

Second, the ignore array passed to the function is not part of Brandes’ algorithm. A typical application of betweenness centrality is to calculate an attack strategy to break apart a network — you knock out the node with highest betweenness centrality, then recalculate, then knock out the highest remaining node, and so on. Passing an array of boolean types that indicate which nodes have already been knocked out is cheaper and easier than actually modifying the network.

Third, VertexQueue is a non-growable circular queue implementation that I knocked up fairly quickly as a simplification of Bearophile’s GrowableCircularQueue example on RosettaCode (simplified because we don’t need the growable side of this, as we can guarantee the required capacity). It’s there simply because D’s standard library currently has no queue implementation of its own, and will be replaced by the standard library version as soon as one is available. Bearophile was kind enough to offer his code under the terms of the Boost licence, enabling its use (and that of its derivatives) in other codebases.

Anyway, this betweenness centrality implementation produces identical results to its igraph counterpart, but is more than 3 times slower — so how do we get it up to par?

Profiling makes clear that memory allocation and garbage collection is a major source of slowdown. Contrary to what some might expect, this isn’t an inevitable consequence of D’s use of garbage collection. My personal experience is that it can help to approach memory allocation in D much like you would in C or C++, which is to say that you should choose the places in which you allocate memory so as to minimize the overall number of allocations and deallocations required. A nice example is in what is probably the most glaring flaw of the above code: line 16,

size_t[][] p = new size_t[][g.vertexCount];

should very obviously be moved outside the foreach loop. I put it where it is because that’s the place in Brandes’ pseudo-code where p is declared, but there’s simply no need to re-allocate it entirely from scratch at each pass of the loop: we can allocate before the loop begins, and then inside the loop set p[] = [], i.e. set every element of p to be an empty array.

As it happens, this hardly saves us anything, probably because the D compiler and garbage collector are smart enough to figure out that the memory can be re-used rather than re-allocated at each pass of the loop. A much bigger improvement is gained by some tweaks that I made after examining the igraph implementation: while the original re-initializes the entire arrays sigma, d and delta with each pass of the loop, in fact the entire array need only be initialized once, and thereafter we can reset only the values that have been touched by the calculation. Similarly, the individual arrays stored in p need only have their lengths reset to zero in the event that we actually re-use them in a later pass of the loop. Implementing this optimization reduces the time taken to about 3/4 that required by the original version — but it’s still much slower than igraph. What else can we do?

Removing the obligation to pass an ignore array doesn’t really buy us anything, but there is something that will. Note that this implementation involves a lot of array appending — and doing this with regular arrays and the append operator ~ is not an efficient choice.

Instead, we convert the variable p to be an array of Appenders, a specially-defined type that is dedicated to efficient array appending. This change results in another huge speedup — taken together with the previous changes, it’s now running at twice the speed of the original implementation.

Adding in further tweaks to make the function agnostic as to the graph implementation, and we come to the current final form:

There’s probably still more we could do at this point to improve the betweenness centrality calculation — among other things, other researchers have developed a parallelized version of Brandes’ algorithm — but by this point our primary bottleneck is something else — the graph type itself. We’ll discuss this in my next blog post.

EDIT: The order of the template parameters for the final version of betweenness has been reversed. This means that you can specify the desired return type without having to also specify graph type (which can be inferred by the compiler).

EDIT 2: Bearophile and I had an interesting follow-up discussion on further possibilities for optimization. Some of these I’ll leave to explore in future when Dgraph is more fully developed, but one tweak I made available immediately was to allow users to pass a buffer to betweenness for the centrality values to be returned in.

D graph library updates

Today I pushed a number of important updates to the Dgraph master repository. These will be breaking changes for anyone currently using the library, so I thought I should provide a concise overview of what has been altered and how to tweak your own code.

The first and most important change is that where previously there was a single Graph class, the concept of a graph has been generalized to any type that implements the expected graph API. A new template isGraph can be used to check whether a type conforms to expectations.

The original Graph class has been renamed to IndexedEdgeList, reflecting the name given to this data structure by the authors of igraph. It has been supplemented by a new class, CachedEdgeList, which offers greatly enhanced speed at the cost of a higher overall memory footprint. Other graph types will probably be added to the library in the not-too-distant future.

How to update your code

Because of its performance advantages, I recommend using CachedEdgeList as the default graph type in your code. The additional memory cost will be insignificant for all except the very largest graphs. The simplest way to adapt your code is therefore to do a search-and-replace of Graph for CachedEdgeList (or, in regex, s/Graph/CachedEdgeList/:-).

A better approach is to adapt your code to take advantage of the isGraph template. Here’s an example of a function using the previous version of the library:

Now, using isGraph, we can generalize this to accept any graph type:

You can see examples of this in how the functions in dgraph.metric have been adapted for use with the new design.

I’ll be following up with a more in-depth look at the changes, particularly in terms of how they impact performance. In the meantime, feedback is very welcome on the new design, and I hope that any inconvenience caused by the breaking changes is repaid by improved speed and flexibility.

Complex networks in D, part 2 — building the network

When we got to the end of my earlier post on the topic, we’d built a nice little graph data structure together with a core set of query operations to calculate vertex degrees and vertices’ neighbours. But all of this assumed we’d already got the graph data in place — when of course we haven’t! So this time round, we’ll examine the functions needed to add vertices and edges to the network.

Interruption! If you’re interested in following developer blogs on the D programming language, do check out John Colvin’s foreach(hour; life) {/* do something here */}. John’s a lovely guy who contributes much helpful advice and insight on the D forums.

Let’s start by reminding ourselves of the basic data structure we’re dealing with:

_head and _tail contain respectively the head and tail vertices of each edge; _indexHead and _indexTail contain indices of the edges sorted respectively according to head and tail IDs; and _sumHead and _sumTail contain cumulative sums of the number of edges with head/tail less than the given head/tail value. This is a transcription into D of the igraph_t datatype from igraph; Gábor Csárdi, one of the authors of igraph, refers to it as an indexed edge list.

Let’s start with adding vertices. We could get a bit complicated and make the vertexCount property writeable, but for now let’s keep it simple and just match igraph with a function to increase the number of vertices in the graph:

All we have to do is make sure that the cumulative sums of edges get extended, and since the new vertices don’t have any edges, this is a simple matter of copying the final value from the original sum array. We’re doing this using D’s slicing syntax: arr[i .. j] simply means the sub-array going from the i’th index up to the j - 1’st index; the dollar sign $ is shorthand for the array’s length. If you’re interested, you can read in much more detail about slices in Ali Çehreli’s excellent online book.

Anyway, adding vertices is boring — it’s adding edges where the exciting stuff starts to happen. In principle there are two ways we can do this: we can add individual edges, one at a time, or we can add many edges, all in one go; igraph defines two different functions, igraph_add_edge and igraph_add_edges, to do this.

We’ll go with adding one at a time to start with. The task is simple — we need to add the head and tail values, place the new edge ID in its correct location in the head/tail-sorted indices, and increment the sum values:

I’m actually cheating a bit here, because when I originally wrote the indexEdges() function, I used schwartzSort!(a => _head[a], "a < b") and not the sort formulation given here. schwartzSort caches the results of the transformation a => f(a), making it cheaper where that transformation is costly; but as a => _head[a] is so cheap, we’re actually better off using regular sort, and the (a, b) transformation here provides the rule for determining whether or not a should come before b in the sort order.

Note that we insist that in an undirected graph, the head of an edge has a vertex ID less than or equal to that of the tail. This may seem like a fussy convention but it’s actually useful — we’ll see why later.

Anyway, this all seems nice and in order, so why not test it? Something like this:

Benchmarking this with a 50-vertex graph with 200 edges, the D code comes out more than 10 times faster than similar code written using igraph. A win, you think? You speak too soon — if we try 10,000 vertices and 20,000 edges, the D code is over 7 times slower! You might think that something must be seriously wrong here, and you’d be right — in fact, we’ve run into an unfortunate bug in D’s sort algorithm. In the corner case where an array is almost entirely sorted except for the last few entries, the sort displays worst-case behaviour and scales with n2 rather than the expected n log n. Fortunately for us there’s an alternative sorting algorithm available, which uses Timsort rather than the standard Quicksort:

This actually works out slightly slower on the smaller graph, but gives a major speedup for the larger one — suddely what was taking over a minute to run on my system is running in less than a second! Most likely the scaling bug wasn’t being triggered at all for the smaller network, but kicks in as the length of the arrays gets above a certain size, and O(n2) as n → several thousand is going to leave a mark …

… but wait a minute. Aren’t we missing something here? We’re talking about adding one (or very few) new values to an already sorted list — we shouldn’t be using QuickSort or TimSort at all, we should be using an insertion sort. And it turns out that we can write this very simply indeed — even simpler than I’d anticipated, thanks to a technique pointed out to me by Dmitry Olshansky:

lowerBound uses a binary search to extract from the already-sorted index list the largest left-hand part of the array such that these edges’ head/tail values are all less than the head/tail value of our new edge. insertInPlace then drops the new index value into its correct location. This is going to be expensive if instead we’re adding multiple edges in one go, so let’s make it nicer by doing all the potential memory allocation in one go:

… and suddenly our adding of new nodes is super-fast all round — whereas it takes igraph about 13 seconds on my machine to build a 10,000-vertex, 20,000-edge graph adding edges one at a time, the D code here delivers it in less than 0.1 seconds. Woop!

* * *

Let’s introduce a note of cold sobriety, before we go rushing off and opening the champagne. To compare with igraph like this may give a frisson of pleasure, but we’re not comparing like with like. igraph’s igraph_add_edge function simply wraps its function for adding multiple edges:

This is bound to take a performance hit, because the function for adding multiple edges does a complete rearrangement of the sorted index and sum arrays — whereas our addEdge function just tweaks the bits it needs to. The same could be implemented in igraph, and it would almost certainly beat the present D code in speed. The take-home point here isn’t that D is able to be faster — it’s that because we can write so quickly and simply and effectively in D, we had the time and mental space to implement a tailored solution like this from the get-go. (It also helps that D’s community is very friendly and helpful with suggestions!)

The second sobriety check comes when we try using igraph how it’s meant to be used — passing the graph all the edges in one big go as a vector. Suddenly it’s generating that big 10,000-vertex, 20,000-edge network in about 0.001 seconds! We clearly have some catching up to do.

Let’s assume, as igraph does (and as we already did in an example above) that our list of many edges is going to come in the form of an array edges = [head1, tail1, head2, tail2, head3, tail3, ...] (maybe we’ll change this in future, but for now it serves). So now let’s write a variant of addEdge that will accept this:

So, we add the head/tail values to their respective arrays, we sort the indices, and we recalculate from scratch the _sumHead and _sumTail arrays. If a large number of edges has been added, the latter should be faster than just incrementing the sums one by one as we did when adding a single edge. For now we’ll stick with the insertion sort — it’s interesting to examine how it performs.

We’re obviously doing something right, because on my machine the time to create the smaller, 50-vertex graph is cut to less than a quarter of what it was before, placing it on a par with igraph. For the 10,000-vertex graph it’s almost 5 times faster than it was adding edges one at a time — but that’s still an order of magnitude slower than igraph. Presumably this is down to the sorting technique — if we’re adding many edges all at once, then insertion sort is probably no longer optimal. Let’s try out a regular sort:

This cuts the creation time for the 50-vertex graph very slightly, and for the 10,000-vertex graph very significantly — the latter is about 8 times faster than it was before! This is probably still very slightly slower than igraph, but only slightly — it’s about 0.002 seconds to generate the larger graph with D compared to about 0.0015 with igraph. (In case you’re wondering, this is an average over 1000 realizations, which takes about 1.5s for igraph, just over 2 for the D code.)

Clearly we’re standing on the shoulders of giants — we’re able to develop these solutions quickly and easily because we have the example of igraph already there in front of us to reverse engineer. The sumEdges() function, for example, is a fairly straight D-ification of igraph’s igraph_i_create_start() function. But as with the previous post, the key thing to note is how cleanly we can implement these ideas; there’s no unnecessary cruft or boilerplate needed to ensure the code operates safely, and many of the solutions are taken straight out of the D standard library, Phobos. Others, such as sorting by head or tail, can be implemented very simply because of the elegant syntax D offers us for delegates and lambdas.

* * *

Actually we’re not quite finished here. We sorted _indexHead and _indexTail according to _head and _tail values respectively, but in fact we want a double sort criterion as there is in igraph: _indexHead should be sorted primarily with respect to _head, but secondarily with respect to _tail; and vice versa for _indexTail. Thanks to advice from suggestions by John Colvin and bearophile on the D forums, we know there is a ready solution for this in the form of the multiSort function:

… while a small tweak to the insertion sort in D ensures correct ordering there as well:

We do take a performance hit here — we’re no longer level pegging with igraph, although we might get back that speed by using instead the radix sort that is used by igraph. But the hit is worth it: among other things, this extra level of sorting means we are guaranteed to get a sorted list out of the neighbours function, but its main application is in making it easier to search for edges and their place in the edge list. We’ll examine this next time when we look in more detail at the range of queries we can make of a graph object, and how they shape up in performance.

Edit 19-07-2013: I’ve tweaked the article to make clear that the double sort criterion described here is also used by igraph and — unsurprisingly — was the inspiration for doing it here as well.

Complex networks in D

Well, that was a bit of a gap in blogging terms.

What happened? Well, I took up a job at the Politecnico di Torino, I got rather busy, and … oh, details can wait. In the meantime, I thought I’d share some recent fun stuff that I’ve been doing with the D programming language — a small and currently very basic library for complex graphs.

Complex networks are something that I’ve been working close to professionally for a long time and yet, oddly enough, I’ve never particularly gone into the dark details of coding them. This is most likely because, absent some pressing need, many of the basic things one wants from a graph library are easier to knock up on the fly. It takes much less time and effort to write something quick’n’easy that does the handful of things you want, than to bend your otherwise completely original code to the needs of some specific library solution.

Sometimes, though, you want the heavy-duty toolbox, and when this comes along you have a number of options. There’s the Boost Graph Library, an STL-style C++ generic programming solution; there’s NetworkX, an attractive Python library; and then there’s the serious big hitter, igraph, which is written in C but also has interfaces in Python and R.

igraph in particular is known for being lightning-fast and capable of scaling to very large graphs indeed. It also has a very wide ranging set of different tools for network analysis, and this makes it an obvious choice for any use case involving research on large scale networks. On the other hand, it suffers from the inevitable problem that all complex programs written in C succumb to: its code is extremely verbose and difficult to follow, and its developers have had to implement and maintain custom versions of a great many ‘standard’ algorithms and tools that other languages give you out of the box.

Contrast this with the D programming language, which I’ve been using intensively and extensively over the past year. Without wanting to go into too many details, D delivers all the power, flexibility and elegance that we’re used to from higher level languages, while offering program speeds that are in the ballpark with C/C++. It’s a joy to use with scientific programming, because that simulation that might have taken a ton of code to write in C, or involved a finnicky debugging process in C++, can suddenly be written briefly and clearly and yet still delivers the performance you need; far more of your time can be spent actually generating results from your code, rather than writing and debugging it.

So, faced with the conclusion that I really did need some seriously thought out graph code to move forward on certain problems, I was faced with a quandary. I wanted to use D, but that would mean having to write hooks for one of the existing graph libraries. igraph was the obvious choice — it’s fairly straightforward in principle to use a C library from D — but on the other hand, given how extensive igraph’s codebase is, I was worried that writing wrappers would in practice be almost as complex as writing a custom solution from scratch.

That was an irritation — and then it became an inspiration. Why not have a play with the data structures and algorithms of igraph and see if they could find a friendly home rewritten in D? This would be a nice baptism-of-fire test for the language’s performance capabilities, and it would also be interesting to see whether the elegant surface of idiomatic D code could be preserved in the face of the need for speed and memory efficiency. And so, last week, my little Dgraph library was born.

* * *

Now, I should emphasize up front that Dgraph is not so much in its infancy as at the point of first post-conception cell division. The reason I’m writing now is that I think it’s instructive how readily it was possible to get quick and relatively elegant code up and running and with good performance. Let’s start with the basics: the core data structure of the graph library.

igraph’s core data structure is a model of simplicity:

What’s going on here? n is the number of nodes, and directed unsurprisingly indicates if the graph is directed or not. from and to contain respectively the ‘head’ and ‘tail’ vertices of graph edges — that is, the i’th edge goes from vertex from[i] to vertex to[i]. Vertex IDs are assumed to be numbers in the range 0, 1, …, |V| – 1, where |V| is the total number of vertices.

Now it gets more interesting: oi is the set of edge index values 0, 1, …, |E| – 1 (where |E| is the total number of edges), sorted according to the value of from. That is, if we iterate over from[oi[0]], from[oi[1]], …, we’ll get an increasing sequence of vertex IDs; and the same with respect to ii, except now we’re sorting according to to. In other words, oi and ii give us the sequence of edges sorted either to the vertices from which they are outgoing, or the vertices to which they are incoming. Obviously oi and ii both have the same length as from and to, that is, the total number of edges in the graph.

os and is by contrast represent sums of the numbers of outgoing or incoming edges. More precisely, os[v] is the total number of edges whose from value is less than v, while is[v] is the total number of edges whose to value is less than v. So, if we want the total number of vertices outgoing from v, it’s given by os[v + 1] - os[v], while the total number of incoming vertices is given by is[v + 1] - is[v]. For this to work, os and is need to have a total length equal to |V| + 1.

Taken together, these elements take up minimal space in memory yet allow us to elegantly reconstruct other graph properties, such as a vertex’s degree or its outgoing or incoming neighbours. To see why, it may be best to switch to D — where we can most simply represent the transformations involved. Let’s start with a simple rewrite of the basic data structure:

We’ve done away with the number of vertices, because we can calculate that from array lengths. from and to have been renamed as _head and _tail, partly to avoid potential naming clashes, but mainly because I like visualizing my graph edges as snakes or worms or other wriggly things (and now you have this lovely image in your head too: you’re welcome:-). _indexHead and _indexTail are, unsurprisingly, edge indices sorted according to head or tail values; and _sumHead and _sumTail are the sums of edges with head or tail values less than the corresponding value.

We also have a boolean variable that indicates if the graph is directed or not. If a graph is not directed (that is, directed == false), then we assume a restriction that _head[e] <= _tail[e] for all edges e.

Let’s start with the simple stuff. The total number of vertices can be calculated from the lengths of _sumHead or _sumTail; we’ll pick the first, and throw in a nice D-style assert() statement to make sure that the two actually are identical:

What about vertex degrees, i.e. the number of edges to which a vertex is connected? In the case of a directed graph, it should be clear that the total number of edges incoming to node v will be equal to the total number of edges incoming to nodes with id less than v + 1 less the total number of edges incoming to nodes with id less than v, and correspondingly for outgoing edges:

… while for an undirected graph, where incoming edges are also outgoing edges and vice versa, we have to combine these numbers:

We still define in- and out-degree functions but this time just alias them to the main degree() function. If you’re familiar with D’s syntax, you will have noticed that this static if () { ... } else { ... } formulation is one simple application of D’s generic programming interface: the static if will be evaluated at compile time so that a directed or undirected graph will have only the functions it’s meant to have.

What about vertices’ neighbours in the graph? Let’s start once more with the directed case. Suppose we want to find the incoming edges to a vertex v, i.e., all the edges for which v is the tail. Well, we know that there are _sumTail[v] edges whose tail is less than v, and _sumTail[v + 1] edges whose tail is less than v + 1 (that is, less than or equal to v). So, if we label the vertices in order of increasing tail value, we know that edges 0, 1, 2, ..., _sumTail[v] - 1 will have tail less than v, while edges _sumTail[v], ..., _sumTail[v + 1] - 1 will have tail equal to v (assuming that _sumTail[v] < _sumTail[v + 1] — if not, there are no neighbours!). And it just so happens that we have a corresponding sorted index of the edges in the form of _indexTail, so taking that, and using a similar process for outgoing neighbours, we have:

Some explanation may be in order. iota(m, n) gives us the range of values m, m + 1, m + 2, ..., n - 1. For incoming neighbours, we’re mapping that range via the transformation a => _indexTail[a] => _head[_indexTail[a]]. That is, we’re mapping the sort order of the vertex to the edge index value, and from that to the tail value. In other words, we’re mapping to the vertex IDs that correspond to the heads of the edges for which v is the tail. Similarly, for outgoing neighbours, we map to the vertex IDs that correspond to the tails of edges for which v is the head — and for undirected networks, we chain the two lists together. (Besides the wriggly snake/worm imagery, that’s another reason I chose to refer to ‘head’ and ‘tail’: in an undirected network, links are bidirectional so it doesn’t make sense to speak of ‘from’ and ‘to’.)

Notice how easy this is to write down, and how cheap to calculate. We’re not even storing the values of the neighbours in memory — they can be calculated lazily as they are needed! Ultimately this lazy calculation may come back to bite us in some extreme performance situations, but it’s brief, elegant and effective.

Put all of that together, and what have we got?

You’ll notice I sneakily threw in a couple more functions — edgeCount does exactly what it says on the tin, while edge returns a list of edges arranged as (head, tail) pairs. Oh, and (this might be a bad idea:-) I threw in a few spelling alternatives for those in the world who have heinously departed from the mother tongue … ;-)

What I’ve presented here is missing a bunch of other aspects of the implementation — for example how to add or remove vertices or edges from the graph, how to find the edge ID of a given (head, tail) pair, or to find out if there is even such an edge in the first place — and we haven’t even begun to address performance yet! Don’t worry, we’ll get to them in future blog posts. But what’s striking for me is that here we’ve implemented the core data structure and query operations of a super-efficient graph representation in barely 80 lines of code. In C, it takes more than that just to implement the function to determine a vertex’s neighbours.

[… interested in finding the source? Dgraph is on GitHub!]

EDIT 17-07-2013: I’ve tweaked the examples using map in the code so they now use UFCS chaining, i.e. they have the form input.map!() rather than map!()(input). This brings this article in line with what’s now in the codebase.