In other happy news (though not quite on the scale of the Higgs boson), my first contribution has been accepted to D’s standard library, Phobos. My patches implement a much faster random sampling algorithm, and also fix a couple of bugs that were present in the pre-existing random sampling functions.
Random sampling is a fun problem, which is addressed by Donald Knuth in chapter 3 (vol. 2) of The Art of Computer Programming. Here’s the essence of it: suppose you have a collection of N items, which without loss of generality we might label 1, 2, …, N or alternatively 0, 1, …, N – 1. You want to take a random subset of these items of size n < N, with the probability of picking any given record being equal to the chance of picking any other given record.
Probably the simplest way to do this is to randomly select items from the original set with probability n/N, keeping them if they have not already been selected, reselecting otherwise, and continuing until you have selected n in total. This has two problems, however: first, we have to store the already-selected items so that we can compare them with new selections, and second, with each new sample point, we have to make an increasing number of comparisons and hence run an increased risk of having to reselect the chosen item.
This doesn’t matter too much if we’re selecting only 3 points from 10, say, but suppose we want to select 1000 points out of 100,000, or 100,000 out of 1 million? It’s also problematic if you’re trying to sample records from a slow-access storage device like a tape, where it’s costly to jump back and forth between different items. We need a method which scales more elegantly in terms of both storage space and algorithmic complexity.
What if, instead, we sequentially consider each item and randomly decide whether or not it is in the sample? We can’t now use n/N as the probability to select — we’d get a sample that was only on average of size n — so we need to find an alternative rule which allows us to do this sequential walk and pick exactly n points without biasing the selection process.
This is a problem which was solved in 1962 in two independently-written papers by Fan et al. and Jones, and their solution is referred to by Knuth as Algorithm S. It rests on a simple principle: the probability of picking a given item is the number of sample points left to select divided by the total number of items still remaining to consider. So, if you have so far selected m sample points and you’re on the (t + 1)st record, the chance of selecting it will be (n – m)/(N – t). Once you’ve selected exactly n points, you exit the sampling process. Here’s a simple example, where we try and select 2 points out of 5:
This algorithm has a lot of nice features. It’s entirely sequential, so you could run it on a tape of records without trouble (very important back in 1962). It’s very simple to implement. On the other hand, although the worst-case scenario of considering every item is usually avoided, it’s still got an algorithmic complexity of O(N) and on average requires O(N) random variates to be generated. Again, not a big deal if you’re sampling 10 from 100, but what if you’re trying to sample 1000 from 10 million?
The problem is we have to consider each item one by one, 1, 2, 3, …, and generate a new random variate (computationally expensive!) for each item we consider. What would be much nicer is to instead jump straight to the selected record, skipping over the others. In other words, what we’d like to do is generate a ‘skip size’ S of records to ignore (which might be 0), select the record we land on, and then skip again from the following record, and so on until the sample is complete. Like this:
For this to be really worthwhile, we’d like to generate only one random variate per skip (meaning the total number scales with the sample size rather than the total number of items) and for S to be generated from that random variate in O(1) complexity. At the same time, we need to ensure that the way we generate S preserves the probabilistic character that we are looking for, i.e. that each record has an equal chance of being in the chosen sample.
Algorithms A and D
A solution to this problem was provided in 1984 by Jeffrey Scott Vitter in a paper entitled Faster methods for random sampling, which introduced several new algorithms with superior scaling properties. Improved versions of the two most effective algorithms were introduced in a later article, An efficient algorithm for sequential random sampling (free access to both these papers is available via Prof. Vitter’s online library). An in-depth presentation of all the details of these two algorithms probably needs a whole post in itself, but we can give a rough overview as follows.
The first of these algorithms, Algorithm A, derives from some simple observations on properties of the distribution function F(s) of skip sizes (for now I won’t give the mathematical details here as I’m currently missing a good solution for mathematical notation, but Vitter’s 1984 paper covers this in depth). Vitter used these properties to derive a simple inequality between N, n, s, and the value of a randomly-generated number V in the open interval (0, 1). The skip size S is then equal to the minimum value of s that satisfies this inequality for a given randomly-generated value of V. In other words, we need generate only 1 random variate per sample point, or n total, compared with O(N) for Algorithm S.
The downside to the simplicity of this approach is that the solution to the inequality is calculated inefficiently: the algorithm still has an overall complexity of O(N). This is still an improvement on Algorithm S, and one that can be optimal for cases where O(n) and O(N) are pretty much equivalent (e.g. selecting 20 out of 100, say).
Algorithm D takes advantage of this by using Algorithm A’s approach where the ratio (n – m)/(N – t) is larger than a certain value, but implements an extended approach that solves the computational complexity issue for the case where n is smaller. In its most basic form it relies on generating two random variates per sample point, the first, U, uniformly distributed in (0, 1), and the second, X, from a distribution closely approximating the skip size distribution F(s). Again, Vitter derives an inequality between these values that the skip size must satisfy; but in addition he observes that there is a simpler-to-calculate inequality, which is satisfied with high probability, that guarantees compliance. In the case of either of these inequalities being satisfied, the skip size S is set equal to the ‘floor’ of X, i.e. the largest integer less than X.
So, the skip size S is calculated roughly as follows:
- Check if the ratio (n – m)/(N – t) is sufficiently large; if so, use Algorithm A to generate this skip size.
- Generate the random variables U and X.
- Check the simple inequality; if satisfied (likely!), set S equal to floor(X), and return this value.
- Check the more complicated inequality; if satisfied, set set S equal to floor(X), and return this value. If not satisfied (very rare!) return to Step 2, and repeat until a satisfactory value is generated.
This basic structure can be improved on in several ways. In practice, once the ratio (n – m)/(N – t) has exceeded the threshold once, we can switch over to Algorithm A for the entire remainder of the sampling; the supposed extra cost of Algorithm A is generally offset by a reduction in the number of times we have to calculate and compare the value of the ratio. This also helps to guard against worst-case scenario behaviour of Algorithm D when (n – m) ≈ (N – t).
Second, Vitter shows that it’s possible in many cases to deterministically calculate the next value of X from the previous one, without violating the independence of the sequence of random values. This allows a sizeable reduction in the number of random variates required, so that in practice the total number is closer to n than 2n.
Finally, a trivial improvement is to consider separately the case where only 1 sample point remains to be selected — where of course all one has to do is generate a uniformly-distributed integer to calculate the skip across the remaining items.
Getting it into D
The D framework for random sampling, written by Andrei Alexandrescu, is a rather elegant creation (worth its own blog post) that uses D’s concept of ranges to create a lazily-evaluated structure that can sample from a broad range of different inputs — an array, a sequence of numbers generated automatically without being stored, an input file, essentially any input that can be interpreted through D’s concept of an InputRange.
I’d originally implemented Algorithms A and D as a little extension library based on the GNU Scientific Library, and it was fairly trivial to rework the code for D (licence lawyers note that the pieces reworked were entirely my own work and not derivative of GSL). To my knowledge, this makes D the only language — certainly the only systems-programming language — to have Vitter’s algorithm implemented in its standard library. I’m not even aware of any mainstream scientific library that implements it, though I’d be delighted to learn otherwise.
For now, though, it seems a bit of a unique feature — and I’m definitely proud to have brought Algorithm D to D.