Skip to content

Way back at the beginning of time, we had a post on performing PageRank on a 1.5B edge graph of who-follows-who on Twitter. We talked a bit about how several big data systems don’t do quite as well as a 40 line, single-threaded C# program. There was also a promise to show how to make things go much, much faster. So we’re going to do that today.

Recall that what goes on in the PageRank computation, and a lot of graph analysis, is that the program repeatedly scans the list of graph edges, and for every edge (i,j) does some logic on vertex state associated with vertices i and j. There are three things the program has to do, and they control how long the computation takes:

1. Read edges in from external storage (disk, flash, maybe RAM).
2. Fetch state associated with vertices i and j.
3. Do some computation on the fetched state.

For the sorts of graphs we are looking at one can describe an edge using two 32bit integers, or eight bytes. A bit of organization (grouping edges by source, for example) can bring this down to just one integer and four bytes. A modern flash-based solid state drive can read data (thing #1) in excess of 500MB/s, or 125M edges per second. This is much faster than we can fetch data from random access memory (thing #2) which takes about 50 nanoseconds to read a cold location, which would limit us to 20M edges per second. If the data exhibit locality, and our numbering of vertices reveals this, we can do better due to caches. Finally, the amount of computation involved in PageRank (thing #3) is basically just an addition, and takes effectively no time at all.

The limiting term for performance is not actually reading the data off of disk, but rather performing the random access into memory. We can confirm this on the Twitter data by loading the entire dataset into main memory (it only takes 6GB), and seeing how fast the 40 lines of code go in that case. It turns out we top out at 40 seconds to do an iteration, corresponding to 40M edges per second. This is faster than we would expect if every lookup was truly random, but there is apparently some locality that caching successfully exploits to reduce the overall time.

A Standard Approach

Before describing the clever idea we’ll use, let’s go over an approach that only sort-of works. If you have been following big data for a while, you’ve probably been told that you can make everything go faster if you use more machines. That is certainly something you can do here. By partitioning the graph vertices up, each machine can process just a fraction of the vertices and their incident edges, determining updates for their neighbors and sending those along to the machines responsible for them.

This works great at first, but it has some limitations: as you use more and more machines, the number of distinct vertices each machine needs to send an update to doesn’t necessarily shrink, or at least not linearly. So while you might get 1,000 machines, each performing 1/1000 of the total work, if each of them still needs to send a message to every other vertex in the graph they’ll still need to send as much data over the network, and will spend exactly the same amount of time waiting on the network. The amount of data and time spent waiting doesn’t need to improve just because you are using more machines. It might, or it might not, and most likely it will improve but not by as much as you would like.

A Better Approach, Using Math

Part of the problem with the approaches described above is that they exhibit limited locality of reference. Even though edges might be arranged by source, their destinations could reference arbitrary other vertices. This results in lots of cold reads from memory and messages sent over the network. But it doesn’t seem like there is much of an option short of running some clustering on the graph to try and tease out structure, which is what PowerGraph does to improve its execution times.

In fact, there is a really excellent technique used in parts of computer science for quite a while, and apparently totally unknown to other parts of computer science. The keyphrase is “Hilbert Space-Filling Curve“, hereafter (HSFC), which sounds pretty scary but is actually a pretty simple concept. The idea is that we are going to arrange the edges not by source, or by destination, but using a different order in which nearby edges are more likely to have nearby sources and destinations.

Informally, the way this works is that the HSFC recursively partitions a set of edges into four parts, each corresponding to a square in the adjacency matrix. It orders the edges within each square, and then concatenates the orders of the squares. The recursive rule is a bit clever to make sure that the order of edges stays continuous (and doesn’t leap as it moves from one square to another), but that is the spirit.

The first two steps in determining the Hilbert space-filling curve order. The adjacency matrix is broken into four parts (numbered 0, 1, 2, 3), and then each part is further subdivided and numbered (appending 0, 1, 2, 3) in a way that the sequence of squares are contiguous. The process generalizes to arbitrary depth.

Operationally, the curve corresponds to a function that maps pairs of (i,j) coordinates to a single integer with twice as many bits, and the sequence described by these integers order the pairs (i,j). Ordering the edges is no more complicated than performing the transformation and sorting. Details on how to compute this transformation (and example code, and animations, and other good stuff) are available at the excellent wikipedia page

The important property of the HSFC order is that if we lay out our edges in an adjacency matrix (entry (i,j) being where we would put an edge from i to j), and we overlay a grid of $2^b \times 2^b$ squares, then the HSFC order moves from square to square, hitting all elements within a square once and never returning to that square again. This is excellent from the perspective of cache behavior, imagining each cache line/page as a $2^b$ wide hunk of state: for each pair of cache lines/pages the HSFC ordering presents all the work (edges) requiring that pair of lines/pages together as a group.

The HSFC order not only helps with memory locality of the random-access variety, it is also a great basis for distributing our computation. Given a cluster of n machines, rather than dice the vertices into n parts and distributed the work that way (corresponding to thin horizontal slices of our adjacency matrix) we can cut the matrix into $\sqrt{n} \times \sqrt{n}$ squares, and give each machine a square. This is handy because although we’ve kept the amount of CPU work per machine constant, the amount of communication each machine requires (input read and output written) decreases as n increases. Specifically, each machine needs to receive $\#(nodes) / \sqrt{n}$ input values, and transmit $\#(nodes) / \sqrt{n}$ output values. By comparison, each machine in the vertex partitioning approach receives $\#(nodes) / n$ input values and may need to send $\#(nodes)$ output values. In the worst case, the sum of the two values is better for the HSFC approach by a factor of $\sqrt{n}/2$.

To order the edges using the HSFC you just need to apply the transformation to map each edge (i,j) to a 64bit integer, and then sort by this integer. If you just want to partition the data, you can use the observation above about the HSFC grouping $2^b \times 2^b$ grid cells, and just partition the data by the grid cells. You can then sort each part independently if you want to, which should be much cheaper than sorting the whole thing.

Going Faster in Practice

So does all this mathematical argle-bargle actually mean anything in practice? Re-running the same algorithm from the previous post, where the edges are loaded into memory but I’ve secretly changed the order of the edges to that of the HSFC, the elapsed time per iteration drops from 40 seconds to 18 seconds. So yes, it does seem to mean something.

This partitioning described above is good for clusters, but it is also good just on single machines. My laptop has two cores, each of which is hyperthreaded, and when I do a 2×2 grid and assign the four squares to different cores the running time drops from 18 seconds down to 11 seconds. Not a factor of four, but more like sqrt(four), which might be what to expect if we are bound by the memory subsystem.

To test out just how fast we can make this go, I wrote a Naiad job to farm out the work to a number of machines ranging from 4 to 64, each getting a square from an appropriately sized grid. Each of them uses four cores just like my laptop (though they are real cores, not hyperthreaded).

Reported per-iteration execution times for several systems using a 1.5 billion edge graph derived from a crawl of the Twitter follower relation. Note that for other systems these times are likely to improve. For example, the PowerGraph number for 64 machines has been reduced to 1.8 seconds.

The PowerGraph folks have since improved their implementation, with the 64 node number dropping to 1.8 seconds/iteration on slightly better hardware. At the limit of 64 machines, the Hilbert-on-Naiad approach is doing iterations in less than half a second, still four times faster than even PowerGraph’s newest numbers, and the scaling suggests that it should continue to improve for a while at least.

It is important to stress that these numbers are per-iteration once you get your data partitioned up and sorted. That takes some non-trivial time compared to each individual iteration. Depending on how much you want to do with your graph, you may be better off not doing any fancy data arrangement. But, if you plan on doing repeated graph analysis you have a decent option. PowerGraph also does some fancy data arrangement, and it could certainly use this approach too if it works out better.

Conclusions

If you want to do large-scale graph analysis, you may need to pay attention to locality if you want performance. The Hilbert Space-Filling Curve is one way to get locality out of any graph, and I think just about anyone building a graph processing platform should know about it. If you know someone building a graph processing platform, you should tell them about it.

Although PowerGraph uses some fancy data-dependent pre-processing of the edges to cluster them and improve locality, it seems here that the data-independent ordering / partitioning indicated by the HSFC can be good enough (or possibly better). This makes it a lot faster to start up, and really easy to maintain if new edges arrive or existing edges are removed. It is completely plausible that one can improve on the HSFC ordering by paying more attention to the data, and I’d be happy to try that out if anyone has concrete suggestions.

If you actually want to PageRank a graph with lots of evident structure, like the web graph, these approaches are all ridiculous. Many folks (including myself) did work on how to efficiently PageRank web graphs back when that was a rite of passage in the web community. Check out this paper for a discussion of standard ways to make PageRanking go fast.

Finally, we’ll be saying a bit more in the coming weeks about what makes Naiad fast. We should have some exciting news for folks interested in grabbing the code and playing with it.

Advertisements

From → Naiad, Opinions

Leave a Comment