Skip to content

GraphLINQ: A graph library for Naiad

by on May 8, 2014

The most recent Naiad release contains GraphLINQ, a new library of LINQ-like operators full of graph-specific optimizations. GraphLINQ is an example of how one can build up domain-specific libraries with carefully tailored implementations, which nonetheless integrate smoothly with Naiad and the rest of its libraries. This post (first of several) explains some of GraphLINQ’s methods by building up a PageRank example. We will talk more about asynchronous graph processing, and how GraphLINQ is built in future posts.

Getting GraphLINQ

GraphLINQ is a library built on top of Naiad, so the only thing you need to do is make sure you get the dll (available through GitHub or NuGet) and then add the following line to the top of your program:

using Microsoft.Research.Naiad.Frameworks.GraphLINQ;

You now have access to several new types and methods, which we'll summarize next.

If you haven't, this would be a great time to read the introduction to Naiad and how to write Naiad programs, without which the remaining discussion may be less clear. Additionally, reading about Naiad's support for high-level frameworks should set the stage for how we are going to approach a graph-specific Naiad library.

The example we work through (PageRank) is included in the current Naiad source release, and runs either locally or on Azure (the two versions differ only in how they read their input data).

Data structures and programming patterns

GraphLINQ introduces a few simple structs representing graph nodes and edges, as well as node-with-values and edge-with-values in case you want to attach additional data to either of them. Graph nodes are just simple wrappers for integer node identifiers, but they help keep the code looking clean and type-safe.

GraphLINQ is based around the manipulation of Naiad streams containing either nodes or edges, each representing collections of nodes or edges that may evolve as a computation proceeds. These streams meet in new data-parallel operators implementing functionality for aggregating node values along edges, manipulating node state in blocking or streaming manners, and other graph transformations. Most GraphLINQ computations should be thought of as computations on entire graphs and collections of node and values, rather than as computations on their underlying streams.

GraphLINQ's main graph-specific optimization is that its operators are able to use of dense integers as keys. Almost all of its methods are extension methods on Naiad streams of Node, Edge, NodeWithValue, or EdgeWithValue, exploiting the fact that the operations are data-parallel where the keys are node identifiers. Rather than use generic dictionaries, GraphLINQ is able to store data in flat arrays indexed by the dense node identifiers. This reduces the amount of memory required and the time to access that data.

GraphReduce: aggregating data over edges

Perhaps the most common pattern in graph processing is to use the graph edges to communicate node state between their endpoints. GraphLINQ exposes a few methods for doing just this, but we'll look at a simple one, GraphReduce:

GraphReduce(Stream<NodeWithValue<TValue>, TTime> nodeValues,
            Stream<Edge, TTime> edges,
            Func<TValue, TValue, TValue> combiner,
            bool useLocalAggregation)

GraphReduce takes a stream of values at nodes, a stream of edges, and a combiner function, and produces an output stream of the values at nodes resulting from combining the values at each of their neighbors. GraphReduce treats the edge stream as static, meaning it locks in the graph the first time it sees edges. On the other hand, the aggregates are produced independently for each logical time, making GraphReduce suitable for use in iterative computations.

GraphReduce is already a fairly powerful primitive to use, but to build up a computation like PageRank we will want a few other tools as well. Rather than show you all the prototypes, we'll work through the code and call out methods that are specialized in GraphLINQ, typically just to exploit dense integer keys by storing data in arrays rather than dictionaries.

An example: Writing PageRank

PageRank is fundamentally an iterative computation in which each round sees each node's rank distributed among its neighbors (nodes it has edges to). We'll write a loop body which takes a collection of initial node ranks, some node degrees (to divide ranks by), and some edges along which the divided ranks should be distributed, and produces the ranks at the next iteration.

// performs one step of PageRank, updating ranks by dividing 
// by degree, distributing along edges, and aggregating.
Stream<NodeWithValue<float>, T> 
PageRankStep<T>(Stream<NodeWithValue<float>, T> ranks,
                Stream<NodeWithValue<Int64>, T> degrs,
                Stream<Edge, T> edges)
  where T : Time<T>
{
  // join ranks with degree to divide by degree (or zero if no edges).
  return ranks.NodeJoin(degrs, (r, d) => d > 0 ? r * (0.85f / d) : 0.0f)
              .GraphReduce(edges, (x, y) => x + y, false)
              .Where(x => x.value > 0.0f);
}

There are a few things that might be surprising about the code.

First, rather than just divide by the degree we are testing whether it is zero or not. Because we are using arrays rather than dictionaries, it is not obvious to GraphLINQ whether a particular location contains valid data, or simply exists because we needed an array of a sufficient length. In a Dictionary, the presence of a key is what contained the information that a location is valid. Of course, we could always extend the datatype with an isValid bit, but in many cases (such as this one) the data itself reveals whether it is valid or not, and allows a smaller memory footprint in this case.

Similarly, GraphReduce may produce outputs with zero rank, because it is not actively tracking which ranks are valid (the result of input values) and not. One can add a valid bit if it is difficult to tell the difference, but in this case we know that a rank is valid if and only if it is strictly positive. The use of Lindi's Where method removes such zero ranks from circulation.

Second, if you are familiar with PageRank, you might think we are doing a somewhat strange version of the computation where we scale down the ranks by a factor of 0.85, but don't add 0.15 to everyone's rank, which would be the standard behavior. It turns out we are doing the right computation, which we'll see in just a moment, because our plan is to accumulate the ranks produced in all iterations, rather than simply take the ranks produced in the final iteration. This approach has better incremental performance (once a node's rank stops changing it stops sending data to its neighbors) and generally runs faster while giving the same answer.

If we put the function above into a Naiad loop (the method IterateAndAccumulate taken from Naiad's Lindi framework), we get a program that looks like the following:

// Performs PageRank computation on a collection of edges.
// Ingests edges, extracts degrees, and then iteratively 
// applies PageRank steps, accumulating the results.
void PageRankMain(Computation computation, IEnumerable<Edge> graph, int iterations)
{
  // converts a graph from IEnumerable to Stream<Edge, Epoch>.
  var edges = graph.AsNaiadStream(computation);

  // capture the node degrees.
  var degrees = edges.Select(x => x.source)
                     .CountNodes();

  // initial distribution of ranks.
  var start = degrs.Select(x => x.node.WithValue(0.15f));

  // define an iterative pagerank computation, add initial values, and aggregate up the results.
  var ranks = start.IterateAndAccumulate((lc, deltas) => deltas.PageRankStep(lc.EnterLoop(degrs),
                                                                             lc.EnterLoop(edges)),
                                         null, iterations, "PageRank")
                   .Concat(start)
                   .NodeAggregate((x, y) => x + y)
                   .Where(x => x.value > 0.0f);

  // start computation, and block until completion.
  computation.Activate();
  computation.Join();
}

As you might expect, CountNodes does a data-parallel count much like Count in Lindi and DifferentialDataflow, and NodeAggregate does a data-parallel aggregation (in fact, GraphReduce uses it internally).

That's really all there is to writing PageRank using GraphLINQ. The code above comes with the most recent release of Naiad on GitHub, and you can try it out on graphs from Stanford's SNAP page. Their LiveJournal graph (69M edges) takes a second or two per iteration on my laptop, once it is loaded up.

We will have a few more posts in the next few days about other ways to use GraphLINQ, discussing asynchronous graph processing, as well as some of the hairier issues about moving data in and out of graph form (a subject often untouched by graph-specific systems).

From → Naiad

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 62 other followers

%d bloggers like this: