# Strongly Connected Components in Naiad

In the previous post we looked at writing an iterative algorithm for computing the connected component structure of an undirected graph. In this post we are going to expand on this, and show how to use that functionality as a subroutine in computing the *strongly* connected component structure of a directed graph. The code to follow is in NaiadExamples/StronglyConnectedComponents.cs in the Naiad download.

The strongly connected components (SCCs) of a directed graph are groups of vertices that can both reach and be reached by all other vertices in the group. Historically, determining the SCC structure of a graph is done using two depth-first searches, which are difficult to parallelize. We’ll develop an alternate approach based on repeatedly performing the reachability computations that were the basis of our connected components computation. The result is a doubly-nested fixed-point computation, a pattern that doesn’t fit especially well in most existing iterative big-data frameworks.

Let’s sketch our algorithm first. Recall from the previous post that we put together a label propagation computation that when run on a symmetric graph identifies the connected components. The same computation can be run on a directed graph, but if the edges are not symmetric the labeling relationship is not exactly the connected components. Instead, the computation determines for each vertex the least label that can reach that vertex. However, it is possible that the two endpoints of a directed edge may receive different labels (the destination vertex may receive a smaller label from elsewhere). If we see such an edge, we know that it cannot be between two vertices in the same SCC: if they were in the same SCC the label could have flowed from the destination back around to the source, and the labels would be the same.

Our algorithm is thus: repeatedly perform forwards reachability queries and backwards reachability queries, removing edges whose endpoints receive different labels. If there are any edges between distinct SCCs we will discover at least one edge to remove, and we’ll never remove an edge within an SCC. If we iterate the process to fixed point, we should end up with just those edges within SCCs.

Let’s write it out. The ConnectedComponents method is just the method produced in the previous post, though we’ll run it on a not-necessarily-symmetric directed graph. We use a bit of fakery with attaching labels to edges to keep the code looking simple (the Label1 and Label2 methods), which you can imagine return a new LabeledEdge struct with populated label1 and label2 fields.

// repeatedly removes edges between nodes in different SCCs. Collection<Edge,T> SCC(this Collection<Edge,T> edges) where T : Lattice<T> { return edges.FixedPoint(y => y.TrimAndTranspose() .TrimAndTranspose()); } // edges labeled by reachability; edges with matching labels are kept. Collection<Edge,T> TrimAndTranspose(this Collection<Edge,T> edges) where T : Lattice<T> { // get labels from directed reachability. var labels = edges.ConnectedComponents(); // look up labels, keep edges whose endpoints have the same label. // imagine methods Label1 and Label2 populate label fields in Edge. return edges.Join(labels, x => x.src, y => y.src, (x,y) => x.Label1(y)) .Join(labels, x => x.dst, y => y.src, (x,y) => x.Label2(y)) .Where(x => x.label1 == x.label2) .Select(x => new Edge(x.dst, x.src)); }

Just a caveat, the SCC code in the download is substantially messier than this. We went to some pains to make it go fast, and the simplest way to write the program isn’t the fastest. We’ll explain the differences in a future post on performance in Naiad.

The code we’ve written only produces edges in the SCCs rather than the SCC labels of each vertex. We could write a similar program as a fixed point on the labels collection, it just ends up being easier to explain (above) why the edge-removal process has a meaningful fixed point. For now let’s just check out the performance when we fire it up.

As before, we’re going to try out a random graph with 1M nodes, but since the edges are not doubled by symmetrization we’ll use 2M edges rather than 1M. We are going to repeatedly randomly re-wire single random edges (remove the edge, replace it with a new random edge), and measure the time to do each. The SCC computation takes 20 seconds (not a short time for SCC), but again we can update the computation in milliseconds.

00:00:19.6745379 Net edge changes within SCCs: 1269479 00:00:00.0181022 Net edge changes within SCCs: -1 00:00:00.0059060 Net edge changes within SCCs: -1 00:00:00.0097498 Net edge changes within SCCs: 2 00:00:00.0028533 Net edge changes within SCCs: 0 00:00:00.0059749 Net edge changes within SCCs: -1 00:00:00.0017057 Net edge changes within SCCs: 0 00:00:00.0048331 Net edge changes within SCCs: 0 00:00:00.0090422 Net edge changes within SCCs: 2 00:00:00.0032759 Net edge changes within SCCs: 0 00:00:00.0165632 Net edge changes within SCCs: -5

Well, it is just about as fast as for connected components. Despite a substantially more complicated program (including multiple invocations of the connected components code) Naiad still updates the computations pretty quickly. The secret ends up being that despite a more complicated dataflow graph, Naiad’s incremental re-computation doesn’t necessarily have to do that much more work; we certainly aren’t re-executing any of the connected components computations, just updating them.

We’ve swept some details under the rug in this post –incrementally updating doubly-nested fixed-point computations isn’t easy stuff– but I hope we’ve started to pique your curiosity about how a system can manage all of these things for you, and provide solid incremental performance on such non-trivial computations.

Next we’ll start in on explaining a bit about how Naiad works, ranging from the underlying mathematics to its distributed implementation. Stay tuned!