# Connected Components in Naiad

In this post we are going to look at a writing a Naiad program that is very different from what other “big data” systems are capable of. Specifically, we’ll develop a program to compute *and incrementally maintain* the connected component structure of a graph (sets of vertices that can reach one another through a sequence of edges). Unlike other systems out there, we’ll do this for arbitrary changes to the input graph, not just edge additions. The code follows NaiadExamples/ConnectedComponents.cs in the Naiad download.

Roughly sketched, our algorithm will maintain a labeling of the graph nodes as a collection of integer pairs: (node name, node label). Initially each node is labeled with its own name, and we then iteratively improve the labels, re-labeling each node by the least label in its immediate neighborhood. If we iterate this process sufficiently, eventually each node will be labeled by the least node name in its component. Two nodes with the same label are in the same component, and nodes in different components will not have the same label.

Let’s start by defining the computation that updates a labeling to one where each node is assigned the least label amongst its immediate neighbors. We’ll just write this as a standard C# method, taking a collection of labels and returning a collection of labels.

// improves a set of labels to the min among neighbors defined by edges. Collection<IntPair,T> ImproveLabels<T>(Collection<IntPair,T> edges, Collection<IntPair,T> labels) where T : Lattice<T> { return edges.Join(labels, edge => edge.src, label => label.src, (edge,label) => new IntPair(edge.dst, label.lbl)) .Concat(labels) .Min(label => label.src, label => label.lbl); }

Some of this should look familiar, but some of it might look very weird. For the first time ever we are seeing the actual *type* of a Naiad Collection. In addition to depending on the datatype (we are using IntPair both for edges and labels) there is this weird generic type “T”, which apparently needs to implement some lattice interface. We haven’t talked about this before, and although this is part of what makes Naiad new and wonderful, we are going to need to ignore it for the moment.

The rest of the code may seem more comfortable. For a given collection of labels, we Join with the set of edges to produces label “messages” destined for neighbors. These messages are collected with the initial labels, and each node then selects the minimum label from those bearing its name. The Min operator is new, but it is a lot like Count: it takes first a key selector function (groups within which to take a min) and then a value selector. It actually produces the set of “argmins”, the records responsible for the minimum values.

The ImproveLabels method is great for improving the labels once, but we want to do it many times. In fact, we don’t know how many times we’ll have to apply it, and would really like to apply it as many times as required. Enter the new Naiad method, “FixedPoint”:

// applies ImproveLabels until the result stabilizes. labels = labels.FixedPoint(x => ImproveLabels(x, edges.ExtendTime()));

Logically, FixedPoint starts from its source data (labels) and repeatedly applies the specified loop body (the ExtendTime method brings a collection from outside a loop into the loop, related to the mysterious T lattice, but doesn’t change the data at all).

In fact, Naiad does not just apply the loop body over and over. We saw in the previous examples that Naiad is capable of responding to small changes in its inputs with small changes it its outputs, relatively quickly. FixedPoint exploits this by feeding its output changes back into its input; it keeps running until there are no more changes, at which point it has (by definition) reached a fixed point. Each time around the loop, Naiad only has to do work for the records that have changed. In this case, only when a node’s label improves does Naiad notify its neighbors and cause them to reconsider their labels. As a result, Naiad can execute this sort of computation much more efficiently than non-incremental systems.

Now, if you have been around a bit (or are very perceptive) you might worry that there is a bit of a cheat here. Having done all that label propagation by pushing differences around, what happens when I try and remove an edge from the input? Labels have already propagated along that edge, and it doesn’t seem easy to take them back (who even remembers what they were?). As it turns out, Naiad does, and this is its main difference from other systems.

If you are interested in how this actually works, we’ll have details on the math in an upcoming post. For the moment, let’s just see what happens when we run the connected components code example released with Naiad. The example loads up a modest random graph (1M nodes, 1M edges) and runs the connected components computation on it. Once it returns with the answer we start removing random input edges, supplying negative edges as inputs. While the initial computation takes about 5 seconds, each edge deletion takes just a few milliseconds:

Time to process: 00:00:05.0246803 Time to process: 00:00:00.0047145 Time to process: 00:00:00.0013096 Time to process: 00:00:00.0036108 Time to process: 00:00:00.0003741 Time to process: 00:00:00.0006532 Time to process: 00:00:00.0006444 Time to process: 00:00:00.0004846 Time to process: 00:00:00.0008126 Time to process: 00:00:00.0006309 Time to process: 00:00:00.0058109 ...

Some durations are bigger than others, because there is occasionally actual non-trivial work to do to repair the computation (labels get pulled back, and new labels need to take their place). Often, there isn’t very much to do, and Naiad only does the work that needs to be done. Exactly what that work is remains unexplained at this point, but we’ll get in to that.

To the best of our knowledge, incrementally updating iterative computations without requiring “logical monotonicity” (collections can either grow or shrink, but not both) is brand new. But this specific example is only the tip of the iceberg. We’ll see even more interesting computations in the posts to come, and try to explain what is going on that makes it possible for Naiad to update iterative computations so quickly. Stay tuned!