Skip to content

What counts as big data?

by on October 9, 2012

When we thought about starting an SVC Big Data blog, we decided we’d like it to be a place that would stir up a bunch of thought, challenge conventional wisdom, and raise your expectations about big data research. In that spirit we’re starting out with a provocative post meant to give a taste of what we’ll talk about in the blog: interesting and often surprising conclusions about how big data is currently done, and how it might be done differently.

One very popular use of big data systems is PageRanking web-scale graphs using clusters of machines. GraphChi, one of the papers at OSDI 2012, demonstrates how you can perform many big data processing tasks on “just a PC”. There is a lot of overhead in many distributed big data systems, and GraphChi can often get pretty close to the cluster performance without the cluster. In the specific example we’ll follow in this post, GraphChi is able to perform PageRank iterations in 156 seconds on a 1.5B edge graph crawled from Twitter, whereas Spark reportedly takes about 90 seconds per iteration on the same graph, using a cluster of 50 machines.

So here is a question very rarely asked in these big data papers: “How long should it take?” As in, if you didn’t have all these problems of data being so “big”, how fast would a vanilla computer program run? As it turns out, PageRank on the Twitter graph is a great case study, because all the necessary state (a double or two for each vertex) fits happily in main memory on my laptop and the graph is just 6GB on my SSD.

Here is a really simple vanilla C# method to apply one step of PageRank. The graph structure is written on disk in binary as a vertex name, its degree, and a list of edges. The code just reads them off disk and applies the correct additions to the result vector.

// propagates ranks from vector b into vector a using graph in filename.
void PageRankStep(string filename, double[] a, double[] b, double reset)
{
   var stopwatch = System.Diagnostics.Stopwatch.StartNew();

   var bytes = new byte[4 * a.Length];     // temp storage for bytes
   var ints = new int[a.Length];           // temp storage for ints

   for (int i = 0; i < a.Length; i++)      // initialize accumulations
      a[i] = reset;

   using (var reader = File.OpenRead(filename))
   {
      var bytesToRead = reader.Length;

      while (bytesToRead > 0)
      {
         // read vertex name and degree
         bytesToRead -= reader.Read(bytes, 0, 8);
         Buffer.BlockCopy(bytes, 0, ints, 0, 8);

         var vertex = ints[0];
         var degree = ints[1];

         // read names of neighbors
         bytesToRead -= reader.Read(bytes, 0, 4 * degree);
         Buffer.BlockCopy(bytes, 0, ints, 0, 4 * degree);

         // update each neighbor's rank
         var update = (1.0 - reset) * b[vertex] / degree;
         for (int i = 0; i < degree; i++)
            a[ints[i]] += update;
      }
   }

   Console.WriteLine(stopwatch.Elapsed);
}

If I run this code on the Twitter graph using my 2010 laptop it takes just 64 seconds. That’s compared to 156 seconds for GraphChi, and 90 seconds for Spark on a cluster.

These systems aren’t slower because they are bad, or because I’m particularly great at programming (see above), they just make performance compromises by providing generality that we don’t need for this problem. The modern computer (my two year old laptop) is able to pull data off a SSD at 100MB/s and update the corresponding 25M memory locations each second, without requiring any fancy level of system support. You can make it go faster if you want (we’ll get to that in an upcoming post), but you get pretty solid performance without even trying.

To be clear, we aren’t here to take shots at GraphChi; it just happens to be recent. This same issue crops up with lots of distributed systems, including DryadLINQ, a system near and dear to my heart. A recent paper compared DryadLINQ with several other systems on a variety of tasks, one of which was PageRanking the ClueWeb category B graph. DryadLINQ takes about 45 seconds per iteration on 16 machines, whereas my laptop takes about 35 seconds.

This isn’t to say that scalable distributed compute systems aren’t useful, only that we need to pay a bit more attention to what they are actually useful for. Running PageRank on medium sized graphs is probably not one of these things. However, there are a lot of things that existing systems can do that I can’t do on my laptop in just a few lines of C#. The first three I could think of are: scaling to terabytes of input data, scheduling and executing multi-stage dataflow graphs, and automatically extracting performance optimizations by tricking me into programming declaratively. It is great to have these done for me, and it is delightful to use these systems on the right problems.

In fact, there are a lot of other new and cool things to do in the big data space, and we are going to tell you about the ones we are currently doing. We’ll even come back to the question of performance and how to beat the pants off of those 40 lines of C#.

From → Opinions

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 69 other followers

%d bloggers like this: