Naiad on YARN and Azure HDInsight
One of the most commonly asked questions about Naiad is, “How on earth do you run it on a cluster?” When we first released Naiad, the only solution available was to grit your teeth and run the distributed programs manually, using scripts that were manually tailored to a particular cluster. In the mean time, Hadoop YARN has become a widely available framework for running data-processing applications on a cluster. This post describes how we ported the latest version of Naiad to run on top of Hadoop, and how this makes it easier to run Naiad programs on your data in Microsoft Azure.
At first the idea of running Naiad on Hadoop might seem surprising. After all, we have spent several blog posts explaining why Naiad is ideal for low-latency data processing, whereas systems like Hadoop are geared more towards higher-latency batch processing. When Hadoop moved to version 2.0, it split the original monolithic MapReduce scheduler into two components, including a generic Resource Manager, which enables application developers to replace the MapReduce scheduler with custom logic in an Application Master. Our colleague Michael Isard developed a simple Application Master (called Peloponnese) for running groups of persistent .NET processes, and we use it to launch the processes that make up a Naiad cluster. After the processes are launched, Naiad takes over and uses its own mechanisms for low-latency message exchange and coordination.
Hadoop YARN is also gaining popularity as a Platform-as-a-Service cloud offering. The HDInsight service from Microsoft Azure now includes support for YARN in version 3.0 of the service. Our colleague Matt Winkler gave a talk at //build/ in which he talked about some of the ways to use YARN in HDInsight. Naiad now includes a command-line tool for submitting Naiad programs to HDInsight, a support framework that makes it easier to get your data into and out of Azure Storage, and a step-by-step guide to getting started with Naiad on Azure.
We are also developing support for running Naiad on self-managed YARN clusters. The GitHub repository contains an experimental tool for deploying Naiad jobs to a YARN cluster, and we are refining the launch process and support libraries for this use case. If you have any difficulty using the tool, raise an issue on our GitHub page.
Naiad on YARN represents one of the first end-to-end .NET solutions for implementing data analysis tasks on Hadoop. (DryadLINQ is another.) The ecosystem of tools for interoperating with other Hadoop jobs continues to grow — for instance, earlier this month the Azure team announced native .NET support for Avro serialization — and we are excited to contribute to this set of tools. Stay tuned for more announcements.
Derek Murray (@mrry) is a member of the Naiad team at Microsoft Research Silicon Valley.