We have recently been talking about Naiad’s low latency (see, for example, Derek’s presentation at SOSP). If you have ever tried to coax good performance out of a distributed system, you may be wondering exactly how we get coordination latencies of less than 1ms across 64 computers with 8 parallel workers per machine. In this series of posts we’re going to reveal to our curious – and possibly skeptical – users what we did in gory detail.
One of the most significant sources of performance issues is the network. Here, I will explain how we configured our cluster and TCP settings to deal with the problems we encountered, and also try to give you some insight as to why we had these problems in the first place.
Our test cluster consists of 64 dual quad-core 2.1 GHz AMD Opteron processors, each with 16GB of memory and an Nvidia NForce Gigabit Ethernet NIC. The machines are running Windows Server 2012 and are placed on two racks, 32 on each rack. Each rack has a Blade Networks G8052 switch, with a 40Gbps uplink to the core switch (a Blade Networks G8124) shared by the rest of the data center. There is no interference from non-Naiad traffic.
Naiad establishes all-to-all TCP connections between the cluster machines, so that each computer handles 63 simultaneous connections. The software is structured so there are two threads dedicated to each TCP connection, one for sending, the other for receiving.
Network traffic patterns
Every worker in Naiad executes the entire program over some partition of the data. For more description of how this works, I urge you to read an earlier post. Since programs are represented as dataflow graphs, and there can be synchronization between stages of the graph, it is typically the case that each worker executes a vertex from the same stage at roughly the same time (modulo skew in the data, stragglers, etc.). In between stages the workers will often, but not always, exchange a bunch of data records over the network at roughly the same time.
In addition to high-volume data exchange, the computers involved in a Naiad program also participate in a distributed coordination protocol (see the “progress protocol” in the paper for details). If coordination messages are not sent and received promptly, then actual progress is slowed down. For example, a batching operator like the Min vertex not only needs to receive all of the input data records before it can run, but it also needs to receive the information that those records have all arrived.
Thus, we have two types of traffic with conflicting requirements: high throughput and low latency. And not only that, because the data exchanges between workers tend to happen almost simultaneously, we also potentially see transient congestion in the network and/or an incast problem, where buffers at the switch or at the receiver’s NIC can’t absorb the spike in traffic volume and overflow, dropping packets.
To summarize, the entire Naiad program will temporarily stall when either a data exchange is slow to complete, or progress protocol messages are delayed, between any pair of machines. These pauses are examples of what we term micro-stragglers, and they can collectively add substantial overhead to the running time of the program. Micro-stragglers have numerous causes throughout the software stack, but in the network they are primarily produced by packet drops at the switch or receiver, as well as congestion avoidance and back-off mechanisms.
Configuring switches and NICs
To try and avoid network congestion at the physical layer, we enabled Ethernet flow control on the rack switches and configured the host NICs to use large receive buffers. Despite this, we still saw a non-trivial amount of packet loss at the receivers, most likely because of the incast problem that is inherent in Naiad’s traffic pattern. The reason increasing the receive buffers didn’t alleviate the packet loss is because TCP is designed to keep the buffers full, so any sudden increase in load to a single destination will inevitably overwhelm the receiver. Therefore we decided to tackle the problem using TCP’s end-to-end congestion control as described below.
We also took some care to constrain the CPU costs of network processing. We enabled TCP offload on the NICs, and configured Receive Side Scaling (RSS) to balance the load across multiple cores. In practice, RSS will be more important at 10Gbps than at the 1Gbps of our network. It is worth noting that since we published the SOSP paper, some new RSS options for low latency scenarios have been introduced on Windows Server 2012, which could possibly have some benefit for Naiad.
Nagle’s algorithm for TCP reduces per-packet overheads by coalescing small payloads: if there is an unacknowledged segment in flight, TCP will wait until it has enough data to send a full-sized packet. Essentially, Nagle’s algorithm increases throughput at the expense of latency for small packets, which can be made worse by the well-known poor interaction with delayed acknowledgements. In Naiad, where small packets are typically involved in the progress protocol, this is exactly the wrong trade-off and so we disable Nagling with the TCP_NODELAY socket option.
The minimum retransmit timeout (MinRto) is another important TCP setting. Round trip times in our cluster are on the order of tens of microseconds, but the default TCP configuration for Windows Server 2012 sets the minimum timeout to 300ms – orders of magnitude larger! This doesn’t matter much when lots of segments are in flight and packet loss can be detected by the continuous stream of acknowledgements. The problem for Naiad is when the message only comprises a single packet, as is usually the case for progress protocol notifications, and that single packet is lost. Since the protocol is highly delay sensitive, it’s critical to set the MinRto and the delayed acknowledgement timer to their minimum values of 20ms and 10ms respectively, which we do using the following PowerShell script:
# Read the list of machines into a string array $mcs = Get-Content .\cluster.lst # Make a new session $cluster = New-CimSession $mcs # Set Rto to 20ms on every machine Set-NetTCPSetting -CimSession $cluster -MinRtoMs 20 -InitialRtoMs 300 –DelayedAckTimeoutMs 10 # Confine effects to TCP port 2666 New-NetTransportFilter -CimSession $cluster -RemotePortStart 2666 -RemotePortEnd 2666 -LocalPortStart 0 -LocalPortEnd 65535
The script also sets the initial retransmit time to the minimum allowed value of 300ms, which helps the Naiad job to set up its all-to-all mesh of TCP connections faster. One unexpected problem we ran into as we scaled up to use more machines was difficulty in establishing these connections. Apparently the sudden onslaught of connection requests was triggering Memory Pressure Protection (MPP), which protects against TCP denial of service attacks by dropping SYN packets and closing existing connections. The feature can be disabled in the PowerShell script above using the option –MemoryPressureProtection Disabled.
What about a modern congestion control technique?
Ideally we would also use Data Center TCP (DCTCP) for congestion control, but that requires Explicit Congestion Notification (ECN), which our rack switches don’t support. DCTCP uses ECN packet marks to indicate the extent of congestion, which allows the sender to react by reducing its TCP window size in proportion to the fraction of packets that are marked. As a result, packet loss is not needed to signal congestion, as in regular TCP, which leads to shorter queues and better end-to-end latency. Since low latency is one of our requirements, DCTCP should be beneficial – we plan to try it out if we can find a large enough cluster with ECN-capable switches.
Decoupling control and data
It is tempting to decouple the control plane traffic requiring low latency (i.e. the progress protocol) from the data plane traffic requiring high throughput. We initially tried this by adding a high priority queue for outgoing protocol messages, but there was a catch: at the time, it was not safe for the progress protocol messages to overtake the data to which they pertain, so the planes could not operate completely independently without compromising the integrity of the computation. With the current version of the progress protocol this safety concern would no longer apply, but eagerly sending progress messages removes the opportunity for an optimization that reduces the volume of protocol traffic and the approach was abandoned. If you are interested in more detail on the safety properties of the progress protocol, the paper published at the 2013 Conference on Formal Techniques for Distributed Systems presents a formal specification. Our SOSP 2013 paper describes some of the progress protocol optimizations and their impact on overall performance.
Although TCP offers the in-order, reliable transmission that we need, in many respects its throughput-oriented mechanisms for controlling congestion and achieving high utilization are inappropriate for Naiad. In particular, short progress protocol messages require timely and reliable delivery, but will never contribute significantly to congestion nor stress network capacity. Therefore, we added a command-line option to optimistically send progress protocol messages over multicast UDP and then to transmit them again over unicast TCP. The first transmission is unreliable, but may arrive sooner since sending a single multicast UDP datagram is very cheap compared to sending a message on each of 64 TCP connections.
Decoupling control and data in this way is yet another incarnation of the familiar bandwidth-latency tradeoff. We saw tangible improvements in running time for some programs, but not others, and it remains future work to systematically tease out the circumstances under which decoupling is the best option.
A note on how we detect and debug network issues is warranted. Many of the pathological behaviors are triggered by the specific characteristics of Naiad jobs, and micro-benchmarks do not expose the problem. Therefore we need a visualization of how the entire system is executing, from packets on the wire right through to the causal relationships between the progress protocol and vertex execution in the Naiad program itself.
Fortunately, Windows ships with a high-performance, low-overhead tracing system called Event Tracing for Windows (ETW). It is straightforward to post events from your own code, and almost every product that Microsoft ships is extensively instrumented already, including most OS components and services, and the .NET runtime. Out-of-the-box tools, downloadable from MSDN, such as Windows Performance Analyzer (WPA) and PerfView can interpret an ETW trace in useful ways.
In diagnosing and debugging performance we used a variety of publicly available tools and we also wrote our own that provide the detailed visualizations of execution that we need for Naiad. This gave us a very powerful suite of tools for performance debugging, which I will write about in a future blog post.
In the end, we were able to achieve pretty satisfactory performance on our cluster. Here is a plot from the SOSP paper showing the latency distributions for a global coordination micro-benchmark for up to 512 worker threads on 64 computers. Note the impact of micro-stragglers revealed in the 95th percentile values as the cluster size increases.
Although good, Naiad performance is by no means a solved problem, essentially because we haven’t been able to completely eliminate loss. If we are unlucky, TCP’s exponential back-off can result in 5s stalls, which is clearly way off the chart in terms of the latencies that we aspire to. If we are really unlucky, we see occasional TCP connection resets (RSTs) caused by the maximum number of retransmits being exceeded and as a result the entire job fails. In ongoing work we are moving to high-performance, reliable networking technology like 40Gb Ethernet, and using the Winsock RIO (Registered Input/Output) API, DCTCP, RoCE (RDMA over Converged Ethernet), Data Center Bridging etc.
I have not described in this post all of the things we tried and discarded. Some of the configuration options that sounded promising, for instance, setting TCP’s congestion window restart option, appeared to have little impact, but we do not have a good explanation as to why. It is most likely that any improvements were dominated by some other effect, but it is impractical to systematically explore the space of options manually, especially when the main focus is to develop a performant system.
Fortunately, this leads to some fun opportunities for future research. To start with we could make the configuration task easier with better diagnostics for the impact of, and interactions between, different options. A more advanced research question that arises from all this gore is whether it is possible to automate the end-to-end configuration for a particular network, TCP stack, and traffic pattern. There was an interesting paper at Sigcomm this year describing a program that automatically generates the TCP congestion control algorithm, given a specification of the network and an objective. Could we go further and automatically tune the network to the traffic in order to get the best bandwidth-latency tradeoff?
Making Naiad perform well involved much more than just the network and in future posts we will cover other parts of the system. Coming next: how we optimized synchronization between I/O and worker threads on individual machines.
Regular readers of this blog might notice that the Naiad team is rather fond of defining new kinds of dataflow. Just yesterday, we did it again with a paper that defines “timely dataflow”. In this post, I’m going to introduce timely dataflow at a high level, and follow-up posts will talk more about the distributed Naiad system that we have built on top of it. Read more…
Way back at the beginning of time, we had a post on performing PageRank on a 1.5B edge graph of who-follows-who on Twitter. We talked a bit about how several big data systems don’t do quite as well as a 40 line, single-threaded C# program. There was also a promise to show how to make things go much, much faster. So we’re going to do that today.
am going to be presenting presented our work on Naiad at the GraphLab workshop, this afternoon at 3:10pm. I’ll talk about the trade-offs between data-parallel and graph-parallel systems, and show how Naiad delivers the best of both worlds.
In this post we will cover a programming pattern frequently thought to be incompatible with data-parallel systems: sequentially updating a collection of values each as a function of the most recent values. In C#, this would look something like:
// sequentially updates array of values for (int i = 0; i < values.Length; i++) values[i] = updateBasedOn(values);
This type of iterative process is at the heart of several graph algorithms, from efficient pageranking, to graph coloring, to general loopy belief propagation. Although the logic looks very sequential, we will see that it can be made very parallel in the context of graph processing.