Last week I attended the USENIX Annual Technical Conference and its affiliated workshops, all here in sunny Boston, Massachusetts. Here were my takeaways from the conference:
- Have increases in processing speed and resources on individual computer systems changed the way that parallelizable problems should be split in distributed systems? In a paper on the “seven deadly sins of cloud computing research”, Schwarzkopf et al. note (for sin #1) that “[even] If we satisfy ourselves that parallel processing is indeed necessary or beneficial, it is also worth considering whether distribution over multiple machines is required. As Rowstron et al. recently pointed out, the rapid increase in RAM available in a single machine combined with large numbers of CPU cores per machine can make it economical and worthwhile to exploit local, rather than distributed, parallelism.” The authors acknowledge that there is an advantage to distributed computation, but they note that “Modern many-core machines…can easily apply 48 or more CPUs to processing a 100+GB dataset entirely in memory, which already covers many practical use cases.” I enjoyed this observation and I hope their observation inspires (additional) research on adaptive systems that can automatically partition a large cloud-oriented workload into a “local” component, sized appropriately for the resources on an individual node, and a “distributed” component for parallelization beyond locally-available resources. Would it be worth pursuing the development such an adaptive system, especially in the face of unknown workloads and heterogeneous resource availability on different nodes—or is the problem intractable relative to the benefits of local multicore processing?
- Just because you have a bunch of cores doesn’t mean you should try to utilize all of them all of the time. Lozi et al. identify two locking-related problems faced by multithreaded applications: First, when locks are heavily contended (many threads trying to acquire a single lock) the overall performance suffers; second, each thread incurs cache misses when executing over the critical section protected by the lock. Their interesting solution involves pinning that critical section onto a dedicated core that does nothing but run that critical section. (The authors cite related work that does similar pinning of critical sections onto dedicated hardware; I found it especially useful to read Section 1 of Lozi’s paper for its description of related work.) The authors further created a tool that modifies the C code of legacy applications “to replace lock acquisitions by optimized remote procedure calls to a dedicated server core.” One of my favorite technical/research questions over the past decade is, given ever-increasing numbers of cores, “what will we do with all these cores?”—and I’ve truly enjoyed that the answer is often that using cores very inefficiently is often the best thing you can do for overall application/system performance. Ananthanarayanan et al. presented a similar inefficiency argument for small clustered jobs executing in the cloud: “Building on the observation that clusters are underutilized, we take speculation to its logical extreme—run full clones of jobs to mitigate the effect of outliers.” One dataset the authors studied had “outlier tasks that are 12 times slower than that job’s median task”; their simulation results showed a 47% improvement in completion time for small jobs at a cost of just 3% additional resources.
- There are still (always?) plenty of opportunities for low-level optimization. Luigi Rizzo presented best-paper-award-winning work on eliminating OS and driver overheads in order to send and receive packets at true 10 Gbps wire speed. I love figures like the one shown above this paragraph, where a system both (a) blows away current approaches and (b) clearly maxes out the available resources with a minimum of fuss. I found Figure 2 from this paper to be especially interesting: The author measured the path and execution times for a network transmit all the way down from the sendto() system call to the ixgbe_xmit() function inside the network card driver, then used these measurements to determine where to focus his optimization efforts—those being to “[remove] per-packet dynamic memory allocations, removed by preallocating resources; [batch] system call overheads, amortized over large batches; and [eliminate] memory copies, eliminated by sharing buffers and metadata between kernel and userspace, while still protecting access to device registers and other kernel memory areas.” None of these techniques are independently novel, but the author claims novelty in that his approach “is tightly integrated with existing operating system primitives, not tied to specific hardware, and easy to use and maintain.” Given his previous success creating dummynet (now part of FreeBSD), I am curious to see whether his modified architecture makes its way into the official FreeBSD and Linux codebases.
- Capturing packets is apparently difficult, despite the years of practice we’ve had in doing so. Two papers addressed aspects of efficiency and efficacy for packet capture: Taylor et al. address the problem that “modern enterprise networks can easily produce terabytes of packet-level data each day, which makes efficient analysis of payload information difficult or impossible even in the best circumstances.” Taylor’s solution involves aggregating and storing application-level information (DNS and HTTP session information) instead of storing the raw packets and later post-processing them. Section 5 of the paper presents an interesting case study of how the authors used their tool to identify potentially compromised hosts on the University of North Carolina at Chapel Hill’s computer network. Papadogiannakis et al. assert that “intrusion detection systems are susceptible to overloads, which can be induced by traffic spikes or algorithmic singularities triggered by carefully crafted malicious packets” and designed a packet pre-processing system that “gracefully responds to overload conditions by storing selected packets in secondary storage for later processing”.
- /bin/true can fail! Miller et al. described their system-call-wrapping software that introduces “gremlins” as part of automated and deterministic software testing. A gremlin causes system calls to return legitimate but unexpected responses, such as having the read() system call return only one byte at a time with each repeat call to read(). (The authors note that “This may happen if an interrupt occurs or if a slow device does not have all requested data immediately available.”) During the authors’ testing they discovered that “the Linux dynamic loader [glibc] failed if it could not read an executable or shared library’s ELF header in one read().” As a result, any program requiring dynamic libraries to load—including, apparently, /bin/true—fail whenever system calls are wrapped in this manner. This fascinating result calls to mind Postel’s Law.
- Switching context to and from the hypervisor is still (somewhat) expensive. The table above shows the “exit latency” for hardware virtualization processor extensions, which I believe Agesen et al. measured as the round-trip time to trap from guest mode, to the virtual machine monitor (VMM), then immediately back to the guest. I was surprised that the numbers are still so high for recent-generation architectures. Worse, the authors verbally speculated that, given the plateau with the Westmere and Sandy Bridge architectures, we may not see much further reduction in exit latency for future processor generations. To address this high latency, Agesen described a scheme where the VMM attempts to find clusters of instructions-that-will-exit, and where the VMM handles such clusters collectively using only a single exit (instead of returning to guest mode after each handled instruction). One example of such a cluster: “Most operating systems, including Windows and Linux, [update] 64 bit PTEs using two back-to-back 32 bit writes to memory….This results in two costly exits”; Agesen’s scheme collapses these into a single exit. The authors are also able to handle complicated cases such as control flow logic between clustered instructions that exit. I like a lot of work on virtual machines, but I especially liked this work for its elegance and immediate practicality.
- JLG’s favorite work: We finally have a usable interactive SSH terminal for cellular or other high-latency Internet connections. Winstein et al. presented long-overdue work on making it easy to interact with remote computers using a cell phone. The authors’ key innovation was to have the client (on the phone) do local echo, so that you see what you typed without having to wait 4 seconds (or 15 seconds or more) for the dang network to echo your characters back to you. The authors further defined a communications protocol that preserves session state even over changes in the network (IP) address, meaning that your application will no longer freeze when your phone switches from a WiFi network to a cellular network. Speaking metaphorically, this work is akin to diving into a pool of cool refreshing aloe cream after years of wading through hot and humid mosquito-laden swamps. (As you can tell, I find high-latency networks troublesome and annoying.) The authors themselves were surprised at how much interest the community has shown in their work: “Mosh is free software, available from http://mosh.mit.edu. It was downloaded more than 15,000 times in the first week of its release.”
- Several papers addressed the cost savings available to clients willing to micromanage their cloud computing allocations or reservations; dynamic reduction of resources is important too. Among the more memorable: Ou et al. describe how to use microbenchmark and application benchmarks to identify hardware heterogeneity (and resulting performance variation) within the same instance types in the Amazon Elastic Compute Cloud. “By selecting better-performing instances to complete the same task, end-users of Amazon EC2 platform can achieve up to 30% cost saving.” I also learned from Ou’s co-author Prof. Antti Ylä-Jääski that it is permissible to chain umlauted letters; however, I’ve been trying to work out exactly how to pronounce them. Zhu et al. challenge the assumption that more cache is always better; they show that “given the skewed popularity distribution for data accesses, significant cost savings can be obtained by scaling the caching tier under dynamic load patterns”—for example, a “4x drop in load can result in 90% savings” in the amount of cache you need to provision to handle the reduced load.
In addition to the USENIX ATC I also attended portions of several workshops held coincident with the main conference, including HotStorage (4th workshop on hot topics in storage and file systems), HotCloud (4th workshop on hot topics in cloud computing), Cyberlaw (workshop on hot topics in cyberlaw), and WebApps (3rd conference on web application development).
At HotStorage I had the pleasure of watching Da Zheng, a student I work with at Johns Hopkins, present his latest work on “A Parallel Page Cache: IOPS and Caching for Multicore Systems“.
Since 2010 USENIX has held a “federated conferences week” where workshops like these take place in parallel with the main conference & where your admission fee to the main conference covers your admission to the workshops. This idea works well, especially since USENIX in its modern incarnations has only a single track of talks. Unfortunately, however, I came to feel that the “hot” workshops are no longer very hot—I was surprised at the dearth of risky, cutting-edge, groundbreaking, not-fully-fleshed-out ideas presented and the resulting lack of heated controversial (interesting) discussion. (I did not attend the full workshops, however, so it could simply be that I missed the exciting papers and the stimulating discussions.)
I also sat in on the USENIX Association’s annual membership meeting. Attendance at this year’s conference was well down from what I remember of the USENIX conference I attended a decade ago (USENIX conferences have been running since the 1970s), and I got the sense that USENIX is scrambling to figure out how to stay relevant, funded/sponsored, attractive to potential authors, and interesting to potential attendees. The newly-elected USENIX board asked for community feedback so I sent the following thoughts:
Hi Usenix board,
I appreciate the opportunity to participate in your annual meeting earlier today. I wasn’t even aware that the meeting was taking place until one of the board members walked around the hallways trying to get people to attend. I otherwise would have thought that the “Usenix annual meeting” was some boring event like voting for FY13 officers or something. Maybe I missed a memo? Perhaps next year mention the annual meeting in the email blast you sent out beforehand. (When I was at ShmooCon earlier this year the conference organizer actually ran a well-attended session, as part of the conference, about how they run a conference — logistics, costs, etc. — and in that session encouraged the same kind of feedback that you all solicited during the meeting.)
One of the points I brought up in the meeting is that Usenix should place itself at the forefront of cool technology. If there is something awesome that people are using, Usenix could be one of the first to adopt it — both in terms of using the technology during a session (live simulcast, electronic audience interaction, etc.) and in terms of the actual technology demonstrations at the conference. Perhaps you could have a “release event” where people agreed to escrow their latest cool online/network/storage/cloud/whatever tools for a few months and then release them all in bulk at the conference? You could make it into a media/publicity event and give awards for the most Usenix-spirity tools.
My wife suggests that you hold regional gatherings simultaneously as part of a large distributed conference. So perhaps it’s difficult/expensive for me to fly out to WA for Usenix Security, but maybe it’d be easy for me to attend the “northeast regional Usenix Security conclave” where all the sessions from WA are simulcast to MA (or NY or DC) — e.g. hosted at a university somewhere. Even better would be to have a few speakers present their work *here* and have that simulcast to the main conference in WA. [Some USENIX members] would complain about this approach, but think of all the news lately about the massive online university courses — perhaps Usenix could be a leader/trendsetter/definer in massive online conferences.
If I can figure out how, I’d like to be involved in bridging the gap between the “rigorous” academic conference community and the “practical” community conference community. I’m giving a talk next month at a community conference, so I will try to lay the groundwork there by describing to the audience what conferences like Usenix are like. You know, YOU could try to speak at these conferences — e.g. [the USENIX president] could arrange to give an invited talk at OScon, talking about how ordinary hacker/programmer types can submit interesting work to Usenix and eventually see their name at the top of a real conference paper.
Anyway, I appreciate the challenges you face in staying relevant in the face of change. I agree with the lady from the back of the room who expressed that it doesn’t matter what Usenix was 25 years ago, what’s important is what Usenix is now and in the next few years. Good luck!
Within hours the USENIX president replied:
Thanks for your thoughtful mail and especially your desire to be part of the solution — that’s what we love about our members!
I think we got a lot of good feedback last night and there are some great ideas here — the idea of simulcasting to specific remote locales where we have regional events is intriguing to me. You’ve given the board and the staff much to think about and I hope we don’t let you down!