Systems Fun at HotOS

Infra

Systems Fun at HotOS

One day somebody will tell me what systems means.

Last week I attended HotOS¹ for the first time. It was super fun. Just the kind of conference I like: single-track, a mix of academic and industry, a mix of normal practical ideas and less-normal less-practical big thinking. I went partially because a colleague twisted my arm, and partially because of this line in the CFP:

The program committee will explicitly favor papers likely to stimulate reflection and discussion.

That sounds like a good time.

Some Papers or Talks I Enjoyed

I thought I’d give a shout-out to some of the papers I enjoyed the most. These aren’t the best papers, just ones I particularly liked for my own arbitrary reasons².

The NIC should be part of the OS by Xu and Roscoe exposes a lot of what was traditionally the OS kernel’s state (like the run queue for each core) to the NIC, allowing the NIC to use that state to decide which packets to deliver and when. Their goal is to do better than existing kernel bypass methods. After all, many modern cloud applications are short bursts of compute between waiting for packets. They do this with CXL 3.0’s cache-coherent peripheral interconnect, but I don’t think peripheral cache coherence is actually critical to the big picture here. This approach of pushing scheduling work down to the NIC seems especially interesting for straight packet processing workloads, and for Serverless workloads, both of which tend to be very run and wait heavy with high concurrency. Just super fun deep systems work³.
Batching with End-to-End Performance Estimation by Borisov et al. This paper starts by questioning the common wisdom that batching is good for throughput and bad for latency, by presenting a set of scenarios that show that it can be good or bad for both latency and throughput depending on small timing differences. I love questioning the common wisdom, so this is my jam. They then point out that this situation can be improved by making batching logic aware of end-to-end latency, and propose a smart way to estimate that latency using Little’s Law.
Spork: A posix_spawn you can use as a fork by Vögele et al. One day, I decided to move to a new house. So a built a perfect replica of my old house in a new place, then tore it down, and built a new house⁴. That’s fork. Fork sucks. posix_spawn is better, but nobody uses it. So what do we do? In this paper, the authors propose to trap fork, analyze the forking binary (!) to find the exec that corresponds to the fork, and then dynamically rewrite (!) the forking program to call posix_spawn instead. If you like deep OS level optimizations, with just the right amount of unhingedness, you’ll enjoy this paper⁵.
Real Life is Uncertain. Consensus Should Be Too! by Frank et al⁷. You know how consensus papers talk about the f-threshold failure model, and safety and liveness, and every implementer of those papers realizes that f-threshold is kinda bunk because things don’t actually fail that way and the world is messy and probablistic and correlated? Yeah. This paper looks at better ways to talk about these things, building a bridge between theory and practice. It goes on to propose some ideas for building better consensus algorithms that deal with the messiness of the real world. Cool paper, important conversation.
From Ahead-of- to Just-in-Time and Back Again: Static Analysis for Unix Shell Programs by Lazarek et al. Just read that title. Static analysis for shell programs. Static analysis for shell programs. Shell programs, like fork, are a little bit of technical debt from a prior era⁶ that we just can’t make go away. But what if we can make them way safer with static analysis? If your initial thought it ‘wow, that sounds hard’, you’re on the right track.
Analyzing Metastable Failures by Isaacs et al. This is the start of something I think is super important: making conceptual and practical progress on catching metastable failures at design time. Figure 1 and Figure 3 are worth the price of admission by themselves, a pair of important concepts that might turn out to be super important to the future of large-scale systems building.

Overall, HotOS was a great time, with tons of new and interesting ideas.

Interesting Trends

Students continue to be smart, motivated, thoughtful, and generally awesome. This has always been true, but maybe runs counter to the general narrative. Tons of great up-and-coming folks.
As I noted about OSDI’23, the rise of Rust as a systems programming language continues. We’ve gone from folks justifying why Rust? to more why not Rust?
AI for systems and Systems for AI are everywhere, but in much more practical ways than in the past. We seem to be over the hump of hype here, and actually using AI as a tool to build some really cool stuff. With a few exceptions, the overall vibe matched what Narayanan and Kapoor talk about in AI as Normal Technology, which I think is a positive thing for the systems community.

Footnotes

That’s The ACM SIGOPS 20th Workshop on Hot Topics in Operating Systems to you.
If your paper or talk is not on the list, you can assume I haven’t had a chance to read it yet, or happened to miss your talk.
Both ETH Zurich and EPFL do a ton of kick ass systems work. There’s something in the water in Switzerland.
I stole this analogy from somebody, but unfortunately can’t remember who.
In some ways it reminds me of the original Xen paper. Similar crazy enough to work energy.
A prior error, one might say.
There’s something in the water in Berkeley, too, but there always has been.