QUIC restarts, slow problems: udpgrm to the rescue
At Cloudflare, we do everything we can to avoid interruption to our services. We frequently deploy new versions of the code that delivers the services, so we need to be able to restart the server processes to upgrade them without missing a beat. In particular, performing graceful restarts (also known as "zero downtime") for UDP servers has proven to be surprisingly difficult.
We've previously written about graceful restarts in the context of TCP, which is much easier to handle. We didn't have a strong reason to deal with UDP until recently — when protocols like HTTP3/QUIC became critical. This blog post introduces udpgrm, a lightweight daemon that helps us to upgrade UDP servers without dropping a single packet.
Here's the udpgrm GitHub repo.
Historical context
</a>
</div>
<p>In the early days of the Internet, UDP was used for stateless request/response communication with protocols like DNS or NTP. Restarts of a server process are not a problem in that context, because it does not have to retain state across multiple requests. However, modern protocols like QUIC, WireGuard, and SIP, as well as online games, use stateful flows. So what happens to the state associated with a flow when a server process is restarted? Typically, old connections are just dropped during a server restart. Migrating the flow state from the old instance to the new instance is possible, but it is complicated and notoriously hard to get right.</p><p>The same problem occurs for TCP connections, but there a common approach is to keep the old instance of the server process running alongside the new instance for a while, routing new connections to the new instance while letting existing ones drain on the old. Once all connections finish or a timeout is reached, the old instance can be safely shut down. The same approach works for UDP, but it requires more involvement from the server process than for TCP.</p><p>In the past, we <a href="https://blog.cloudflare.com/everything-you-ever-wanted-to-know-about-udp-sockets-but-were-afraid-to-ask-part-1/"><u>described</u></a> the <i>established-over-unconnected</i> method. It offers one way to implement flow handoff, but it comes with significant drawbacks: it’s prone to race conditions in protocols with multi-packet handshakes, and it suffers from a scalability issue. Specifically, the kernel hash table used for dispatching packets is keyed only by the local IP:port tuple, which can lead to bucket overfill when dealing with many inbound UDP sockets.</p><p>Now we have found a better method, leveraging Linux’s <code>SO_REUSEPORT</code> API. By placing both old and new sockets into the same REUSEPORT group and using an eBPF program for flow tracking, we can route packets to the correct instance and preserve flow stickiness. This is how <i>udpgrm</i> works.</p>
<div>
<h2>REUSEPORT group</h2>
<a href="#reuseport-group">
</a>
</div>
<p>Before diving deeper, let's quickly review the basics. Linux provides the <code>SO_REUSEPORT</code> socket option, typically set after <code>socket()</code> but before <code>bind()</code>. Please note that this has a separate purpose from the better known <code>SO_REUSEADDR</code> socket option.</p><p><code>SO_REUSEPORT</code> allows multiple sockets to bind to the same IP:port tuple. This feature is primarily used for load balancing, letting servers spread traffic efficiently across multiple CPU cores. You can think of it as a way for an IP:port to be associated with multiple packet queues. In the kernel, sockets sharing an IP:port this way are organized into a <i>reuseport group </i>— a term we'll refer to frequently throughout this post.</p>
<pre><code>┌───────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443 │ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ │ socket #1 │ │ socket #2 │ │ socket #3 │ │ │ └───────────┘ └───────────┘ └───────────┘ │ └───────────────────────────────────────────┘
Linux supports several methods for distributing inbound packets across a reuseport group. By default, the kernel uses a hash of the packet’s 4-tuple to select a target socket. Another method is SO_INCOMING_CPU
, which, when enabled, tries to steer packets to sockets running on the same CPU that received the packet. This approach works but has limited flexibility.
To provide more control, Linux introduced the SO_ATTACH_REUSEPORT_CBPF
option, allowing server processes to attach a classic BPF (cBPF) program to make socket selection decisions. This was later extended with SO_ATTACH_REUSEPORT_EBPF
, enabling the use of modern eBPF programs. With eBPF, developers can implement arbitrary custom logic. A boilerplate program would look like this:
SEC(“sk_reuseport”)
int udpgrm_reuseport_prog(struct sk_reuseport_md *md)
{
uint64_t socket_identifier = xxxx;
bpf_sk_select_reuseport(md, &sockhash, &socket_identifier, 0);
return SK_PASS;
}
To select a specific socket, the eBPF program calls bpf_sk_select_reuseport
, using a reference to a map with sockets (SOCKHASH
, SOCKMAP
, or the older, mostly obsolete SOCKARRAY
), along with a key or index. For example, a declaration of a SOCKHASH
might look like this:
struct {
__uint(type, BPF_MAP_TYPE_SOCKHASH);
__uint(max_entries, MAX_SOCKETS);
__uint(key_size, sizeof(uint64_t));
__uint(value_size, sizeof(uint64_t));
} sockhash SEC(".maps");
This SOCKHASH
is a hash map that holds references to sockets, even though the value size looks like a scalar 8-byte value. In our case it’s indexed by an uint64_t
key. This is pretty neat, as it allows for a simple number-to-socket mapping!
However, there’s a catch: the SOCKHASH
must be populated and maintained from user space (or a separate control plane), outside the eBPF program itself. Keeping this socket map accurate and in sync with the server process state is surprisingly difficult to get right — especially under dynamic conditions like restarts, crashes, or scaling events. The point of udpgrm is to take care of this stuff, so that server processes don’t have to.
Socket generation and working generation
</a>
</div>
<p>Let’s look at how graceful restarts for UDP flows are achieved in <i>udpgrm</i>. To reason about this setup, we’ll need a bit of terminology: A <b>socket generation</b> is a set of sockets within a reuseport group that belong to the same logical application instance:</p>
<pre><code>┌───────────────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443 │ │ ┌─────────────────────────────────────────────┐ │ │ │ socket generation 0 │ │ │ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ │ │ │ socket #1 │ │ socket #2 │ │ socket #3 │ │ │ │ │ └───────────┘ └───────────┘ └───────────┘ │ │ │ └─────────────────────────────────────────────┘ │ │ ┌─────────────────────────────────────────────┐ │ │ │ socket generation 1 │ │ │ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ │ │ │ socket #4 │ │ socket #5 │ │ socket #6 │ │ │ │ │ └───────────┘ └───────────┘ └───────────┘ │ │ │ └─────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────┘
When a server process needs to be restarted, the new version creates a new socket generation for its sockets. The old version keeps running alongside the new one, using sockets from the previous socket generation.
Reuseport eBPF routing boils down to two problems:
For new flows, we should choose a socket from the socket generation that belongs to the active server instance.
For already established flows, we should choose the appropriate socket — possibly from an older socket generation — to keep the flows sticky. The flows will eventually drain away, allowing the old server instance to shut down.
Easy, right?
Of course not! The devil is in the details. Let’s take it one step at a time.
Routing new flows is relatively easy. udpgrm simply maintains a reference to the socket generation that should handle new connections. We call this reference the working generation. Whenever a new flow arrives, the eBPF program consults the working generation pointer and selects a socket from that generation.
┌──────────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443 │
│ … │
│ Working generation ────┐ │
│ V │
│ ┌───────────────────────────────┐ │
│ │ socket generation 1 │ │
│ │ ┌───────────┐ ┌──────────┐ │ │
│ │ │ socket #4 │ │ … │ │ │
│ │ └───────────┘ └──────────┘ │ │
│ └───────────────────────────────┘ │
│ … │
└──────────────────────────────────────────────┘
For this to work, we first need to be able to differentiate packets belonging to new connections from packets belonging to old connections. This is very tricky and highly dependent on the specific UDP protocol. For example, QUIC has an initial packet concept, similar to a TCP SYN, but other protocols might not.
There needs to be some flexibility in this and udpgrm makes this configurable. Each reuseport group sets a specific flow dissector.
Flow dissector has two tasks:
It distinguishes new packets from packets belonging to old, already established flows.
For recognized flows, it tells udpgrm which specific socket the flow belongs to.
These concepts are closely related and depend on the specific server. Different UDP protocols define flows differently. For example, a naive UDP server might use a typical 5-tuple to define flows, while QUIC uses a “connection ID” field in the QUIC packet header to survive NAT rebinding.
udpgrm supports three flow dissectors out of the box and is highly configurable to support any UDP protocol. More on this later.
Welcome udpgrm!
</a>
</div>
<p>Now that we covered the theory, we're ready for the business: please welcome <b>udpgrm</b> — UDP Graceful Restart Marshal! <i>udpgrm</i> is a stateful daemon that handles all the complexities of the graceful restart process for UDP. It installs the appropriate eBPF REUSEPORT program, maintains flow state, communicates with the server process during restarts, and reports useful metrics for easier debugging.</p><p>We can describe <i>udpgrm</i> from two perspectives: for administrators and for programmers.</p>
<div>
<h2>udpgrm daemon for the system administrator</h2>
<a href="#udpgrm-daemon-for-the-system-administrator">
</a>
</div>
<p><i>udpgrm</i> is a stateful daemon, to run it:</p>
<pre><code>$ sudo udpgrm --daemon
[ ] Loading BPF code [ ] Pinning bpf programs to /sys/fs/bpf/udpgrm [*] Tailing message ring buffer map_id 936146
This sets up the basic functionality, prints rudimentary logs, and should be deployed as a dedicated systemd service — loaded after networking. However, this is not enough to fully use udpgrm. udpgrm needs to hook into getsockopt
, setsockopt
, bind
, and sendmsg
syscalls, which are scoped to a cgroup. To install the udpgrm hooks, you can install it like this:
$ sudo udpgrm –install=/sys/fs/cgroup/system.slice
But a more common pattern is to install it within the current cgroup:
$ sudo udpgrm –install –self
Better yet, use it as part of the systemd “service” config:
[Service]
…
ExecStartPre=/usr/local/bin/udpgrm –install –self
Once udpgrm is running, the administrator can use the CLI to list reuseport groups, sockets, and metrics, like this:
$ sudo udpgrm list
[ ] Retrievieng BPF progs from /sys/fs/bpf/udpgrm
192.0.2.0:4433
netns 0x1 dissector bespoke digest 0xdead
socket generations:
gen 3 0x17a0da <= app 0 gen 3
metrics:
rx_processed_total 13777528077
…
Now, with both the udpgrm daemon running, and cgroup hooks set up, we can focus on the server part.
udpgrm for the programmer
</a>
</div>
<p>We expect the server to create the appropriate UDP sockets by itself. We depend on <code>SO_REUSEPORT</code>, so that each server instance can have a dedicated socket or a set of sockets:</p>
<pre><code>sd = socket.socket(AF_INET, SOCK_DGRAM, 0)
sd.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1) sd.bind((“192.0.2.1”, 5201))
With a socket descriptor handy, we can pursue the udpgrm magic dance. The server communicates with the udpgrm daemon using setsockopt
calls. Behind the scenes, udpgrm provides eBPF setsockopt
and getsockopt
hooks and hijacks specific calls. It’s not easy to set up on the kernel side, but when it works, it’s truly awesome. A typical socket setup looks like this:
try:
work_gen = sd.getsockopt(IPPROTO_UDP, UDP_GRM_WORKING_GEN)
except OSError:
raise OSError(‘Is udpgrm daemon loaded? Try “udpgrm –self –install”’)
sd.setsockopt(IPPROTO_UDP, UDP_GRM_SOCKET_GEN, work_gen + 1)
for i in range(10):
v = sd.getsockopt(IPPROTO_UDP, UDP_GRM_SOCKET_GEN, 8);
sk_gen, sk_idx = struct.unpack(‘II’, v)
if sk_idx != 0xffffffff:
break
time.sleep(0.01 * (2 ** i))
else:
raise OSError(“Communicating with udpgrm daemon failed.”)
sd.setsockopt(IPPROTO_UDP, UDP_GRM_WORKING_GEN, work_gen + 1)
You can see three blocks here:
First, we retrieve the working generation number and, by doing so, check for udpgrm presence. Typically, udpgrm absence is fine for non-production workloads.
Then we register the socket to an arbitrary socket generation. We choose
work_gen + 1
as the value and verify that the registration went through correctly.Finally, we bump the working generation pointer.
That’s it! Hopefully, the API presented here is clear and reasonable. Under the hood, the udpgrm daemon installs the REUSEPORT eBPF program, sets up internal data structures, collects metrics, and manages the sockets in a SOCKHASH
.
Advanced socket creation with udpgrm_activate.py
</a>
</div>
<p>In practice, we often need sockets bound to low ports like <code>:443</code>, which requires elevated privileges like <code>CAP_NET_BIND_SERVICE</code>. It's usually better to configure listening sockets outside the server itself. A typical pattern is to pass the listening sockets using <a href="https://0pointer.de/blog/projects/socket-activation.html"><u>socket activation</u></a>.</p><p>Sadly, systemd cannot create a new set of UDP <code>SO_REUSEPORT</code> sockets for each server instance. To overcome this limitation, <i>udpgrm</i> provides a script called <code>udpgrm_activate.py</code>, which can be used like this:</p>
<pre><code>[Service]
Type=notify # Enable access to fd store NotifyAccess=all # Allow access to fd store from ExecStartPre FileDescriptorStoreMax=128 # Limit of stored sockets must be set
ExecStartPre=/usr/local/bin/udpgrm_activate.py test-port 0.0.0.0:5201
Here, udpgrm_activate.py
binds to 0.0.0.0:5201
and stores the created socket in the systemd FD store under the name test-port
. The server echoserver.py
will inherit this socket and receive the appropriate FD_LISTEN
environment variables, following the typical systemd socket activation pattern.