☑ That's Good Enough
Gilding lilies might be fun, but gold leaf doesn't grow on trees.

Blog

Site

Me

☑ HTTP/3 in Practice — QUIC

25 Mar 2023 at 8:56AM in Software

｜

Photo by Andraes Arteaga on Unsplash

｜

See comments

http networks web quic

The first in what I hope will be a series of articles covering my attempts to implement a HTTP/3 server from scratch in Rust. This article outlines my goals and looks at the QUIC protocol on which HTTP/3 is implemented.

This is the 1^st of the 2 articles that currently make up the “HTTP/3 in Practice” series.

HTTP/3 in Practice — QUIC
Sat 25 Mar, 2023
HTTP/3 in Practice — HTTP/3
Thu 13 Apr, 2023

Table of Contents

Geta Plan
Why so QUIC?
- Connection Establishment Latency
- Head-of-line Blocking
A Not-So Quick QUIC Tour
Conclusions

Having spent the last couple of years exclusively blogging about Python releases, I wanted to pursue something different for awhile. Looking into the next major version of the HTTP procovol, HTTP/3, has been on my radar for awhile, as has getting some practice authoring Rust. Hence, I’ve decided to combine the two and build a HTTP/3 web server in Rust. This article is the first in what I hope will be a series covering my progress on this project.

As those who’ve been keeping up with my Python release articles will know, however, speed isn’t my strong point so I suspect this series will be somewhat intermittent as I make progress. In this article I’m going to briefly outline how I intend to structure the project, and then jump into some coverage of QUIC, which is a UDP-based transport on which HTTP/3 is built.

Geta Plan¶

Every project needs a name, lest it languish for all time under a moniker such as “that HTTP/3 server project thingy what I wrote in Rust, like”, and with deplorably little thought I’ve decided to call mine Geta — named after the Roman emperor not the Japanese footwear. The reasons don’t bear much scrutiny, but largely because I thought it would be amusing to make it a homophone of “getter”¹, and because Geta was the son and successor of Septimius Severus where “Severus” sounds a bit like “server us”. I did warn you it was tenuous, but it’ll do for a throw-away project.

My general plan, and one might reasonably assume the articles within this series, is as follows:

Get a hopefully working understanding of the QUIC protocol.
Get an equally hopefully working understanding of how HTTP/3 is layered upon it.
Implement a simple UDP server in Rust as a first baby step.
Implement QUIC and HTTP/3 libraries layered upon UDP.
Finish this delicious UDP-flavour cake with some configuration and command-line niceties and some lovely lemon icing.
Add HTTP/2 and HTTP/1.1 support, just for fun.
Use the server to host this website to prove it works.
???
Profit.

It’s a bit of a vague plan, but it has the important virtue of giving me a starting point, and moreover a starting point that lends itself very well to the first couple of blog articles — then we can see whether the rest even happens.

Some two and a half thousand years ago, Sun Tzu wrote in The Art of War:

The general who wins the battle makes many calculations in his temple before the battle is fought. The general who loses makes but few calculations beforehand.

Essentially: be prepared. In this case, I’m not going to get too far unless I have a good understanding of how the protocols work, so going over that is going to be the first step. I’m sad to say that I don’t have my own temple in which to do my calculations, but I do have a collection of RFCs and a mind that’s hungry to learn, so I shall cease this self-indulgent jabbering and start getting to grips with it.

My first target is a working understanding of QUIC, and the results of my efforts in this area will form the remainder of this article.

Why so QUIC?¶

At this point you might be quite reasonably asking what on earth is QUIC and how is it relevant to HTTP/3². Well, the answers to those questions are that QUIC is a transport protocol that was designed by Jim Roskind at Google in 2012 as a replacement for TCP for HTTP and similar protocols. It grew out of experience of running HTTP/2 over TCP — if you’re not familiar with HTTP/2 then I wrote an article about it way back in 2015³, and I’m sure you can find any number of resources with a cursory search.

QUIC attempts to improve on several aspects of TCP, but there are two main ones that I think are the key selling points:

Reduce the latency of setting up a secure connection.
Remove the remaining head-of-line blocking issues.

Before jumping into an overview of how QUIC works, I’ll briefly discuss these two issues in the context of HTTP/2 over TCP, as I think it’s always important to understand the motivations for doing something differently.

Connection Establishment Latency¶

The 3-way handshake involved when setting up a TCP connection is an overhead when setting up a new connection, such as connecting to a new website in your browser. To make matters much worse, however, most connections use TLS these days and over TCP that must be negotiated after the connection has been established. This means that the typical HTTP connection over TLS 1.3 over TCP requires at least three round-trip times (RTT) before the first HTTP response starts to be received, or more if pre-1.3 versions of TLS are used⁴.

A full discussion of either the TCP or TLS handshake is outside the scope of this article, and you can find any number of descriptions online, but for reference the connection establishment handshake for HTTP (pre-HTTP/3) over TLS 1.3 over TCP is shown in the diagram below.

TLS over TCP handshake

This isn’t so bad across fast links, but over lower-bandwidth and unreliable mobile networks it can impose a significant delay. If you factor in situations such as HTTP redirects to other sites, which require their own TCP connections, things can get really dire quite quickly.

Head-of-line Blocking¶

One of the issues with HTTP/1.1 was that it suffered from an issue known as head-of-line (HOL) blocking. Rendering a website typically involves fetching multiple resources — HTML, CSS, Javascript, images — and HTTP/1.1 could only return these resources in a serialised manner on a single TCP connection. This means that any kind of a delay fetching a resource, say a particularly large image, would hold up fetching the rest of them. A technique called pipelining allowed multiple requests to be sent in parallel, but this didn’t help all that much because responses must be sent in the same order and were still serialised. Also, many servers never supported pipelining properly.

HTTP/2 made a lot of progress in resolving this issue by multiplexing multiple concurrent request/response streams over a single connection. This meant that resources were now independent of each other, and also the server was free to return them in whatever order most benefited page loading time. There’s still a HOL blocking issue with the underlying TCP stream, however — if there is some network error, such as a dropped packet, everything is halted until the data can be resent and successfully acknowledged.

This is an inherent consequence of TCP being a reliable byte stream: the protocol enforces that bytes arrive at the other end in exactly the order they were sent, and that means any gaps must be filled before moving on. With HTTP/2 the browser could, in principle, still use data on streams unaffected by the dropped packets — but because TCP is unaware of the multiple streams multiplexed over it, the application never gets to see those bytes until the dropped data is recovered.

One solution would be to make multiple TCP connections, one for each resource, instead of the multiplexing approach that HTTP/2 takes. This solves all your HOL blocking issues, but at the considerable expense of the connection establishment latency we looked at above, as well as the overheads of repeating all the HTTP request headers, etc. This harks back to the bad old days of HTTP/1.0 before persistent connections, where some scarily massive proportion of Internet traffic ended up being TCP SYNs and SYN ACKs.

Over today’s wireless networks, packet loss isn’t all that uncommon, and so this additional HOL blocking issue was still a noticeable pain point with HTTP/2 over TCP. The less reliable the underlying link layer becomes, the more painful this issue is.

A Not-So Quick QUIC Tour¶

Now we know the problems with TCP that QUIC is aiming to resolve, let’s take a look at its technical aspects and we’ll see its advantages over TCP as we go.

With apologies for stating the obvious, by necessity this article doesn’t go into the full technical details you’ll need to know to implement the protocol, such as byte-by-byte layout of specific frames and headers, so if that’s what you’re looking for then you’ll want to check out these RFCs:

RFC 9000 specifies the core protocol.
RFC 8999 is a short specification of the invariant aspects of the protocol which are guaranteed not to change in any version — in particular, the initial version negotiation.
RFC 9001 describes how TLS 1.3 is integrated into QUIC in greater detail.
RFC 9002 describes QUIC’s loss detection and congestion control mechanisms in greater detail.

That said, RFCs often seem to have an almost pathological conformance to the DRY principle, and this can mean that it’s hard to understand any of it until you’ve read all of it — sometimes multiple times and across multiple documents. Hopefully, therefore, this article might therefore still be a useful preparation for understanding the greater detail covered in the RFCs without having to read them eighteen times each. I’m sure it goes without saying, however, that in case of any conflict then assume the RFCs are correct!

Datagrams, Packets and Frames¶

Instead of being implemented directly on top of IP, QUIC uses UDP datagrams. This is a pragmatic decision to coexist with as much current network equipment as possible whilst not using TCP. It does mean that QUIC needs to implement reliable delivery of frames on top of UDP, however, as opposed to if they’d used, say, SCTP. Mind you, that would have presented a different challenge, which is that virtually no network equipment anywhere has any degree of support for SCTP, and a pretty substantial proportion of the software world may not even be aware of its existence — so UDP doesn’t seem like a bad choice, all things considered.

Each UDP datagram can contain one or more packets, and each packet may contain one or more frames according to its type. This may seem a little redundant, but typically including multiple packets per datagram is only done during the initial protocol handshake, and allows fewer round-trip delays to complete the process, thus improving latency. These initial packets have a longer header which includes a length field, and this is what allows multiple packets to unambiguously coexist in a single datagram. We’ll look at connection setup later, as I think it’s more useful to understand the core protocol first.

Once the handshake is completed, a single packet type, called a 1-RTT packet, is generally used for the rest of the connection. This carries frames which contain the data of the various data streams, and to reduce overheads it has a much shorter header which lacks a length field. This, in turn, means that it must occur as the final packet in a datagram, where the size of the datagram implicitly delimits the size of the packet. Generally, in fact, it will be the only packet in the datagram, since the other packet types are used during handshaking. The fact that the 1-RTT packet may include multiple frames means that the capacity of the UDP packet can still be maximised, however.

Packets themselves are protected using TLS, and the nature of this protection differs depending on the packet type — since the first packets set up the TLS connection parameters, it’s clear that they can’t be protected fully themselves. That said, I’m going to continue this discussion totally disregarding the encryption so we can look at the underlying protocol without complications. We’ll look at how TLS protection is introduced later on.

Packet Numbers and Acknowledgement¶

The reliable data delivery is handled at the packet level rather than the frame level. Each packet has a unique ID number which is acknowledged by the other peer when received, and such acknowledgement confirms the receipt of all frames contained within that packet. Packet numbering always starts at zero, as opposed to TCP where sequence numbers deliberately don’t all start at zero.

Glossing over some complexities during connection establishment, there is a single packet count maintained in each direction for the lifetime of the connection. That is, each peer remembers the packet number it most recently sent, and the packet number it most recently successfully received from its peer.

Acknowledgements of packets are sent in ACK frames. This contrasts with TCP, where the acknowledgement is a field within the packet. This may at first seem inefficient, but since packets can include multiple frames then it’s still possible to collocate data and acknowledgements in QUIC in the same way as TCP, it’s just that the mechanism differs.

Peers are allowed to delay sending of an ACK for reasons of efficiency. For example, if you’re in the process of sending data as well as receiving it, you may want to briefly wait for more data to send before sending a pending ACK, so that you can put the frames in a single packet and improve the latency of the connection overall. You may also want to wait briefly so you can acknowledge multiple packets at once to reduce the need to send multiple packets back. However, among the connection parameters that a peer can declare there is a max_ack_delay which is an upper limit on how long a peer can wait before it must send the ACK for a packet it’s received — if not specified then 25 ms is assumed. As you can see from the default value, it’s expected that an ACK will be sent promptly.

The ACK packet itself is a little more complicated than you might imagine since it doesn’t just acknowledge the maximum consecutive packet the peer has seen — it conveys gaps in the stream so that the sender can resend only those packets which are missing. It does this by listing the maximum packet number it’s received, and the number of consecutive packets before this which it has also received. Beyond this it can then optionally list a sequence of gap sizes and acknowledged packet lengths.

The diagram below shows a case where the server has sent eight packets since the most recent ACK from the client, and three of those have been dropped. This illustrates the format of ACK that the client would send in this circumstance.

QUIC ACK example

One thing that you may notice is that the ranges are all one smaller than you’d expect them to be — this is a quirk in that the range is assumed to have at least one packet in it, so the range length is the number of additional packets which precede the maximum package number in the range. The only benefit to this I can see, beyond being able to laugh at the million or so off-by-one errors developers will have to track down, is that you slightly reduce the size of values that you’ll need to represent. This has a theoretical benefit, because the values are encoded in a variable-length fashion where the minimum number of bytes are used to represent the value. I’d argue this benefit is debatable given the additional fiddliness it puts into implementations, but it’s just an opinion.

In response to receiving the second ACK shown in the diagram, the server should resend just the dropped packets: 130, 132 and 133.

One final note on this process is that just because there are gaps in the packets received doesn’t necessarily mean that data must be ignored until gaps are filled. As we’ll see below, the frames which comprise a stream also contain an offset within the stream, so it’s possible to validate if all frames within a given stream are available. This means that the consequences of dropped packets should only impact those streams whose frames are within the dropped packets — this is one way that the HOL blocking issues are minimised.

Conversely, however, this does potentially make life slightly tricky for those implementing QUIC support. When a packet arrives out of order, it’s quite possible that some frames within it can be passed directly to the application whereas other frames must be held until some earlier dropped packet is resent. This means that it’s going to be very important to do good housekeeping on which frames have been passed on as well as which packets have been successfully acknowledged, both individually and consecutively.

If you want to know more, note that RFC 9002 has some more detailed discussion of this as well as congestion control.

Sending and Receiving Data¶

Now we know the basic transport mechanisms, let’s take a look at the main core of the protocol: how streams of data are transferred in both directions.

Streams¶

As a replacement for TCP, QUIC also offers a stream abstraction, but the main difference is that it multiplexes multiple streams on to a single connection, to better support HTTP/2. These streams are the level at which QUIC implements reliable delivery — data is always delivered in order within a single stream. A stream may be unidirectional or bidirectional in type, and may be initiated by either the client or the server. In the case of a unidirectional stream, it is always initiated by the peer which wishes to send data.

These two properties are encoded in the lower two bits of a 62-bit stream ID which is unique for every stream within a single connection for its lifetime. To illustrate this, the table below shows the two-bit values which encode these combinations.

LSB	Meaning
`0x0`	Client-initiated bidirectional
`0x1`	Server-initiated bidirectional
`0x2`	Client-initiated unidirectional
`0x3`	Server-initiated unidirectional

It’s worth noting that the stream ID is always sequentially allocated using the remaining bits — it is a strict requirement of the protocol that steam IDs are allocated in numeric order, and this ensures that streams are created in the same order in both peers.

Each datagram within a stream generally contains an offset and a length. The offset indicates the offset of the first byte of payload data within the stream — the first packet in a stream which carries data will have an offset of zero. This offset is how out-of-order receipt of data is detected and corrected. The length field is simply the number of bytes of stream data within this datagram, although since these are individual UDP frames and have a length already then the length can be omitted to assume a length of all remaining data in the datagram.

Once data has been received by a peer, it is acknowledged back using a special ACK frame — this differs from TCP, in which acknowledgements are indicated as a field within a normal payload packet. Acknowledgements confirm receipt of a range of bytes within a stream

Streams are regarded as transmitting a limited amount of data — since QUIC was designed with HTTP in mind, generally a stream will be a single resource being sent. Once all data has been sent on a stream, the RFC talks about the “final size” of the stream being known, so this assumption of a finite amount of data is fairly explicit — once the final size is set, it cannot be changed for a stream. The individual per-stream behaviour, therefore, is rather like the original HTTP specification where the end of a response was indicated by simply closing the TCP connection.

As well as reliable delivery, streams must conform to flow-control limits set by the peer. This is crucial because implementing an ordered stream with any degree of efficiency necessarily entails buffering of any underlying packets which may arrive out of order. Without any form of limit, a malicious or simply poorly written sender could cause memory exhaustion at the other end by requiring unlimited buffering.

I won’t drill into the full details at this stage, but briefly the limits are somewhat similar to the TCP concept of window size. Each peer advertises how many bytes it’s prepared to receive and this is done both at a per-stream level and also at a global level for the connection — the other peer is required to conform to all such limits at all times. There are additional limits on the number of concurrent streams a peer can initiate without first losing some existing ones, and

Frame Types¶

The main frame types involved in sending a receiving data are listed below.

STREAM: These are the main workhorse of streams — receiving one implicitly creates a new stream, and they carry stream payloads. They can also have a FIN flag set which indicates the end of a stream, aside from any retransmission of data which may have been missed.
ACK: Used for acknowledging packets (not frames) as outlined earlier in this article.
RESET_STREAM: Sent by the sending end of a stream to abruptly terminate it, typically when the data in that stream is no longer required. After sending this, the sending end will immediately cease all transmission and retransmission of any data within the stream, and the receiving end can immediately discard any data on the stream it’s received so far and indicate to the application that the stream was reset.
STOP_SENDING: Sent by the receiving end of a stream to indicate it no longer requires the data in this stream. Upon receipt of this frame, the sending end should send a RESET_STREAM as outlined above.
DATA_BLOCKED and STREAM_DATA_BLOCKED: These frames are sent to indicate that the sender has reached an advertised flow-control limit and therefore cannot send any more data, but has more data to send. The difference between them is that DATA_BLOCKED indicates that the per-connection limit has been reached, whereas STREAM_DATA_BLOCKED indicates that the per-stream limit has been reached.
MAX_DATA and MAX_STREAM_DATA: These frames are sent by a receiver to increase the flow control limit either globally (MAX_DATA) or per-stream (MAX_STREAM_DATA). This is how the receiver controls the sending rate, and we’ll look at the specifics of how this works a little later.

Sending Data¶

Sending data can occur when either:

The application opens its own stream, either unidirectional or bidirectional.
The peer opens a bidirectional stream.

In both of these cases, the behaviour of the sender follows the same procedure, outlined in the state machine shown below.

QUIC sending state machine

When the stream is first opened, it’s in the READY state, waiting for the application to give it data to send. As soon as the application passes data, the stream moves to the SEND state and will send STREAM, STREAM_DATA_BLOCKED and DATA_BLOCKED messages as appropriate until all data is sent. Once the application indicates an end to the data, a final STEAM is set with the FIN flag set in it — at this point all transmission is finished, aside from any required retransmissions if data isn’t acknowledged. Once ACK frames have indicated the receipt of all data, the stream is closed and can be discarded.

The path down the right-hand side of the diagram occurs if the stream is reset, at which point all data to be sent is discarded and a RESET_STREAM is immediately sent — at this point the stream stays in RESET SENT until the RESET_STREAM is acknowledged, at which point the stream is closed.

Receiving Data¶

The receiving state of a connection is triggered by any of:

Receiving a data or reset frame sent by a peer on either a unidirectional or birectional stream — this will be either STREAM, STREAM_DATA_BLOCKED⁵ or RESET_STREAM⁶.
The application opens is own bidirectional stream.
On a bidrectional stream (only) a MAX_STREAM_DATA or STOP_SENDING is received⁷.
If a higher-numbered stream of the same type is opened for any reason. The standard requires streams to be opened in numerical order, so any lower-numbered streams are opened implicitly at this point on the basis that presumably the packets opening those have just been temporarily delayed and will arrive shortly anyway⁸.

QUIC receiving state machine

You’ll notice the state machine is fairly similar to that for the sending side, but there are differences. There’s no equivalent of the READY state, because the receiver has no concept of waiting for the application — the application waits for data from it instead. The receiver keeps processing STREAM messages from the sending peer, buffering up received data until it has a new chunk of contiguous data, at which point it can pass it to the application by whatever API it’s using and then free up some space in its buffer. It issues MAX_STREAM_DATA and MAX_DATA frames as necessary to indicate to the peer it can send more, which we’ll examine in a moment on the section on flow-control.

Once the receiver sees a STREAM frame with the FIN bit set then it knows the final size of the frame, and at this point it knows exactly how much buffer it will need to store all the remaining data on the stream. That data may or may not have arrived yet, of course, because the packet containing the STEAM with FIN may arrive out of order — this is the circumstance in which a receiver may linger in the SIZE KNOWN state.

Once all the data has arrived the state machine still isn’t quite ready to terminate because the application still needs to read the remaining data from the receiver’s buffer. Once this happens, the buffer is no longer required, the receiver is in DATA READ state and the stream is closed for receive.

Similar to sending, a STREAM_RESET at any point puts the receiver into RESET RECEIVED. Similar to reading data, the application must be informed that the stream was reset, and then the receiver is in RESET READ and the stream is similarly closed. As a final detail, the transition from DATA RECEIVED to RESET RECEIVED is optional — if the receiving peer already has all the data for the stream, it’s free to ignore the STREAM_RESET and just allow the application to read the remaining data.

One aspect which isn’t shown above is that if a peer is no longer interested in the content of the stream, if it’s still in RECEIVING or SIZE KNOWN states, it can send a STOP_SENDING message to the peer. This is a request for the sending end to send a STREAM_RESET and terminate the stream. Once all the data has already been received on the stream, however, this would be redundant.

Data Flow Control¶

As you will have seen in the earlier sections, the flow control system is intrinsic to QUIC’s design. As with TCP’s window size, it presumes a well-behaved sending peer will hold off on sending data beyond an agreed limit until the receiving peer has updated that limit. It enforces separate limits per-stream, and also another per-connection limit as well.

The initial values of these limits are specified as transport parameters during the handshake process, which we’ll look at towards the end of this article. The crucial aspect is that both these parameters are specified as absolute limits across the lifetime of the connection — they can be updated, but at any given point the limits specify the maximum cumulative total amount of data than can be sent since the start of the connection, either globally or on a specific stream.

Limits are always specified by the receiving party, though of course both peers are both senders and receivers in general, so each has their own separate limit. The limit should initially reflect the amount of memory that the peer has reserved for buffering received data, and this limit is assumed not to change as the buffers are filled. However, once data has been delivered to the application then this frees up more space in the buffer and the peer sends either MAX_DATA or MAX_STREAM_DATA to give the sending peer new limits, which allows it to send more data.

If a peer becomes blocked on sending data to the receiver because it’s hit the buffer limit, it SHOULD send a DATA_BLOCKED message if it’s hit the overall connection limit, or a STREAM_DATA_BLOCKED if it’s hit a per-stream limit. That “should” is bold because it has the same meaning as in an RFC — the peer is encouraged but not required to do so. This is one reason why it’s important that the receiving peer should proactively send updated MAX_DATA and MAX_STREAM_DATA messages as buffer space is available — since the sending peer isn’t strictly required to indicate that it’s blocked, the connection may become deadlocked if the receiving peer doesn’t unblock it proactively. The second reason is that relying on notifications of being blocked wastes a round-trip time whilst the *_BLOCKED message goes one way and the *_DATA message goes the other. Throughput and latency are both much improved if the sending peer is never actually allowed to reach its limits.

As an aside, it may seem as if the *_BLOCKED messages aren’t a lot of use — after all, if the peer has no buffer space, surely that’s still going to be true even if the sender is blocked? They do serve an important purpose, however, which is preventing an otherwise idle connection from hitting its idle timeout. If you consider the case of a connection with a single stream sending data in only one direction, if the reading application is prevented from reading for a long period, the connection will be idle — the sender is waiting for more space to send data, and the receiver QUIC stack is waiting for the application to read data so that it can free up the sender to send more. However, every connection has an idle timeout, and even though the sender still has data to send, the connection may still idle out if there’s no traffic. For this reason, the sender should periodically repeat its DATA_BLOCKED and STREAM_DATA_BLOCKED messages to make sure the connection stays open.

I can also think of a few potential additional uses for these messages, if peers wish to take advantage of it. The first is priority — it may be possible for the application to prioritise processing of data from streams that are blocked, hence unblocking them faster. Another is automatic tuning behaviour — a QUIC stack may well be running on a server with plenty of RAM, and only be limiting its usage as a matter of policy. Keeping track of how frequently peers are blocked can feed back in and trigger the allocation of more memory for buffers. Finally, this sort of feedback is also useful for operators for much the same reason, so they can determine how frequently the server’s performance has been limited by available memory.

To illustrate the overall exchange, consider the sample request/response exchange over QUIC in the sequence diagram below. It shows not only the frames transferred between the peers, but also simplified interactions between the QUIC stacks and the applications using them. Note that only the flow of steam-related frames is shown for simplicity — the grouping into packets and datagrams, and the use of ACK frames, is ignored.

QUIC flow control example sequence diagram

Of course, many peers can allocate large buffers and might never come close to hitting their limits — this is going to result in optimal throughput on a given link. However, buffering is going to vary a lot — a large server with 512GB RAM is probably going to want to prioritise performance over memory conservation, whereas some home router with an embedded webserver for its configuration interface might a hundred KB of memory to use for buffering. Also remember, though, that a large server may be expected to handle thousands of concurrent connections, and as soon as you operate at significant scale then the considerations that are important for embedded developers start to become important even on large servers.

One point that’s worth noting is that this mechanism of flow control means that particularly small buffers are going to have a particularly detrimental impact on performance. This is for two reasons, the first of which is that there’s a round-trip delay when a buffer limit is reached — the sending peer has to wait until the MAX_STREAM_DATA or MAX_DATA message reaches them, and then the data they send has to reach the receiving peer. This is a full round-trip delay where no data can be processed. The second reason is that if the receiving peer has smaller buffers, they’re going to be sending MAX_DATA and MAX_STREAM_DATA messages much more frequently for a given amount of data, which means higher traffic overhead on the connection. Receivers with large buffers have the luxury of sending those messages less frequently, with each one representing a larger step up in available space.

Of course this isn’t just going to depend on the QUIC stack implementation — the largest buffers you might use may eventually fill up if applications don’t read from them. So it’s also going to be important for the application layers to read data promptly and frequently, if they want to maintain high throughput and low latency.

If you want to know all the gory details that I’ve summarised here, there’s a lot of discussion in-depth discussion of congestion control in RFC 9002. In particular, it discusses how QUIC compares to TCP in these regards, as well as quite a few implementation details that are likely only of interest to those actually building a QUIC stack.

Managing Connections¶

All of the interactions described so far depend on the existence of a “connection”, which must be set up as a pre-requisite for sending frames. At the start of a connection is a handshake which serves several purposes:

A shared secret is established, which will be used for encryption (discussed later).
Confirms both endpoints are willing to communicate.
Establishes connection parameters, such as initial values for rate limiting.

These and other details about connections are discussed in the following sections.

Connection IDs¶

Each connection has two unique IDs, which are chosen independently by the two peers — this pair of identifiers is used to disambiguate between multiple connections using the same IP address and port, and also are used to allow the same connection to migrate between networks without interruption — for example, to allow a connection to move uninterrupted from a WiFi to a 5G connection, which are using entirely different IP addresses.

Actually that’s a bit of a simplification — at any point each peer has a pool of connection IDs available. Each packet only ever uses one connection ID, and it can be any of the active IDs associated with the connection — new IDs can be issued as needed, and old ones retired so they’re no longer active. The reason for this complexity is so that different IDs can be used when communicating on different networks, and this means that an eavesdropper has no way to correlate these connections over multiple paths. In particular, the standard requires that implementations allow no way for the multiple connection IDs used on a single connection to be correlated with each other, so that rules out (say) using a value where the high-order bits identify the connection and the low-order bits are an instance number within that connection.

The length of a connection ID is variable and can be chosen by each endpoint independently. In the initial version of the protocol, IDs can be anything from 8 to 20 bytes. It’s also possible for an endpoint to choose a length of zero for the connection ID, which means that endpoint will not expect its peer to use connection IDs. This means that only one connection can use a given IP address and port, and also that the connection won’t benefit from the protection of being correlated across multiple transport links.

The handling of connection IDs is a bit of a complicated little corner and I don’t want to drill into it too much, but at a high level some initial connection IDs are established during handshake, and then either endpoint may issue new connection IDs for its peer to use using the NEW_CONNECTION_ID frame. Each connection ID issued by a peer are assigned a sequence number, sequentially issued starting at zero — this allows its peer to detect duplicates. The peer receiving the new connection ID should add it to the pool of potential IDs it can use for that connection.

The sending peer can indicate that it will no longer use a specified connection ID by sending a RETIRE_CONNECTION_ID frame. The receiving endpoint can also request that its peer retire IDs by setting a sequence number in the Reture Prior To field in the NEW_CONNECTION_ID frame — this should trigger its peer to immediately retire all IDs of that sequence number or earlier.

Identifying Packets¶

When a UDP frame comes in containing QUIC packets, it must be either matched to an existing connection or, in the case of a server, be deemed to be initiating a new connection. The flowchart below shows the process by which an incoming packet is matched against existing connections.

Matching packets to connections flowchart

In summary, then, the connection ID is used to match against existing connections, unless a zero-length connection ID is found — in that case the IP address and port number are used to match the packet to a connection using the same address and port, and which is also using zero-length connection IDs. The endpoint can either just use the destination IP address and port, or it can use both source and destination address and port. The downside of using just the destination address is that only one connection may be used on a given port — this may be fine for a client, but isn’t any good for a server, which presumably expects to be servicing multiple client connections concurrently. The downside of using both source and destination is that it makes the connection more fragile, not allowing it to survive migration between networks, or any other change which alters the source IP address or port. Ultimately this is why connection IDs were added to the protocol in the first place, so it seems to me a poor trade off to avoid using them just for the sake of some small simplification of the protocol library code.

On servers, the situation is slightly complicated by the fact that a packet which doesn’t match a new connection must be regarded as potentially opening a new connection, and further complicated by the fact that this might be from a peer which supports a later version of the QUIC protocol than the server does. When a packet is not matched to an existing connection, the server checks to see whether this packet contains a CRYPTO frame initiating a handshake for a protocol version that it recognises — this handshake process is described towards the end of this article. If the version is recognised and supported, the new connection handshake proceeds as normal.

If the packet is not recognised as being a handshake for the server’s expected version, a simple heuristic is used to determine whether it could be a higher protocol version — the size of the UDP datagram must be at least as large as the minimum datagram size for one of the versions which is supported. If the UDP datagram is large enough, the server sends back a Version Negotiation packet (not frame!) which contains a list of the protocol versions that the server does support, allowing the client to repeat the connection attempt using the highest version that it has in common with the server.

The implication of the heuristic above is that clients must send an initial UDP datagram whose size is at least large enough to be a valid initial packet for all the protocol versions it supports. For example, let’s say a client supports both version 1 and 2, and let’s also suppose that when version 2 is specified it ends up with a smaller initial packet than version 1 had. The client will make an initial connection using version 2, since higher versions are preferred, but it must make the UDP datagram at least as large as the smallest it could have legitimately been under version 1 — this means a server only supporting version 1 will recognise it as a potential connection attempt, and respond with Version Negotiation. If necessary, there is a PADDING frame type which can be added to the packet to make it larger, but which is otherwise ignored.

Establishing Connections¶

Since one of the main goals of QUIC is to reduce the latency of connection establishment, the protocol connection handshake intrinsically incorporates the cryptographic handshake as well as establishing the transport connection. In this section, we’ll look at how the connection is established, and then in the following section we’ll look at how the negotiated keys are used to encrypt the connection.

TLS 1.3¶

TLS is a transport-independent protocol used to provide a secure channel between two endpoints. It provides authentication of the server to the client, and optionally the other way around, as well as confidentiality and integrity of the data transmitted. There are several versions of TLS which are not directly compatible, but TLS itself provides a way to negotiate the version in use. TLS can be run over any transport which can provide the semantics of a reliable, ordered byte stream.

As already mentioned, QUIC is based specifically on TLS 1.3, no other versions are supported. To understand in detail what’s going on under the hood you might need a decent familiarity with TLS 1.3, which is specified in RFC 8446, but I’ve included a bare-bones outline here that’s hopefully enough to understand how TLS and QUIC work together.

TLS typically consists of two components:

The handshake protocol, which handles authentication, cryptographic protocol negotiation and key exchange.
The record protocol, which uses the shared keys established during the handshake to provide secure data transfer.

As we’ll see later, however, only the handshake is used by QUIC — this is facilitated by carrying TLS handshake messages within QUIC frames. Once the handshake is completed, however, instead of then carrying TLS records over QUIC (as would be done over TCP), the QUIC protocol itself takes over and uses the exchanged keys to provide protection for QUIC packets directly.

The handshake itself has three goals:

Key exchange establishes shared secrets and selects the cryptogtaphic protocol. Everything after this is encrypted.
Server parameters are then established, such as whether the client will be authenticated.
Authentication is where the server and, optionally, the client are authenticated to each other.

During key exchange the client initiates a new connection with a ClientHello message. This contains:

A cryptographic nonce.
The list of protocol versions it supports.
A list of supported combinations of symmetric cipher and hash function for HKDF, used key derivation.
Parameters for Diffie-Hellman or other key exchange mechanism.

The server then responds with a ServerHello which confirms the selected cryptographic protocol and other parameters and, depending on the key exchange mechanism in use, may contain related data such as the ephemeral key for Diffie-Hellman. The server may send optional messages at this point for extensions indicated in the ClientHello, and to request client authentication.

The server immediately follows this with the authentication phase, where it sends a Certificate message, which contains its certificate, and a CertificateVerify message, whose purpose is to prove that the server has access to the private key associated with the certificate⁹. Finally, it follows those with a Finished message, with a MAC calculated over the whole handshake.

Once the client receives these it responds with its own Certificate and CertificateVerify, if client authentication was requested by the server, and then it’s own Finished message. At this point the handshake is complete and both endpoints have shared keys which can be used for further secure communication.

In the common case of no client authentication, and no additional extensions, the message exchange is summarised in the sequence diagram below. Note that this is the flow of messages, which doesn’t necessarily correspond to individual datagrams, as we’ll see shortly.

TLS 1.3 handshake sequence diagram

That’s all I’m going to cover about the TLS side of things — as I mentioned earlier, the RFC is fairly accessible if you want more details. Next we’ll see how that TLS handshake fits inside the QUIC handshake.

QUIC Handshake¶

QUIC supports both pre-shared key (PSK) and Diffie-Hellman key exchange via TLS. PSK requires prior communication between the endpoints, so isn’t any use for a client connecting to a new server, but if available it does support a performance optimisation called early data or 0-RTT — we’ll discuss this (briefly) later. Diffie-Hellman provides forward secrecy, which means that the encrypted session is still protected even if it’s captured and the long-term secret (i.e. server private key) is compromised at a later date. Both of these features can be used together to provide forward secrecy whilst still benefiting from 0-RTT via PSK. For the purposes of this section, though, we’ll mostly ignore PSK and look at the handshake process of a given client talking to a new server for the first time over QUIC. This is known as a 1-RTT handshake.

To start with, let’s see a diagram which illustrates a QUIC 1-RTT handshake. Don’t worry if some of it doesn’t make sense, I’ll try to explain it all. In particular the packet numbers may look odd — the reason for this will be revealed in the next section.

QUIC Handshake

The TLS handshake itself is carried in CRYPTO frames in QUIC packets, each of which carries a chunk of TLS data. Since TLS is built on the concept of an ordered byte stream, each CRYPTO frame contains fields specifying the offset of this frame within the stream, the length of this data chunk in bytes, and the data itself. As with many QUIC frames, each packet may contain several CRYPTO frames.

Since the very first thing on the QUIC connection is the TLS handshake, the client initiates the connection in the example above by asking the TLS library for the initial bytes of handshake, which will contain the ClientHello message. This is placed in a CRYPTO frame, wrapped in an Initial packet and sent by the client as the opening UDP datagram.

The server’s response is a single datagram, but contains quite a number of things:

The TLS ServerHello message within a CRYPTO frame in an Initial packet.
The Initial packet also contains an Ack frame for the Initial packet from the client.
A Handshake packet which contains a further CRYPTO frame which contains multiple TLS messages:
- An EncryptedExtension message, containing QUIC connection parameters¹⁰.
- Certificate and CertificateVerify messages for server authentication.
- Finished to indicate an end to the TLS handshake.
Potentially, a 1-RTT packet containing one or more initial STREAM frames.

The client then responds with:

An Initial packet which only holds the ACK frame for the Initial from the server.
A Handshake packet which contains the client’s Finished TLS message, and ACKs the server’s Handshake packet.
A 1-RTT packet containing STREAM frames with data from the client (e.g. HTTP requests) and ACK frames for the server’s 1-RTT packets.

Finally, the server replies with:

A final Handshake packet which ACKs the client’s Handshake.
A 1-RTT packet which contains:
- A special HANDSHAKE_DONE frame, which doesn’t include any payload.
- STREAM frames for data.
- ACK frames to acknowledge the client’s 1-RTT packets.

The connection then continues with typically only 1-RTT packets with STREAM and ACK frames, as described earlier in this article.

Packet Types and Namespaces¶

At different points during the handshake process, each QUIC endpoint is at a specific encryption level. Each level has a different pair of keys, and at any given point in time the level may be different for the sending and receiving direction of each peer. We’ll look at how QUIC obtains these encryption keys in the next section.

You’ll recall from the earlier discussion that QUIC packets all have an incrementing number. The reality is slightly more complicated as there are actually three namespaces for these packet numbers, and the packet numbering starts at zero in each one. The namespaces loosely conform to the levels above, although the 0-RTT and 1-RTT share an “Application Data” namespace, to make detection of dropped and duplicate data easier, since both of these flows carry data from the application.

The types of packet are outlined below, including the encryption level and packet namespace each type is in.

Initial: Encryption keys: Initial secrets
Packet number space: Initial
This packet contains the initial CRYPTO frames from the TLS stack which implement the TLS handshake, and it can also carry ACK frames, as discussed in the Packet Numbers and Acknowledgement section above.
0-RTT: Encryption keys: 0-RTT
Packet number space: Application data
These are used for the early data optimisation which I’ll discuss in a little more detail later. If PSK isn’t in use, they won’t be used. These were not included in the example above.
Handshake: Encryption keys: Handshake
Packet number space: Handshake
Contains further CRYPTO frames after the Initial packets are exchanged. Note that these packets use session encryption keys, unlike Initial packets, so are better protected.
Retry: Encryption keys: Retry
Packet number space: Not applicable
Used if the server wishes to validate the client’s IP address before commencing the cryptographic key exchange, using the simple expedient of sending the client a token which is specific to its address and requiring the client echo that back in its Initial packets. This is designed to make it hard to spoof a client IP and use the server in a traffic amplification attack. I discuss this briefly in the section on [Address Validation][#address-validation] below, and §8 of RFC-9000 has a more detailed discussion.
Version Negotiation: Encryption keys: Not applicable
Packet number space: Not applicable
This is used to allow endpoints on different protocol versions to negotiate which to use. I discussed it briefly in the section above on Identifying Packets. If you want more details, check out §6 of RFC-9000.
1-RTT: Encryption keys: 1-RTT
Packet number space: Application data
The main workhorse of QUIC, these packets contain STREAM frames with data, ACK frames to acknowledge other 1-RTT packets received.

Now we know about these packet number namespaces, perhaps the packet numbers in the handshake example shown above will make more sense. The first packet of each namespace in the same direction is zero, and all further packets in the same namespace and direction increment the count by one each time. The ACK frames contain the number of the packet they’re acknowledging.

Encryption¶

So far we’ve seen how packets are sent and received, how streams of data are multiplexed over them, and how the connection is established in the first place. The main thing we haven’t seen, however, is how those packets are protected by TLS. This is particularly interesting in the case of QUIC, since it’s not cleanly tunnelled through TLS in the same way as some other protocols, but rather it cooperates with it.

Payload Protection¶

We’ve already seen how QUIC facilitates the TLS handshake, by embedding the TLS stream in CRYPTO frames. This is all handled directly by the TLS stack with QUIC just acting as an opaque transport layer. However, the encryption process is handled by QUIC using keys passed to it by the TLS stack as the handshake proceeds.

In TLS 1.3, all the ciphers are modelled as Authenticated Encryption with Associated Data (AEAD), which is defined in RFC 5116. This means that at once they offer both confidentiality, as well as confirmation of integrity and authenticity.

The AEAD function is part of what’s referred to as a cipher suite, which also includes a hash function, and it’s the cipher suite which is negotiated by TLS along with the secrets. For example, in the cipher suite TLS_AES_128_GCM_SHA256 breaks down as:

TLS starts all cipher suite names.
AES_128_GCM is the AEAD algorithm used.
- AES is the basic algorithm.
- 128 is the block size.
- GCM is the cipher mode, in this case Galois/Counter mode.
SHA256 is the hash algorithm used for key derivation.

Briefly, an AEAD function has four inputs for encryption:

A secret key
A nonce
Associated data, which is to be authenticated but not encrypted
Plaintext, to be encrypted

The output is a ciphertext, which is sent to the other party along with the associated data. The reverse decryption operation requires this ciphertext along with the key, nonce and associated data. When provided with these, the decryption operation either provides the original plaintext, or an error to indicate that the data was not authentic. We’ll now see how QUIC uses this function.

A packet consists of a header and a payload, and only the payload is directly encrypted. The key to perform this encryption is derived from the TLS secret in the appropriate direction using HKDF, a HMAC-based key derivation function specified in RFC 5869. The inputs to the key derivation function are the secret and a string label, which in QUIC is a fixed string depending on the type of data required — "quic-key" is used to derive the payload encryption key, "quic-iv" is used to derive the initialisation vector (IV), and "quic-hp" is used to derive the header protection key. We’ll see what the header protection key is used for in a moment.

You’ll recall from the discussion above that the AEAD function also requires a nonce as input, and this is derived from the XOR of the IV and the packet number, which makes the nonce unique for each packet. The associated data for the AEAD is the entirety of the packet header, including the packet number. The plaintext is, of course, the packet payload.

Header Protection¶

At this stage in the process, the packet now comprises the header in plaintext and the encrypted payload. Before transmission of the packet, however, an additional process called header protection is applied.

Header protection is applied selectively to parts of the header to disguise it from inspection. The parts of the header currently protected are the packet number and the key phase bit, which is discussed in the section Key Updates. A number of bytes of the ciphertext at a specific offset is taken, where the number of bytes is dependent on the negotiated cipher suite, and these bytes are encrypted using the header protection key which was mentioned above. This produces 5 bytes of data which are XORed with the portion of the header to be protected.

After applying header protection, the protected header and the payload together form the packet which is ready to be placed within a UDP datagram and sent.

The reason for hiding the packet number is that it could be used to correlate a single connection across multiple paths — as we saw earlier with the use of multiple connection IDs, QUIC goes to some lengths to make this more difficult.

Encryption Keys¶

So now that we’ve seen how encryption keys are used, we’ll look at which encryption keys are used and where.

In the Packet Types and Namespaces section above, you can see for each packet type there is a pair of encryption keys to be used to encrypt and decrypt packets of that type. These always proceed in sequence, and TLS itself is only ever at one encryption level at a time — however, the send and receive directions may be at different levels at any given moment. Because QUIC may need to retransmit previous packets, however, it must keep the keys from previous encryption levels around so that the appropriate key can be used.

The keys used at each level are as described below.

Initial Secrets: For the Initial packets, no keys have been negotiated on the connection yet. Encryption is still used, but using a key derived from the Destination Connection ID in the client’s first Initial packet. HKDF, a HMAC-based key derivation function specified in RFC 5869, is used with a fixed salt specified in RFC 9001, and some other predefined operations which are predictable on both server and client. This offers no real protection since the connection ID is recorded in plaintext in the packet header¹¹.
0-RTT: The use of 0-RTT packets depends on reusing encryption parameters and keys from a previous connection, and so keys must already be available to both the client and server in this case. Servers are at liberty to reject 0-RTT data and the client will resend it in conventional 1-RTT packets once the handshake is complete.
Handshake: As soon as the ClientHello and ServerHello messages have been exchanged, the TLS stack at both ends are able to construct some session keys for the remainder of the handshake, and these are passed from TLS to QUIC for its encryption of the Handshake packets. In the TLS RFC, these are referred to as client_handshake_traffic_secret and server_handshake_traffic_secret.
1-RTT: At the completion of the TLS handshake, a new set of keys is available and should be used for all further communications. In particular, these should replace any 0-RTT keys as soon as available. In TLS these are client_application_traffic_secret_0 and server_application_traffic_secret_0¹².
Retry: There is a special process for deriving a key for calculating an integrity tag for Retry packets, which any entity which observed the Initial packet can perform. This is a bit of an obscure wrinkle, so I’m not going to discuss it further here.

Key Updates¶

At any time after the handshake is completed and the connection is using 1-RTT packets, either endpoint may initiate a key update, where the keys used for encryption are updated. Some cipher suites have limits placed on how much data may be encrypted with a given key without impairing security, which may be a reason why an endpoint wishes to move to a new key. Instead of using TLS KeyUpdate messages, QUIC has its own way of triggering a key update — another difference from TLS is that an update always changes keys in both directions, whereas in TLS each endpoint’s key can be updated independently.

To initiate a key update, an endpoint first creates a new key for itself using the standard TLS approach. This involves feeding the existing secret, as well as the label "quic ku" into the HKDF function, and using the result as the new key. Because both sides already have both secrets, this process can be carried out at each end independently for each key and no new keys need ever be sent across the wire.

Once the new key is generated, the endpoint then starts sending its 1-RTT packets encrypted with the new key. It toggles the key phase bit to indicate to the endpoint that the key has changed — this mechanism means that no specified other packet type is required to change the key. Since the flag is a single bit, however, it does mean that an endpoint cannot initiate another subsequent key update until it has received at least one acknowledgement of a packet with the new key, otherwise the change may be lost.

Upon receiving a packet with the key phase bit that differs from the most recent packet it sent, the other endpoint also updates the keys that its using for both send and receive. From this point on the new keys are always used — it’s an error to go back to older keys on the same connection.

There some additional details in the RFC which you might not have at first considered, such as the fact that each endpoint should have the next key pre-generated so that a pause in traffic whilst the key generation occurs can’t be used as a side-channel indication that key update has occurred. I’m going to skip over that level of detail here, however.

Other Features¶

I think we’ve covered the core features of interest in QUIC. There are a few additional features that have only had the briefest of mentions thus far, however, and in this final section I’ll give an outline of what I consider to be the important ones.

Path MTU Discovery¶

One issue that modern protocols need to consider is determining the maximum size of packet that can be sent over a given network link without resorting to fragmentation — this is known as the path MTU (PMTU), where MTU is maximum transmission unit.

It might appear that this isn’t so important for anything built on IP, since IP fragmentation means, in principle at least, that even if an overly large datagram is sent it should still arrive intact at its destination. In practice, relying on fragmentation is considered quite fragile for reasons explained at length in RFC 8900. Additionally, although IPv6 still supports fragmentation at source, it doesn’t allow intermediate routers to further fragment packets, which also makes relying upon it a chancy business at best. As a result, modern protocols do their best to determine, and conform to, the PMTU so the need for fragmentation is avoided.

QUIC starts with a basic assumption that it can send IP packets of at least 1,280 bytes, and padding is added to any datagram which contains an Initial packet so that it hits this limit — this means that if the MTU is smaller than this, and some middlebox along the way doesn’t support fragmentation, then the initial handshake will fail. This is considerably less inconvenient than a connection failing later, halfway through transferring some data.

Implementations can choose to stick to this minimum, or they can attempt to use larger datagrams for better performance — to use larger datagrams, however, they need to first validate whether they can do so reliably on a given network path.

There are two approaches QUIC implementations can use to perform PMTU discovery:

PMTU Discovery: The first option is an approach specified in two separate RFCs, RFC 1191 for IPv4 and RFC 8201 for IPv6. This relies on probing the path with different packet sizes, using the “do not fragment” bit in IPv4, and relying on appropriate ICMP error packets coming back indicating when packet sizes are too large. Multiple packet sizes can be used to hone in on a PMTU accurate to within an acceptable interval. There are some nuances to relying on ICMP, such as guarding against spoofed ICMP packets being used for denial-of-service, which are discussed in the QUIC RFC. On a personal note I’d also say relying on ICMP is a tricky business, since a lot of middleboxes don’t like generating ICMP packets, or severely rate-limit them, for handy-wavy “security reasons”.
Datagram Packetisation Layer PMTU Discovery: The snappily titled DPLPMTUD, specified in RFC 8899, is another method of PMTUD which doesn’t rely on ICMP errors — as a result it’s more robust and should probably be preferred. This also involves probing the connection with different sizes, potentially using PADDING frames to increase the size of datagrams where necessary, but instead of relying on ICMP errors the receipt of an ACK, or lack thereof, is used as indication of successfully navigating the network path. The RFC seems to recommend, but doesn’t mandate, packets that contain simply a PING frame, which elicits an ACK but is otherwise ignored, and an appropriately sized PADDING frame.

See §14 of RFC 9000 for more discussion about packet sizes and PMTUD.

Connection Migration¶

QUIC connections rely on endpoints maintaining the same address through the TLS handshake, but after this the use of connection IDs allows a given connection to be migrated over to a completely different address without being interrupted. QUIC provides negotiated options to disallow active migration, although addresses can still change unintentionally by, for example, NAT rebinding.

Clients always initiate a migration — servers are assumed to have stable addresses and not require migration, whereas clients are assumed to change address more frequently. If a client receives packets from a different server address, it must discard them. It is possible for a server to include a parameter in the TLS negotiation for a “preferred address”, to which it would like clients to migrate after the handshake is completed, but it’s still the client’s decision whether to do so.

If you’re wondering how a client initiates a migration, the short answer is that it simply starts using a new address. You’ll recall from the earlier discussion of connection IDs that peers should always use a new connection ID when communicating on a new network path. The RFC has quite a bit of discussion on this, but the upshot is that it’s effectively optional because the change of address might happen outside of the peer’s knowledge or control — for example, an intervening NAT box may reassign the peer to a new external IP address. Nonetheless, peers are supposed to refrain from intentional migration unless they have a spare connection ID to use — as a result, endpoints should always ensure their peers are supplied with a decent number of connection IDs.

There’s actually an additional subtlety in that some frame types count as probing types, which can be used to test a new network path before migrating to it. Receipt of any packet which contains only these frames does not constitute a migration, and the receiving peer should continue to use the old address until a non-probing packet arrives. Probing frames are:

PATH_CHALLENGE
PATH_RESPONSE
NEW_CONNECTION_ID
PADDING

When a peer is going to start sending back to a new address, it must first repeat path validation, the process of which is discussed in another section below. The first two packets in the list above are used for path validation — the fact that these are probing frames allows an endpoint to use path validation on a potential new path without disturbing actual traffic flow. If the path validation fails, the new path can be ignored without further impact; if it’s successful, the client can start sending non-probing frames on it to effect the migration.

§9 of RFC 9000 goes into a lot more detail on this process, and also outlines various ways that an attacker could try to subvert the process by spoofing addresses and forwarding packets. I’m not going to cover all these details here, I just wanted to give a flavour of the procedure.

Connection Termination¶

Connections can be terminated in three ways:

Idle timeout: Either endpoint may specify a max_idle_timeout transport parameter during the handshake, and both endpoints are expected to silently drop the connection if no traffic occurs within an interval of this length. If the application is genuinely idle, but the connection should remain open, endpoints can send a PING frame periodically, which prompts an ACK but is otherwise ignored. As well as providing traffic to reset the idle timer, they also demonstrate the other endpoint is still available and listening.¹³
Immediate close: To terminate the connection actively, either endpoint can send a CONNECTION_CLOSE frame. This is done after a protocol violation is detected, or when the application using QUIC has suffered an error, or simply as the most rapid way to dispose of a connection that’s no longer required after the application-level protocol running over QUIC has agreed a graceful shutdown of its own. Having sent one of these, the endpoint should only send further CONNECTION_CLOSE frames in response to any received packet — the packet number rules are exempted here to avoid implementations having to maintain too much state, they can just resend the same closing packet over and over. Once an endpoint receives one of these frames, it may send its own single CONNECTION_CLOSE in response, but must not otherwise respond to packets otherwise the two could end up ping-ponging CONNECTION_CLOSE frames back and forth endlessly.
Stateless reset: As an option of last resort for endpoints that receive a packet to which they cannot attribute an active connection, an endpoint can send in response a stateless reset. An example of where this might occur is if an application crashes during handling of an active connection and loses all its state — any packets it subsequently receives after a restart may be from previously active connections. To support this, whilst a connection is active endpoints may send their peer a stateless reset token as part of a NEW_CONNECTION_ID frame. If, subsequent to this, the endpoint loses its state, it may send a packet which contains random data which ends with this token. The use of random data aims to prevent an eavesdropper from distinguishing this from any other encrypted packet on the connection. Of course, if the endpoint has crashed and lost its state then it’s also presumably lost the other peer’s connection ID and the token that it sent earlier — reset tokens should therefore be calculated using some repeatable HMAC process, so the endpoint which lost state can regenerate it. The stateless reset is always sent in response to a received packet, which will contain the destination ID of the peer which lost state, so that should be used as the input to the HMAC to generate the token.

QUIC stateless reset

Address Validation During Connect¶

As briefly mentioned earlier, address validation is a process designed to mitigate a potential traffic amplification attack. For example, if an attacker wishes to perform a denial of service (DoS) attack on a target address T, then one aim would be to generate massive amounts of traffic which can overwhelm the ability of host T to handle it. A cheap way to do this is to find “amplifiers”, or other hosts which can be prompted into unwittingly sending unsolicited traffic to T directly.

To see how this might work, imagine a hypothetical UDP-based live video streaming protocol where a client sends a small request in a single UDP datagram, and the server immediately starts sending back a stream of UDP video frames. An attacker could spoof T as the source address of a UDP request to a server, and it would immediately start sending video frames to T — if the attacker does this on many such servers, it generates significantly more traffic directed at T than it could have generated itself.

To mitigate against this attack, QUIC servers are required to limit the amount of data they send in response to an initial connection request — this limit is three times the amount of data in the initial request, which is still an amplification, but only a modest one. This limit applies until the address has been validated, which is essentially having received a response from the client which demonstrates that the address given was indeed genuine, and not spoofed by an attacker.

Because this limit is based on the size of the initial packet(s) from the client, however, QUIC requires that clients ensure a minimum size for these, to make sure the server can send its proper response and remain standards compliant. A client must ensure that any UDP datagram it sends which contains an Initial packet must have a total payload of not less than 1200 bytes, thus limiting the server responses to 3600 bytes before the client’s address is validated. In the presence of packet loss this can still leave things deadlocked, however, as a server cannot resend without breaching the limit — RFC 9002 has some discussion of how this is avoided which I won’t go into here.

Address validation is carried out both on initial connection, and also during connection migration. During initial connection, the simplest expedient is to receive a valid Handshake packet back from the other endpoint, which demonstrates that the endpoint is both a willing participant in the connection, and is also in possession of the handshake secrets from TLS.

the server also has the option of performing address validation before starting the cryptographic handshake. This is done by sending a Retry packet with a randomly-generated token, which the client must then include in a further Initial packet — this demonstrates that the server is sending packets to the real client.

To avoid this overhead on future connections, and to validate any 0-RTT data included, the server can send additional tokens using a NEW_TOKEN frame in any 1-RTT packet. These tokens can be used within the Initial packet sent by clients on future connections.

See §8.1 of RFC 9000 if you would like to sift through the finer details.

Path Validation¶

During connection migration, both peers use a process called path validation to ensure they will still be reachable to each other after the migration. Put simply, it validates that packets sent from a particular address and port on the local side to a particular address and port on the remote side are received by the remote peer successfully. You might imagine that acknowledgements would provide sufficient confirmation of this, but the RFC declares that these contain insufficient entropy and can therefore be spoofed too easily by an attacker.

Path validation is conducted by a straightforward challenge-response approach. The initiating endpoint sends the other a PATH_CHALLENGE frame, which contains an unpredictable payload. The other endpoint then simply echoes this payload back in a PATH_RESPONSE frame. The sending endpoint is also required to pad the packet (with a PADDING frame) to at least 1,200 bytes, at which point a successful response indicates two things:

The endpoint address used for a connection migration is genuine, and wasn’t spoofed by an attacker.
The path MTU allows datagrams of at least 1,200 bytes, the minimum required by the QUIC standards.

One slightly tricky detail is that the anti-traffic-amplification requirements outlined earlier might mean that the initiating endpoint isn’t permitted to send a 1,200 byte datagram on the new path. If so, it can send a smaller challenge to first validate the endpoint address wasn’t spoofed, and once this is the case the limitations no longer apply so it can repeat the challenge with a properly padded frame to separately validate the path MTU. I have all these sorts of fun details ahead of me as I implement this stack!

See §8.2 of RFC 9000 if you want to know more about the details of path validation.

0-RTT Data¶

QUIC’s efforts to get to usable data transfer as quickly as possible aren’t limited to just reducing the round-trips involved in the initial handshake — it’s also possible for the initiating endpoint to send data before the secure connection is even fully established. In the case of HTTP, for example, this might be an initial GET request, which is small and may well fit in an initial datagram. This is known as 0-RTT data.

This is only possible where a client is connecting back to a server to which it’s previously connected, and furthermore it having persistently stored some of the parameters from that previous connection to reuse for the 0-RTT data. The parameters that it must store are a combination of TLS and QUIC parameters, including the key from the previous connection to use for encryption.

This uses the early_data extension in TLS 1.3, which is described in §4.2.10 of RFC 8446, which I’m not going to discuss here, but suffice to say that the client indicates its intention to use 0-RTT in the ClientHello, and also includes a 0-RTT packet in that same first datagram, which itself contains STREAM frames as normal.

The server may accept or reject 0-RTT — if it’s accepted, it will process the STREAM frames as normal, although the parameters and encryption keys are replaced and normal 1-RTT transfers are used as soon as the handshake for the new connection is complete. If the server rejects 0-RTT, it indicates such in its initial response and the client knows that it needs to resend the data once the handshake is completed.

QUIC imposes a limit of 7 days between the previous and new connections — after this delay, 0-RTT is not permitted.

This optimisation is interesting, but I suspect it’ll be much easier to simply ignore it during an initial implementation — a client using the library won’t send it, and a server will reject it. Among other things, it slightly complicates the API to create a new connection, since the application has to already have some data ready to pass to whatever function opens that connection — normally you think of a socket-type model, where a connection is first open and returns some sort of handle, then you send data using that handle.¹⁴.

Conclusions¶

That’s it for my look at QUIC — I think this article took about as long to write as any other I’ve attempted, and required a fair amount of back-and-forth among the RFCs. The decision to split the protocol across multiple RFCs in particular makes things quite fragmented (no pun intended), but given that I have the luxury of going into considerably less detail than the RFCs, I can understand the reluctance to draft everything into one massive monolithic document.

The protocol itself seems generally sensible and has some clever aspects, such as the aggressively accelerated handshake and the maximal usage of every individual datagram. The simple acknowledgement and packet numbering systems are also refreshing, after the complexity that TCP puts into these. The fact that stream data is split into frames, but acknowledgement is done at the higher packet level, allows endpoints to keep the overhead of acking more minimal than if each frame had to be acknowledged independently. The fact that acknowledgements don’t need to be contiguous is also clever, but comes with a complexity cost for implementations.

I do feel that certain aspects were a little over-engineered, however, and might have been more beneficially regarded as extensions to the core protocol rather than included in the basic functionality. The flow control approach, where each end is required to take action to grant its peer more capacity to send data seems a little heavier than is perhaps justified — it’s likely going to take some significant tweaking of the strategy for allocating buffer space to make sure you’re not killing the performance by continually throttling the other endpoint, particularly on high-latency connections where a lot of data needs to be in flight at any given point in time to keep throughput decent. I’m not saying it doesn’t have its advantages — the question is simply whether the benefits justify the complexity in the common case situations in which QUIC will be used.

That said, it’s reassuring to note that the protocol authors didn’t try to reinvent everything — their use of UDP was a sensibly pragmatic step to ensure that it should be supported on most existing network equipment, although it may fall foul of networks with particularly aggressive security policies, which block anything they don’t explicitly recognise. QUIC’s use within the Chrome browser should ensure that most widely used ISPs should probably not intefere with this traffic, although experience of their reaction to expanding P2P traffic in the noughties should be an indication that they’re not afraid to step in and disrupt traffic that they feel destabilises their networks — it won’t be until (and if) QUIC’s adoption really ramps up that we’ll know for sure.

The use of TLS is similarly pragmatic, although its novel use of the encryption keys might require additional APIs which I’m not certain that all TLS stacks will offer, at least initially. Its use with HTTP/3 should hopefully ensure there’s at least one decent option for each major langauge, however.

The final observation I have is that the flexibility of QUIC is also going to raise interesting questions about the APIs which a QUIC stack offers to applications. If you consider TLS over TCP, a single socket abstraction is fine, since you just have a single data stream. In QUIC, however, you have a single connection abstraction with multiple data streams multiplexed on top — this implies that you’ll need to expose some sort of initial connection identifier, through which some functionality is accessed, but then offer additional APIs to initiate other streams and also allow applications to accept streams that its peer has initiated. One assumes that the protocol running over QUIC (e.g. HTTP/3) will define how the application determines when these actions need to be taken, but there are all sorts of fun details to sort out — a few things that occur to me just off the top of my head are:

Should applications need to poll for peer-initiated streams, or use a registered callback?
Should applications be told the final size of a stream as soon as it’s known, or simply be told there’s no more data once they’ve read it all?
Should applications be involved and/or notified of connection migration events?
How do applications reset an active stream? How are they notified then their peer resets?
Should applications be involved in buffering limits and flow control?
Should 0-RTT data be explicitly requested by the application, or somehow transparently used by the QUIC stack?
Who should be responsible for storing transport parameters for future 0-RTT, the application or the QUIC stack?

I’m sure there are a ton of other details to figure out once I start to consider how this API’s going to be structured, but these initial thoughts give you a flavour of some of the many things to consider when designing the interface. I’m intentionally not doing too much interface design now until I’ve gone through the HTTP/3 standards in more detail, as this is a concrete use-case which, along with HTTP/2, will help guide some of these design choices.

That’s it for this article — I hope that’s been of interest and/or some use, and I especially hope that I didn’t make any major mistakes in my interpretation of the standards. In the next article in this series I’ll cover my investigations into HTTP/3 about which, at time of writing, I know basically nothing except that it’s the successor to HTTP/2 and apparently it runs over QUIC.

Yes I know that makes it sound like a client not a server, and that’s why it amuses me. I have a particularly contrarian sense of humour at times. ↩
I mean, if you did ask that I might remind you that I did give you at least a brief answer in the second pararaph of the article. But I’m nothing if not accomodating, so don’t worry, I’ll explain it again… ↩
I’m just going to need to take a moment while I come to terms with the fact that was almost eight years ago…! ↩
Let’s be clear that TLS 1.3 support is by no means a given even now. At time of writing, the SSL Pulse report shows only 60% of the most popular websites currently support TLS 1.3. Even if the server itself supports it, it’s quite possible that a lot of users may be directed through proxies and other middleboxes which refuse to support TLS 1.3. ↩
Note that a peer may send DATA_BLOCKED if it can’t send data on a stream it has just created, but this cannot implicitly open the receiving state machine for a stream simply because it’s not a stream-specific message and contains no stream ID. ↩
It would be unusual for the first frame you see on a new stream to be a RESET_STREAM, but it’s possible if earlier STREAM frame(s) were dropped. ↩
These messages may also be received on a locally-initiated unidirectional streams, but in this case there is no need for a receiving state machine to be used, which is why this case isn’t included in the list. ↩
Those of you paying attention may realise that this would leave peers vulnerable to a resource-exhaustion attack, where a malcious peer opens very high-numbered stream ID, causing large amounts of resources to be consumed by opening all the lower-numbered streams as well. RFC 9000 calls this a stream commitment attack. The mitigation for this is that peers limit the maximum number of concurrent open streams and, properly chosen and enforced, this limit helps to cap the overall per-connection resources required within the peer. ↩
In TLS these messages can actually be omitted if the server is using some other means to authenticate itself, such as using pre-shared keys instead of Diffie-Hellman key exchange, but in the case of HTTPS it’s always certificate-based authentication. ↩
If you’re familiar with older versions of TLS, note that EncryptedExtension is new in TLS 1.3. This removes a portion of what would have been in the ServerHello in earlier versions of TLS, and allows it to be encrypted. ↩
The RFC doesn’t really go into detail why Initial packets have encryption applied, given that it’s trivial for any eavesdropper to derive the keys. I’m guessing it might be more about having a consistent workflow for all packet types than any security-related concern. ↩
The reason for the _0 suffix is that either side may update its sending keys at any point within the TLS protocol. This is done by deriving new secrets in an iterative fashion from the previous one, and driving keys from those. ↩
As a point of interest, the QUIC RFC references RFC 4787, which is a “best practices” document for UDP implementations traversing NAT, which states in §4.3 / REQ-5 that UDP NAT mappings must persist for at least two minutes. However, the QUIC standard goes on to say that in their experience a PING resend interval of 30 seconds was required to keep the mapping active in many NAT implementations in production use. This is a useful reminder of why the robustness principle is popular — although a good engineer should strive to be RFC-compliant, they can never assume that other implementations are. My opinion is somewhat mixed, in that being overly liberal in what you accept leads to erosion of standards and quality across the industry, but it’s a nuanced position which isn’t best elucidated in an already overly long footnote. ↩
In principle you could make this transparent to the application by briefly delaying the connection initiation and allowing the application to send an initial chunk of data using its standard interface, in an approach a little like Nagle’s algorithm. In my opinion this sort of “hide the complexity” approach in network stacks serves only to make like more painful for application developers, not less. If you expose the complexity in an optional fashion, you conform to the “pay only for what you use” mentality which keeps things intuitive and predictable, whilst still allowing simple interfaces for simple cases. ↩

⎘

The next article in the “HTTP/3 in Practice” series is HTTP/3 in Practice — HTTP/3

Thu 13 Apr, 2023

25 Mar 2023 at 8:56AM in Software

｜

Photo by Andraes Arteaga on Unsplash

｜

See comments

http networks web quic