QUIC Protocol - Internals • luminary.blog

Packet Formats

At the wire level, QUIC defines two packet header families:

Long header: Used during connection establishment (Initial, Handshake, Retry, 0-RTT).
Short header: Used for established 1-RTT traffic.

Long-Header Packets

Fields (conceptual, in order):

Flags (1 byte):
- Header form (1 = long header).
- Fixed bit (must be 1, helps middleboxes distinguish QUIC from random UDP).
- Long packet type: 2 bits (Initial, 0-RTT, Handshake, Retry).
- Reserved bits (for future use, must be zero).
Version (4 bytes): QUIC version (e.g., 0x00000001 for v1).
DCID length (1 byte) + DCID (0–20 bytes): Destination Connection ID.
SCID length (1 byte) + SCID (0–20 bytes): Source Connection ID.
Token length + Token (Initial only): For address validation (Retry, anti-amplification).
Type-specific fields (e.g., length field as varint for remaining packet size).
Packet number (1–4 bytes): Encrypted/obfuscated with header protection.
Encrypted payload: TLS handshake messages (Initial/Handshake), or early data (0-RTT) as QUIC frames.

From an implementation standpoint:

You parse the first byte to determine long vs short header and packet type.
CID handling is critical for routing: many server deployments terminate QUIC on a fleet of workers keyed by CID, so CID often encodes a routing hint or is looked up in a connection table.
You construct the header in cleartext, compute AEAD over the payload (including packet number as associated data), then apply header protection (HP) which masks packet number bits and some header bits using a separate key.
Receivers must undo HP first to recover the packet number length and value, then validate AEAD.

Short-Header Packets (1-RTT Data)

Used after handshake completes:

Flags (1 byte):
- Header form = 0 (short).
- Fixed bit = 1.
- Spin bit (optional, for RTT measurement, often disabled in privacy-sensitive deployments).
- Reserved bits.
- Key phase bit (tells you which key epoch to use for decryption; toggles when you roll traffic keys).
- Packet number length bits.
DCID (Destination Connection ID): Only the DCID, no SCID, no version, no token, no length field.
Packet number (1–4 bytes, protected). Truncated to 1/2/4 bytes — you reconstruct the full PN based on context.
AEAD-encrypted payload containing frames (STREAM, ACK, MAX_DATA, CONNECTION_CLOSE, etc.).

Key design point: packet numbers are never retransmitted; “retransmission” is moving frames to a fresh packet number, which simplifies loss detection and avoids ambiguity.

Implementation note: PN reconstruction is subtle. Given a packet number encoded in N bytes and a “packet number space” with a highest acknowledged PN, you choose the PN in the range that is closest to your expected value (similar to TCP extended sequence numbers). Getting this wrong leads to bad loss detection.

Frames and Stream Mapping

QUIC packets contain a sequence of frames. As an implementer, you’ll build:

A generic frame parser (type byte + type-specific body).
A dispatcher that routes frames to subsystems: streams, ACK handler, flow control, connection management, path validation, etc.

Important frame types (as seen at the transport layer):

STREAM: Carries ordered, reliable data within a stream. Includes Stream ID, offset, length, FIN flag.
ACK / ACK_ECN: Acknowledges ranges of packet numbers (uses ACK ranges instead of individual numbers, like TCP SACK but mandatory and richer). Optional ECN counters.
CRYPTO: Carries TLS handshake bytes for the various encryption levels (Initial, Handshake, 1-RTT).
MAX_DATA / MAX_STREAM_DATA: Flow control window updates (connection-level vs per-stream).
MAX_STREAMS / STREAMS_BLOCKED / DATA_BLOCKED: Stream count and flow-control coordination.
CONNECTION_CLOSE: Signals immediate close with an error code (transport vs app error codes).
PATH_CHALLENGE / PATH_RESPONSE: Used for path validation (e.g., verifying a new address for migration).
NEW_CONNECTION_ID / RETIRE_CONNECTION_ID: Manage multiple CIDs for anti-correlation and migration.
PING / PADDING: Keepalive, path validation, padding.
DATAGRAM (if extension is enabled): Unreliable messages.

For HTTP/3, each request/response is mapped to a bidirectional stream for request/response body, plus unidirectional streams for control (QPACK encoder/decoder, H3 control stream). On the QUIC side, this is just “stream id 0,1,2,… with directional semantics derived from LSBs.”

In a typical HTTP/3 stack, your app never sees these frames directly; your QUIC library does. But you do need to understand the flow control primitives and how your library exposes backpressure signals (e.g., “can’t write more, flow-controlled”).

TLS 1.3 Handshake Over QUIC

Conceptually, QUIC has four crypto levels: Initial, Handshake, 0-RTT, and 1-RTT. Each has separate keys. Think in three layers:

UDP datagrams carrying QUIC packets.
QUIC packets carrying CRYPTO frames.
CRYPTO frames carrying TLS 1.3 handshake messages.

1-RTT Initial Connection

Client:

Picks Destination CID (server chooses what to respond with later) and Source CID.
Constructs an Initial long-header packet with:
- CRYPTO frame containing TLS 1.3 ClientHello (ALPN includes “h3” for HTTP/3, SNI, supported groups, key shares, etc.).
- Possibly PADDING frames to meet minimum size requirements (e.g., 1200 bytes).
- Possibly 0-RTT packet(s) in parallel if resuming and ALPN + config support it.
Sends UDP datagram with this Initial.

TLS in QUIC uses “QUIC transport parameters” embedded in the ClientHello extension to advertise max_streams, max_data, idle_timeout, etc.

Server:

Receives Initial, validates it (including anti-amplification rules: can only send limited bytes before address validation).
Runs TLS server handshake logic over the CRYPTO stream:
- Generates ServerHello, EncryptedExtensions, Certificate, CertificateVerify, Finished.
Sends one or more packets:
- Initial packet with CRYPTO frame (ServerHello + part of handshake).
- Handshake packet(s) with remaining handshake CRYPTO frames.
May send early ACK frames, flow control frames, and even application data encrypted at 1-RTT keys once handshake is far enough.

Client:

Processes server Initial + Handshake packets, feeds CRYPTO frames into TLS stack.
When TLS reports “handshake complete,” client:
- Installs 1-RTT keys.
- Sends HANDSHAKE + 1-RTT packets with ACKs, possibly HTTP/3 control streams and requests.
Old Initial/Handshake keys are discarded once both sides agree handshake is confirmed.

Sequence diagram (logical):

1
Client                                          Server
2
  |                                               |
3
  |--- Initial (CH in CRYPTO) ------------------->|
4
  |<-- Initial (SH + partial HS) -----------------|
5
  |<-- Handshake (cert, CV, Finished in CRYPTO) --|
6
  |--- Handshake ACK / Finished in CRYPTO ------->|
7
  |--- 1-RTT HTTP/3 streams <--> 1-RTT HTTP/3 streams
8
  |                                               |

(CH = ClientHello, SH = ServerHello, HS = remaining handshake messages)

Once both sides switch to 1-RTT keys, they continue exchanging ACKs for Initial/Handshake PN spaces until those are fully acknowledged, then discard Initial/Handshake keys and PN spaces.

0-RTT Resumption

If the client has a valid session ticket:

Client’s first Initial includes:
- ClientHello with PSK + early data indication.
- 0-RTT data in 0-RTT packets (protected by early data keys) containing HTTP/3 requests on streams.
Server:
- Validates ticket, optionally accepts 0-RTT, and responds similarly to 1-RTT flow.
You must implement anti-replay controls and application-level idempotence for 0-RTT requests (e.g., safe for GETs, careful with POSTs).

Infra Engineering Considerations

The TLS stack is not a separate layer; your QUIC core drives TLS, pulls/pushes CRYPTO frames, and takes key updates as soon as TLS says so.
You must handle three parallel PN spaces (Initial, Handshake, Application) with independent ACK/loss logic.

Congestion Control and Loss Detection

QUIC’s spec describes the framework; most real-world implementations ship NewReno or CUBIC, plus variants of BBR. The protocol is congestion control algorithm agnostic — you keep the wire format fixed and swap algorithms behind an abstract “congestion controller” interface.

Loss Detection Algorithm

You track per-packet:

Packet number, send time, size, crypto level.
Whether it contains ack-eliciting frames (e.g., STREAM, CRYPTO).

Core ideas:

RTT measurement: Every ACK gives you an RTT sample from the time of the oldest newly-acked packet. Maintain smoothed RTT and RTT variance akin to TCP.
Time threshold loss: A packet is marked lost if it is unacked for longer than a multiple of the RTT (e.g., 9/8 of RTT or similar).
Packet threshold loss: A packet is marked lost if it is missing while later packets (with higher PNs) are acknowledged, beyond a certain threshold (e.g., 3 packets).
PTO (Probe Timeout): Replaces traditional RTO; if no packets are acknowledged within a timeout based on RTT and variance, send probe packets to re-ignite ACKs and detect loss. PTO is computed from smoothed RTT, RTT variance, and max ack delay.

Implementation approach:

Maintain three packet number spaces (Initial, Handshake, Application) with their own loss timers and ACK state. You don’t mix ACKs between spaces.
When ACKs arrive:
- Update RTT and variance.
- Mark any packets newly acknowledged.
- Use threshold rules to mark some outstanding packets as lost.
- Queue lost frames for retransmission in new packets (never reuse old packet numbers).
When a PTO fires, send at least one ack-eliciting packet on each relevant PN space, then double PTO like an exponential backoff.

NewReno in QUIC

State variables:

cwnd (congestion window, in bytes).
ssthresh (slow-start threshold).
bytes_in_flight (sum of sizes of packets sent but not yet acknowledged).
recovery_start_time (to track whether you’re in recovery).

Algorithm sketch:

On connection start:
- cwnd = min(10 × max_datagram_size, max(2 × max_datagram_size, 14720)).
- ssthresh = ∞ (or a large value).
On ack of previously unacknowledged packet:
- If cwnd < ssthresh (slow start): cwnd += acked_bytes (classic exponential growth).
- Else (congestion avoidance): cwnd += max_datagram_size × acked_bytes / cwnd (additive increase per RTT).
On detecting loss:
- If not already within recovery for that packet:
  - ssthresh = cwnd / 2.
  - cwnd = ssthresh (fast reduction).
  - recovery_start_time = now.
- Do not reduce cwnd again for packets lost while now < recovery_start_time (classic fast recovery notion).
Send logic:
- Only send if bytes_in_flight + packet_size <= cwnd.
- Every time you send, increment bytes_in_flight; on ACK or loss, decrement by that packet’s size.

This is basically TCP NewReno but with QUIC’s better ACK signaling and PTO logic.

CUBIC

Same signals as NewReno but uses CUBIC’s cubic growth function in congestion avoidance:
- W(t) = C(t - K)^3 + W_max (window is a cubic function of time since last loss).
Faster ramp-up in high BDP environments, better utilization on long-fat networks.
Requires accurate wall-clock time and RTT estimates.

BBR

Model-based, not loss-based:
- Estimates bottleneck bandwidth and RTT (BDP).
- Sets cwnd ≈ BDP to keep pipe full but not bloated.
Works in phases (STARTUP, DRAIN, PROBE_BW, PROBE_RTT).
More complex to implement but pairs very well with QUIC’s richer timing and ACK data.

In practice: start with NewReno (simplest, spec-aligned), then plug in CUBIC/BBR.

Sequence Diagrams

Basic Handshake Plus First HTTP/3 Request

Client (C) and Server (S). Timelines show encryption level in brackets.

C → S (Initial / [Initial keys])
- CRYPTO(ClientHello)
- PADDING / possibly some transport params
S → C (Initial / [Initial keys])
- CRYPTO(ServerHello, part of handshake)
S → C (Handshake / [Handshake keys])
- CRYPTO(EncryptedExtensions, Certificate, CertificateVerify, Finished)
C → S (Handshake / [Handshake keys])
- CRYPTO(Finished)
Now both sides derive 1-RTT keys.
C → S (1-RTT / [1-RTT keys])
- STREAM (HTTP/3 control stream: SETTINGS, etc.)
- STREAM (Request stream: GET /index.html HTTP/3 headers & body)
S → C (1-RTT / [1-RTT keys])
- STREAM (Response headers and body on same stream ID)
- ACK frames for received data

In parallel, both sides are exchanging ACKs for Initial/Handshake PN spaces until those are fully acknowledged, then they discard Initial/Handshake keys and PN spaces.

Connection Migration (Wi-Fi → LTE)

Assume established 1-RTT connection:

C on Wi-Fi → S
- Short-header 1-RTT packets, DCID = server’s chosen ID, from (IP1, port1).
C moves to LTE, new 4-tuple (IP2, port2). It sends:

C(IP2,port2) → S
- Short-header with same DCID, some STREAM + PATH_CHALLENGE frame.
S sees same DCID with new address, replies:

S → C(IP2,port2)
- PATH_RESPONSE with challenge value
- Maybe STREAM/ACK frames
Once C receives PATH_RESPONSE, path is validated. C (or S) may then update their “active path” and start sending all traffic on LTE. Old path can be retired after a timeout.

From an infra perspective, middleboxes suddenly see QUIC traffic from a new IP, but because QUIC identifies connections by CID and not 4-tuple, your server’s transport is fine.

End-to-End From Infra Perspective

Imagine you’re running a QUIC-terminating edge (HTTP/3) in front of an origin:

Client sends UDP Initial to your anycast IP.
Your edge node:
- Inspects DCID to route to a specific worker (maybe encoding shard ID in the CID).
- Worker looks up/creates connection state keyed by CID.
TLS/QUIC stack on worker:
- Processes CRYPTO frames, runs TLS handshake, derives keys.
- Starts congestion control with an initial cwnd and RTT = initial rtt guess.
Once handshake is sufficiently advanced:
- Server sends 1-RTT packets with HTTP/3 SETTINGS on control stream and maybe h3 HEADERS on stream 0.
- CC reacts to ACKs/losses by adjusting cwnd and pacing rate.
As client roams (e.g., Wi-Fi → LTE):
- New UDP 4-tuple appears with same DCID.
- Worker associates it with existing connection (or routes via new DCID mapping).
- Path validation using PATH_CHALLENGE/PATH_RESPONSE ensures new path is valid.
Connection closes:
- Either endpoint sends CONNECTION_CLOSE frame inside a short/long header packet.
- You keep some stateless reset token for the CID to allow stateless connection teardown if packets arrive after state is gone.

Implementation and Integration

User-Space Stacks

Almost every QUIC implementation is in user space (e.g., in Rust, Go, C) running over UDP sockets. Expect to manage:

UDP socket fan-out (SO_REUSEPORT, multiple workers).
Tight event loops (epoll/kqueue) with timer wheels for loss/PTO.
Per-connection state machines with multiple encryption levels and PN spaces.

CPU and GC

QUIC’s fine-grained packet accounting (ACK ranges, RTT samples, timers) benefits from careful allocation strategies. In GC languages, consider pooling packet/ack objects and using arenas for per-connection state to reduce churn.

Observability

You’ll want:

Per-connection RTT, cwnd, bytes_in_flight, PTO count.
Per-path stats if you support migration or multipath.
Error codes and where CONNECTION_CLOSE originated (transport vs application).
Hooks for QLOG or similar structured traces for debugging.

HTTP/3 Mapping

For app teams, QUIC should present something similar to an HTTP/2-like API: request streams, push (if you use it), flow-control callbacks, plus explicit signals for 0-RTT accept/reject and migration events.

Subsystem Architecture

As a backend/infra engineer, model the implementation around these subsystems:

Packet I/O: UDP socket handling, batching, NIC offload awareness, pacing.
Connection table + CID routing: mapping 4-tuples and CIDs to connection objects.
Crypto/TLS module: handles CRYPTO frames, provides AEAD keys, key updates, 0-RTT / 1-RTT separation.
Frame layer: parse/encode frames, dispatch to:
- Stream manager (ordered byte streams, flow control, FIN, resets).
- ACK manager (generating ACK frames, tracking ranges).
- Congestion controller + loss detector.
- Path manager (migration, validation, multi-path if you go there).
Application mapping: HTTP/3 engine on top of streams (QPACK, control streams, request routing to origin).

← QUIC Protocol — Deep Dive