Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Welcome to Networking Protocols: A Developer’s Guide. This book is designed to give you a deep understanding of how computers communicate over networks—knowledge that will make you a more effective developer, debugger, and system designer.

Why Learn Network Protocols?

Every time you make an API call, load a webpage, or send a message, dozens of protocols work together to make it happen. Yet most developers treat networking as a black box. Understanding what happens beneath the surface gives you:

  • Better debugging skills: When things go wrong, you’ll know where to look
  • Informed architecture decisions: Choose the right protocol for your use case
  • Performance optimization: Understand why things are slow and how to fix them
  • Security awareness: Know what protections exist and their limitations

What This Book Covers

┌─────────────────────────────────────────────────────────────┐
│                    Your Application                         │
├─────────────────────────────────────────────────────────────┤
│  HTTP/2  │  WebSocket  │  DNS  │  SMTP  │  Custom Protocol  │
├─────────────────────────────────────────────────────────────┤
│                    TLS/SSL (Security)                       │
├─────────────────────────────────────────────────────────────┤
│              TCP                    │         UDP           │
├─────────────────────────────────────────────────────────────┤
│                    IP (IPv4 / IPv6)                         │
├─────────────────────────────────────────────────────────────┤
│                   Network Interface                         │
└─────────────────────────────────────────────────────────────┘
                    The Protocol Stack

We’ll work through this stack from bottom to top:

  1. Foundations: The conceptual models that organize network communication
  2. IP Layer: How data finds its way across the internet
  3. Transport Layer: TCP and UDP—reliability vs. speed
  4. Security Layer: TLS and how encryption protects your data
  5. Application Layer: HTTP, DNS, WebSockets, and more
  6. Real-World Patterns: Load balancing, CDNs, and production concerns

How to Read This Book

This book is structured to be read sequentially, with each chapter building on previous concepts. However, if you’re already familiar with networking basics, feel free to jump to specific topics that interest you.

Throughout the book, you’ll find:

  • ASCII diagrams illustrating packet structures and protocol flows
  • Code examples showing practical implementations
  • “Deep Dive” sections for those who want extra detail
  • “In Practice” sections with real-world tips and gotchas

Prerequisites

You should be comfortable with:

  • Basic programming concepts
  • Command-line usage
  • Reading simple code examples (we use Python and pseudocode)

No prior networking knowledge is required—we’ll build everything from the ground up.

A Note on Diagrams

Network protocols are inherently visual—packets flow, handshakes happen, connections open and close. We use ASCII diagrams extensively because they:

  1. Work everywhere (including terminals and plain text)
  2. Force clarity (no hiding complexity behind pretty graphics)
  3. Are easy to reproduce and modify

For example, here’s a TCP three-way handshake:

    Client                              Server
       │                                   │
       │─────────── SYN ──────────────────>│
       │                                   │
       │<────────── SYN-ACK ───────────────│
       │                                   │
       │─────────── ACK ──────────────────>│
       │                                   │
       │     Connection Established        │

Get used to reading diagrams like this—they’ll appear throughout the book.

Let’s Begin

Networks are fascinating. They’re the invisible infrastructure that connects billions of devices, enabling everything from casual browsing to global financial systems. Understanding how they work isn’t just academically interesting—it’s practically valuable.

Let’s start with the fundamentals.

Network Fundamentals

Before diving into specific protocols, we need to establish a common vocabulary and conceptual framework. This chapter covers the foundational concepts that everything else builds upon.

What Is a Protocol?

A protocol is a set of rules that govern how two parties communicate. In networking, protocols define:

  • Message format: What does the data look like?
  • Message semantics: What does each field mean?
  • Timing: When should messages be sent?
  • Error handling: What happens when things go wrong?

Think of protocols like human languages—they’re agreements that allow parties to understand each other. Just as you can’t have a conversation if one person speaks English and the other speaks Mandarin (without translation), computers can’t communicate without agreeing on a protocol.

The Need for Layering

Early networks were monolithic—each application had to handle everything from electrical signals to data formatting. This was:

  • Inflexible: Changing one thing meant changing everything
  • Duplicative: Every application reimplemented the same logic
  • Error-prone: More code means more bugs

The solution was layering: dividing responsibilities into distinct layers, each with a specific job. This is the fundamental insight that makes modern networking possible.

┌─────────────────────────────────────────────┐
│   Application Layer                         │
│   "What data do we want to send?"           │
├─────────────────────────────────────────────┤
│   Transport Layer                           │
│   "How do we ensure reliable delivery?"     │
├─────────────────────────────────────────────┤
│   Network Layer                             │
│   "How does data find its destination?"     │
├─────────────────────────────────────────────┤
│   Link Layer                                │
│   "How do bits travel on the physical wire?"│
└─────────────────────────────────────────────┘

Each layer:

  • Provides services to the layer above
  • Uses services from the layer below
  • Has no knowledge of layers beyond its immediate neighbors

This separation of concerns is powerful. You can change the physical network (switch from Ethernet to WiFi) without touching your application. You can change applications without affecting how packets are routed.

What You’ll Learn

In this chapter, we’ll cover:

  1. The OSI Model: The theoretical seven-layer reference model
  2. The TCP/IP Stack: The practical four-layer model the internet actually uses
  3. Encapsulation: How data is wrapped and unwrapped as it moves through layers
  4. Ports and Sockets: How multiple applications share a single network connection

These concepts form the foundation for everything that follows.

The OSI Model

The Open Systems Interconnection (OSI) model is a conceptual framework that standardizes how network communication should be organized. Created by the International Organization for Standardization (ISO) in 1984, it divides networking into seven distinct layers.

The Seven Layers

┌───────────────────────────────────────────────────────────────┐
│  Layer 7: Application     │  HTTP, FTP, SMTP, DNS            │
├───────────────────────────────────────────────────────────────┤
│  Layer 6: Presentation    │  Encryption, Compression, Format │
├───────────────────────────────────────────────────────────────┤
│  Layer 5: Session         │  Session Management, RPC         │
├───────────────────────────────────────────────────────────────┤
│  Layer 4: Transport       │  TCP, UDP                        │
├───────────────────────────────────────────────────────────────┤
│  Layer 3: Network         │  IP, ICMP, Routing               │
├───────────────────────────────────────────────────────────────┤
│  Layer 2: Data Link       │  Ethernet, WiFi, MAC addresses   │
├───────────────────────────────────────────────────────────────┤
│  Layer 1: Physical        │  Cables, Radio waves, Voltages   │
└───────────────────────────────────────────────────────────────┘

Layer 1: Physical Layer

The physical layer deals with the actual transmission of raw bits over a physical medium.

Responsibilities:

  • Defining physical connectors and cables
  • Encoding bits as electrical signals, light pulses, or radio waves
  • Specifying transmission rates (bandwidth)
  • Managing physical topology (how devices connect)

Examples:

  • Ethernet cables (Cat5, Cat6)
  • Fiber optic cables
  • WiFi radio signals
  • USB connections

What it looks like:

Bit stream: 10110010 01001101 11010010 ...
            ↓
Physical:   ▁▁▔▔▁▔▁▁ ▁▔▁▁▔▔▁▔ ▔▔▁▔▁▁▔▁
            (voltage levels on copper wire)

The data link layer handles communication between directly connected devices on the same network segment.

Responsibilities:

  • Framing: Organizing bits into frames
  • MAC (Media Access Control) addressing
  • Error detection (not correction)
  • Flow control between adjacent nodes

Key Concepts:

  • MAC Address: A 48-bit hardware address (e.g., 00:1A:2B:3C:4D:5E)
  • Frame: The unit of data at this layer

Ethernet Frame Structure:

┌──────────┬──────────┬──────┬─────────────────┬─────┐
│ Dest MAC │ Src MAC  │ Type │     Payload     │ FCS │
│ (6 bytes)│ (6 bytes)│(2 B) │  (46-1500 B)    │(4 B)│
└──────────┴──────────┴──────┴─────────────────┴─────┘

FCS = Frame Check Sequence (error detection)

Layer 3: Network Layer

The network layer enables communication across different networks—it’s what makes “inter-networking” (the Internet) possible.

Responsibilities:

  • Logical addressing (IP addresses)
  • Routing packets between networks
  • Fragmentation and reassembly
  • Quality of Service (QoS)

Key Protocols:

  • IP (Internet Protocol): The primary protocol
  • ICMP (Internet Control Message Protocol): Error reporting and diagnostics
  • ARP (Address Resolution Protocol): Maps IP to MAC addresses
Routing Decision:

   Source: 192.168.1.100
   Destination: 8.8.8.8

   Is destination on local network? NO
   ↓
   Send to default gateway (router)
   ↓
   Router examines destination, forwards to next hop
   ↓
   Process repeats until packet reaches destination

Layer 4: Transport Layer

The transport layer provides end-to-end communication services, handling the complexities of reliable data transfer.

Responsibilities:

  • Segmentation and reassembly
  • Connection management
  • Reliability (for TCP)
  • Flow control
  • Multiplexing via ports

Key Protocols:

  • TCP (Transmission Control Protocol): Reliable, ordered delivery
  • UDP (User Datagram Protocol): Fast, connectionless delivery
Port Multiplexing:

Single IP address, multiple applications:

   IP: 192.168.1.100
   ├── Port 80:   Web Server
   ├── Port 443:  HTTPS Server
   ├── Port 22:   SSH Server
   └── Port 3000: Development Server

Layer 5: Session Layer

The session layer manages sessions—ongoing dialogues between applications.

Responsibilities:

  • Establishing, maintaining, and terminating sessions
  • Session checkpointing and recovery
  • Synchronization

In Practice: This layer is often merged with the application layer in real implementations. Few protocols exist purely at this layer.

Examples:

  • NetBIOS
  • RPC (Remote Procedure Call)
  • Session tokens in web applications (conceptually)

Layer 6: Presentation Layer

The presentation layer handles data representation—how information is formatted, encoded, and encrypted.

Responsibilities:

  • Data translation between formats
  • Encryption and decryption
  • Compression and decompression
  • Character encoding (ASCII, UTF-8)

In Practice: Like the session layer, this is often absorbed into the application layer. TLS can be considered a presentation layer protocol.

Examples:

  • SSL/TLS (encryption)
  • JPEG, GIF (image formatting)
  • MIME types

Layer 7: Application Layer

The application layer is where network applications and their protocols operate. This is the layer developers interact with most directly.

Responsibilities:

  • Providing network services to applications
  • User authentication
  • Resource sharing

Examples:

  • HTTP/HTTPS (web)
  • SMTP, POP3, IMAP (email)
  • FTP, SFTP (file transfer)
  • DNS (name resolution)
  • SSH (secure shell)

How Data Flows Through Layers

When you send data, it travels down the stack on your machine, across the network, and up the stack on the destination:

   Sender                                    Receiver
┌───────────┐                            ┌───────────┐
│Application│ ──────────────────────────>│Application│
├───────────┤                            ├───────────┤
│Presentation─────────────────────────────Presentation
├───────────┤                            ├───────────┤
│  Session  │ ────────────────────────────  Session  │
├───────────┤                            ├───────────┤
│ Transport │ ────────────────────────────│ Transport │
├───────────┤                            ├───────────┤
│  Network  │ ────────────────────────────│  Network  │
├───────────┤      ┌─────────┐           ├───────────┤
│ Data Link │ ─────│ Router  │───────────│ Data Link │
├───────────┤      └─────────┘           ├───────────┤
│ Physical  │ ═══════════════════════════│ Physical  │
└───────────┘     Physical Medium        └───────────┘

Each layer adds its own header (and sometimes trailer) to the data—a process called encapsulation.

OSI in the Real World

Here’s an important truth: the OSI model is a teaching tool, not a strict blueprint.

The internet wasn’t built on OSI—it was built on TCP/IP, which predates OSI and uses a simpler four-layer model. Real protocols often don’t fit neatly into single layers:

  • TLS spans presentation and session layers
  • HTTP is application layer but handles some session concerns
  • TCP handles some session-layer functions

The OSI model is valuable for:

  • Learning and discussing networking concepts
  • Troubleshooting (“Is this a Layer 2 or Layer 3 problem?”)
  • Understanding where protocols fit conceptually

But don’t expect real-world protocols to follow it rigidly.

Memorization Tricks

Many people use mnemonics to remember the layers. From Layer 1 to 7:

  • Please Do Not Throw Sausage Pizza Away
  • Physical, Data Link, Network, Transport, Session, Presentation, Application

Or from 7 to 1:

  • All People Seem To Need Data Processing

Summary

The OSI model provides a framework for understanding network communication:

LayerNameKey FunctionExample
7ApplicationUser interface to networkHTTP, DNS
6PresentationData formattingTLS, JPEG
5SessionDialog managementRPC
4TransportEnd-to-end deliveryTCP, UDP
3NetworkRouting between networksIP
2Data LinkLocal network deliveryEthernet
1PhysicalBit transmissionCables, WiFi

In the next section, we’ll look at the TCP/IP model—what the internet actually uses.

The TCP/IP Stack

While the OSI model is a useful teaching framework, the TCP/IP model is what the internet actually runs on. Developed in the 1970s by Vint Cerf and Bob Kahn, it’s simpler, more pragmatic, and battle-tested by decades of real-world use.

Four Layers vs. Seven

The TCP/IP model condenses networking into four layers:

┌─────────────────────────────────────────────────────────────┐
│          TCP/IP Model          │       OSI Model           │
├─────────────────────────────────────────────────────────────┤
│                                │   Application  (Layer 7)  │
│   Application Layer            │   Presentation (Layer 6)  │
│                                │   Session      (Layer 5)  │
├─────────────────────────────────────────────────────────────┤
│   Transport Layer              │   Transport    (Layer 4)  │
├─────────────────────────────────────────────────────────────┤
│   Internet Layer               │   Network      (Layer 3)  │
├─────────────────────────────────────────────────────────────┤
│   Network Access Layer         │   Data Link    (Layer 2)  │
│   (Link Layer)                 │   Physical     (Layer 1)  │
└─────────────────────────────────────────────────────────────┘

This simplification isn’t accidental—it reflects reality. The top three OSI layers often blend together in practice, and the bottom two are typically handled by the same hardware/drivers.

Layer 1: Network Access Layer

Also called the Link Layer, this combines OSI’s physical and data link layers. It handles everything needed to send packets across a physical network segment.

Responsibilities:

  • Physical transmission
  • MAC addressing
  • Frame formatting
  • Local delivery

The TCP/IP model is agnostic about this layer. Whether you’re using:

  • Ethernet
  • WiFi
  • Cellular (4G/5G)
  • Satellite
  • Carrier pigeon (yes, there’s an RFC for that: RFC 1149)

…the upper layers don’t care. This abstraction is what allows the internet to work across wildly different physical media.

Layer 2: Internet Layer

The internet layer handles logical addressing and routing. Its job is getting packets from source to destination across multiple networks.

Key Protocol: IP (Internet Protocol)

IP's Job: Get this packet from A to B, somehow.

   Network A          Network B           Network C
┌───────────┐      ┌───────────┐       ┌───────────┐
│   Host A  │      │  Router   │       │   Host B  │
│192.168.1.5├──────┤  1  ║  2  ├───────┤10.0.0.100 │
└───────────┘      └─────╨─────┘       └───────────┘

IP handles: addressing, routing, fragmentation
IP doesn't handle: reliability, ordering, delivery confirmation

Other Internet Layer Protocols:

  • ICMP (Internet Control Message Protocol): Error reporting, ping
  • ARP (Address Resolution Protocol): Finds MAC address for an IP
  • IGMP (Internet Group Management Protocol): Multicast group membership

Key Characteristics of IP:

  • Connectionless: Each packet is independent
  • Best-effort: No guarantee of delivery
  • Unreliable: Packets can be lost, duplicated, or reordered

This might seem like a weakness, but it’s actually a feature. By keeping IP simple, it can be fast and widely implemented. Reliability can be added at higher layers when needed.

Layer 3: Transport Layer

The transport layer provides end-to-end communication between applications. It’s where we choose between reliability and speed.

TCP (Transmission Control Protocol)

TCP provides reliable, ordered, error-checked delivery.

TCP Provides:
✓ Connection-oriented (explicit setup and teardown)
✓ Reliable delivery (acknowledgments, retransmission)
✓ Ordered delivery (sequence numbers)
✓ Flow control (don't overwhelm the receiver)
✓ Congestion control (don't overwhelm the network)

TCP Costs:
✗ Connection overhead (handshake latency)
✗ Head-of-line blocking (one lost packet stalls everything)
✗ Higher latency than UDP

UDP (User Datagram Protocol)

UDP provides minimal transport services—just multiplexing and checksums.

UDP Provides:
✓ Connectionless (no setup overhead)
✓ Fast (minimal processing)
✓ No head-of-line blocking
✓ Optional checksum

UDP Lacks:
✗ No reliability (packets can be lost)
✗ No ordering (packets can arrive out of order)
✗ No flow control
✗ No congestion control

When to use which?

Use CaseProtocolWhy
Web browsingTCPNeed complete, ordered pages
File transferTCPCan’t have missing bytes
EmailTCPReliability required
Video streamingUDP*Some loss acceptable, low latency important
Online gamingUDPReal-time updates, old data worthless
DNS queriesUDPSmall, single request/response
VoIPUDPReal-time, loss preferable to delay

*Modern streaming often uses TCP or QUIC for adaptive bitrate streaming.

Layer 4: Application Layer

The application layer is where user-facing protocols live. It combines the application, presentation, and session layers from OSI.

Common Application Layer Protocols:

┌─────────────────────────────────────────────────────────────┐
│                    Application Layer                         │
├──────────────┬──────────────┬──────────────┬───────────────┤
│     HTTP     │     DNS      │     SMTP     │     SSH       │
│   (Web)      │   (Names)    │   (Email)    │   (Secure     │
│              │              │              │    Shell)     │
├──────────────┼──────────────┼──────────────┼───────────────┤
│     FTP      │    DHCP      │    SNMP      │     NTP       │
│   (Files)    │   (Config)   │ (Management) │    (Time)     │
└──────────────┴──────────────┴──────────────┴───────────────┘

This layer handles:

  • Data formatting and encoding
  • Session management
  • Application-specific protocols
  • User authentication (in many protocols)

Putting It All Together

Let’s trace what happens when you request a webpage:

1. APPLICATION LAYER
   Your browser creates an HTTP request:
   "GET /index.html HTTP/1.1"

2. TRANSPORT LAYER
   TCP segments the data, adds:
   - Source port (e.g., 52431)
   - Destination port (80)
   - Sequence number
   - Checksum

3. INTERNET LAYER
   IP adds:
   - Source IP (192.168.1.100)
   - Destination IP (93.184.216.34)
   - TTL (Time to Live)

4. NETWORK ACCESS LAYER
   Ethernet adds:
   - Source MAC
   - Destination MAC (router's MAC)
   - Frame check sequence

5. PHYSICAL
   Converted to electrical signals on the wire

On the receiving end, this process reverses—each layer strips its header and passes data up.

The Protocol Graph

Rather than a strict stack, TCP/IP is better visualized as a graph:

                 ┌─────────────────────────────────────┐
                 │           Applications              │
                 │  HTTP   SMTP   DNS   SSH   Custom   │
                 └──────────────┬──────────────────────┘
                                │
              ┌─────────────────┴─────────────────┐
              │                                   │
         ┌────┴────┐                        ┌────┴────┐
         │   TCP   │                        │   UDP   │
         └────┬────┘                        └────┬────┘
              │                                   │
              └─────────────────┬─────────────────┘
                                │
                          ┌─────┴─────┐
                          │    IP     │
                          └─────┬─────┘
                                │
         ┌────────────────┬─────┴─────┬────────────────┐
         │                │           │                │
    ┌────┴────┐     ┌────┴────┐ ┌────┴────┐     ┌────┴────┐
    │Ethernet │     │  WiFi   │ │Cellular │     │  Other  │
    └─────────┘     └─────────┘ └─────────┘     └─────────┘

Any application can use TCP or UDP. Both use IP. IP can run over any network technology. This flexibility is why the internet works.

Why TCP/IP Won

The OSI model was designed by committee to be complete and correct. TCP/IP was designed by engineers to work. Key differences:

AspectOSITCP/IP
Design approachTop-down, theoreticalBottom-up, practical
ImplementationCame after specSpec described working code
Layer count7 (sometimes awkward)4 (pragmatic)
Real-world useReference modelRunning on billions of devices

TCP/IP’s success came from:

  1. Working code first: The spec described implementations that already worked
  2. Simplicity: Fewer layers, clearer responsibilities
  3. Flexibility: “Be liberal in what you accept, conservative in what you send”
  4. Open standards: Anyone could implement it

Summary

The TCP/IP model is the practical foundation of the internet:

LayerFunctionKey Protocols
ApplicationUser servicesHTTP, DNS, SMTP, SSH
TransportEnd-to-end deliveryTCP, UDP
InternetRouting between networksIP, ICMP
Network AccessLocal deliveryEthernet, WiFi

Understanding this model—especially the separation between IP (best-effort routing) and TCP (reliable delivery)—is essential for understanding how the internet works.

Next, we’ll look at how data is wrapped and unwrapped as it moves through these layers: encapsulation.

Encapsulation

Encapsulation is the process by which each layer wraps data with its own header (and sometimes trailer) information. It’s how layers communicate without knowing about each other’s internals.

The Concept

Think of encapsulation like mailing a letter:

1. You write a letter                    [Your message]
2. Put it in an envelope                 [+ Your address, recipient address]
3. The post office puts it in a bin      [+ Sorting codes, routing info]
4. The bin goes in a truck               [+ Truck manifest, destination hub]

Each layer adds information needed for its job, without looking inside what it received.

Layer-by-Layer Encapsulation

Let’s trace a web request through the TCP/IP stack:

┌─────────────────────────────────────────────────────────────────┐
│  Application Layer                                               │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │              HTTP Request (Data)                          │  │
│  │  "GET /index.html HTTP/1.1\r\nHost: example.com\r\n\r\n"  │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  Transport Layer (TCP)                                           │
│                                                                  │
│  ┌──────────────┬────────────────────────────────────────────┐  │
│  │ TCP Header   │              Data                          │  │
│  │ (20+ bytes)  │  (HTTP Request from above)                 │  │
│  └──────────────┴────────────────────────────────────────────┘  │
│                                                                  │
│  TCP Segment                                                     │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  Internet Layer (IP)                                             │
│                                                                  │
│  ┌──────────────┬────────────────────────────────────────────┐  │
│  │  IP Header   │              Data                          │  │
│  │  (20+ bytes) │  (TCP Segment from above)                  │  │
│  └──────────────┴────────────────────────────────────────────┘  │
│                                                                  │
│  IP Packet (or Datagram)                                         │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  Network Access Layer (Ethernet)                                 │
│                                                                  │
│  ┌──────────────┬────────────────────────────────────┬───────┐  │
│  │Ethernet Hdr  │              Data                  │  FCS  │  │
│  │  (14 bytes)  │  (IP Packet from above)            │(4 B)  │  │
│  └──────────────┴────────────────────────────────────┴───────┘  │
│                                                                  │
│  Ethernet Frame                                                  │
└─────────────────────────────────────────────────────────────────┘

Terminology

Different layers use different names for their data units:

┌─────────────────────────────────────────┐
│  Layer        │  Data Unit Name        │
├─────────────────────────────────────────┤
│  Application  │  Message / Data        │
│  Transport    │  Segment (TCP)         │
│               │  Datagram (UDP)        │
│  Internet     │  Packet                │
│  Network      │  Frame                 │
└─────────────────────────────────────────┘

These terms matter when debugging—if someone mentions “packet loss,” they’re typically talking about the IP layer.

Detailed Header View

Here’s what each header actually contains:

Ethernet Header (14 bytes)

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
├───────────────────────────────────────────────────────────────┤
│                    Destination MAC Address                    │
│                         (6 bytes)                             │
├───────────────────────────────────────────────────────────────┤
│                      Source MAC Address                       │
│                         (6 bytes)                             │
├───────────────────────────────────────────────────────────────┤
│         EtherType (2 bytes)       │
│     (0x0800 = IPv4, 0x86DD = IPv6) │
└───────────────────────────────────┘

IPv4 Header (20-60 bytes)

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
├───────┬───────┬───────────────┬───────────────────────────────┤
│Version│  IHL  │    DSCP/ECN   │         Total Length          │
├───────┴───────┴───────────────┼───────┬───────────────────────┤
│         Identification        │ Flags │    Fragment Offset    │
├───────────────┬───────────────┼───────┴───────────────────────┤
│      TTL      │   Protocol    │        Header Checksum        │
├───────────────┴───────────────┴───────────────────────────────┤
│                       Source IP Address                       │
├───────────────────────────────────────────────────────────────┤
│                    Destination IP Address                     │
├───────────────────────────────────────────────────────────────┤
│                    Options (if IHL > 5)                       │
└───────────────────────────────────────────────────────────────┘

TCP Header (20-60 bytes)

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
├───────────────────────────────┬───────────────────────────────┤
│          Source Port          │       Destination Port        │
├───────────────────────────────┴───────────────────────────────┤
│                        Sequence Number                        │
├───────────────────────────────────────────────────────────────┤
│                    Acknowledgment Number                      │
├───────┬───────┬───────────────┬───────────────────────────────┤
│  Data │       │C│E│U│A│P│R│S│F│                               │
│ Offset│ Rsrvd │W│C│R│C│S│S│Y│I│          Window Size          │
│       │       │R│E│G│K│H│T│N│N│                               │
├───────┴───────┴───────────────┼───────────────────────────────┤
│           Checksum            │        Urgent Pointer         │
├───────────────────────────────┴───────────────────────────────┤
│                    Options (if Data Offset > 5)               │
└───────────────────────────────────────────────────────────────┘

Overhead Analysis

Each layer adds overhead. For a small HTTP request:

Layer           Header Size    Running Total
─────────────────────────────────────────────
HTTP Data       ~50 bytes      50 bytes
TCP Header      20 bytes       70 bytes
IP Header       20 bytes       90 bytes
Ethernet        18 bytes*      108 bytes
─────────────────────────────────────────────
                              *14 header + 4 FCS

Efficiency: 50/108 = 46% payload

For small packets, overhead can be significant. This is why protocols often batch multiple operations or use compression.

Decapsulation (Receiving)

On the receiving side, each layer strips its header and passes the payload up:

   Receiving Host
   ─────────────────────────────────────────────────────────

   Frame arrives → Network Card

   ┌─────────────────────────────────────────────────────┐
   │  Link Layer                                         │
   │                                                     │
   │  1. Verify FCS (checksum)                           │
   │  2. Check destination MAC                           │
   │  3. Read EtherType → 0x0800 (IPv4)                  │
   │  4. Strip Ethernet header, pass up                  │
   └───────────────────────┬─────────────────────────────┘
                           │
                           ▼
   ┌─────────────────────────────────────────────────────┐
   │  Network Layer                                      │
   │                                                     │
   │  1. Verify header checksum                          │
   │  2. Check destination IP                            │
   │  3. Read Protocol field → 6 (TCP)                   │
   │  4. Strip IP header, pass up                        │
   └───────────────────────┬─────────────────────────────┘
                           │
                           ▼
   ┌─────────────────────────────────────────────────────┐
   │  Transport Layer                                    │
   │                                                     │
   │  1. Verify checksum                                 │
   │  2. Read destination port → 80                      │
   │  3. Find socket listening on port 80               │
   │  4. Process TCP state machine                       │
   │  5. Strip TCP header, pass up                       │
   └───────────────────────┬─────────────────────────────┘
                           │
                           ▼
   ┌─────────────────────────────────────────────────────┐
   │  Application Layer                                  │
   │                                                     │
   │  Web server receives: "GET /index.html HTTP/1.1"    │
   └─────────────────────────────────────────────────────┘

How Layers Know What’s Inside

Each layer includes a field indicating what’s in the payload:

Ethernet EtherType:
  0x0800 = IPv4
  0x86DD = IPv6
  0x0806 = ARP

IP Protocol:
  1  = ICMP
  6  = TCP
  17 = UDP
  47 = GRE

TCP/UDP Port:
  80  = HTTP
  443 = HTTPS
  22  = SSH
  53  = DNS

This is how a packet finds its way to the right application.

Encapsulation in Code

Here’s a simplified view of building a packet in Python (conceptual):

# Application layer - your data
http_request = b"GET / HTTP/1.1\r\nHost: example.com\r\n\r\n"

# Transport layer - add TCP header
tcp_segment = TCPHeader(
    src_port=52431,
    dst_port=80,
    seq_num=1000,
    ack_num=0,
    flags=SYN
) + http_request

# Network layer - add IP header
ip_packet = IPHeader(
    src_ip="192.168.1.100",
    dst_ip="93.184.216.34",
    protocol=TCP,
    ttl=64
) + tcp_segment

# Link layer - add Ethernet header
ethernet_frame = EthernetHeader(
    src_mac="00:11:22:33:44:55",
    dst_mac="aa:bb:cc:dd:ee:ff",
    ethertype=IPv4
) + ip_packet + calculate_fcs()

# Send it!
network_card.send(ethernet_frame)

In practice, the operating system’s network stack handles this, but understanding the process helps when debugging.

Practical Implications

MTU (Maximum Transmission Unit)

The link layer limits frame size. For Ethernet, the MTU is typically 1500 bytes:

Ethernet Frame Limit: 1518 bytes total
  - Ethernet header: 14 bytes
  - Payload: 1500 bytes (MTU)
  - FCS: 4 bytes

Available for IP packet: 1500 bytes
  - IP header: 20 bytes
  - TCP header: 20 bytes
  - Application data: 1460 bytes (typical MSS)

If data exceeds this, it must be fragmented—which has performance costs.

Jumbo Frames

Some networks support larger MTUs (up to 9000 bytes):

  • Reduces overhead ratio
  • Common in data centers
  • Not universal—can cause problems if intermediate networks don’t support them

Summary

Encapsulation is the mechanism that makes layered networking work:

  1. Each layer adds its own header with information needed for its function
  2. Headers contain “next layer” indicators so receivers know how to decode
  3. Layers are independent—changes to one don’t affect others
  4. Overhead accumulates—important for small packet performance

Understanding encapsulation helps you:

  • Debug network issues at the right layer
  • Understand packet capture output
  • Make informed decisions about protocol overhead

Next, we’ll explore ports and sockets—how multiple applications share a single network connection.

Ports and Sockets

A single computer can run dozens of networked applications simultaneously—a web browser, email client, chat application, and more. How does the operating system route incoming data to the right application? The answer lies in ports and sockets.

The Problem

Consider a server with IP address 192.168.1.100 running:

  • A web server
  • An SSH server
  • A database
  • An API service

When a packet arrives addressed to 192.168.1.100, which application should receive it?

         Incoming Packets
              │
              ▼
    ┌─────────────────────┐
    │   IP: 192.168.1.100 │
    │                     │
    │   ??? Which app ??? │
    │                     │
    │   ┌───┐ ┌───┐ ┌───┐ │
    │   │Web│ │SSH│ │DB │ │
    │   └───┘ └───┘ └───┘ │
    └─────────────────────┘

Ports: Application Addressing

Ports are 16-bit numbers (0-65535) that identify specific applications or services on a host. Combined with an IP address, a port uniquely identifies an application endpoint.

┌─────────────────────────────────────────────────────────────┐
│                    Port Number Space                        │
│                                                             │
│   0 ─────── 1023 ──────── 49151 ──────── 65535             │
│   │          │              │              │                │
│   │ Well-Known│  Registered │   Dynamic/   │                │
│   │   Ports   │    Ports    │   Private    │                │
│   │           │             │    Ports     │                │
│   │ (System)  │ (IANA reg)  │ (Ephemeral)  │                │
└─────────────────────────────────────────────────────────────┘

Port Ranges

RangeNamePurpose
0-1023Well-Known PortsReserved for standard services; require root/admin
1024-49151Registered PortsCan be registered with IANA for specific services
49152-65535Dynamic/PrivateUsed for client-side ephemeral ports

Common Well-Known Ports

Port    Protocol    Service
────────────────────────────
20      TCP         FTP Data
21      TCP         FTP Control
22      TCP         SSH
23      TCP         Telnet
25      TCP         SMTP
53      TCP/UDP     DNS
67/68   UDP         DHCP
80      TCP         HTTP
110     TCP         POP3
143     TCP         IMAP
443     TCP         HTTPS
465     TCP         SMTPS
587     TCP         SMTP Submission
993     TCP         IMAPS
995     TCP         POP3S
3306    TCP         MySQL
5432    TCP         PostgreSQL
6379    TCP         Redis
27017   TCP         MongoDB

How Ports Enable Multiplexing

With ports, our server can now direct traffic:

         Incoming Packets
              │
    ┌─────────┴─────────┐
    │    Check port     │
    └─────────┬─────────┘
              │
    ┌─────────┼─────────────────┬──────────────┐
    │         │                 │              │
    ▼         ▼                 ▼              ▼
┌───────┐ ┌───────┐       ┌───────┐      ┌───────┐
│Port 80│ │Port 22│       │Port   │      │Port   │
│ HTTP  │ │  SSH  │       │ 5432  │      │ 3000  │
│Server │ │Server │       │Postgre│      │  API  │
└───────┘ └───────┘       └───────┘      └───────┘

Sockets: The Programming Interface

A socket is an endpoint for network communication. It’s the API that applications use to send and receive data over the network.

The Socket Tuple

A socket is uniquely identified by a 5-tuple:

┌─────────────────────────────────────────────────────────────┐
│                      Socket 5-Tuple                         │
├─────────────────────────────────────────────────────────────┤
│  1. Protocol        (TCP or UDP)                            │
│  2. Local IP        (192.168.1.100)                         │
│  3. Local Port      (80)                                    │
│  4. Remote IP       (10.0.0.50)                             │
│  5. Remote Port     (52431)                                 │
└─────────────────────────────────────────────────────────────┘

This combination uniquely identifies a connection.

Why the Tuple Matters

Multiple connections can share the same local port:

Server listening on port 80 (192.168.1.100:80)

Connection 1: (TCP, 192.168.1.100, 80, 10.0.0.50, 52431)
Connection 2: (TCP, 192.168.1.100, 80, 10.0.0.50, 52432)
Connection 3: (TCP, 192.168.1.100, 80, 10.0.0.99, 41000)
              └─┬─┘ └──────┬─────┘ └┬┘ └────┬────┘ └──┬──┘
              Proto   Local IP   Local  Remote IP  Remote
                                 Port               Port

All three connections go to the same server port, but
each is a unique connection due to different remote endpoints.

This is how a web server can handle thousands of simultaneous connections on port 80.

Socket Types

Stream Sockets (SOCK_STREAM)

Used with TCP:

  • Connection-oriented
  • Reliable, ordered byte stream
  • Most common for applications
Client                              Server
   │                                   │
   │────── connect() ─────────────────>│
   │                                   │ accept()
   │<──────────────────────────────────│
   │                                   │
   │═══════ Bidirectional Stream ══════│
   │                                   │
   │────── send(data) ────────────────>│
   │<───── send(response) ─────────────│
   │                                   │
   │────── close() ───────────────────>│

Datagram Sockets (SOCK_DGRAM)

Used with UDP:

  • Connectionless
  • Individual messages (datagrams)
  • No guarantee of delivery or order
Client                              Server
   │                                   │
   │────── sendto(data, addr) ────────>│
   │<───── sendto(response, addr) ─────│
   │                                   │
   │      (No connection state)        │
   │                                   │
   │────── sendto(data, addr) ────────>│

Socket Programming Example

Here’s a simple TCP server and client in Python:

TCP Server

import socket

# Create socket
server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# Allow address reuse (helpful during development)
server_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

# Bind to address and port
server_socket.bind(('0.0.0.0', 8080))

# Listen for connections (backlog of 5)
server_socket.listen(5)

print("Server listening on port 8080...")

while True:
    # Accept incoming connection
    client_socket, client_address = server_socket.accept()
    print(f"Connection from {client_address}")

    # Receive data
    data = client_socket.recv(1024)
    print(f"Received: {data.decode()}")

    # Send response
    client_socket.send(b"Hello from server!")

    # Close connection
    client_socket.close()

TCP Client

import socket

# Create socket
client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# Connect to server
client_socket.connect(('localhost', 8080))

# Send data
client_socket.send(b"Hello from client!")

# Receive response
response = client_socket.recv(1024)
print(f"Received: {response.decode()}")

# Close connection
client_socket.close()

The Socket Lifecycle

Server Side

┌─────────────────────────────────────────────────────────────┐
│                    Server Socket Lifecycle                   │
└─────────────────────────────────────────────────────────────┘

    socket()         Create the socket
        │
        ▼
    bind()           Assign local address and port
        │
        ▼
    listen()         Mark socket as passive (accepting connections)
        │
        ▼
    ┌──────────────────────────────────────┐
    │              accept()                │◄────┐
    │   (blocks until client connects)     │     │
    └──────────────┬───────────────────────┘     │
                   │                             │
                   ▼                             │
           New connected socket                  │
                   │                             │
        ┌──────────┴──────────┐                  │
        │                     │                  │
        ▼                     ▼                  │
    recv()/send()        spawn thread/          │
        │                handle async            │
        ▼                     │                  │
    close()                   │                  │
        │                     └──────────────────┘
        │
    (Handle next connection)

Client Side

┌─────────────────────────────────────────────────────────────┐
│                    Client Socket Lifecycle                   │
└─────────────────────────────────────────────────────────────┘

    socket()         Create the socket
        │
        ▼
    connect()        Connect to remote server
        │            (OS assigns ephemeral local port)
        ▼
    send()/recv()    Exchange data
        │
        ▼
    close()          Terminate connection

Ephemeral Ports

When a client connects to a server, the OS automatically assigns a ephemeral port (temporary port) for the client side:

Client                                      Server
┌─────────────┐                        ┌─────────────┐
│ 10.0.0.50   │                        │192.168.1.100│
│             │                        │             │
│ Port: ???   │── connect() ──────────>│ Port: 80    │
└─────────────┘                        └─────────────┘

OS assigns ephemeral port (e.g., 52431)

┌─────────────┐                        ┌─────────────┐
│ 10.0.0.50   │                        │192.168.1.100│
│             │                        │             │
│ Port: 52431 │<═══════════════════════│ Port: 80    │
└─────────────┘     Connection         └─────────────┘

Ephemeral Port Range

Different systems use different ranges:

OSDefault Range
Linux32768-60999
Windows49152-65535
macOS49152-65535

You can check and modify this on Linux:

$ cat /proc/sys/net/ipv4/ip_local_port_range
32768   60999

$ sudo sysctl -w net.ipv4.ip_local_port_range="10000 65535"

Port Exhaustion

Each outbound connection uses an ephemeral port. If your application makes many outbound connections, you can exhaust available ports:

Problem Scenario:
─────────────────
Application makes 50,000 connections to an API server.
Each connection uses one ephemeral port.
Default range: 32768-60999 = ~28,000 ports

If connections aren't closed properly (lingering in TIME_WAIT),
you run out of ports!

Solutions:
─────────────────
1. Expand ephemeral port range
2. Enable TCP reuse options (SO_REUSEADDR, tcp_tw_reuse)
3. Use connection pooling
4. Properly close connections

Viewing Port Usage

Linux/macOS

# List all listening ports
$ netstat -tlnp
Proto Local Address    Foreign Address  State   PID/Program
tcp   0.0.0.0:22       0.0.0.0:*        LISTEN  1234/sshd
tcp   0.0.0.0:80       0.0.0.0:*        LISTEN  5678/nginx

# Or with ss (modern replacement)
$ ss -tlnp

# List all connections
$ netstat -anp | grep ESTABLISHED

# Show which process owns a port
$ lsof -i :80

Windows

# List all listening ports
netstat -an | findstr LISTENING

# Show process IDs
netstat -ano | findstr :80

Special Port Behaviors

Binding to 0.0.0.0

Binding to 0.0.0.0 means “all interfaces”:

┌─────────────────────────────────────────────────────────────┐
│  Server with multiple interfaces                            │
│                                                             │
│  eth0: 192.168.1.100                                        │
│  eth1: 10.0.0.50                                            │
│  lo:   127.0.0.1                                            │
│                                                             │
│  bind('0.0.0.0', 80) → accepts on ALL interfaces            │
│  bind('192.168.1.100', 80) → accepts only on eth0           │
│  bind('127.0.0.1', 80) → accepts only on localhost          │
└─────────────────────────────────────────────────────────────┘

Port 0

Binding to port 0 asks the OS to assign any available port:

server_socket.bind(('0.0.0.0', 0))
actual_port = server_socket.getsockname()[1]
print(f"Assigned port: {actual_port}")  # e.g., 54321

Reserved Ports (< 1024)

On Unix systems, ports below 1024 require root privileges:

$ python -c "import socket; s=socket.socket(); s.bind(('',80))"
PermissionError: [Errno 13] Permission denied

$ sudo python -c "import socket; s=socket.socket(); s.bind(('',80))"
# Works

This prevents unprivileged users from impersonating system services.

Summary

  • Ports (0-65535) identify applications on a host
  • Sockets are the programming interface for network I/O
  • A connection is uniquely identified by the 5-tuple: (protocol, local IP, local port, remote IP, remote port)
  • Ephemeral ports are automatically assigned for outbound connections
  • Multiple connections can share a server port because remote endpoints differ

Understanding ports and sockets is essential for:

  • Writing networked applications
  • Debugging connectivity issues
  • Understanding firewall rules
  • Diagnosing port exhaustion problems

With the fundamentals covered, we’re ready to dive into the IP layer—how data finds its way across the internet.

The IP Layer

The Internet Protocol (IP) is the foundation of the internet. It provides logical addressing and routing—the ability to send packets from any device to any other device, regardless of the physical networks in between.

IP’s Simple Contract

IP makes a simple promise: “I’ll try to get this packet to its destination.”

Notice what IP doesn’t promise:

  • Packets will arrive (they might be dropped)
  • Packets will arrive in order (they might take different routes)
  • Packets will arrive only once (duplicates can happen)
  • Packets will arrive intact (corruption is possible, though detected)

This “best-effort” service might seem inadequate, but it’s deliberately minimal. By keeping IP simple, it can be:

  • Fast: Minimal processing per packet
  • Scalable: Routers don’t maintain connection state
  • Universal: Works over any link layer

Higher layers (like TCP) can add reliability when needed.

The Two IP Versions

Today’s internet runs on two versions of IP:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   IPv4 (1981)              │   IPv6 (1998)                  │
│   ─────────────────────────│─────────────────────────────   │
│   32-bit addresses         │   128-bit addresses            │
│   ~4.3 billion addresses   │   ~340 undecillion addresses   │
│   Widely deployed          │   Growing adoption             │
│   NAT commonly used        │   NAT generally unnecessary    │
│   Simpler header           │   Fixed header, extensions     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Both are in active use. Your device likely uses both daily.

What You’ll Learn

In this chapter, we’ll cover:

  1. IPv4 Addressing: The original 32-bit addressing scheme
  2. IPv6: The next generation with its vastly larger address space
  3. Subnetting: Dividing networks into smaller segments
  4. Routing: How packets find their way across networks
  5. Fragmentation: What happens when packets are too big

Key Concepts Preview

Addresses Identify Interfaces, Not Hosts

A common misconception is that an IP address identifies a computer. Actually, it identifies a network interface. A computer with two network cards has two IP addresses:

┌────────────────────────────────────────────────┐
│                    Server                      │
│                                                │
│   ┌────────────┐          ┌────────────┐      │
│   │    eth0    │          │    eth1    │      │
│   │192.168.1.10│          │ 10.0.0.10  │      │
│   └──────┬─────┘          └──────┬─────┘      │
└──────────┼───────────────────────┼────────────┘
           │                       │
      ┌────┴─────┐            ┌────┴─────┐
      │ Network A │            │ Network B │
      └──────────┘            └──────────┘

Routing Is Hop-by-Hop

No device knows the complete path to a destination. Each router makes a local decision about the next hop:

Source ──> Router1 ──> Router2 ──> Router3 ──> Destination

Each router:
1. Looks at destination IP
2. Consults routing table
3. Forwards to next hop
4. Forgets about the packet

No router knows the full path. Each just knows "for this
destination, send to that next router."

TTL Prevents Infinite Loops

The Time to Live (TTL) field starts at some value (typically 64 or 128) and decrements at each hop. If it reaches 0, the packet is discarded. This prevents packets from circulating forever if there’s a routing loop.

TTL at source:     64
After router 1:    63
After router 2:    62
...
If routing loop:   Eventually hits 0 → packet dropped

Let’s dive into the details, starting with IPv4.

IPv4 Addressing

IPv4 (Internet Protocol version 4) has been the backbone of the internet since 1981. Despite its age and limitations, it still carries the majority of internet traffic.

The IPv4 Address

An IPv4 address is a 32-bit number, typically written as four decimal numbers separated by dots (dotted-decimal notation):

Binary:    11000000 10101000 00000001 01100100
           └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘
Decimal:     192   .  168   .   1    .  100

Each number (octet) ranges from 0-255 (8 bits)
Total: 4 octets × 8 bits = 32 bits

Address Space Size

32 bits gives us 2³² = 4,294,967,296 addresses. Sounds like a lot, but:

  • Many are reserved for special purposes
  • Allocation was historically wasteful
  • Every device needs an address (phones, IoT, servers…)

We ran out of new IPv4 blocks in 2011.

The IPv4 Header

Every IP packet starts with a header containing routing and handling information:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
│Version│  IHL  │    DSCP   │ECN│         Total Length          │
├───────┴───────┴───────────┴───┼───────┬───────────────────────┤
│         Identification        │ Flags │    Fragment Offset    │
├───────────────┬───────────────┼───────┴───────────────────────┤
│      TTL      │   Protocol    │        Header Checksum        │
├───────────────┴───────────────┴───────────────────────────────┤
│                       Source IP Address                       │
├───────────────────────────────────────────────────────────────┤
│                    Destination IP Address                     │
├───────────────────────────────────────────────────────────────┤
│                    Options (if IHL > 5)                       │
└───────────────────────────────────────────────────────────────┘

Minimum header size: 20 bytes (no options)
Maximum header size: 60 bytes (with options)

Key Header Fields

FieldSizePurpose
Version4 bitsIP version (4 for IPv4)
IHL4 bitsHeader length in 32-bit words
DSCP/ECN8 bitsQuality of Service hints
Total Length16 bitsPacket size (header + data)
Identification16 bitsUnique ID for fragmentation
Flags3 bitsFragmentation control
Fragment Offset13 bitsPosition in fragmented packet
TTL8 bitsHop limit (prevents loops)
Protocol8 bitsUpper layer protocol (TCP=6, UDP=17)
Header Checksum16 bitsError detection for header
Source IP32 bitsSender’s address
Destination IP32 bitsReceiver’s address

Address Classes (Historical)

Originally, IPv4 used a classful addressing scheme:

Class A: 0xxxxxxx.xxxxxxxx.xxxxxxxx.xxxxxxxx
         │└──────────────┬───────────────────┘
         Network (8 bits)    Host (24 bits)
         Range: 1.0.0.0 - 126.255.255.255
         Networks: 126    Hosts/Network: 16 million

Class B: 10xxxxxx.xxxxxxxx.xxxxxxxx.xxxxxxxx
         └───────┬────────┘└───────┬────────┘
         Network (16 bits)   Host (16 bits)
         Range: 128.0.0.0 - 191.255.255.255
         Networks: 16,384  Hosts/Network: 65,534

Class C: 110xxxxx.xxxxxxxx.xxxxxxxx.xxxxxxxx
         └──────────┬──────────────┘└───┬───┘
         Network (24 bits)         Host (8 bits)
         Range: 192.0.0.0 - 223.255.255.255
         Networks: 2 million  Hosts/Network: 254

Class D: 1110xxxx.xxxxxxxx.xxxxxxxx.xxxxxxxx
         Multicast addresses (224.0.0.0 - 239.255.255.255)

Class E: 1111xxxx.xxxxxxxx.xxxxxxxx.xxxxxxxx
         Reserved/Experimental (240.0.0.0 - 255.255.255.255)

This system is obsolete. It was too inflexible—an organization needing 300 addresses had to get a Class B (65,534 addresses) because Class C was too small (254). This wasted addresses. Modern networks use CIDR (classless addressing) instead.

Special and Reserved Addresses

Several address ranges have special meanings:

┌──────────────────────────────────────────────────────────────┐
│  Address Range        │  Purpose                             │
├──────────────────────────────────────────────────────────────┤
│  0.0.0.0/8            │  "This network" / unspecified        │
│  10.0.0.0/8           │  Private network (Class A)           │
│  127.0.0.0/8          │  Loopback (localhost)                │
│  169.254.0.0/16       │  Link-local (auto-config)            │
│  172.16.0.0/12        │  Private network (Class B range)     │
│  192.168.0.0/16       │  Private network (Class C range)     │
│  224.0.0.0/4          │  Multicast                           │
│  255.255.255.255      │  Broadcast                           │
└──────────────────────────────────────────────────────────────┘

Private Addresses (RFC 1918)

Three ranges are designated for private use—they’re not routable on the public internet:

10.0.0.0     - 10.255.255.255    (10.0.0.0/8)      16 million addresses
172.16.0.0   - 172.31.255.255    (172.16.0.0/12)   1 million addresses
192.168.0.0  - 192.168.255.255   (192.168.0.0/16)  65,536 addresses

Your home network almost certainly uses one of these ranges (typically 192.168.x.x). To reach the internet, your router performs NAT (Network Address Translation).

Loopback Address

127.0.0.1 (or any 127.x.x.x) is the loopback address. Traffic sent here never leaves your machine—it’s used for local testing:

$ ping 127.0.0.1
PING 127.0.0.1: 64 bytes, seq=0 time=0.054 ms

# Same as:
$ ping localhost

Broadcast Address

255.255.255.255 is the limited broadcast address. Packets sent here go to all devices on the local network segment.

Each network also has a directed broadcast address (the highest address in the range). For 192.168.1.0/24, the broadcast is 192.168.1.255.

Network vs. Host Portions

An IP address has two parts:

       192.168.1.100
       └───┬───┘└┬┘
       Network  Host
       Portion  Portion

The division is determined by the subnet mask.

The network portion identifies which network a host belongs to. The host portion identifies the specific device on that network.

Subnet Mask

A subnet mask indicates how many bits are network vs. host:

IP Address:     192.168.1.100   = 11000000.10101000.00000001.01100100
Subnet Mask:    255.255.255.0   = 11111111.11111111.11111111.00000000
                                  └────────── Network ─────────────┘└ Host ┘

AND them together to get the network address:
Network:        192.168.1.0     = 11000000.10101000.00000001.00000000

CIDR Notation

CIDR (Classless Inter-Domain Routing) notation appends a slash and the number of network bits:

192.168.1.100/24
             └── 24 bits for network = 255.255.255.0 mask

Common CIDR blocks:
/8   = 255.0.0.0       = 16,777,214 hosts
/16  = 255.255.0.0     = 65,534 hosts
/24  = 255.255.255.0   = 254 hosts
/32  = 255.255.255.255 = 1 host (single address)

Determining If Two Hosts Are on the Same Network

Hosts on the same network can communicate directly. Hosts on different networks need a router.

Host A: 192.168.1.100/24
Host B: 192.168.1.200/24
Host C: 192.168.2.50/24

Apply mask to each:
A network: 192.168.1.100 AND 255.255.255.0 = 192.168.1.0
B network: 192.168.1.200 AND 255.255.255.0 = 192.168.1.0
C network: 192.168.2.50  AND 255.255.255.0 = 192.168.2.0

A and B: Same network (192.168.1.0) → Direct communication
A and C: Different networks → Need router

NAT (Network Address Translation)

With private addresses and limited IPv4 space, NAT lets many devices share one public IP:

Private Network (192.168.1.0/24)          Internet
┌─────────────────────────────────┐
│  ┌─────────┐                    │     ┌─────────────────┐
│  │ Laptop  │                    │     │                 │
│  │ .100    ├──┐                 │     │   Web Server    │
│  └─────────┘  │    ┌─────────┐  │     │  93.184.216.34  │
│               ├────┤ Router  ├──┼────>│                 │
│  ┌─────────┐  │    │  NAT    │  │     │                 │
│  │  Phone  ├──┘    │         │  │     └─────────────────┘
│  │ .101    │       │ Public: │  │
│  └─────────┘       │73.45.2.1│  │
│                    └─────────┘  │
└─────────────────────────────────┘

Laptop sends: src=192.168.1.100:52000 dst=93.184.216.34:80
NAT rewrites: src=73.45.2.1:40123    dst=93.184.216.34:80

Response comes back to 73.45.2.1:40123
NAT looks up mapping, forwards to 192.168.1.100:52000

NAT is why billions of devices can use the internet with only ~4 billion addresses.

Working with IP Addresses in Code

Python

import ipaddress

# Parse an address
ip = ipaddress.ip_address('192.168.1.100')
print(ip.is_private)      # True
print(ip.is_loopback)     # False

# Work with networks
network = ipaddress.ip_network('192.168.1.0/24')
print(network.num_addresses)  # 256
print(network.netmask)        # 255.255.255.0

# Check if address is in network
ip = ipaddress.ip_address('192.168.1.100')
print(ip in network)          # True

# Iterate over hosts
for host in network.hosts():
    print(host)  # 192.168.1.1 through 192.168.1.254

Bash

# Get your IP addresses
$ ip addr show
# or
$ ifconfig

# Check if you can reach an IP
$ ping -c 3 192.168.1.1

# Trace route to destination
$ traceroute 8.8.8.8

# Look up your public IP
$ curl ifconfig.me

Practical Tips

Finding Your IP Address

# Linux/Mac - local IP
$ hostname -I
192.168.1.100

# Windows - local IP
> ipconfig

# Public IP (what the internet sees)
$ curl ifconfig.me

Common Issues

“Network is unreachable”

  • Check if you have an IP (DHCP may have failed)
  • Check subnet mask is correct
  • Check default gateway is set

“No route to host”

  • Destination may be down
  • Firewall may be blocking
  • ARP resolution may have failed

“Connection refused”

  • You reached the host, but no service is listening
  • This is a good sign for network debugging—networking works!

Summary

IPv4’s 32-bit addressing scheme, while showing its age, remains the internet’s foundation:

  • Addresses are written as four octets (e.g., 192.168.1.100)
  • Network and host portions are determined by the subnet mask
  • Private ranges (10.x, 172.16-31.x, 192.168.x) are for internal use
  • NAT allows address sharing but adds complexity
  • CIDR replaced wasteful classful addressing

The address shortage led to IPv6, which we’ll cover next.

IPv6 and the Future

IPv6 was designed to solve IPv4’s address exhaustion problem—and to fix several other shortcomings along the way. With 128-bit addresses, IPv6 provides enough addresses for every grain of sand on Earth to have its own IP… many times over.

Why IPv6?

The primary driver was address space:

IPv4: 2³² = 4.3 billion addresses
IPv6: 2¹²⁸ = 340 undecillion addresses

             340,282,366,920,938,463,463,374,607,431,768,211,456

That's 340 trillion trillion trillion addresses.
Or about 50 octillion addresses per human alive today.

But IPv6 also addressed other IPv4 limitations:

  • No more NAT required (enough addresses for everyone)
  • Simplified header (faster routing)
  • Built-in security (IPsec)
  • Better multicast support
  • Stateless address autoconfiguration

IPv6 Address Format

An IPv6 address is 128 bits, written as eight groups of four hexadecimal digits:

Full form:
2001:0db8:85a3:0000:0000:8a2e:0370:7334
└──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘
  │    │    │    │    │    │    │    │
  ▼    ▼    ▼    ▼    ▼    ▼    ▼    ▼
Each group = 16 bits (4 hex digits)
8 groups × 16 bits = 128 bits

Address Shortening Rules

IPv6 addresses can be shortened for readability:

Rule 1: Remove leading zeros in each group

2001:0db8:0042:0000:0000:0000:0000:0001
     ↓
2001:db8:42:0:0:0:0:1

Rule 2: Replace one sequence of all-zero groups with ::

2001:db8:42:0:0:0:0:1
           ↓
2001:db8:42::1

Important: :: can only appear once per address (otherwise it’s ambiguous).

Examples

Full                                    Shortened
────────────────────────────────────────────────────────────
2001:0db8:0000:0000:0000:0000:0000:0001  2001:db8::1
0000:0000:0000:0000:0000:0000:0000:0001  ::1 (loopback)
0000:0000:0000:0000:0000:0000:0000:0000  :: (unspecified)
fe80:0000:0000:0000:0215:5dff:fe00:0000  fe80::215:5dff:fe00:0
2001:0db8:85a3:0000:0000:8a2e:0370:7334  2001:db8:85a3::8a2e:370:7334

The IPv6 Header

IPv6’s header is simpler than IPv4’s—fixed at 40 bytes with no options in the base header:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
├───────┬───────────────┬───────────────────────────────────────┤
│Version│ Traffic Class │              Flow Label               │
├───────┴───────────────┼───────────────────┬───────────────────┤
│      Payload Length   │   Next Header     │    Hop Limit      │
├───────────────────────┴───────────────────┴───────────────────┤
│                                                               │
│                       Source Address                          │
│                       (128 bits)                              │
│                                                               │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│                    Destination Address                        │
│                       (128 bits)                              │
│                                                               │
└───────────────────────────────────────────────────────────────┘

Key Differences from IPv4

IPv4IPv6
Variable header (20-60 bytes)Fixed header (40 bytes)
Header checksumNo checksum (relies on link layer)
Fragmentation in headerExtension headers
Options in headerExtension headers

Extension Headers

IPv6 uses extension headers for optional features. They chain together:

┌──────────────┬────────────────┬──────────────┬─────────────┐
│ IPv6 Header  │  Hop-by-Hop    │  Destination │    TCP      │
│ Next: Hop-by │  Next: Dest    │  Next: TCP   │   Segment   │
│     -Hop     │   Options      │              │             │
└──────────────┴────────────────┴──────────────┴─────────────┘

Common Extension Headers:
- Hop-by-Hop Options (processed by every router)
- Routing (specify intermediate routers)
- Fragment (for packet fragmentation)
- Authentication Header (IPsec)
- Encapsulating Security Payload (IPsec encryption)
- Destination Options (for destination only)

Address Types

IPv6 has three address types (no broadcast!):

┌─────────────────────────────────────────────────────────────┐
│                    IPv6 Address Types                       │
├─────────────────────────────────────────────────────────────┤
│  Unicast      One-to-one communication                      │
│               Single sender, single receiver                │
│                                                             │
│  Multicast    One-to-many communication                     │
│               Single sender, multiple receivers             │
│               (Replaces broadcast)                          │
│                                                             │
│  Anycast      One-to-nearest communication                  │
│               Delivered to closest node in a group          │
│               (Same address on multiple nodes)              │
└─────────────────────────────────────────────────────────────┘

Special Address Prefixes

Prefix              Type                    Purpose
──────────────────────────────────────────────────────────────
::1/128             Loopback                Local host (localhost)
::/128              Unspecified             No address assigned
fe80::/10           Link-local              Same network only
fc00::/7            Unique local            Private addresses
ff00::/8            Multicast               Group communication
2000::/3            Global unicast          Public internet
::ffff:0:0/96       IPv4-mapped             IPv4 in IPv6 format
64:ff9b::/96        IPv4-IPv6 translation   NAT64

Every IPv6 interface automatically gets a link-local address starting with fe80:::

Interface: eth0
Link-local: fe80::1a2b:3c4d:5e6f:7890

These addresses:
- Auto-generated from MAC address (or random)
- Valid only on local network segment
- Not routed beyond local link
- Always present, even without DHCP/manual config

Global Unicast Addresses

Public IPv6 addresses typically start with 2 or 3:

2001:db8:1234:5678:9abc:def0:1234:5678
└───────────┬──────────┘└────────┬──────┘
      Routing Prefix        Interface ID
    (Network portion)     (Host portion)

Typical allocation:
/48  - Organization gets this from ISP
/64  - Single subnet (standard recommendation)

Address Autoconfiguration

IPv6 supports Stateless Address Autoconfiguration (SLAAC)—devices can configure their own addresses without DHCP:

1. Interface comes up
   ↓
2. Generate link-local address (fe80::...)
   ↓
3. Router sends Router Advertisement (RA)
   Contains: Network prefix (e.g., 2001:db8:1::/64)
   ↓
4. Host generates global address:
   Prefix from RA + Interface ID = Global Address
   2001:db8:1::1a2b:3c4d:5e6f:7890/64
   ↓
5. Host verifies uniqueness (DAD - Duplicate Address Detection)
   ↓
6. Address is ready to use!

DHCPv6 is available for networks needing more control.

IPv4 to IPv6 Transition

The world is slowly transitioning. Several mechanisms help:

Dual Stack

Devices run both IPv4 and IPv6:

┌─────────────────────────────────────┐
│            Application              │
├──────────────┬──────────────────────┤
│     IPv4     │        IPv6          │
├──────────────┼──────────────────────┤
│   Network Interface                 │
└─────────────────────────────────────┘

Device has both:
  IPv4: 192.168.1.100
  IPv6: 2001:db8::1234

Tunneling

IPv6 packets wrapped in IPv4 to cross IPv4-only networks:

┌───────────────────────────────────────────────────────┐
│ IPv4 Header                                           │
│ (src: 203.0.113.1, dst: 198.51.100.1)                │
├───────────────────────────────────────────────────────┤
│ ┌───────────────────────────────────────────────────┐ │
│ │ IPv6 Header                                       │ │
│ │ (src: 2001:db8::1, dst: 2001:db8::2)              │ │
│ ├───────────────────────────────────────────────────┤ │
│ │ Original Data                                     │ │
│ └───────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────┘

NAT64/DNS64

Allows IPv6-only devices to reach IPv4 servers:

IPv6-only Client                NAT64 Gateway              IPv4 Server
      │                              │                          │
      │─────IPv6 packet──────────────>│                          │
      │  dst: 64:ff9b::93.184.216.34 │                          │
      │                              │──────IPv4 packet────────>│
      │                              │  dst: 93.184.216.34      │
      │                              │                          │
      │                              │<─────IPv4 response───────│
      │<─────IPv6 response───────────│                          │

Working with IPv6

Command Line

# Show IPv6 addresses
$ ip -6 addr show
2: eth0: <BROADCAST,MULTICAST,UP>
    inet6 2001:db8::1/64 scope global
    inet6 fe80::1/64 scope link

# Ping IPv6
$ ping6 ::1
$ ping -6 google.com

# Trace route
$ traceroute6 google.com

# DNS lookup
$ dig AAAA google.com

In URLs

IPv6 addresses in URLs must be bracketed:

http://[2001:db8::1]:8080/path
       └─────────────┘
       IPv6 address in brackets

Without brackets, colons are ambiguous:
http://2001:db8::1:8080  ← Is 8080 the port or part of address?

Python

import ipaddress

# Parse IPv6
ip = ipaddress.ip_address('2001:db8::1')
print(ip.is_global)       # True
print(ip.is_link_local)   # False
print(ip.exploded)        # 2001:0db8:0000:0000:0000:0000:0000:0001

# Network operations
net = ipaddress.ip_network('2001:db8::/32')
print(net.num_addresses)  # 79228162514264337593543950336

# Socket programming
import socket
sock = socket.socket(socket.AF_INET6, socket.SOCK_STREAM)
sock.connect(('2001:db8::1', 80))

IPv6 Adoption Status

As of recent measurements:

  • ~40% of Google traffic is over IPv6
  • Major cloud providers fully support IPv6
  • Mobile networks often IPv6-primary
  • Many ISPs support IPv6 (but not all)
Adoption varies by region:
  India:    ~70% IPv6
  USA:      ~50% IPv6
  Germany:  ~60% IPv6
  China:    ~30% IPv6
  Global:   ~40% IPv6 (and growing)

Practical Considerations

When You Need IPv6

  • Modern mobile app development
  • IoT devices (often IPv6-only)
  • Reaching IPv6-only users
  • Future-proofing infrastructure

Common Issues

“Network unreachable” to IPv6 addresses

  • Your network may not have IPv6 connectivity
  • Check: ping6 ::1 (should work - loopback)
  • Check: ping6 google.com (needs IPv6 internet)

Application doesn’t support IPv6

  • Some older software hardcodes IPv4
  • Check for IPv6/dual-stack support in dependencies

Firewall not configured for IPv6

  • IPv6 rules are often separate from IPv4
  • Don’t forget to configure both!

Summary

IPv6 solves IPv4’s address exhaustion with a vastly larger address space:

FeatureIPv4IPv6
Address size32 bits128 bits
Address formatDotted decimalColon hexadecimal
Header sizeVariable (20-60)Fixed (40)
Address configDHCP or manualSLAAC, DHCPv6, or manual
NATCommonGenerally unnecessary
IPsecOptionalBuilt-in

The transition to IPv6 is ongoing but inevitable. New projects should support both protocols.

Next, we’ll look at subnetting—how to divide networks into smaller, manageable pieces.

Subnetting

Subnetting is the practice of dividing a larger network into smaller, logical segments. It’s a fundamental skill for network design and one of the most practical topics in IP networking.

Why Subnet?

Without subnetting, you’d have one flat network for all devices:

Single Network (No Subnetting):

┌─────────────────────────────────────────────────────────────┐
│   All 65,534 possible hosts on one network segment         │
│                                                             │
│   ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ... thousands │
│   │PC1│ │PC2│ │PC3│ │Srv│ │Dev│ │IoT│ │...│     more      │
│   └───┘ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘               │
│                                                             │
│   Problems:                                                 │
│   - Broadcast traffic reaches everyone                      │
│   - No isolation between departments                        │
│   - Security is harder to manage                            │
│   - Single failure can affect everyone                      │
└─────────────────────────────────────────────────────────────┘

With subnetting:

Subnetted Network:

┌────────────────────┐ ┌────────────────────┐ ┌────────────────────┐
│   Engineering      │ │    Marketing       │ │      Servers       │
│   192.168.1.0/26   │ │   192.168.1.64/26  │ │  192.168.1.128/26  │
│                    │ │                    │ │                    │
│  ┌───┐ ┌───┐ ┌───┐ │ │  ┌───┐ ┌───┐      │ │  ┌───┐ ┌───┐      │
│  │PC1│ │PC2│ │PC3│ │ │  │PC4│ │PC5│      │ │  │Web│ │DB │      │
│  └───┘ └───┘ └───┘ │ │  └───┘ └───┘      │ │  └───┘ └───┘      │
│   62 usable hosts  │ │   62 usable hosts  │ │   62 usable hosts  │
└─────────┬──────────┘ └─────────┬──────────┘ └─────────┬──────────┘
          │                      │                      │
          └──────────────────────┼──────────────────────┘
                                 │
                            ┌────┴────┐
                            │ Router  │
                            └─────────┘

Benefits:

  • Broadcast containment: Broadcasts stay within subnet
  • Security: Apply different policies to different subnets
  • Organization: Logical grouping of related systems
  • Performance: Less broadcast traffic per segment
  • Troubleshooting: Easier to isolate issues

Understanding CIDR Notation

CIDR (Classless Inter-Domain Routing) notation specifies how many bits are used for the network portion:

192.168.1.0/24
            └── 24 network bits, 8 host bits

Binary breakdown:
Address:    11000000.10101000.00000001.00000000
            └────────── 24 bits ──────────┘└ 8 ┘
                     Network           Host

Subnet mask: 255.255.255.0
             11111111.11111111.11111111.00000000

Common CIDR Blocks

CIDRSubnet MaskNetwork BitsHost BitsUsable Hosts
/8255.0.0.082416,777,214
/16255.255.0.0161665,534
/24255.255.255.0248254
/25255.255.255.128257126
/26255.255.255.19226662
/27255.255.255.22427530
/28255.255.255.24028414
/29255.255.255.2482936
/30255.255.255.2523022
/31255.255.255.2543112*
/32255.255.255.2553201

/31 is special: Used for point-to-point links (no broadcast needed).

Why “Usable Hosts” Is Less Than 2^n

Two addresses in every subnet are reserved:

  • Network address: All host bits = 0 (identifies the subnet)
  • Broadcast address: All host bits = 1 (reaches all hosts)
192.168.1.0/24:
  Network:   192.168.1.0     (first address)
  Hosts:     192.168.1.1 - 192.168.1.254
  Broadcast: 192.168.1.255   (last address)

Usable = 2^(host bits) - 2 = 256 - 2 = 254

Calculating Subnets

Method 1: Binary Calculation

Given 192.168.1.0/24, create 4 subnets:

Step 1: Determine bits needed
  4 subnets = 2^2, so we need 2 additional network bits
  New prefix: /24 + 2 = /26

Step 2: Calculate subnet size
  Host bits = 32 - 26 = 6
  Addresses per subnet = 2^6 = 64
  Usable hosts = 64 - 2 = 62

Step 3: List subnets (increment by 64)
  Subnet 0: 192.168.1.0/26    (hosts .1-.62,   broadcast .63)
  Subnet 1: 192.168.1.64/26   (hosts .65-.126, broadcast .127)
  Subnet 2: 192.168.1.128/26  (hosts .129-.190, broadcast .191)
  Subnet 3: 192.168.1.192/26  (hosts .193-.254, broadcast .255)

Method 2: The “Magic Number” Method

The “magic number” is 256 minus the last non-zero octet of the subnet mask:

For /26: Mask = 255.255.255.192
Magic number = 256 - 192 = 64

Subnets start at multiples of 64:
  192.168.1.0, 192.168.1.64, 192.168.1.128, 192.168.1.192

Subnet Calculation Chart

┌──────────────────────────────────────────────────────────────────────┐
│  CIDR    Mask           Magic#   Subnets(from /24)  Hosts  Range    │
├──────────────────────────────────────────────────────────────────────┤
│  /25     255.255.255.128  128         2              126    /2       │
│  /26     255.255.255.192   64         4               62    /4       │
│  /27     255.255.255.224   32         8               30    /8       │
│  /28     255.255.255.240   16        16               14    /16      │
│  /29     255.255.255.248    8        32                6    /32      │
│  /30     255.255.255.252    4        64                2    /64      │
└──────────────────────────────────────────────────────────────────────┘

Practical Examples

Example 1: Office Network Design

Requirement: Design a network for a small office with:

  • 50 employees (workstations)
  • 10 servers
  • 5 network devices
  • Room for 50% growth

Given: 192.168.10.0/24

Solution:

Department        Hosts Needed   Subnet         Usable Range
────────────────────────────────────────────────────────────────
Workstations      50 (→75)       /25 (126)      192.168.10.0/25
                                                .1 - .126
Servers           10 (→15)       /27 (30)       192.168.10.128/27
                                                .129 - .158
Network Devices   5 (→8)         /28 (14)       192.168.10.160/28
                                                .161 - .174
Future Use        -              /28 (14)       192.168.10.176/28
Management        -              /28 (14)       192.168.10.192/28

Remaining: 192.168.10.208 - 192.168.10.255 (/28 + partial)

Example 2: Finding Subnet for an IP

Question: What subnet does 192.168.1.147/26 belong to?

Step 1: Find the magic number
  /26 mask = 255.255.255.192
  Magic = 256 - 192 = 64

Step 2: Find which multiple of 64 contains .147
  0, 64, 128, 192...
  128 ≤ 147 < 192

Step 3: Answer
  Network: 192.168.1.128/26
  Range: 192.168.1.128 - 192.168.1.191
  Broadcast: 192.168.1.191

Example 3: Are Two IPs on Same Subnet?

Question: Are 10.1.1.50/28 and 10.1.1.60/28 on the same subnet?

For /28: Magic number = 256 - 240 = 16

10.1.1.50: Falls in 10.1.1.48/28 (48 ≤ 50 < 64)
10.1.1.60: Falls in 10.1.1.48/28 (48 ≤ 60 < 64)

Answer: Yes, same subnet (10.1.1.48/28)

VLSM (Variable Length Subnet Mask)

VLSM allows different subnets to have different sizes, optimizing address usage:

Without VLSM (Fixed /26):
┌──────────────────────────────────────────────────────────────┐
│  Dept A: 60 hosts    │  Dept B: 10 hosts   │  Links: 2 hosts │
│  /26 (62 usable) ✓   │  /26 (62 usable)    │  /26 (62 usable)│
│                      │  52 wasted!          │  60 wasted!     │
└──────────────────────────────────────────────────────────────┘

With VLSM:
┌──────────────────────────────────────────────────────────────┐
│  Dept A: 60 hosts    │  Dept B: 10 hosts   │  Links: 2 hosts │
│  /26 (62 usable) ✓   │  /28 (14 usable) ✓  │  /30 (2 usable)✓│
│                      │  4 spare             │  0 wasted       │
└──────────────────────────────────────────────────────────────┘

VLSM Planning Process

  1. List requirements from largest to smallest
  2. Assign subnets starting with largest
  3. Use remaining space for smaller subnets
Given: 172.16.0.0/16
Requirements:
  - Engineering: 500 hosts
  - Sales: 100 hosts
  - HR: 50 hosts
  - Point-to-point links: 4 (need 2 hosts each)

Allocation:
  Engineering:  172.16.0.0/23    (510 hosts)   172.16.0.1 - 172.16.1.254
  Sales:        172.16.2.0/25    (126 hosts)   172.16.2.1 - 172.16.2.126
  HR:           172.16.2.128/26  (62 hosts)    172.16.2.129 - 172.16.2.190
  Link 1:       172.16.2.192/30  (2 hosts)     172.16.2.193 - 172.16.2.194
  Link 2:       172.16.2.196/30  (2 hosts)     172.16.2.197 - 172.16.2.198
  Link 3:       172.16.2.200/30  (2 hosts)     172.16.2.201 - 172.16.2.202
  Link 4:       172.16.2.204/30  (2 hosts)     172.16.2.205 - 172.16.2.206

  Remaining:    172.16.2.208 - 172.16.255.255 (available for future)

Supernetting (Route Aggregation)

Supernetting (or CIDR aggregation) combines multiple smaller networks into one larger route:

Before aggregation (4 routes):
  192.168.0.0/24
  192.168.1.0/24
  192.168.2.0/24
  192.168.3.0/24

After aggregation (1 route):
  192.168.0.0/22

This reduces routing table size and improves router efficiency.

Binary visualization:
  192.168.0.0  = 11000000.10101000.000000|00.00000000
  192.168.1.0  = 11000000.10101000.000000|01.00000000
  192.168.2.0  = 11000000.10101000.000000|10.00000000
  192.168.3.0  = 11000000.10101000.000000|11.00000000
                                  └──────┘
                                  These bits vary

  Common prefix: 22 bits → /22 covers all four

IPv6 Subnetting

IPv6 subnetting is conceptually similar but the numbers are larger:

Standard allocation:
  ISP receives:     /32 or /48 from registry
  Organization gets: /48 from ISP
  Site/Subnet:      /64 (standard LAN)

/48 to /64 gives: 16 bits = 65,536 subnets
Each /64 has:     64 bits for hosts = 2^64 addresses

Example:
  Organization: 2001:db8:abcd::/48

  Subnets:
    2001:db8:abcd:0000::/64  - HQ Floor 1
    2001:db8:abcd:0001::/64  - HQ Floor 2
    2001:db8:abcd:0002::/64  - HQ Servers
    ...
    2001:db8:abcd:ffff::/64  - 65,536th subnet

Tools for Subnetting

Command Line

# ipcalc (Linux)
$ ipcalc 192.168.1.0/26
Address:   192.168.1.0          11000000.10101000.00000001.00 000000
Netmask:   255.255.255.192 = 26 11111111.11111111.11111111.11 000000
Wildcard:  0.0.0.63             00000000.00000000.00000000.00 111111
Network:   192.168.1.0/26       11000000.10101000.00000001.00 000000
HostMin:   192.168.1.1          11000000.10101000.00000001.00 000001
HostMax:   192.168.1.62         11000000.10101000.00000001.00 111110
Broadcast: 192.168.1.63         11000000.10101000.00000001.00 111111
Hosts/Net: 62

# sipcalc (more features)
$ sipcalc 192.168.1.0/24 -s 26

Python

import ipaddress

# Create network
network = ipaddress.ip_network('192.168.1.0/24')

# Get subnet info
print(f"Network: {network.network_address}")
print(f"Netmask: {network.netmask}")
print(f"Broadcast: {network.broadcast_address}")
print(f"Hosts: {network.num_addresses - 2}")

# Divide into subnets
subnets = list(network.subnets(new_prefix=26))
for subnet in subnets:
    print(f"  {subnet}")
# Output:
#   192.168.1.0/26
#   192.168.1.64/26
#   192.168.1.128/26
#   192.168.1.192/26

# Check if IP is in network
ip = ipaddress.ip_address('192.168.1.100')
print(ip in network)  # True

Common Mistakes

  1. Forgetting reserved addresses

    • Always subtract 2 from total for usable hosts
  2. Overlapping subnets

    • 192.168.1.0/25 and 192.168.1.64/26 overlap!
    • Plan carefully, especially with VLSM
  3. Not planning for growth

    • Networks grow; leave room for expansion
  4. Using /30 for LANs

    • /30 is for point-to-point links only
    • LANs need room for multiple hosts

Summary

Subnetting divides networks for better organization, security, and efficiency:

  • CIDR notation (/24) indicates network vs. host bits
  • Subnet mask shows the network boundary
  • Magic number (256 - mask octet) gives subnet size
  • VLSM allows different-sized subnets for efficiency
  • Supernetting aggregates routes for simpler routing

Practice is key—work through examples until it becomes intuitive.

Next, we’ll explore how packets actually find their way across networks: routing fundamentals.

Routing Fundamentals

Routing is the process of selecting paths for network traffic. When you send a packet to a destination across the internet, it passes through many intermediate devices (routers) that make forwarding decisions. Understanding routing helps you debug connectivity issues and design better network architectures.

The Core Concept

Routing works through a simple, repeated process:

At each router:
┌─────────────────────────────────────────────────────────────┐
│  1. Receive packet                                          │
│  2. Examine destination IP address                          │
│  3. Consult routing table                                   │
│  4. Forward packet to next hop (or deliver locally)         │
│  5. Decrement TTL                                           │
│  6. Forget about the packet                                 │
└─────────────────────────────────────────────────────────────┘

No router knows the complete path. Each makes a local decision.

This hop-by-hop routing is fundamental to the internet’s scalability and resilience.

Direct vs. Indirect Delivery

When a host wants to send a packet, it first determines if the destination is local (same network) or remote (different network):

Source: 192.168.1.100/24

Case 1: Destination 192.168.1.200 (same network)
┌───────────────────────────────────────────────────────────────┐
│  Apply subnet mask:                                           │
│    192.168.1.100 AND 255.255.255.0 = 192.168.1.0              │
│    192.168.1.200 AND 255.255.255.0 = 192.168.1.0              │
│  Same network! → Direct delivery via ARP                      │
└───────────────────────────────────────────────────────────────┘

Case 2: Destination 10.0.0.50 (different network)
┌───────────────────────────────────────────────────────────────┐
│  Apply subnet mask:                                           │
│    192.168.1.100 AND 255.255.255.0 = 192.168.1.0              │
│    10.0.0.50 AND 255.255.255.0 = 10.0.0.0                     │
│  Different networks! → Send to default gateway                │
└───────────────────────────────────────────────────────────────┘

The Routing Table

A routing table maps destination networks to next hops. Every device with IP networking has one:

$ ip route show  # Linux
$ netstat -rn    # Linux/Mac
$ route print    # Windows

Example output:
┌──────────────────┬──────────────────┬─────────────┬───────────┐
│   Destination    │     Gateway      │   Iface     │  Metric   │
├──────────────────┼──────────────────┼─────────────┼───────────┤
│   0.0.0.0/0      │   192.168.1.1    │    eth0     │    100    │
│   192.168.1.0/24 │   0.0.0.0        │    eth0     │    0      │
│   10.10.0.0/16   │   192.168.1.254  │    eth0     │    100    │
│   127.0.0.0/8    │   0.0.0.0        │    lo       │    0      │
└──────────────────┴──────────────────┴─────────────┴───────────┘

Entry meanings:
- 0.0.0.0/0: Default route ("everything else") → send to 192.168.1.1
- 192.168.1.0/24: Local network → deliver directly (0.0.0.0 gateway)
- 10.10.0.0/16: Route to remote network → via 192.168.1.254
- 127.0.0.0/8: Loopback → handled locally

Routing Table Lookup

When forwarding a packet, the router finds the most specific matching route (longest prefix match):

Destination: 10.10.5.100

Routing table entries:
  0.0.0.0/0      → Gateway A (default)
  10.0.0.0/8     → Gateway B
  10.10.0.0/16   → Gateway C
  10.10.5.0/24   → Gateway D

Matching process:
  0.0.0.0/0    - Matches (but only 0 bits specific)
  10.0.0.0/8   - Matches (8 bits specific)
  10.10.0.0/16 - Matches (16 bits specific)
  10.10.5.0/24 - Matches (24 bits specific) ← WINNER

Result: Forward to Gateway D (most specific match)

Static vs. Dynamic Routing

Static Routing

Routes manually configured by an administrator:

# Add a static route (Linux)
$ sudo ip route add 10.20.0.0/16 via 192.168.1.254

# Persistent (varies by distro, often in /etc/network/interfaces or netplan)

Pros:

  • Simple, predictable
  • No protocol overhead
  • Good for small, stable networks

Cons:

  • Doesn’t adapt to failures
  • Tedious for large networks
  • Error-prone at scale

Dynamic Routing

Routes learned automatically via routing protocols:

┌─────────────────────────────────────────────────────────────┐
│                    Routing Protocols                        │
├─────────────────────────────────────────────────────────────┤
│  Interior Gateway Protocols (within organization):          │
│    RIP   - Simple, distance-vector, limited scale          │
│    OSPF  - Link-state, widely used, complex                │
│    EIGRP - Cisco proprietary, efficient                    │
│    IS-IS - Link-state, used by large ISPs                  │
│                                                             │
│  Exterior Gateway Protocol (between organizations):         │
│    BGP   - Border Gateway Protocol, runs the internet      │
└─────────────────────────────────────────────────────────────┘

Pros:

  • Automatically adapts to failures
  • Scales to large networks
  • Finds optimal paths

Cons:

  • Protocol overhead
  • More complex to configure
  • Convergence time during changes

How Routing Protocols Work

Distance-Vector (RIP)

Routers share their entire routing table with neighbors periodically:

Initial state:
                  ┌───────────────┐
  Router A ─────── Router B ─────── Router C
  Knows:          Knows:          Knows:
  Net 1           Net 2           Net 3

After exchange:
  Router A        Router B        Router C
  Knows:          Knows:          Knows:
  Net 1 (direct)  Net 1 (via A)   Net 1 (via B)
  Net 2 (via B)   Net 2 (direct)  Net 2 (via B)
  Net 3 (via B)   Net 3 (via C)   Net 3 (direct)

Each router learns the complete network topology and calculates best paths:

1. Each router discovers neighbors
2. Each router floods link-state info to all routers
3. Every router has identical network map
4. Each router independently calculates best paths (Dijkstra's algorithm)

Advantage: Faster convergence, no routing loops during transition
Disadvantage: More memory and CPU intensive

BGP: The Internet’s Routing Protocol

BGP (Border Gateway Protocol) is how autonomous systems (AS) exchange routing information:

┌──────────────────┐          ┌──────────────────┐
│     AS 65001     │          │     AS 65002     │
│   (Your ISP)     │──BGP────│   (Another ISP)  │
│                  │          │                  │
│  Announces:      │          │  Announces:      │
│  203.0.113.0/24  │          │  198.51.100.0/24 │
└──────────────────┘          └──────────────────┘

BGP characteristics:
- Path-vector protocol (tracks AS path)
- Policy-based routing (not just shortest path)
- Slow convergence (stability over speed)
- ~900,000+ routes in global table

BGP Path Selection

BGP chooses routes based on multiple criteria (simplified):

  1. Highest local preference
  2. Shortest AS path
  3. Lowest origin type
  4. Lowest MED (Multi-Exit Discriminator)
  5. Prefer eBGP over iBGP
  6. Lowest IGP metric to next hop
  7. … (many more tie-breakers)

Routing in Action

Let’s trace a packet from your laptop to a web server:

Your Laptop (192.168.1.100)
    │
    │ Destination: 93.184.216.34 (example.com)
    │ Different network → send to default gateway
    ▼
Home Router (192.168.1.1)
    │
    │ Routing table: default route → ISP
    ▼
ISP Router #1
    │
    │ BGP table: 93.184.216.0/24 → via AS 15133
    │ (Multiple paths available, chooses best)
    ▼
ISP Router #2
    │
    │ BGP: Next hop toward destination AS
    ▼
    ... (several more hops) ...
    │
    ▼
Destination Router
    │
    │ 93.184.216.0/24 is directly connected
    │ ARP for 93.184.216.34, deliver to server
    ▼
Web Server (93.184.216.34)

Traceroute: Seeing the Path

Traceroute reveals the path packets take by exploiting TTL:

$ traceroute example.com
 1  192.168.1.1 (192.168.1.1)     1.234 ms
 2  96.120.92.1 (96.120.92.1)     12.456 ms
 3  68.86.90.137 (68.86.90.137)   15.789 ms
 4  * * *                         (no response)
 5  be-33651-cr02.nyc (66.109.6.81) 25.123 ms
 6  93.184.216.34 (93.184.216.34) 28.456 ms

How it works:
  Send packet with TTL=1 → First router replies "TTL exceeded"
  Send packet with TTL=2 → Second router replies
  Send packet with TTL=3 → Third router replies
  ... continue until destination reached

Reading Traceroute Output

Hop 4: * * *

This means:
- Router didn't respond to traceroute probes
- Could be: firewall blocking, ICMP rate limiting, high latency
- Packets might still pass through this router fine
- Not necessarily a problem

Multiple times per hop:
 3  68.86.90.137  15.789 ms  16.123 ms  14.567 ms
    └─────────────┴─ Three separate probes, showing latency variation

Common Routing Issues

Routing Loops

Misconfiguration can cause packets to circle:

Router A: "To reach 10.0.0.0/8, send to B"
Router B: "To reach 10.0.0.0/8, send to C"
Router C: "To reach 10.0.0.0/8, send to A"

Packet bounces: A → B → C → A → B → C → ...
Until TTL reaches 0!

Solutions:
- TTL prevents infinite loops
- Routing protocols have loop prevention (split horizon, etc.)
- BGP uses AS path to detect loops

Asymmetric Routing

Outbound and inbound paths can differ:

Request:  A → B → C → D → Server
Response: Server → E → F → A

This is normal and common!
Can complicate:
- Troubleshooting
- Stateful firewalls
- Performance analysis

Black Holes

Traffic enters but doesn’t come out:

Causes:
- Null route (route to nowhere)
- Firewall silently drops
- Network failure with no alternative path
- MTU issues (packets too large)

Debugging:
- Traceroute to find where packets stop
- Check routing tables
- Verify firewall rules

Routing Table Management

Viewing Routes

# Linux
$ ip route show
$ ip -6 route show  # IPv6

# macOS
$ netstat -rn

# Windows
> route print

Adding/Removing Routes

# Linux - add route
$ sudo ip route add 10.20.0.0/16 via 192.168.1.254

# Linux - remove route
$ sudo ip route del 10.20.0.0/16

# Linux - change default gateway
$ sudo ip route replace default via 192.168.1.1

# Make persistent (varies by distro)
# Ubuntu/Debian: /etc/netplan/*.yaml
# RHEL/CentOS: /etc/sysconfig/network-scripts/route-eth0

Summary

Routing is the backbone of internet connectivity:

  • Hop-by-hop forwarding: Each router makes local decisions
  • Routing tables: Map destinations to next hops
  • Longest prefix match: Most specific route wins
  • Static routing: Manual configuration for simple networks
  • Dynamic routing: Protocols (OSPF, BGP) for automatic adaptation
  • BGP: The protocol that makes the internet work

Key debugging tools:

  • ip route / netstat -rn: View routing tables
  • traceroute / tracert: See packet paths
  • ping: Test basic connectivity

Understanding routing helps you diagnose why packets aren’t reaching their destination and design networks that are resilient to failures.

Next, we’ll cover IP fragmentation—what happens when packets are too large for a network link.

IP Fragmentation

Different network links have different maximum packet sizes. When a packet is too large for a link, it must be fragmented—split into smaller pieces. Understanding fragmentation helps you diagnose performance problems and configure networks properly.

MTU: Maximum Transmission Unit

The MTU is the largest packet size a link can carry:

┌─────────────────────────────────────────────────────────────┐
│                    Common MTU Values                        │
├─────────────────────────────────────────────────────────────┤
│  Ethernet:              1500 bytes (standard)               │
│  Jumbo Frames:          9000 bytes (data centers)           │
│  PPPoE (DSL):           1492 bytes                          │
│  VPN Tunnels:           ~1400-1450 bytes (overhead)         │
│  IPv6 minimum:          1280 bytes                          │
│  Dial-up (PPP):         576 bytes (historical)              │
└─────────────────────────────────────────────────────────────┘

When a packet exceeds the outgoing link’s MTU, something must happen.

How Fragmentation Works

IPv4 routers can fragment packets when needed:

Original Packet (3000 bytes payload + 20 byte header = 3020 bytes)
┌──────────────────────────────────────────────────────────────┐
│IP Hdr│                     Payload (3000 bytes)              │
│ 20B  │                                                       │
└──────────────────────────────────────────────────────────────┘

Link MTU: 1500 bytes
Max payload per fragment: 1500 - 20 = 1480 bytes

After Fragmentation:
┌────────────────────────────────┐
│IP Hdr│  Fragment 1 (1480 B)   │  Offset: 0, MF=1
│ 20B  │  ID: 12345             │
└────────────────────────────────┘
┌────────────────────────────────┐
│IP Hdr│  Fragment 2 (1480 B)   │  Offset: 1480, MF=1
│ 20B  │  ID: 12345             │
└────────────────────────────────┘
┌─────────────────────┐
│IP Hdr│ Fragment 3   │  Offset: 2960, MF=0 (last fragment)
│ 20B  │   (40 B)     │  ID: 12345
└─────────────────────┘

Fragmentation Header Fields

Three IP header fields manage fragmentation:

┌───────────────────────────────────────────────────────────────┐
│  Identification (16 bits)                                     │
│    Unique ID for the original packet                          │
│    All fragments share the same ID                            │
├───────────────────────────────────────────────────────────────┤
│  Flags (3 bits)                                               │
│    Bit 0: Reserved (must be 0)                                │
│    Bit 1: DF (Don't Fragment)                                 │
│            0 = May fragment                                   │
│            1 = Don't fragment (drop if too big)               │
│    Bit 2: MF (More Fragments)                                 │
│            0 = Last fragment (or unfragmented)                │
│            1 = More fragments follow                          │
├───────────────────────────────────────────────────────────────┤
│  Fragment Offset (13 bits)                                    │
│    Position of this fragment in original packet               │
│    Measured in 8-byte units (not bytes!)                      │
│    Max offset: 8191 × 8 = 65,528 bytes                        │
└───────────────────────────────────────────────────────────────┘

Fragment Offset Calculation

Fragment offsets must be multiples of 8 bytes:

Original payload: 3000 bytes
MTU: 1500 bytes
Max payload per fragment: 1480 bytes (must be multiple of 8)

Fragment 1:
  Offset: 0 (bytes) / 8 = 0
  Size: 1480 bytes
  MF: 1 (more fragments)

Fragment 2:
  Offset: 1480 / 8 = 185
  Size: 1480 bytes
  MF: 1 (more fragments)

Fragment 3:
  Offset: 2960 / 8 = 370
  Size: 40 bytes (remaining)
  MF: 0 (last fragment)

Reassembly

Fragments are reassembled only at the final destination, not at intermediate routers:

Sender → Router1 → Router2 → Router3 → Receiver
                                        │
                                        ▼
                             ┌──────────────────┐
                             │   Reassembly     │
                             │                  │
                             │  Wait for all    │
                             │  fragments with  │
                             │  same ID         │
                             │                  │
                             │  Arrange by      │
                             │  offset          │
                             │                  │
                             │  Check MF=0 for  │
                             │  last piece      │
                             │                  │
                             │  Reconstruct     │
                             │  original packet │
                             └──────────────────┘

Reassembly Timeout

If fragments don’t all arrive within a timeout (typically 30-120 seconds), the partial packet is discarded:

Fragment 1: ✓ Received
Fragment 2: ✓ Received
Fragment 3: ✗ Lost

After timeout:
  All fragments discarded
  ICMP "Fragment Reassembly Time Exceeded" may be sent
  Upper layer (TCP) must retransmit entire original packet

Problems with Fragmentation

Fragmentation has significant drawbacks:

1. Performance Overhead

Single 3000-byte packet vs. 3 fragments:

Original (1 packet):
  Processing: 1 header lookup
  Transmission: 1 packet

Fragmented (3 packets):
  Processing: 3 header lookups (3x)
  Headers: 60 bytes (vs. 20)
  Reassembly: Buffer management, timeout tracking

2. Fragment Loss Amplification

If any fragment is lost, the entire packet is lost:

3 fragments, 1% loss rate each:

Probability all arrive = 0.99³ = 97%
Probability of packet loss = 3%

vs. unfragmented: 1% loss

More fragments = higher effective loss rate

3. Security Issues

  • Tiny fragment attacks: Malicious fragments too small to contain port numbers
  • Overlapping fragment attacks: Crafted to bypass firewalls
  • Fragment flood DoS: Exhaust reassembly buffers

Many firewalls drop fragments by default.

4. Stateful Firewall Problems

Firewall examines:
  Source/Dest IP: In every fragment ✓
  Source/Dest Port: Only in FIRST fragment!

Fragment 2 arrives first:
  No port information
  Firewall can't apply port-based rules
  May drop or pass incorrectly

Path MTU Discovery (PMTUD)

Modern systems avoid fragmentation using Path MTU Discovery:

1. Sender sends packet with DF (Don't Fragment) bit set
2. If packet is too large, router sends ICMP "Fragmentation Needed"
3. Sender reduces packet size and retries
4. Repeat until path MTU is found

┌────────┐                              ┌────────┐
│ Sender │                              │  Dest  │
└───┬────┘                              └───┬────┘
    │                                       │
    │──── 1500 byte packet, DF=1 ──────────>│
    │                                       │
    │        ┌────────┐                     │
    │        │ Router │                     │
    │        │MTU=1400│                     │
    │        └───┬────┘                     │
    │            │                          │
    │<── ICMP "Frag Needed, MTU=1400" ──────│
    │                                       │
    │──── 1400 byte packet, DF=1 ──────────>│
    │                                       │
    │<─────────── Response ─────────────────│

PMTUD Problems

Black Hole Routers: Some routers don’t send ICMP messages (or firewalls block them):

Sender → Router1 → Router2 → Dest
          │
          └── Has MTU 1400
              Drops packet (too big, DF=1)
              Doesn't send ICMP (broken/filtered)

Sender keeps trying 1500-byte packets
All silently dropped = "black hole"

Workarounds:

  • MSS clamping (TCP)
  • Fallback to minimum MTU
  • Manual MTU configuration

TCP and MTU

TCP uses the MSS (Maximum Segment Size) to avoid IP fragmentation:

MSS = MTU - IP header - TCP header
MSS = 1500 - 20 - 20 = 1460 bytes (typical)

TCP segments are sized to fit in one IP packet:
┌──────────────────────────────────────────────────┐
│ IP Hdr │ TCP Hdr │      TCP Data (≤ MSS)        │
│  20B   │   20B   │         1460 bytes           │
└──────────────────────────────────────────────────┘
             Total: ≤ MTU (1500 bytes)

MSS is negotiated during the TCP handshake:

Client → SYN, MSS=1460
Server → SYN-ACK, MSS=1400
         (Server's path MTU is smaller)

Connection uses minimum: MSS=1400

IPv6 and Fragmentation

IPv6 handles fragmentation differently:

IPv4:
- Routers can fragment
- Sender can fragment
- Minimum MTU: 576 bytes

IPv6:
- Routers CANNOT fragment (must be done at source)
- PMTUD is mandatory
- Minimum MTU: 1280 bytes
- Uses Fragment extension header

IPv6’s approach improves router performance (no fragmentation processing) but requires working PMTUD.

Practical Considerations

Checking MTU

# Linux - show interface MTU
$ ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500

# macOS
$ ifconfig en0 | grep mtu

# Test path MTU with ping
$ ping -M do -s 1472 example.com   # Linux: -M do = DF bit
$ ping -D -s 1472 example.com      # macOS: -D = DF bit
# 1472 + 8 (ICMP) + 20 (IP) = 1500

# If it fails, reduce size until it works

Setting MTU

# Linux - temporary
$ sudo ip link set eth0 mtu 1400

# Linux - permanent (varies by distro)
# Netplan (Ubuntu):
# /etc/netplan/01-network.yaml
network:
  ethernets:
    eth0:
      mtu: 1400

# Windows
> netsh interface ipv4 set subinterface "Ethernet" mtu=1400

Common MTU Issues

VPN tunnels:

Original: 1500 MTU
VPN overhead: ~60-80 bytes (encryption, headers)
Effective MTU: ~1420-1440 bytes

If not configured, causes fragmentation or black holes

Docker/containers:

Host MTU: 1500
Container default: 1500
Overlay network: Adds headers

May need: MTU 1450 or lower inside containers

PPPoE (DSL):

Ethernet MTU: 1500
PPPoE overhead: 8 bytes
Effective: 1492 MTU

ISP-provided routers usually handle this
Manual configurations may need adjustment

Debugging Fragmentation Issues

Symptoms

  • Large file transfers fail, small requests work
  • Connections hang during data transfer
  • Works on LAN, fails over VPN/WAN
  • PMTUD blackhole (DF packets disappear)

Diagnostics

# Check if fragmentation is occurring
$ netstat -s | grep -i frag   # Linux
fragments received
fragments created

# Tcpdump for fragments
$ tcpdump -i eth0 'ip[6:2] & 0x3fff != 0'

# Test specific sizes
$ ping -M do -s SIZE destination

# Traceroute with MTU discovery
$ tracepath example.com

Summary

IP fragmentation handles oversized packets but comes with costs:

AspectImpact
PerformanceMultiple packets, reassembly overhead
ReliabilityOne lost fragment = lost packet
SecurityFragment attacks, firewall issues
Modern approachAvoid via PMTUD, MSS clamping

Best practices:

  • Design for 1500-byte MTU (or smaller if tunneling)
  • Use PMTUD where possible
  • Configure MSS clamping on border routers
  • Test with large packets during deployment

IPv6 eliminates router fragmentation entirely, making PMTUD mandatory but more predictable.

This completes our coverage of the IP layer. Next, we’ll dive deep into TCP—the protocol that provides reliable, ordered delivery on top of IP’s best-effort service.

TCP Deep Dive

TCP (Transmission Control Protocol) transforms IP’s unreliable packet delivery into a reliable, ordered byte stream. It’s the foundation for most internet applications—web browsing, email, file transfer, and API calls all typically use TCP.

What TCP Provides

TCP adds these guarantees on top of IP:

┌─────────────────────────────────────────────────────────────┐
│                    TCP Guarantees                           │
├─────────────────────────────────────────────────────────────┤
│  ✓ Reliable Delivery                                        │
│    Lost packets are detected and retransmitted              │
│                                                             │
│  ✓ Ordered Delivery                                         │
│    Data arrives in the order it was sent                    │
│                                                             │
│  ✓ Error Detection                                          │
│    Corrupted data is detected via checksums                 │
│                                                             │
│  ✓ Flow Control                                             │
│    Sender doesn't overwhelm receiver                        │
│                                                             │
│  ✓ Congestion Control                                       │
│    Sender doesn't overwhelm the network                     │
│                                                             │
│  ✓ Connection-Oriented                                      │
│    Explicit setup and teardown                              │
│                                                             │
│  ✓ Full-Duplex                                              │
│    Data flows in both directions simultaneously             │
└─────────────────────────────────────────────────────────────┘

The Trade-offs

These guarantees come at a cost:

Reliability vs. Latency
──────────────────────────────────────────
TCP must wait for acknowledgments
Lost packet? Wait for retransmission
Connection setup requires round trips

Ordering vs. Throughput
──────────────────────────────────────────
Head-of-line blocking: One lost packet
stalls delivery of everything behind it

Packet:  1  2  3  4  5  6  7
              ↑
              Lost

Received: 1  2  [waiting...] 3  4  5  6  7
              │
              Can't deliver 4-7 until 3 arrives

This is why some applications (real-time video, gaming) use UDP instead.

TCP vs. IP

Think of TCP and IP as two different jobs:

┌─────────────────────────────────────────────────────────────┐
│                          IP                                 │
│  "I'll try to get this packet to the destination address"   │
│                                                             │
│  - No guarantee of delivery                                 │
│  - No guarantee of order                                    │
│  - Packets are independent                                  │
│  - Fast, simple, stateless                                  │
└─────────────────────────────────────────────────────────────┘
                            ↑
                            │
┌─────────────────────────────────────────────────────────────┐
│                         TCP                                 │
│  "I'll make sure all data arrives correctly and in order"   │
│                                                             │
│  - Reliable delivery (detects loss, retransmits)            │
│  - Ordered delivery (sequence numbers)                      │
│  - Connection state (both sides track progress)             │
│  - Slower, complex, stateful                                │
└─────────────────────────────────────────────────────────────┘

Key Concepts Preview

Sequence Numbers

Every byte in a TCP stream has a sequence number:

Application sends: "Hello, World!" (13 bytes)

TCP assigns:
  Seq 1000: 'H'
  Seq 1001: 'e'
  Seq 1002: 'l'
  ...
  Seq 1012: '!'

Segments might be:
  Segment 1: Seq=1000, "Hello, "
  Segment 2: Seq=1007, "World!"

Acknowledgments

The receiver tells the sender what it’s received:

Sender                          Receiver
   │                               │
   │──── Seq=1000, "Hello" ───────>│
   │                               │
   │<──── ACK=1005 ────────────────│
   │      "I've received up to     │
   │       byte 1004, send 1005"   │

The Window

The receiver advertises how much data it can accept:

"My receive buffer can hold 65535 more bytes"
  Window = 65535

Sender can send that much without waiting for ACKs
(Sliding window protocol)

What You’ll Learn

This chapter covers TCP in depth:

  1. The Three-Way Handshake: How connections are established
  2. TCP Header and Segments: The packet format and key fields
  3. Flow Control: Preventing receiver overload
  4. Congestion Control: Preventing network overload
  5. Retransmission: How lost data is recovered
  6. TCP States: The connection lifecycle

Understanding TCP helps you:

  • Debug connection problems
  • Optimize application performance
  • Make informed protocol choices
  • Understand why things sometimes feel slow

Let’s start with the handshake—how two systems establish a TCP connection.

The Three-Way Handshake

Before TCP can transfer data, both sides must establish a connection. This happens through a three-message exchange called the three-way handshake.

Why a Handshake?

The handshake serves several purposes:

  1. Verify both endpoints are reachable and willing
  2. Exchange initial sequence numbers (ISNs)
  3. Negotiate connection parameters (MSS, window scaling, etc.)
  4. Synchronize state between client and server

The Three Steps

    Client                                    Server
       │                                         │
       │         State: LISTEN                   │
       │         (waiting for connections)       │
       │                                         │
  ┌────┴────┐                                    │
  │  SYN    │                                    │
  │ Seq=100 │────────────────────────────────────>
  │         │        "I want to connect,         │
  └─────────┘         my ISN is 100"             │
       │                                         │
       │                                    ┌────┴────┐
       │                                    │ SYN-ACK │
       <─────────────────────────────────────│Seq=300 │
       │    "OK, I acknowledge your SYN      │ACK=101 │
       │     (expecting byte 101 next),      └────────┘
       │     and here's my ISN: 300"              │
       │                                         │
  ┌────┴────┐                                    │
  │  ACK    │                                    │
  │ACK=301  │────────────────────────────────────>
  │         │    "I acknowledge your SYN,        │
  └─────────┘     expecting byte 301"            │
       │                                         │
       │       CONNECTION ESTABLISHED            │
       │                                         │

Step 1: SYN (Synchronize)

Client initiates the connection:

TCP Header:
┌─────────────────────────────────────────┐
│  Source Port: 52431                     │
│  Dest Port: 80                          │
│  Sequence Number: 100 (client's ISN)    │
│  Acknowledgment: 0 (not yet used)       │
│  Flags: SYN=1                           │
│  Window: 65535                          │
│  Options: MSS=1460, Window Scale, etc.  │
└─────────────────────────────────────────┘

The Initial Sequence Number (ISN) is randomized for security reasons (prevents sequence prediction attacks).

Step 2: SYN-ACK

Server acknowledges and synchronizes:

TCP Header:
┌─────────────────────────────────────────┐
│  Source Port: 80                        │
│  Dest Port: 52431                       │
│  Sequence Number: 300 (server's ISN)    │
│  Acknowledgment: 101 (client's ISN + 1) │
│  Flags: SYN=1, ACK=1                    │
│  Window: 65535                          │
│  Options: MSS=1460, Window Scale, etc.  │
└─────────────────────────────────────────┘

The ACK value (101) means “I’ve received everything up to byte 100, send me byte 101 next.”

Step 3: ACK

Client confirms:

TCP Header:
┌─────────────────────────────────────────┐
│  Source Port: 52431                     │
│  Dest Port: 80                          │
│  Sequence Number: 101                   │
│  Acknowledgment: 301 (server's ISN + 1) │
│  Flags: ACK=1                           │
│  Window: 65535                          │
└─────────────────────────────────────────┘

At this point, both sides have verified connectivity and exchanged initial sequence numbers.

Why Three Messages?

Could we do it in two? No—here’s why:

Two-way handshake problem:

Client ──SYN──> Server
Client <──ACK── Server

What if the server's ACK is lost?
- Server thinks connection is established
- Client thinks connection failed
- Server waits forever for data that won't come

Three-way solves this:
- Both sides must acknowledge the other's SYN
- Both sides know the other received their ISN

State Changes During Handshake

Client States                    Server States
────────────────────────────────────────────────────

CLOSED                           CLOSED
   │                                │
   │                              listen()
   │                                │
   │                                ▼
   │                             LISTEN
   │                                │
connect()                           │
   │                                │
   ▼                                │
SYN_SENT ──────── SYN ─────────────>│
   │                                │
   │                                ▼
   │                            SYN_RCVD
   │                                │
   │<─────────── SYN-ACK ───────────│
   │                                │
   ▼                                │
ESTABLISHED ────── ACK ────────────>│
   │                                │
   │                                ▼
   │                           ESTABLISHED

Options Negotiated in Handshake

Several important options are exchanged during the SYN and SYN-ACK:

Maximum Segment Size (MSS)

"The largest TCP segment I can receive"

Typical values:
  Ethernet: MSS = 1460 (1500 MTU - 20 IP - 20 TCP)
  Jumbo:    MSS = 8960 (9000 MTU - headers)

Both sides advertise their MSS
Connection uses the minimum of the two

Window Scaling (RFC 7323)

Original window field: 16 bits = max 65535 bytes
Too small for high-bandwidth, high-latency networks

Window scaling multiplies by 2^scale:
  Scale=7: Window can be 65535 × 128 = 8MB

SYN:     Window Scale = 7
SYN-ACK: Window Scale = 8

Enables large windows for high-performance networks

Selective Acknowledgment (SACK)

"I support SACK - I can tell you exactly which
 bytes I've received, not just the contiguous ones"

Without SACK: ACK=1000 means "got 1-999"
              If 1000 is lost but 1001-2000 arrived,
              can only ACK up to 999

With SACK:   ACK=1000, SACK: 1001-2000
             "Got 1-999 and 1001-2000, missing 1000"
             Sender retransmits only byte 1000

Timestamps (RFC 7323)

Used for:
1. RTT measurement (Round Trip Time)
2. PAWS (Protection Against Wrapped Sequences)

TSval:  Sender's timestamp
TSecr:  Echoed timestamp from peer

Helps with:
- Accurate timeout calculation
- Distinguishing old duplicate packets

Handshake in Action

Here’s a real handshake captured with tcpdump:

$ tcpdump -i eth0 port 80

14:23:15.123456 IP 192.168.1.100.52431 > 93.184.216.34.80:
    Flags [S], seq 1823761425, win 65535,
    options [mss 1460,sackOK,TS val 1234567 ecr 0,
             nop,wscale 7], length 0

14:23:15.156789 IP 93.184.216.34.80 > 192.168.1.100.52431:
    Flags [S.], seq 2948572615, ack 1823761426, win 65535,
    options [mss 1460,sackOK,TS val 9876543 ecr 1234567,
             nop,wscale 8], length 0

14:23:15.156801 IP 192.168.1.100.52431 > 93.184.216.34.80:
    Flags [.], ack 2948572616, win 512,
    options [nop,nop,TS val 1234568 ecr 9876543], length 0

Reading the flags:

  • [S] = SYN
  • [S.] = SYN-ACK (the dot means ACK is set)
  • [.] = ACK only

Connection Latency

The handshake adds latency before data transfer can begin:

Timeline:
────────────────────────────────────────────────────────
0ms      Client sends SYN
                            │
50ms                        Server receives SYN
                            Server sends SYN-ACK
                            │
100ms    Client receives SYN-ACK
         Client sends ACK
         Client can NOW send data!
                            │
150ms                       Server receives ACK
                            Server can NOW send data!

Minimum: 1 RTT before client can send
         1.5 RTT before server can send

For a 100ms RTT connection:
  100ms before HTTP request can be sent
  150ms before HTTP response can begin

This is why connection reuse (HTTP keep-alive, connection pooling) matters for performance.

TCP Fast Open (TFO)

TCP Fast Open allows data in the SYN packet:

First connection (normal):
Client ──SYN──────────────> Server
Client <──SYN-ACK + Cookie── Server
Client ──ACK──────────────> Server
Client ──Data─────────────> Server

Subsequent connections (with TFO cookie):
Client ──SYN + Cookie + Data──> Server
                                Server can respond immediately!
Client <───────Response───────── Server

Saves 1 RTT on repeat connections!

TFO requires:

  • Both client and server support
  • Idempotent initial request (retry-safe)
  • Not universally deployed due to middlebox issues

Handshake Failures

Connection Refused

Client ──SYN──> Server (no service on port)
Client <──RST── Server

"RST" (Reset) means "I'm not accepting connections on this port"

$ telnet example.com 12345
Connection refused

Connection Timeout

Client ──SYN──> (packet lost or server unreachable)
         ... wait ...
Client ──SYN──> (retry 1)
         ... wait longer ...
Client ──SYN──> (retry 2)
         ... give up after multiple attempts

Typical: 3 retries over ~75 seconds

SYN Flood Attack

Attacker sends many SYNs without completing handshake:

Attacker ──SYN (fake source)──> Server
Attacker ──SYN (fake source)──> Server
Attacker ──SYN (fake source)──> Server
           ... thousands more ...

Server:
- Allocates resources for each half-open connection
- SYN queue fills up
- Can't accept legitimate connections

Mitigations:
- SYN cookies (stateless SYN handling)
- Rate limiting
- Larger SYN queues

Simultaneous Open (Rare)

Both sides can simultaneously send SYN:

Client ──SYN──> <──SYN── Server
       ↓              ↓
Client ──SYN-ACK──> <──SYN-ACK── Server

Both sides:
1. Receive SYN while in SYN_SENT
2. Send SYN-ACK
3. Move to ESTABLISHED when ACK received

Same result, different path. Rare in practice.

Summary

The three-way handshake establishes TCP connections:

StepDirectionFlagsPurpose
1Client → ServerSYN“I want to connect”
2Server → ClientSYN-ACK“OK, I acknowledge”
3Client → ServerACK“Confirmed”

Key points:

  • Exchanges initial sequence numbers
  • Negotiates options (MSS, window scale, SACK)
  • Takes 1-1.5 RTT before data can flow
  • Connection reuse avoids repeated handshakes
  • SYN cookies protect against SYN floods

Next, we’ll examine the TCP header in detail—understanding each field and how they work together.

TCP Header and Segments

Understanding the TCP header is essential for debugging network issues, interpreting packet captures, and grasping how TCP works. Every TCP segment starts with a header containing control information.

TCP Segment Structure

A TCP segment is the unit of data at the transport layer:

┌─────────────────────────────────────────────────────────────┐
│                      TCP Segment                            │
├──────────────────────────────┬──────────────────────────────┤
│         TCP Header           │         TCP Payload          │
│        (20-60 bytes)         │     (0 to MSS bytes)         │
└──────────────────────────────┴──────────────────────────────┘

Segments are encapsulated in IP packets:

┌─────────────────────────────────────────────────────────────┐
│  IP Header   │  TCP Header   │        TCP Payload           │
│  (20 bytes)  │ (20-60 bytes) │       (application data)     │
└─────────────────────────────────────────────────────────────┘

The TCP Header

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
│          Source Port          │       Destination Port        │
├───────────────────────────────┴───────────────────────────────┤
│                        Sequence Number                        │
├───────────────────────────────────────────────────────────────┤
│                    Acknowledgment Number                      │
├───────┬───────┬─┬─┬─┬─┬─┬─┬─┬─┬───────────────────────────────┤
│  Data │       │C│E│U│A│P│R│S│F│                               │
│ Offset│ Rsrvd │W│C│R│C│S│S│Y│I│            Window             │
│       │       │R│E│G│K│H│T│N│N│                               │
├───────┴───────┴─┴─┴─┴─┴─┴─┴─┴─┼───────────────────────────────┤
│           Checksum            │        Urgent Pointer         │
├───────────────────────────────┴───────────────────────────────┤
│                    Options (if Data Offset > 5)               │
│                             ...                               │
├───────────────────────────────────────────────────────────────┤
│                           Payload                             │
│                             ...                               │
└───────────────────────────────────────────────────────────────┘

Header Fields Explained

Source Port and Destination Port (16 bits each)

Source Port:      The sender's port (often ephemeral)
Destination Port: The receiver's port (often well-known)

Together with IP addresses, these form the connection 5-tuple:
(Protocol, Source IP, Source Port, Dest IP, Dest Port)

Example:
  Client → Server HTTP request:
    Source: 192.168.1.100:52431
    Dest:   93.184.216.34:80

Sequence Number (32 bits)

Identifies the position of data in the byte stream.

If Sequence = 1000 and Payload = 100 bytes:
  This segment contains bytes 1000-1099

First SYN: Sequence = ISN (Initial Sequence Number)
           Subsequent: ISN + bytes sent

32 bits → wraps around after ~4GB
  (Timestamps help disambiguate in fast networks)

Acknowledgment Number (32 bits)

"I've received all bytes up to this number - 1"
"Send me this byte next"

If ACK = 1100:
  "I have bytes 0-1099, expecting 1100"

Only valid when ACK flag is set.

Data Offset (4 bits)

Length of TCP header in 32-bit words.

Minimum: 5 (5 × 4 = 20 bytes, no options)
Maximum: 15 (15 × 4 = 60 bytes, 40 bytes of options)

Tells receiver where the payload begins.

Reserved (4 bits)

Reserved for future use. Must be zero.
(Historically 6 bits, 2 repurposed for CWR/ECE)

Control Flags (8 bits)

Each flag is 1 bit:

┌─────┬─────────────────────────────────────────────────────────┐
│ CWR │ Congestion Window Reduced                               │
│     │ Sender reduced congestion window                        │
├─────┼─────────────────────────────────────────────────────────┤
│ ECE │ ECN-Echo                                                │
│     │ Congestion notification echo                            │
├─────┼─────────────────────────────────────────────────────────┤
│ URG │ Urgent pointer field is valid                           │
│     │ (Rarely used today)                                     │
├─────┼─────────────────────────────────────────────────────────┤
│ ACK │ Acknowledgment field is valid                           │
│     │ Set on almost every segment after SYN                   │
├─────┼─────────────────────────────────────────────────────────┤
│ PSH │ Push - deliver data immediately to application          │
│     │ Don't buffer waiting for more data                      │
├─────┼─────────────────────────────────────────────────────────┤
│ RST │ Reset - abort the connection immediately                │
│     │ Something went wrong                                    │
├─────┼─────────────────────────────────────────────────────────┤
│ SYN │ Synchronize - connection establishment                  │
│     │ Only set during handshake                               │
├─────┼─────────────────────────────────────────────────────────┤
│ FIN │ Finish - sender is done sending                         │
│     │ Graceful connection termination                         │
└─────┴─────────────────────────────────────────────────────────┘

Common flag combinations:

SYN           = Connection request
SYN + ACK     = Connection accepted
ACK           = Data or acknowledgment
PSH + ACK     = Push data (common for requests)
FIN + ACK     = Done sending, acknowledging
RST           = Connection abort
RST + ACK     = Reset with acknowledgment

Window (16 bits)

Receive window size: "I can accept this many more bytes"

Range: 0 - 65535 bytes

With window scaling (negotiated in SYN):
  Actual window = Window × 2^scale

Example with scale=7:
  Window=512 means 512 × 128 = 65536 bytes

Checksum (16 bits)

Covers header, data, and a pseudo-header:

┌─────────────────────────────────────────────────────────────┐
│                     Pseudo-Header                           │
├─────────────────────────────────────────────────────────────┤
│  Source IP Address (from IP header)                         │
│  Destination IP Address (from IP header)                    │
│  Zero | Protocol (6 for TCP) | TCP Length                   │
└─────────────────────────────────────────────────────────────┘

Why pseudo-header?
- Ensures segment reaches correct destination
- Protects against IP address spoofing

Urgent Pointer (16 bits)

Offset to end of urgent data (if URG flag set).

Largely obsolete - rarely used in modern applications.
Was intended for out-of-band signaling.

TCP Options

Options extend the header beyond 20 bytes:

Option Format:
┌─────────┬────────┬──────────────────────┐
│  Kind   │ Length │        Data          │
│ (1 byte)│(1 byte)│   (Length-2 bytes)   │
└─────────┴────────┴──────────────────────┘

Single-byte options: Kind only (no length/data)
  Kind 0: End of Options
  Kind 1: NOP (padding)

Common Options

┌──────┬────────┬────────────────────────────────────────────────┐
│ Kind │ Length │ Description                                    │
├──────┼────────┼────────────────────────────────────────────────┤
│  0   │   -    │ End of Options List                            │
│  1   │   -    │ NOP (No Operation) - padding                   │
│  2   │   4    │ MSS (Maximum Segment Size)                     │
│  3   │   3    │ Window Scale                                   │
│  4   │   2    │ SACK Permitted                                 │
│  5   │  var   │ SACK (Selective Acknowledgment)                │
│  8   │  10    │ Timestamps (TSval, TSecr)                      │
└──────┴────────┴────────────────────────────────────────────────┘

MSS Option (Kind 2)

Maximum Segment Size - largest payload sender can receive.

┌─────────┬────────┬─────────────────────┐
│ Kind=2  │ Len=4  │   MSS Value (16b)   │
└─────────┴────────┴─────────────────────┘

Typical: 1460 (Ethernet) or 1440 (with timestamps)
Only in SYN and SYN-ACK segments.

Window Scale Option (Kind 3)

Multiplier for window field: Window × 2^scale

┌─────────┬────────┬────────────┐
│ Kind=3  │ Len=3  │ Shift (8b) │
└─────────┴────────┴────────────┘

Shift range: 0-14
Max window: 65535 × 2^14 = ~1GB

Only in SYN and SYN-ACK.

SACK Option (Kind 5)

Reports non-contiguous received blocks:

┌─────────┬────────┬──────────┬──────────┬─────┐
│ Kind=5  │ Length │ Left Edge│Right Edge│ ... │
└─────────┴────────┴──────────┴──────────┴─────┘

Example: SACK 1001-1500, 2001-3000
  "I have bytes 1001-1500 and 2001-3000,
   missing 1501-2000"

Timestamps Option (Kind 8)

┌─────────┬────────┬────────────────┬────────────────┐
│ Kind=8  │ Len=10 │ TSval (32 bit) │ TSecr (32 bit) │
└─────────┴────────┴────────────────┴────────────────┘

TSval:  Sender's current timestamp
TSecr:  Echo of peer's last timestamp

Uses:
1. RTT measurement (TSecr shows when original was sent)
2. PAWS - detect old duplicates by timestamp

Example TCP Segments

SYN Segment

Client initiating connection to web server:

Source Port:      52431
Dest Port:        80
Sequence:         2837465182 (random ISN)
Acknowledgment:   0 (not used)
Data Offset:      8 (32 bytes header = options)
Flags:            SYN
Window:           65535
Checksum:         0x1a2b
Urgent:           0

Options:
  MSS: 1460
  SACK Permitted
  Timestamps: TSval=1234567, TSecr=0
  NOP (padding)
  Window Scale: 7

Data Segment

HTTP request being sent:

Source Port:      52431
Dest Port:        80
Sequence:         2837465183
Acknowledgment:   948271635
Data Offset:      8
Flags:            PSH, ACK
Window:           502
Checksum:         0x3c4d
Urgent:           0

Options:
  NOP, NOP
  Timestamps: TSval=1234590, TSecr=9876543

Payload (95 bytes):
  GET / HTTP/1.1\r\n
  Host: example.com\r\n
  \r\n

ACK-only Segment

Acknowledging received data (no payload):

Source Port:      80
Dest Port:        52431
Sequence:         948272000
Acknowledgment:   2837465278
Data Offset:      8
Flags:            ACK
Window:           1024
Checksum:         0x5e6f
Urgent:           0

Options:
  NOP, NOP
  Timestamps: TSval=9876600, TSecr=1234590

Payload: (empty)

Segment Size Considerations

Maximum Segment Size (MSS)

MSS = MTU - IP Header - TCP Header
MSS = 1500 - 20 - 20 = 1460 bytes (typical Ethernet)

With timestamps (common):
MSS = 1500 - 20 - 32 = 1448 bytes

The actual payload in a segment ≤ MSS

Why Segment Size Matters

Too small:
- More packets = more overhead
- More ACKs needed
- Less efficient

Too large:
- IP fragmentation (bad for performance)
- Higher chance of loss requiring retransmit

Optimal: Just under MTU (Path MTU Discovery helps)

Viewing TCP Headers

Using tcpdump

$ tcpdump -i eth0 -nn tcp port 80 -vvX

15:30:45.123456 IP 192.168.1.100.52431 > 93.184.216.34.80:
    Flags [S], cksum 0x1a2b (correct),
    seq 2837465182, win 65535,
    options [mss 1460,sackOK,TS val 1234567 ecr 0,
             nop,wscale 7],
    length 0

Using Wireshark

Wireshark provides a graphical view with all fields decoded:

Transmission Control Protocol, Src Port: 52431, Dst Port: 80
    Source Port: 52431
    Destination Port: 80
    Sequence Number: 2837465182
    Acknowledgment Number: 0
    Header Length: 32 bytes (8)
    Flags: 0x002 (SYN)
        .... .... ..0. = Reserved
        .... .... ...0 = Nonce
        .... ...0 .... = Congestion Window Reduced
        .... ..0. .... = ECN-Echo
        .... .0.. .... = Urgent
        .... 0... .... = Acknowledgment
        ...0 .... .... = Push
        ..0. .... .... = Reset
        .0.. .... .... = Syn: Set
        0... .... .... = Fin
    Window: 65535
    Options: (12 bytes)

Summary

The TCP header contains everything needed for reliable, ordered delivery:

FieldSizePurpose
Source/Dest Port16 bits eachIdentify applications
Sequence Number32 bitsTrack byte position
Acknowledgment32 bitsConfirm receipt
Data Offset4 bitsHeader length
Flags8 bitsControl (SYN, ACK, FIN, etc.)
Window16 bitsFlow control
Checksum16 bitsError detection
OptionsVariableMSS, SACK, timestamps, etc.

Understanding these fields helps you:

  • Debug connection problems
  • Interpret packet captures
  • Tune TCP performance
  • Recognize attacks (SYN floods, RST attacks)

Next, we’ll explore flow control—how TCP prevents senders from overwhelming receivers.

Flow Control

Flow control prevents a fast sender from overwhelming a slow receiver. Without it, a server could blast data faster than your application can process it, leading to lost data and wasted retransmissions.

The Problem

Consider a file download:

Fast Server                           Slow Client
(100 Mbps)                            (processes 1 MB/s)

   │──── 1 MB ────────────────────────>│ Buffer: [1MB]
   │──── 1 MB ────────────────────────>│ Buffer: [2MB]
   │──── 1 MB ────────────────────────>│ Buffer: [3MB]
   │──── 1 MB ────────────────────────>│ Buffer: [4MB] ← FULL!
   │──── 1 MB ────────────────────────>│ Buffer: OVERFLOW!
   │                                   │
   │  Data lost! Must retransmit.      │
   │  Waste of bandwidth.              │

Without flow control, fast senders can:
- Overflow receiver buffers
- Cause packet loss
- Trigger unnecessary retransmissions

The Sliding Window

TCP uses a sliding window mechanism for flow control. The receiver advertises how much buffer space is available, and the sender limits itself accordingly.

Receive Window (rwnd)

Receiver's perspective:

Receive Buffer (size: 65535 bytes)
┌─────────────────────────────────────────────────────────────┐
│ Data waiting │ Application    │      Available Space        │
│ to be read   │ reading...     │       (Window)              │
│    (ACKed)   │                │                             │
├──────────────┼────────────────┼─────────────────────────────┤
│    10000     │    (consuming) │         55535               │
└──────────────┴────────────────┴─────────────────────────────┘

Window advertised to sender: 55535 bytes
"I can accept 55535 more bytes"

Sender’s View

The sender tracks three pointers:

Sent & ACKed │ Sent, waiting for ACK │   Can send   │ Cannot send
             │                       │   (Window)   │ (beyond window)
─────────────┴───────────────────────┴──────────────┴────────────────
    1000         1000-5000              5000-10000      10000+

The "window" slides forward as ACKs arrive:

Before ACK:
[=====Sent=====][=====In Flight=====][===Can Send===][  Cannot  ]
                                     └──rwnd=5000───┘

After ACK (receiver consumed data):
        [===Sent===][===In Flight===][=====Can Send=====][Cannot]
                                     └─────rwnd=8000─────┘

Window "slides" right as data is acknowledged

Window Flow

Let’s trace a file transfer with flow control:

Sender                                              Receiver
   │        rwnd = 4000                                │
   │                                                   │
   │──── Seq=1000, 1000 bytes ────────────────────────>│
   │──── Seq=2000, 1000 bytes ────────────────────────>│
   │──── Seq=3000, 1000 bytes ────────────────────────>│
   │──── Seq=4000, 1000 bytes ────────────────────────>│
   │                                                   │
   │    (Sender has sent rwnd bytes, must wait)        │
   │                                                   │
   │<──── ACK=5000, Win=2000 ──────────────────────────│
   │      (App read 2000 bytes, 2000 space freed)      │
   │                                                   │
   │──── Seq=5000, 1000 bytes ────────────────────────>│
   │──── Seq=6000, 1000 bytes ────────────────────────>│
   │                                                   │
   │    (Window full again, wait)                      │
   │                                                   │
   │<──── ACK=7000, Win=4000 ──────────────────────────│
   │      (App caught up, more space)                  │

Window Size and Throughput

The window limits throughput based on latency:

Maximum throughput = Window Size / Round Trip Time

Example 1: Window=65535, RTT=10ms
  Throughput ≤ 65535 / 0.010 = 6.5 MB/s

Example 2: Window=65535, RTT=100ms
  Throughput ≤ 65535 / 0.100 = 655 KB/s

This is why window scaling matters for high-latency links!

Bandwidth-Delay Product (BDP)

For optimal throughput, window should be ≥ BDP:

BDP = Bandwidth × RTT

Example: 100 Mbps link, 50ms RTT
  BDP = 100,000,000 bits/s × 0.050 s
      = 5,000,000 bits = 625,000 bytes

Need window ≥ 625 KB to fully utilize the link!
Standard 65535-byte window is way too small.
Window scaling essential: 65535 × 2^4 = ~1MB

Window Scaling

Window scaling multiplies the 16-bit window field:

Without scaling:
  Max window = 65535 bytes
  On 100Mbps, 50ms link: 65535/0.050 = 1.3 MB/s (10% utilization)

With scale factor 7:
  Max window = 65535 × 128 = 8.3 MB bytes
  On 100Mbps, 50ms link: 8.3M/0.050 = 166 MB/s (full utilization)

Negotiated during handshake:
  SYN: WScale=7
  SYN-ACK: WScale=8

Scale applies to window field in all subsequent segments

Zero Window

When the receiver’s buffer is full, it advertises window = 0:

Sender                                              Receiver
   │                                                   │
   │<──── ACK=5000, Win=0 ────────────────────────────│
   │      "My buffer is full, stop sending!"          │
   │                                                   │
   │    (Sender stops, starts "persist timer")        │
   │                                                   │
   │──── Window Probe (1 byte) ──────────────────────>│
   │                                                   │
   │<──── ACK=5000, Win=0 ────────────────────────────│
   │      (Still full)                                │
   │                                                   │
   │    (Wait, probe again)                           │
   │                                                   │
   │──── Window Probe (1 byte) ──────────────────────>│
   │                                                   │
   │<──── ACK=5000, Win=4000 ─────────────────────────│
   │      (Space available, resume!)                  │

Persist Timer

The persist timer prevents deadlock when window = 0:

Without persist timer:
  Receiver: Window=0 (buffer full)
  Sender: Stops sending, waits for window update
  Receiver: Window update packet is lost!
  Both sides wait forever → Deadlock

With persist timer:
  Sender periodically probes with 1-byte segments
  Eventually receives window update
  No deadlock possible

Silly Window Syndrome

A pathological condition where tiny windows cause inefficiency:

Problem scenario:
  Application reads 1 byte at a time
  Receiver advertises 1-byte window
  Sender sends 1-byte segments (huge overhead!)

1 byte payload + 20 TCP + 20 IP = 41 bytes
Efficiency: 1/41 = 2.4%

This is "Silly Window Syndrome" (SWS)

Prevention

Receiver side (Clark’s algorithm):

Don't advertise tiny windows.
Wait until either:
  - Window ≥ MSS, or
  - Window ≥ buffer/2

"I have space" → If space < MSS, advertise Win=0

Sender side (Nagle’s algorithm):

Don't send tiny segments.
If there's unacknowledged data:
  - Buffer small writes
  - Wait for ACK before sending

Can be disabled with TCP_NODELAY socket option
(Important for latency-sensitive apps)

Flow Control in Action

Here’s a real-world example captured with tcpdump:

Time    Direction  Seq      ACK      Win    Len
──────────────────────────────────────────────────
0.000   →          1        1        65535  1460   # Send data
0.001   →          1461     1        65535  1460   # More data
0.050   ←          1        2921     32768  0      # ACK, window shrunk
0.051   →          2921     1        65535  1460   # Continue
0.052   →          4381     1        65535  1460
0.100   ←          1        5841     16384  0      # Window shrinking
0.101   →          5841     1        65535  1460
0.150   ←          1        7301     0      0      # ZERO WINDOW!
0.650   →          7301     1        65535  1      # Window probe
0.700   ←          1        7302     8192   0      # Window opened
0.701   →          7302     1        65535  1460   # Resume

Tuning Flow Control

Receiver Buffer Size

# Linux - check current buffer sizes
$ sysctl net.core.rmem_default
net.core.rmem_default = 212992

$ sysctl net.core.rmem_max
net.core.rmem_max = 212992

# Increase for high-bandwidth applications
$ sudo sysctl -w net.core.rmem_max=16777216
$ sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
#                                   min  default  max

Application-Level Control

import socket

# Create socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# Set receive buffer (affects advertised window)
s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1048576)  # 1MB

# Check actual buffer size (OS may adjust)
actual = s.getsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF)
print(f"Receive buffer: {actual}")

Visualizing the Window

Receiver's buffer over time:

Time=0 (empty buffer, large window)
┌────────────────────────────────────────────────────────────┐
│                        Available (Win=64KB)                │
└────────────────────────────────────────────────────────────┘

Time=1 (receiving faster than app reads)
┌───────────────────────────┬────────────────────────────────┐
│    Buffered (32KB)        │      Available (Win=32KB)      │
└───────────────────────────┴────────────────────────────────┘

Time=2 (app not reading, buffer filling)
┌───────────────────────────────────────────┬────────────────┐
│           Buffered (48KB)                 │ Avail(Win=16KB)│
└───────────────────────────────────────────┴────────────────┘

Time=3 (buffer full!)
┌────────────────────────────────────────────────────────────┐
│                    Buffered (64KB) - Win=0!                │
└────────────────────────────────────────────────────────────┘

Time=4 (app reads 32KB)
┌───────────────────────────┬────────────────────────────────┐
│    Buffered (32KB)        │      Available (Win=32KB)      │
└───────────────────────────┴────────────────────────────────┘

Summary

Flow control ensures receivers aren’t overwhelmed:

MechanismPurpose
Receive Window (rwnd)Advertises available buffer space
Window ScalingEnables windows > 65535 bytes
Zero WindowSignals “stop sending”
Persist TimerPrevents deadlock on zero window
Nagle’s AlgorithmPrevents sending tiny segments
Clark’s AlgorithmPrevents advertising tiny windows

Key formulas:

Max throughput = Window / RTT
BDP = Bandwidth × RTT (optimal window size)

Flow control handles receiver capacity. But what about the network itself? That’s congestion control—our next topic.

Congestion Control

While flow control prevents overwhelming the receiver, congestion control prevents overwhelming the network. Without it, the internet would suffer from congestion collapse—everyone sending as fast as possible, causing massive packet loss and near-zero throughput.

The Congestion Problem

Multiple senders sharing a bottleneck:

Sender A (100 Mbps) ─┐
                     │
Sender B (100 Mbps) ─┼────[Router]────> 50 Mbps link ────> Internet
                     │     (bottleneck)
Sender C (100 Mbps) ─┘

If everyone sends at full speed:
  Input: 300 Mbps
  Capacity: 50 Mbps
  Result: Router drops 250 Mbps worth of packets!

Dropped packets → Retransmissions → Even more traffic → More drops
This is "congestion collapse"

TCP’s Solution: Congestion Window

TCP maintains a congestion window (cwnd) that limits how much unacknowledged data can be in flight:

Effective window = min(cwnd, rwnd)

rwnd: What the receiver can accept (flow control)
cwnd: What the network can handle (congestion control)

Even if receiver says "send 1 MB", if cwnd=64KB,
sender only sends 64KB before waiting for ACKs.

The Four Phases

TCP congestion control has four main phases:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│  cwnd                                                       │
│    │                                                        │
│    │                    Congestion        │                 │
│    │                    Avoidance         │                 │
│    │               ____/                  │ ssthresh        │
│    │              /                       │←───────         │
│    │             /                  ______│                 │
│    │            /                  /      │                 │
│    │           /──────────────────/       │                 │
│    │          /                           │                 │
│    │         /                            │                 │
│    │        / Slow Start                  │                 │
│    │       /                              │                 │
│    │      /                               │                 │
│    │     /                                │                 │
│    │────/                                 │                 │
│    └─────────────────────────────────────────────> Time     │
│           Loss detected: cwnd cut,                          │
│           ssthresh lowered                                  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

1. Slow Start

Despite the name, slow start grows cwnd exponentially:

Initial cwnd = 1 MSS (or IW, Initial Window, typically 10 MSS now)

Round 1: Send 1 segment, get 1 ACK → cwnd = 2
Round 2: Send 2 segments, get 2 ACKs → cwnd = 4
Round 3: Send 4 segments, get 4 ACKs → cwnd = 8
Round 4: Send 8 segments, get 8 ACKs → cwnd = 16

cwnd doubles every RTT (exponential growth)

Continues until:
  - cwnd reaches ssthresh (slow start threshold)
  - Packet loss is detected

Why “slow” start?

Before TCP had congestion control, senders would
immediately blast data at full speed. "Slow" start
is slow compared to that—it probes the network
capacity before going full throttle.

2. Congestion Avoidance

Once cwnd ≥ ssthresh, growth becomes linear:

For each RTT (when all cwnd bytes are acknowledged):
  cwnd = cwnd + MSS

Or equivalently, for each ACK:
  cwnd = cwnd + MSS/cwnd

Example (MSS=1000, cwnd=10000):
  ACK received → cwnd = 10000 + 1000/10000 = 10000.1
  After 10 ACKs → cwnd ≈ 10001
  After 100 ACKs → cwnd = 10010
  After 10000 ACKs (1 RTT) → cwnd = 11000

Linear growth: +1 MSS per RTT
This is much slower than slow start's doubling

3. Loss Detection and Response

When packet loss is detected, TCP assumes congestion:

Triple Duplicate ACK (Fast Retransmit):

Sender receives 3 duplicate ACKs for same sequence

Interpretation: "One packet lost, but others arriving"
  (Mild congestion, some packets getting through)

Response (TCP Reno/NewReno):
  ssthresh = cwnd / 2
  cwnd = ssthresh + 3 MSS  (Fast Recovery)
  Retransmit lost segment
  Stay in congestion avoidance

Timeout (RTO expiration):

No ACK received within timeout period

Interpretation: "Severe congestion, possibly multiple losses"
  (Major congestion, most packets lost)

Response:
  ssthresh = cwnd / 2
  cwnd = 1 MSS (or IW)
  Go back to slow start

4. Fast Recovery

After fast retransmit, enters fast recovery:

During Fast Recovery:
  For each duplicate ACK received:
    cwnd = cwnd + MSS
    (Indicates packets leaving network)

  When new ACK arrives (lost packet recovered):
    cwnd = ssthresh
    Exit fast recovery
    Enter congestion avoidance

Congestion Control Algorithms

Different algorithms for different scenarios:

TCP Reno (Classic)

The original widely-deployed algorithm

Slow Start:     Exponential growth
Cong. Avoid:    Linear growth (AIMD - Additive Increase)
Loss Response:  Multiplicative Decrease (cwnd/2)

AIMD = Additive Increase, Multiplicative Decrease
  - Increase by 1 MSS per RTT
  - Decrease by half on loss
  - Proven to converge to fair share

TCP NewReno

Improvement over Reno for multiple losses:

Problem with Reno:
  Multiple packets lost in one window
  Fast retransmit fixes one, then exits fast recovery
  Has to wait for timeout for others

NewReno:
  Stays in fast recovery until all lost packets recovered
  "Partial ACK" triggers retransmit of next lost segment
  Much better for high loss environments

TCP CUBIC (Linux Default)

Designed for high-bandwidth, high-latency networks

Key differences:
  - cwnd growth is cubic function of time since last loss
  - More aggressive than Reno in probing capacity
  - Better utilization of fat pipes

cwnd = C × (t - K)³ + Wmax

Where:
  C = scaling constant
  t = time since last loss
  K = time to reach Wmax
  Wmax = cwnd at last loss

BBR (Bottleneck Bandwidth and RTT)

Google's model-based algorithm (2016)

Revolutionary approach:
  - Explicitly measures bottleneck bandwidth
  - Explicitly measures minimum RTT
  - Doesn't use loss as primary congestion signal

Phases:
  Startup:   Exponential probing (like slow start)
  Drain:     Reduce queue after startup
  Probe BW:  Cycle through bandwidth probing
  Probe RTT: Periodically measure minimum RTT

Advantages:
  - Better throughput on lossy links
  - Lower latency (doesn't fill buffers)
  - Fairer bandwidth sharing

Visualizing Congestion Control

TCP Reno behavior over time:

cwnd │
     │            *
     │           *  *           *
     │          *    *         * *
     │         *      *       *   *
     │        *        *     *     *
     │       *          *   *       *
     │      *            * *         *
     │     *              *           *
     │    * (slow start)   Loss!      *
     │   *                  ↓          Loss!
     │  *                ssthresh set    ↓
     │ *
     │*
     └────────────────────────────────────────> Time

"Sawtooth" pattern is classic TCP Reno behavior

Congestion Control in Practice

Checking Your System’s Algorithm

# Linux
$ sysctl net.ipv4.tcp_congestion_control
net.ipv4.tcp_congestion_control = cubic

# See available algorithms
$ sysctl net.ipv4.tcp_available_congestion_control
net.ipv4.tcp_available_congestion_control = reno cubic bbr

# Change algorithm (root)
$ sudo sysctl -w net.ipv4.tcp_congestion_control=bbr

Per-Connection Algorithm (Linux)

import socket

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# Set BBR for this connection
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_CONGESTION, b'bbr')

# Check what's set
algo = s.getsockopt(socket.IPPROTO_TCP, socket.TCP_CONGESTION, 16)
print(algo.decode())  # 'bbr'

ECN: Explicit Congestion Notification

Instead of dropping packets, routers can mark them:

Traditional congestion signal:
  Router overloaded → Drops packets
  Sender sees loss → Reduces cwnd

With ECN:
  Router overloaded → Sets ECN bits in IP header
  Receiver sees ECN → Echoes to sender via TCP
  Sender reduces cwnd → No packet loss!

Benefits:
  - No lost data
  - Faster response
  - Lower latency

ECN requires support from:

  • Routers (must mark instead of drop)
  • Both TCP endpoints (must negotiate)

Fairness

Congestion control isn’t just about throughput—it’s about sharing:

Two flows sharing a bottleneck:

Flow A: 100 Mbps network, long-running download
Flow B: 100 Mbps network, long-running download
Bottleneck: 10 Mbps

Fair outcome: Each gets ~5 Mbps

TCP's AIMD achieves this:
  - Both increase at same rate (additive)
  - Both decrease proportionally (multiplicative)
  - Over time, converges to fair share

RTT Fairness Problem

Flow A: 10 ms RTT
Flow B: 100 ms RTT
Same bottleneck

Problem: Flow A increases cwnd 10× faster!
  A: +1 MSS every 10ms = +100 MSS/second
  B: +1 MSS every 100ms = +10 MSS/second

Flow A gets ~10× more bandwidth
This is why CUBIC and BBR were designed

Bufferbloat

Excessive buffering causes latency issues:

Problem:
  Router has 100MB buffer
  TCP fills buffer to maximize throughput
  1000 Mbps link with 100MB buffer:
    Buffer delay = 100MB / 125MB/s = 800ms!

Packets wait in queue → High latency
TCP only reacts when buffer overflows → Too late

Solutions:
  - Active Queue Management (AQM)
  - CoDel, PIE, fq_codel
  - BBR (doesn't fill buffers)

Debugging Congestion

Symptoms

  • Good bandwidth but high latency = bufferbloat
  • Periodic throughput drops = congestion/loss
  • Consistently low throughput = bottleneck or small cwnd

Tools

# Linux: view cwnd and ssthresh
$ ss -ti
ESTAB  0  0  192.168.1.100:52431  93.184.216.34:80
    cubic wscale:7,7 rto:208 rtt:104/52 ato:40 mss:1448
    cwnd:10 ssthresh:7 bytes_sent:1448 bytes_acked:1449

# Trace cwnd over time
$ ss -ti | grep cwnd  # repeat or use watch

# tcptrace for analysis
$ tcptrace -l captured.pcap

Summary

Congestion control prevents network overload through self-regulation:

PhaseGrowthTrigger
Slow StartExponentialcwnd < ssthresh
Congestion AvoidanceLinearcwnd ≥ ssthresh
Fast Recovery+1 MSS per dup ACK3 duplicate ACKs
TimeoutReset to 1RTO expiration

Key algorithms:

  • Reno: Classic AIMD, good baseline
  • CUBIC: Default Linux, better for fat pipes
  • BBR: Model-based, good for lossy networks

Effective sending rate:

Rate = min(cwnd, rwnd) / RTT

Congestion control is why the internet works—millions of TCP connections sharing limited bandwidth without centralized coordination. Next, we’ll look at retransmission mechanisms—how TCP actually recovers lost data.

Retransmission Mechanisms

TCP guarantees reliable delivery by detecting lost packets and retransmitting them. This chapter explores how TCP knows when to retransmit and the mechanisms it uses to recover efficiently.

The Challenge

IP provides no delivery guarantee. Packets can be:

  • Lost (router overflow, corruption, route failure)
  • Duplicated (rare, but possible)
  • Reordered (different paths)
  • Delayed (congestion, buffering)

TCP must distinguish between these cases and respond appropriately.

How TCP Detects Loss

TCP uses two primary loss detection mechanisms:

1. Timeout (RTO)

If no ACK arrives within the Retransmission Timeout (RTO), assume the packet is lost:

Sender                               Receiver
   │                                    │
   │─── Seq=1000 (data) ───────X        │  ← Packet lost!
   │                                    │
   │    [Timer starts]                  │
   │    [waiting...]                    │
   │    [RTO expires!]                  │
   │                                    │
   │─── Seq=1000 (retransmit) ─────────>│
   │                                    │
   │<── ACK=1500 ───────────────────────│

2. Fast Retransmit (Triple Duplicate ACK)

Three duplicate ACKs indicate a packet was lost but later packets arrived:

Sender                               Receiver
   │                                    │
   │─── Seq=1000 ──────────────────────>│
   │─── Seq=1500 ─────────X             │  ← Lost!
   │─── Seq=2000 ──────────────────────>│
   │─── Seq=2500 ──────────────────────>│
   │─── Seq=3000 ──────────────────────>│
   │                                    │
   │<── ACK=1500 ──────────────────────│  (got 1000, want 1500)
   │<── ACK=1500 (dup 1) ──────────────│  (got 2000, still want 1500)
   │<── ACK=1500 (dup 2) ──────────────│  (got 2500, still want 1500)
   │<── ACK=1500 (dup 3) ──────────────│  (got 3000, still want 1500)
   │                                    │
   │   [3 dup ACKs = loss!]             │
   │                                    │
   │─── Seq=1500 (retransmit) ─────────>│
   │                                    │
   │<── ACK=3500 ──────────────────────│  (got everything!)

Fast retransmit is faster than waiting for timeout—often by hundreds of milliseconds.

Retransmission Timeout (RTO) Calculation

RTO must adapt to network conditions:

Too short: Unnecessary retransmissions (network already delivered)
Too long:  Slow recovery from actual loss

RTO is calculated from measured RTT:

SRTT = (1 - α) × SRTT + α × RTT_sample
       (Smoothed RTT, exponential moving average)
       α = 1/8

RTTVAR = (1 - β) × RTTVAR + β × |SRTT - RTT_sample|
         (RTT variance)
         β = 1/4

RTO = SRTT + max(G, 4 × RTTVAR)
      G = clock granularity (typically 1ms)

Example:
  SRTT = 100ms, RTTVAR = 25ms
  RTO = 100 + 4×25 = 200ms

RTO Bounds

Minimum RTO: Typically 200ms (RFC 6298 recommends 1 second!)
Maximum RTO: Typically 120 seconds

Initial RTO: 1 second (before any measurements)

RTO Backoff

On repeated timeouts, RTO doubles (exponential backoff):

1st timeout: RTO = 200ms
2nd timeout: RTO = 400ms
3rd timeout: RTO = 800ms
4th timeout: RTO = 1600ms
...
Gives up after max retries (typically ~15)

This prevents overwhelming an already congested network.

Selective Acknowledgment (SACK)

SACK dramatically improves retransmission efficiency when multiple packets are lost:

Without SACK

Lost: packets 3 and 5 out of 1,2,3,4,5,6,7

Sender                               Receiver
   │                                    │
   │ Receives ACK=3                     │
   │ (receiver has 1,2)                 │
   │                                    │
   │ Retransmits 3                      │
   │                                    │
   │ Receives ACK=5                     │
   │ (receiver has 1,2,3,4)             │
   │                                    │
   │ Retransmits 5                      │
   │                                    │
   │ Finally ACK=8                      │

Each loss requires a separate round trip!

With SACK

Lost: packets 3 and 5

Sender                               Receiver
   │                                    │
   │ Receives:                          │
   │   ACK=3, SACK=4-5,6-8              │
   │   "Got 1-2 (ack), 4-5 (sack),      │
   │    6-8 (sack). Missing: 3, 5"      │
   │                                    │
   │ Retransmits 3 and 5 together       │
   │                                    │
   │ Receives ACK=8                     │

Both lost packets identified and retransmitted in one round trip!

SACK Format

TCP Option:
┌─────────┬────────┬─────────────┬─────────────┬─────┐
│ Kind=5  │ Length │ Left Edge 1 │ Right Edge 1│ ... │
│ (1 byte)│(1 byte)│  (4 bytes)  │  (4 bytes)  │     │
└─────────┴────────┴─────────────┴─────────────┴─────┘

Example: SACK 5001-6000, 7001-9000
  "I have bytes 5001-6000 and 7001-9000"
  "I'm missing 1-5000 and 6001-7000"

Maximum 4 SACK blocks (40 bytes option max, minus timestamps)

Duplicate Detection

TCP must handle duplicate packets (from retransmission or network duplication):

Sequence Number Check

Receiver tracks:
  RCV.NXT = next expected sequence number

Incoming sequence < RCV.NXT?
  → Duplicate! Already received. Discard (but still ACK).

Example:
  RCV.NXT = 5000
  Packet arrives: Seq=3000
  Already have this, discard.

PAWS (Protection Against Wrapped Sequences)

For high-speed connections, sequence numbers can wrap:

32-bit sequence: 0 to 4,294,967,295

At 1 Gbps: wraps every ~34 seconds
At 10 Gbps: wraps every ~3.4 seconds

Problem:
  Old duplicate from previous wrap could be accepted
  as valid data!

Solution: Timestamps
  Each segment has timestamp
  Old segment has old timestamp
  Even if sequence matches, timestamp reveals age
  Reject segments with timestamps too old

Spurious Retransmissions

Sometimes TCP retransmits unnecessarily:

Causes:
  - RTT suddenly increased (but packet not lost)
  - Delay spike on reverse path (ACK delayed)
  - RTO calculated too aggressively

Problems:
  - Wastes bandwidth
  - cwnd reduced unnecessarily
  - Triggers congestion response

Mitigations:
  - F-RTO: Detect spurious timeout retransmissions
  - Eifel algorithm: Use timestamps to detect
  - DSACK: Receiver reports duplicate segments received

D-SACK (Duplicate SACK)

Receiver tells sender about duplicate segments:

Sender retransmits Seq=1000 (timeout)
Original arrives late at receiver
Retransmit also arrives

Receiver sends:
  ACK=2000, SACK=1000-1500 (D-SACK)
  "You already sent this, I got it twice"

Sender learns: My RTO was too aggressive!
Can adjust RTO calculation.

Retransmission in Action

Real-world packet capture showing loss and recovery:

Time     Direction  Seq      ACK      Flags  Notes
──────────────────────────────────────────────────────────────
0.000    →          1000              PSH    Send data
0.001    →          1500              PSH    Send more
0.002    →          2000              PSH    Lost!
0.003    →          2500              PSH    Send more
0.004    →          3000              PSH    Send more

0.050    ←                   1500           ACK (got 1000)
0.051    ←                   2000           ACK (got 1500)
0.052    ←                   2000     DUP   DupACK 1 (gap!)
0.053    ←                   2000     DUP   DupACK 2
0.054    ←                   2000     DUP   DupACK 3

0.055    →          2000              PSH    Fast retransmit!

0.105    ←                   3500           ACK (recovered!)

Optimizations

Tail Loss Probe (TLP)

Probes for loss when the sender goes idle:

Problem:
  Send last segment of request
  Segment lost
  No more data to send → No duplicate ACKs
  Must wait for full RTO

TLP solution:
  If no ACK within 2×SRTT after sending:
    Retransmit last segment (or send new probe)
    Triggers immediate feedback

Reduces tail latency significantly.

Early Retransmit

Allows fast retransmit with fewer than 3 dup ACKs:

Traditional: Need 3 dup ACKs
Problem: What if only 2 packets in flight?

Early retransmit:
  Small window (< 4 segments)
  Allow fast retransmit with just 1-2 dup ACKs
  Better for small transfers

RACK (Recent ACKnowledgment)

Time-based loss detection:

Traditional: Count duplicate ACKs
Problem: Reordering looks like loss

RACK approach:
  Track time of most recent ACK
  If segment sent before recent ACK hasn't been ACKed:
    Probably lost (not reordered)

Better handles reordering vs. loss distinction

Configuration

Linux Tuning

# View retransmission stats
$ netstat -s | grep -i retrans
    1234 segments retransmitted
    567 fast retransmits

# RTO settings
$ sysctl net.ipv4.tcp_retries1  # Soft threshold
net.ipv4.tcp_retries1 = 3

$ sysctl net.ipv4.tcp_retries2  # Hard maximum
net.ipv4.tcp_retries2 = 15

# Enable SACK (usually default)
$ sysctl net.ipv4.tcp_sack
net.ipv4.tcp_sack = 1

# Enable TLP
$ sysctl net.ipv4.tcp_early_retrans
net.ipv4.tcp_early_retrans = 3  # TLP enabled

Monitoring Retransmissions

# Count retransmits on a connection
$ ss -ti dst example.com
    ... retrans:5/10 ...
         │      └── Total retransmits
         └── Unrecovered retransmits

# Watch for retransmissions
$ tcpdump -i eth0 'tcp[tcpflags] & tcp-syn == 0' | grep -i retrans

Summary

TCP uses multiple mechanisms to recover from loss:

MechanismDetectionSpeedUse Case
Timeout (RTO)Timer expiresSlowLast resort
Fast Retransmit3 dup ACKsFastMost losses
SACKExplicit gapsFastMultiple losses
TLPProbe timeoutFastTail losses

RTO calculation:

RTO = SRTT + 4 × RTTVAR

Key principles:

  • Fast retransmit beats waiting for timeout
  • SACK enables efficient multi-loss recovery
  • Timestamps help detect spurious retransmissions
  • Modern algorithms (RACK) improve reordering tolerance

Understanding retransmission helps you diagnose network issues. High retransmission rates indicate packet loss—which could be congestion, bad hardware, or misconfiguration.

Next, we’ll cover TCP states—the lifecycle of a TCP connection from creation to termination.

TCP States and Lifecycle

A TCP connection progresses through a series of states from creation to termination. Understanding these states helps you debug connection issues, interpret netstat output, and understand why connections sometimes linger.

The State Diagram

                            ┌───────────────────────────────────────┐
                            │                CLOSED                  │
                            └───────────────────┬───────────────────┘
                                                │
              ┌─────────────────────────────────┼─────────────────────────────────┐
              │                                 │                                 │
              │ Passive Open                    │ Active Open                     │
              │ (Server: listen())              │ (Client: connect())             │
              ▼                                 ▼                                 │
      ┌───────────────┐                ┌───────────────┐                          │
      │    LISTEN     │                │   SYN_SENT    │                          │
      │               │                │               │                          │
      │ Waiting for   │                │ SYN sent,     │                          │
      │ connection    │                │ waiting for   │                          │
      │ request       │                │ SYN-ACK       │                          │
      └───────┬───────┘                └───────┬───────┘                          │
              │                                 │                                 │
              │ Receive SYN                     │ Receive SYN-ACK                 │
              │ Send SYN-ACK                    │ Send ACK                        │
              ▼                                 ▼                                 │
      ┌───────────────┐                ┌───────────────┐                          │
      │   SYN_RCVD    │                │  ESTABLISHED  │◄─────────────────────────┘
      │               │                │               │
      │ SYN received, │                │  Connection   │
      │ SYN-ACK sent  │──────────────>│  is open      │
      │ waiting ACK   │ Receive ACK   │               │
      └───────────────┘                └───────────────┘
                                                │
                                    ┌───────────┴───────────┐
                                    │                       │
                                    │ Close requested       │
                                    │                       │
                                    ▼                       ▼
                            (Active Close)          (Passive Close)
                            ┌───────────────┐      ┌───────────────┐
                            │   FIN_WAIT_1  │      │  CLOSE_WAIT   │
                            │               │      │               │
                            │ FIN sent,     │      │ FIN received, │
                            │ waiting ACK   │      │ ACK sent,     │
                            └───────┬───────┘      │ waiting for   │
                                    │              │ app to close  │
                      Receive ACK   │              └───────┬───────┘
                            ┌───────┴───────┐              │
                            │               │              │ App calls close()
                            ▼               │              │ Send FIN
                    ┌───────────────┐       │              ▼
                    │   FIN_WAIT_2  │       │      ┌───────────────┐
                    │               │       │      │   LAST_ACK    │
                    │ Waiting for   │       │      │               │
                    │ peer's FIN    │       │      │ FIN sent,     │
                    └───────┬───────┘       │      │ waiting ACK   │
                            │               │      └───────┬───────┘
                Receive FIN │               │              │
                Send ACK    │               │ Receive FIN  │ Receive ACK
                            │               │ Send ACK     │
                            ▼               ▼              ▼
                    ┌───────────────────────────┐  ┌───────────────┐
                    │        TIME_WAIT          │  │    CLOSED     │
                    │                           │  │               │
                    │  Wait 2×MSL before        │  │  Connection   │
                    │  fully closing            │  │  terminated   │
                    │  (typically 60-120 sec)   │  │               │
                    └─────────────┬─────────────┘  └───────────────┘
                                  │
                                  │ Timeout (2×MSL)
                                  ▼
                          ┌───────────────┐
                          │    CLOSED     │
                          └───────────────┘

State Descriptions

CLOSED

The starting and ending state. No connection exists.
Not actually tracked—absence of connection state.

LISTEN

Server is waiting for incoming connections.
Created by: listen() system call

$ netstat -an | grep LISTEN
tcp   0   0  0.0.0.0:80      0.0.0.0:*    LISTEN
tcp   0   0  0.0.0.0:22      0.0.0.0:*    LISTEN

SYN_SENT

Client has sent SYN, waiting for SYN-ACK.
Created by: connect() system call

Typical duration: 1 RTT (plus retries if lost)

If you see many SYN_SENT:
  - Remote server might be down
  - Firewall blocking SYN-ACKs
  - Network connectivity issues

SYN_RCVD (SYN_RECEIVED)

Server received SYN, sent SYN-ACK, waiting for ACK.
Part of the half-open connection.

Typical duration: 1 RTT

If you see many SYN_RCVD:
  - Could be SYN flood attack
  - Check SYN backlog settings
  - Consider SYN cookies

ESTABLISHED

Three-way handshake complete. Data can flow.
This is the normal "connection open" state.

$ netstat -an | grep ESTABLISHED
tcp  0   0  192.168.1.100:52431  93.184.216.34:80  ESTABLISHED

FIN_WAIT_1

Application called close(), FIN sent.
Waiting for ACK of FIN (or FIN from peer).

Brief transitional state.

FIN_WAIT_2

FIN acknowledged, waiting for peer's FIN.
Peer's application hasn't closed yet.

Can persist if peer doesn't close:
  - Application bug (not calling close())
  - Half-close intentional

Linux: tcp_fin_timeout controls how long to wait

CLOSE_WAIT

Received FIN from peer, sent ACK.
Waiting for local application to close.

If you see many CLOSE_WAIT:
  - Application not calling close()!
  - Resource leak / application bug
  - Common source of "too many open files"

LAST_ACK

Sent FIN after receiving peer's FIN.
Waiting for final ACK.

Brief transitional state.

TIME_WAIT

Connection fully closed, waiting before reuse.
The "lingering" state that often confuses people.

Duration: 2 × MSL (Maximum Segment Lifetime)
  MSL typically 30-60 seconds
  TIME_WAIT typically 60-120 seconds

Why it exists: (see below)

CLOSING

Rare state: Both sides sent FIN simultaneously.
Each waiting for ACK of their FIN.

Simultaneous close scenario.

Why TIME_WAIT Exists

TIME_WAIT serves two important purposes:

1. Reliable Connection Termination

Scenario: Final ACK is lost

Client                               Server
   │                                    │
   │──── FIN ──────────────────────────>│
   │<─── ACK, FIN ──────────────────────│
   │──── ACK ───────────X               │  ← Lost!
   │                                    │
   │    (Client in TIME_WAIT)           │  (Server in LAST_ACK)
   │                                    │
   │<─── FIN (retransmit) ──────────────│
   │──── ACK ──────────────────────────>│  (Re-ACK)
   │                                    │
   │    TIME_WAIT ensures we can        │
   │    re-acknowledge if needed        │

2. Prevent Stale Segments

Old connection: 192.168.1.100:52431 → 93.184.216.34:80
  Some segments still in network (delayed)

New connection with same 4-tuple:
  If TIME_WAIT didn't exist, could reuse immediately
  Old segments might be accepted as valid!

TIME_WAIT (2×MSL) ensures old segments expire:
  MSL = Maximum Segment Lifetime in network
  2×MSL = round-trip time for any lingering data

TIME_WAIT Problems and Solutions

The Problem

High-traffic servers can accumulate thousands of TIME_WAIT connections:

$ netstat -an | grep TIME_WAIT | wc -l
15234

Each TIME_WAIT:
  - Consumes memory (small, but adds up)
  - Holds ephemeral port (can exhaust ports!)
  - 4-tuple unavailable for new connections

Solutions

1. SO_REUSEADDR

# Allow bind() to reuse address in TIME_WAIT
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

# Server can restart immediately after crash
# Doesn't allow simultaneous bind to same port

2. tcp_tw_reuse (Linux)

# Allow reusing TIME_WAIT sockets for outgoing connections
$ sysctl -w net.ipv4.tcp_tw_reuse=1

# Safe because timestamps prevent confusion
# Only for outgoing connections (client side)

3. Reduce TIME_WAIT duration (careful!)

# Not recommended - violates TCP specification
# Some systems allow it anyway

# Linux doesn't have direct control
# tcp_fin_timeout only affects FIN_WAIT_2

4. Connection pooling

Reuse established connections
  - HTTP Keep-Alive
  - Database connection pools
  - gRPC persistent connections

Fewer connections = fewer TIME_WAITs

5. Use server-side close

If server closes first → Server gets TIME_WAIT
If client closes first → Client gets TIME_WAIT

For servers with many short-lived connections:
  Let clients close first (HTTP/1.1 does this)

Viewing Connection States

Linux/macOS

# All connections with state
$ netstat -an
$ ss -an

# Count by state
$ ss -s
TCP:   2156 (estab 234, closed 1856, orphaned 12, timewait 1844)

# Filter by state
$ ss -t state established
$ ss -t state time-wait

# Show process info
$ ss -tp
$ netstat -tp

State Distribution Check

# Quick state summary
$ ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
   1844 TIME-WAIT
    234 ESTAB
     56 FIN-WAIT-2
     23 CLOSE-WAIT
      5 LISTEN
      3 SYN-SENT

Connection Termination: Normal vs. Abort

Graceful Close (FIN)

Normal termination - all data delivered

Client:  close() → sends FIN
         Waits for peer's FIN
         Both sides agree connection is done

4-way handshake:
  FIN →
  ← ACK
  ← FIN
  ACK →

Can be combined (FIN+ACK together)

Abortive Close (RST)

Immediate termination - may lose data

Triggers:
  - SO_LINGER with timeout=0
  - Receiving data on closed socket
  - Connection to non-listening port
  - Firewall timeout/rejection

No TIME_WAIT needed - immediate cleanup
But: any in-flight data is lost

Half-Close

TCP allows closing one direction:

Client: shutdown(SHUT_WR)
  - Client can't send more data
  - Client can still receive
  - Server sees EOF when reading

Use case:
  "I'm done sending, but I'll wait for your response"

Example: HTTP request sent, waiting for response

Common Issues

Too Many CLOSE_WAIT

Symptoms:
  - Connections stuck in CLOSE_WAIT
  - "Too many open files" errors
  - Application eventually fails

Cause:
  - Application receiving FIN but not calling close()
  - Bug in cleanup code
  - Exception handling not closing sockets

Fix:
  - Fix application to properly close sockets
  - Use finally blocks / context managers
  - Check for file descriptor leaks

Too Many TIME_WAIT

Symptoms:
  - Thousands of TIME_WAIT connections
  - Port exhaustion for outgoing connections
  - "Cannot assign requested address" errors

Cause:
  - Many short-lived outgoing connections
  - Server closing connections (gets TIME_WAIT)

Fix:
  - Connection pooling
  - tcp_tw_reuse (client-side)
  - Let clients close first (server-side)
  - Longer-lived connections

SYN_RECV Accumulation

Symptoms:
  - Many connections in SYN_RCVD
  - New connections rejected
  - Server appears slow or unresponsive

Cause:
  - SYN flood attack
  - Slow/lossy network (ACKs not arriving)

Fix:
  - Enable SYN cookies
  - Increase SYN backlog
  - Rate limiting
  - DDoS protection

Summary

TCP states track the connection lifecycle:

StateSideMeaning
LISTENServerWaiting for connections
SYN_SENTClientHandshake in progress
SYN_RCVDServerHandshake in progress
ESTABLISHEDBothConnection open
FIN_WAIT_1CloserSent FIN, waiting ACK
FIN_WAIT_2CloserFIN ACKed, waiting peer FIN
CLOSE_WAITReceiverReceived FIN, app hasn’t closed
LAST_ACKReceiverSent FIN, waiting final ACK
TIME_WAITCloserWaiting to ensure clean close
CLOSEDBothNo connection

Key debugging insights:

  • CLOSE_WAIT accumulation = application not closing sockets
  • TIME_WAIT accumulation = many short connections (may be normal)
  • SYN_RCVD accumulation = possible SYN flood attack

This completes our deep dive into TCP. You now understand the protocol that powers most of the internet. Next, we’ll look at UDP—the simpler, faster alternative.

UDP: The Simple Protocol

UDP (User Datagram Protocol) is TCP’s minimalist counterpart. Where TCP provides reliability, ordering, and connection management, UDP provides almost nothing—just a thin wrapper around IP. This simplicity makes it fast, lightweight, and ideal for certain use cases.

What UDP Provides

┌─────────────────────────────────────────────────────────────┐
│                    UDP Provides                             │
├─────────────────────────────────────────────────────────────┤
│  ✓ Multiplexing via ports                                   │
│    (Multiple apps can use the network)                      │
│                                                             │
│  ✓ Checksum for error detection                             │
│    (Optional in IPv4, mandatory in IPv6)                    │
│                                                             │
│  That's it. Really.                                         │
└─────────────────────────────────────────────────────────────┘

What UDP Does NOT Provide

┌─────────────────────────────────────────────────────────────┐
│                 UDP Does NOT Provide                        │
├─────────────────────────────────────────────────────────────┤
│  ✗ Reliability (packets may be lost)                        │
│  ✗ Ordering (packets may arrive out of order)               │
│  ✗ Duplication prevention (same packet may arrive twice)    │
│  ✗ Connection state (no handshake, no teardown)             │
│  ✗ Flow control (can overwhelm receiver)                    │
│  ✗ Congestion control (can overwhelm network)               │
└─────────────────────────────────────────────────────────────┘

TCP vs. UDP at a Glance

TCP                              UDP
─────────────────────────────────────────────────────────────
Connection-oriented              Connectionless
Reliable delivery                Best-effort delivery
Ordered delivery                 No ordering guarantee
Flow control                     No flow control
Congestion control               No congestion control
Higher latency                   Lower latency
Higher overhead                  Lower overhead
Stream-based                     Message-based

Why Choose UDP?

If UDP lacks so many features, why use it?

1. Lower Latency

TCP connection setup:
  1. SYN ────────> (1 RTT)
  2. <──────── SYN-ACK
  3. ACK + Data ──> (another RTT for handshake)

UDP "setup":
  1. Data ────────> (immediate!)

For single request-response: UDP saves at least 1 RTT

2. No Head-of-Line Blocking

TCP: Packet 3 lost

  Received: 1, 2, [gap], 4, 5, 6, 7
                   │
                   └── Can't deliver 4-7 until 3 arrives!

UDP: Packet 3 lost

  Received: 1, 2, 4, 5, 6, 7  ← Deliver immediately!
                 │
                 └── Application decides what to do

For real-time apps, old data is often worthless anyway.

3. Message Boundaries Preserved

TCP is a byte stream:
  send("Hello")
  send("World")

  Receiver might get: "HelloWorld" or "Hell" + "oWorld"
  (No message boundaries)

UDP is message-based:
  sendto("Hello")
  sendto("World")

  Receiver gets: "Hello" then "World"
  (Each datagram is discrete)

4. Application Control

TCP decides:
  - When to retransmit
  - How fast to send
  - How to react to loss

UDP lets the application decide:
  - Custom retransmission logic
  - Application-specific rate control
  - Skip old data, prioritize new

When to Use UDP

┌─────────────────────────────────────────────────────────────┐
│                    UDP Is Good For:                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Real-time Applications                                     │
│    • Voice/Video calls (VoIP)                               │
│    • Live streaming                                         │
│    • Online gaming                                          │
│    • Real-time sensor data                                  │
│                                                             │
│  Simple Request-Response                                    │
│    • DNS queries                                            │
│    • NTP (time sync)                                        │
│    • DHCP                                                   │
│                                                             │
│  Broadcast/Multicast                                        │
│    • Service discovery                                      │
│    • Network announcements                                  │
│    • LAN games                                              │
│                                                             │
│  Custom Protocols                                           │
│    • QUIC (UDP-based, adds reliability)                     │
│    • Custom game protocols                                  │
│    • IoT protocols                                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

What You’ll Learn

In this chapter:

  1. UDP Header and Datagrams: The simple packet format
  2. When to Use UDP: Detailed use cases and examples
  3. UDP vs TCP Trade-offs: Making the right choice

UDP’s simplicity is its strength. By providing just enough transport-layer functionality, it enables applications to build exactly what they need on top—nothing more, nothing less.

UDP Header and Datagrams

The UDP header is remarkably simple—just 8 bytes compared to TCP’s minimum of 20. This minimalism is by design, providing just enough functionality to multiplex applications and detect corruption.

UDP Datagram Structure

┌─────────────────────────────────────────────────────────────┐
│                      UDP Datagram                           │
├─────────────────────────────────────────────────────────────┤
│     UDP Header      │           UDP Payload                 │
│      (8 bytes)      │      (0 to 65,507 bytes)              │
└─────────────────────────────────────────────────────────────┘

Maximum payload: 65,535 (IP max) - 20 (IP header) - 8 (UDP header) = 65,507 bytes

But practical limit is usually much smaller due to MTU.

The UDP Header

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
│          Source Port          │       Destination Port        │
├───────────────────────────────┼───────────────────────────────┤
│            Length             │           Checksum            │
├───────────────────────────────┴───────────────────────────────┤
│                                                               │
│                           Payload                             │
│                                                               │
└───────────────────────────────────────────────────────────────┘

That’s it. Four 16-bit fields. Compare to TCP’s 10+ fields!

Header Fields

Source Port (16 bits)

The sender's port number.

Optional: Can be 0 if no reply is expected
  (Though this is rarely done in practice)

Used by receiver to send responses back.

Destination Port (16 bits)

The receiver's port number.

Identifies the application/service.
Well-known ports same as TCP:
  53  = DNS
  67  = DHCP Server
  68  = DHCP Client
  69  = TFTP
  123 = NTP
  161 = SNMP
  500 = IKE (IPsec)

Length (16 bits)

Total length of UDP datagram (header + payload).

Minimum: 8 (header only, no payload)
Maximum: 65535 (theoretical, rarely practical)

Length = 8 + payload_size

Checksum (16 bits)

Error detection for header and data.

IPv4: Optional (0 = disabled)
IPv6: Mandatory

Calculated over:
  - UDP pseudo-header (from IP)
  - UDP header
  - UDP payload (padded if odd length)

Pseudo-Header

Like TCP, UDP checksum covers IP addresses:

IPv4 Pseudo-Header:
┌───────────────────────────────────────────────────────────────┐
│                       Source IP Address                       │
├───────────────────────────────────────────────────────────────┤
│                    Destination IP Address                     │
├───────────────┬───────────────┬───────────────────────────────┤
│    Zero (8)   │ Protocol (17) │         UDP Length            │
└───────────────┴───────────────┴───────────────────────────────┘

This ensures the datagram reaches the intended destination.
If IP addresses were modified, checksum fails.

Comparing Headers

TCP Header (minimum):               UDP Header:
┌───────────────────────────┐      ┌───────────────────────────┐
│ Source Port (16)          │      │ Source Port (16)          │
│ Destination Port (16)     │      │ Destination Port (16)     │
│ Sequence Number (32)      │      │ Length (16)               │
│ Acknowledgment (32)       │      │ Checksum (16)             │
│ Data Offset/Flags (16)    │      └───────────────────────────┘
│ Window (16)               │           8 bytes total
│ Checksum (16)             │
│ Urgent Pointer (16)       │
│ [Options...]              │
└───────────────────────────┘
     20+ bytes

TCP overhead: 20+ bytes
UDP overhead: 8 bytes

For small messages, the difference matters!

UDP Socket Programming

Sending a Datagram

import socket

# Create UDP socket
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

# No connect() needed - just send!
message = b"Hello, UDP!"
sock.sendto(message, ("192.168.1.100", 12345))

# Can send to different destinations with same socket
sock.sendto(b"Hello A", ("192.168.1.101", 12345))
sock.sendto(b"Hello B", ("192.168.1.102", 12345))

sock.close()

Receiving Datagrams

import socket

# Create and bind UDP socket
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind(("0.0.0.0", 12345))

print("Listening on port 12345...")

while True:
    # recvfrom returns data AND sender address
    data, addr = sock.recvfrom(65535)  # Buffer size
    print(f"Received from {addr}: {data.decode()}")

    # Can reply directly
    sock.sendto(b"Got it!", addr)

Connected UDP (Optional)

import socket

sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

# Can "connect" UDP socket - sets default destination
sock.connect(("192.168.1.100", 12345))

# Now can use send() instead of sendto()
sock.send(b"Hello!")

# recv() instead of recvfrom()
response = sock.recv(1024)

# Also enables receiving ICMP errors
# (Unconnected UDP sockets don't see them)

Message Boundaries

Unlike TCP’s byte stream, UDP preserves message boundaries:

Sender:
  sendto(b"Message 1")
  sendto(b"Message 2")
  sendto(b"Message 3")

Receiver:
  recvfrom() → b"Message 1"
  recvfrom() → b"Message 2"
  recvfrom() → b"Message 3"

Each datagram is delivered as a complete unit (or not at all).
Never get partial messages or merged messages.

Datagram Size Considerations

Practical Limits

Maximum theoretical: 65,507 bytes
Maximum without fragmentation: MTU - IP header - UDP header
  Ethernet: 1500 - 20 - 8 = 1472 bytes
  Jumbo:    9000 - 20 - 8 = 8972 bytes

Recommended maximum for reliability:
  512-1400 bytes (avoids fragmentation)

DNS uses 512 bytes historically (EDNS allows larger)

Fragmentation Problem

UDP datagram > MTU gets fragmented at IP layer:

Original: 3000-byte UDP datagram
          │
          ▼
┌─────────────────┐  ┌─────────────────┐  ┌────────────┐
│ IP Frag 1       │  │ IP Frag 2       │  │ IP Frag 3  │
│ UDP hdr + 1472B │  │ Data (1480B)    │  │ Data (48B) │
└─────────────────┘  └─────────────────┘  └────────────┘

Problems:
  - Any fragment lost = entire datagram lost
  - No automatic retransmission
  - Higher effective loss rate

Best practice: Keep datagrams under MTU

No Connection State

UDP sockets don’t track connections:

TCP Server:
  listen()
  while True:
      client = accept()  ← New socket per connection
      handle(client)
      client.close()

UDP Server:
  bind()
  while True:
      data, addr = recvfrom()  ← All messages on same socket
      # addr tells you who sent it
      handle(data, addr)
      sendto(response, addr)

UDP has no notion of "accepted connections"
Just receives datagrams with source addresses.

Common UDP Patterns

Request-Response

Client                              Server
   │                                   │
   │─── Request (with client port) ───>│
   │                                   │
   │<── Response (to client port) ─────│
   │                                   │

Simple: One datagram each direction.
DNS works this way.

Streaming

Source                              Destination
   │                                     │
   │─── Packet 1 (seq=1) ───────────────>│
   │─── Packet 2 (seq=2) ───────────────>│
   │─── Packet 3 (seq=3) ──X             │  Lost!
   │─── Packet 4 (seq=4) ───────────────>│
   │─── Packet 5 (seq=5) ───────────────>│
   │                                     │
   │   Receiver notices gap, may request │
   │   retransmit or skip packet 3       │

Application implements sequencing/recovery as needed.
Video streaming, gaming use this pattern.

Multicast

One sender, multiple receivers:

Source ─────────────────┬───────────────> Receiver A
    │                   │
    │    Multicast      ├───────────────> Receiver B
    │    Group          │
    │                   └───────────────> Receiver C

UDP is required for multicast (TCP is point-to-point only).
Used for IPTV, service discovery, LAN gaming.

Viewing UDP Traffic

# Linux: Show UDP sockets
$ ss -u -a
State   Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
UNCONN  0       0       0.0.0.0:68           0.0.0.0:*
UNCONN  0       0       127.0.0.1:323        0.0.0.0:*

# Capture UDP packets
$ tcpdump -i eth0 udp port 53
14:23:15.123 IP 192.168.1.100.52431 > 8.8.8.8.53: UDP, length 32

# Show UDP statistics
$ netstat -su
Udp:
    1234567 packets received
    12 packets to unknown port received
    0 packet receive errors
    1234560 packets sent

Summary

The UDP header is minimal by design:

FieldSizePurpose
Source Port16 bitsReply address
Destination Port16 bitsTarget application
Length16 bitsDatagram size
Checksum16 bitsError detection

Key characteristics:

  • 8-byte header (vs TCP’s 20+)
  • Message-oriented (boundaries preserved)
  • Connectionless (no state to manage)
  • No fragmentation at UDP level (handled by IP)

UDP provides just enough to identify applications and detect corruption. Everything else—reliability, ordering, flow control—is the application’s responsibility (if needed at all).

Next, we’ll explore when UDP is the right choice and common use cases.

When to Use UDP

Choosing UDP over TCP is a significant architectural decision. UDP shines in specific scenarios where its characteristics—low latency, no connection overhead, and application control—outweigh the lack of built-in reliability.

Primary Use Cases

Real-Time Communication

Voice over IP (VoIP)

Why UDP?

TCP behavior on packet loss:
  Packet lost → Retransmit → Arrives 200ms later
  Audio: "Hello, how are--[200ms pause]--you?"

UDP behavior:
  Packet lost → Move on
  Audio: "Hello, how are [brief glitch] you?"

Humans tolerate small audio gaps.
Humans hate delays in conversation.

VoIP typically tolerates 1-5% packet loss gracefully.
Delay > 150ms makes conversation awkward.

Video Streaming (Live)

Live video constraints:
  - Frame every 33ms (30 fps)
  - Old frames are worthless
  - Viewer can't wait for retransmit

UDP approach:
  Lost packet? Skip it, show next frame.
  Minor visual artifact better than frozen video.

Note: Buffered streaming (Netflix) often uses TCP.
      TCP's reliability works when you can buffer ahead.

Online Gaming

Game server sends world state 60 times/second:

Frame 1: Player at (100, 200)
Frame 2: Player at (102, 201)  ← Lost!
Frame 3: Player at (104, 202)
Frame 4: Player at (106, 203)

With TCP: Wait for Frame 2 retransmit
          Game stutters, all updates delayed

With UDP: Skip Frame 2
          Frame 3 has newer position anyway!
          Smooth gameplay

Games implement their own:
  - Sequence numbers (detect loss)
  - Interpolation (smooth missing frames)
  - Prediction (guess missing data)

Simple Request-Response

DNS (Domain Name System)

DNS query:
  Client: "What's the IP for example.com?"
  Server: "93.184.216.34"

Why UDP (historically)?
  - Single small request (<512 bytes)
  - Single small response
  - No connection state needed
  - Low latency critical (affects every web request)

UDP saves: 1-2 RTT from TCP handshake

Modern note: DNS over TCP exists and is growing
  - Large responses (DNSSEC)
  - DNS over HTTPS/TLS (encrypted, uses TCP)

NTP (Network Time Protocol)

Time sync:
  Client: "What time is it?"
  Server: "2024-01-15 14:23:45.123456"

Latency matters for accuracy!
  Every ms of delay affects time calculation

UDP request-response: ~1 RTT
TCP setup + request: ~3 RTT

DHCP (Dynamic Host Configuration Protocol)

Network bootstrapping:
  Client: "I need an IP address!" (broadcast)
  Server: "You can use 192.168.1.100"

Special challenge: Client has NO IP address yet!
  TCP requires an IP to establish connection
  UDP can broadcast without source IP

Also: DHCP predates TCP optimizations

Broadcast and Multicast

Service Discovery

Finding services on local network:

Option 1 (TCP): Connect to every device, ask "Are you a printer?"
                Slow, inefficient, doesn't scale

Option 2 (UDP multicast):
  Send to multicast address: "Who's a printer?"
  All printers respond: "Me! I'm at 192.168.1.50"

mDNS/Bonjour uses this (224.0.0.251, port 5353)

IPTV / Live TV Distribution

Sending same video to 10,000 viewers:

TCP: 10,000 separate connections
     10,000 copies of each packet
     Massive server load

UDP Multicast: 1 stream
               Network duplicates as needed
               Scales infinitely

Multicast REQUIRES UDP (TCP is point-to-point).

Custom Protocols

QUIC (HTTP/3)

QUIC is a custom protocol over UDP:
  - Implements reliability (like TCP)
  - Implements congestion control (like TCP)
  - But with multiplexing, 0-RTT, migration

Why not just improve TCP?
  - TCP is in operating system kernels
  - Kernel changes take years to deploy
  - Middleboxes (firewalls, NAT) expect TCP behavior

UDP is a blank slate:
  - Implement in userspace (fast iteration)
  - Passes through middleboxes (they don't inspect UDP)
  - Customize behavior completely

Custom Game Protocols

Games often need:
  - Reliable delivery for some messages (chat, purchases)
  - Unreliable for others (position updates)
  - Priority levels
  - Custom congestion handling

TCP: One-size-fits-all, no customization
UDP: Build exactly what you need

Many game engines implement hybrid:
  - Reliable ordered channel (mimics TCP)
  - Reliable unordered channel
  - Unreliable channel
  All over single UDP socket.

Lightweight IoT

Sensor Networks

Thousands of sensors reporting temperature:
  - Small messages (few bytes)
  - Frequent updates
  - Individual readings not critical
  - Network/power constrained

TCP overhead per reading:
  20-byte header (often > payload!)
  Connection state on server

UDP overhead:
  8-byte header
  No state
  Fire and forget

CoAP (Constrained Application Protocol) uses UDP.

When NOT to Use UDP

File Transfer

File transfer requirements:
  ✓ Complete delivery (every byte matters)
  ✓ Correct order
  ✓ Error detection

UDP would require implementing:
  - Sequence numbers
  - Acknowledgments
  - Retransmission
  - Congestion control

...basically reimplementing TCP poorly.

Use TCP for file transfer. Or QUIC.

Web APIs / HTTP

HTTP requires:
  ✓ Reliable delivery (incomplete JSON is useless)
  ✓ Request-response matching
  ✓ Large responses

TCP is the right choice.
(HTTP/3 uses QUIC over UDP, but QUIC handles reliability)

Anything Through Firewalls

Many corporate firewalls:
  - Allow TCP 80, 443
  - Block most UDP
  - May even block all UDP

If targeting corporate networks:
  Consider TCP for better connectivity.

WebSocket (TCP) often works where custom UDP doesn't.

Decision Framework

┌─────────────────────────────────────────────────────────────┐
│               Should I Use UDP?                             │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
           ┌──────────────────────────────┐
           │ Is low latency critical?     │
           │ (Real-time, interactive)     │
           └──────────────┬───────────────┘
                          │
              ┌───────────┴───────────┐
              │                       │
             Yes                      No
              │                       │
              ▼                       ▼
   ┌────────────────────┐   ┌────────────────────┐
   │ Can you tolerate   │   │ Do you need        │
   │ some data loss?    │   │ broadcast/multicast?│
   └─────────┬──────────┘   └─────────┬──────────┘
             │                        │
     ┌───────┴───────┐        ┌───────┴───────┐
     │               │        │               │
    Yes              No      Yes              No
     │               │        │               │
     ▼               ▼        ▼               ▼
  ┌──────┐     ┌───────┐   ┌──────┐     ┌───────┐
  │ UDP! │     │QUIC or│   │ UDP! │     │  TCP  │
  │      │     │  TCP  │   │      │     │       │
  └──────┘     └───────┘   └──────┘     └───────┘

Real-World Examples

Discord Voice

Text chat: TCP (reliable)
Voice chat: UDP (low latency)

Voice handling:
  - Opus codec (tolerates loss)
  - Packet loss concealment
  - Jitter buffer
  - Falls back to TCP if UDP blocked

Zoom

Video: UDP preferred, TCP fallback
Audio: UDP preferred, TCP fallback
Screen share: UDP preferred

Quality adapts to conditions:
  - High loss? Reduce quality
  - UDP blocked? Switch to TCP
  - Still works, but with higher latency

DNS

Traditional: UDP port 53
  - Fast, simple
  - Limited to 512 bytes (without EDNS)

DNS over TCP: Port 53
  - Large responses (DNSSEC)
  - Zone transfers

DNS over HTTPS: TCP port 443
  - Encrypted
  - Privacy focused
  - More overhead

Online Games

Fortnite, Valorant, etc.:
  Position updates: UDP (unreliable, frequent)
  Game events: UDP (reliable channel)
  Chat: UDP or TCP
  Downloads/patches: TCP

Hybrid approach is common.

Summary

Use UDP when:

  • Latency matters more than reliability
  • Data has short lifespan (real-time)
  • Some loss is acceptable
  • Broadcast/multicast needed
  • Building custom protocol (like QUIC)
  • Extreme resource constraints (IoT)

Use TCP when:

  • Every byte must arrive
  • Order matters
  • Simplicity preferred (let TCP handle complexity)
  • Firewall traversal important
  • Building on existing TCP-based protocols

The key question: Is old data still valuable?

  • Yes → TCP (file, web page, API)
  • No → Consider UDP (voice, video, game state)

Next, we’ll do a detailed comparison of UDP vs TCP trade-offs.

UDP vs TCP Trade-offs

Choosing between UDP and TCP isn’t about which is “better”—it’s about understanding the trade-offs and matching them to your requirements. This chapter provides a detailed comparison to help you make informed decisions.

Feature Comparison

┌─────────────────────────────────────────────────────────────────────────┐
│               Feature                  │    TCP    │    UDP    │        │
├────────────────────────────────────────┼───────────┼───────────┼────────┤
│ Reliable delivery                      │    ✓      │    ✗      │        │
│ Ordered delivery                       │    ✓      │    ✗      │        │
│ Error detection                        │    ✓      │    ✓*     │ *opt.  │
│ Flow control                           │    ✓      │    ✗      │        │
│ Congestion control                     │    ✓      │    ✗      │        │
│ Connection-oriented                    │    ✓      │    ✗      │        │
│ Message boundaries                     │    ✗      │    ✓      │        │
│ Broadcast/Multicast                    │    ✗      │    ✓      │        │
│ NAT traversal friendly                 │    ✓      │   varies  │        │
│ Firewall friendly                      │    ✓      │    ✗      │        │
└────────────────────────────────────────┴───────────┴───────────┴────────┘

Latency Analysis

Connection Establishment

TCP - New Connection:
┌────────────────────────────────────────────────────────────┐
│  0ms    Client sends SYN                                   │
│  50ms   Server receives, sends SYN-ACK                     │
│  100ms  Client receives, sends ACK + first data            │
│  150ms  Server receives first data                         │
│                                                            │
│  Minimum latency to first data: 1.5 RTT                    │
└────────────────────────────────────────────────────────────┘

TCP - Established Connection:
┌────────────────────────────────────────────────────────────┐
│  0ms    Client sends data                                  │
│  50ms   Server receives data                               │
│                                                            │
│  Latency: 0.5 RTT (one-way)                                │
└────────────────────────────────────────────────────────────┘

UDP:
┌────────────────────────────────────────────────────────────┐
│  0ms    Client sends datagram                              │
│  50ms   Server receives datagram                           │
│                                                            │
│  Latency: 0.5 RTT (always)                                 │
│  No connection overhead!                                   │
└────────────────────────────────────────────────────────────┘

Request-Response Latency

Single request, single response:

TCP (new connection):
  Handshake:  1 RTT
  Request:    0.5 RTT
  Response:   0.5 RTT
  Total:      2 RTT

TCP (existing connection):
  Request:    0.5 RTT
  Response:   0.5 RTT
  Total:      1 RTT

UDP:
  Request:    0.5 RTT
  Response:   0.5 RTT
  Total:      1 RTT

For one-shot interactions, UDP saves 1 RTT.
For repeated interactions, TCP connection reuse matches UDP.

Latency Under Loss

5% packet loss scenario:

TCP:
  Packet lost → Detected (3 dup ACKs or timeout)
  Fast retransmit: ~1 RTT additional delay
  Timeout: Several seconds delay!

  Also: Congestion window reduced
        Subsequent packets slowed

UDP:
  Packet lost → Application decides:
    - Ignore it (real-time)
    - Request retransmit (application-level)
    - Interpolate from adjacent data

  No cascading effects on other packets.

Throughput Analysis

Header Overhead

Per-packet overhead:

TCP: 20-60 bytes (typically 32 with timestamps)
UDP: 8 bytes

Efficiency for 100-byte payload:
  TCP: 100 / 132 = 76%
  UDP: 100 / 108 = 93%

Efficiency for 1400-byte payload:
  TCP: 1400 / 1432 = 98%
  UDP: 1400 / 1408 = 99%

UDP's advantage shrinks with larger payloads.
Matters most for small messages.

Maximum Throughput

TCP:
  Limited by: min(cwnd, rwnd) / RTT
  Congestion control prevents network overload
  Fair sharing with other flows

  Example: 64KB window, 50ms RTT
           Max: 64KB / 50ms = 1.28 MB/s

UDP:
  Limited by: Application send rate
  No built-in limits!
  Can overwhelm network

  Can achieve wire speed... if network allows.
  But may cause massive loss and collateral damage.

Behavior Under Congestion

Network congested:

TCP:
  Detects loss → Reduces cwnd
  Backs off → Congestion clears
  Gradually increases again
  "Good citizen" - shares fairly

UDP:
  No awareness of congestion
  Keeps sending at same rate
  Causes more congestion
  Other TCP flows suffer

This is why uncontrolled UDP can harm the network.
Responsible UDP apps implement their own congestion control.

Reliability Implications

Handling Loss

TCP handles loss automatically:
  1. Detects via ACK timeout or dup ACKs
  2. Retransmits lost segment
  3. Adjusts congestion window
  4. Application sees reliable byte stream

UDP loss is application's problem:
  1. Application must detect (if it cares)
  2. Application must request retransmit (if needed)
  3. Application decides what to do

Sometimes that's a feature:
  Video codec can mask lost frame
  Game can interpolate missing position
  Voice can use error concealment

Ordering Implications

TCP guarantees order:
  Sent: A B C D E
  Received: A B C D E (always)

  If C is lost:
    B arrives, delivered
    D arrives, buffered
    E arrives, buffered
    C retransmitted, arrives
    C D E delivered in order

UDP makes no guarantee:
  Sent: A B C D E
  Received: A B D C E (possible)
            A B D E (C lost)
            A D B C E (reordered)

Application must handle or ignore.

Resource Usage

Server Memory

TCP Server (10,000 connections):
  Per connection:
    - Socket structure
    - Send buffer (~16KB)
    - Receive buffer (~16KB)
    - TCP control block

  Total: ~320MB for buffers alone
         Plus connection tracking overhead

UDP Server (10,000 "clients"):
  Single socket:
    - One send buffer
    - One receive buffer
    - No connection state!

  Total: ~32KB
         Applications track state if needed

UDP scales better for many ephemeral interactions.

CPU Usage

TCP per packet:
  - Checksum calculation
  - Sequence number tracking
  - ACK generation
  - Window management
  - Congestion control
  - Timer management

UDP per packet:
  - Checksum calculation (optional in IPv4)
  - That's it

UDP has lower CPU overhead per packet.
But if you implement reliability, you add CPU work.

NAT and Firewall Behavior

NAT Traversal

TCP through NAT:
  1. Client connects out
  2. NAT creates mapping
  3. Server responses follow mapping
  4. Works reliably

UDP through NAT:
  1. Client sends datagram out
  2. NAT creates mapping
  3. Mapping may timeout quickly!
  4. Need keepalive packets

UDP NAT mappings often timeout in 30-120 seconds.
Long-lived UDP "connections" need periodic keepalive.

Firewall Policies

Common firewall behavior:

Corporate firewalls:
  TCP 80 (HTTP): Usually allowed
  TCP 443 (HTTPS): Usually allowed
  UDP 53 (DNS): Often allowed
  UDP 123 (NTP): Sometimes allowed
  Other UDP: Often blocked!

If targeting corporate networks:
  UDP may not work
  TCP or WebSocket more reliable
  HTTPS most reliable

When to Choose What

Choose TCP When:

✓ Data integrity critical (files, transactions)
✓ Simple implementation preferred
✓ Operating through corporate firewalls
✓ Long-lived connections
✓ Need reliable delivery without custom code
✓ Building on HTTP, TLS, or other TCP protocols

Choose UDP When:

✓ Real-time requirements (voice, video, gaming)
✓ Broadcast or multicast needed
✓ Small, independent messages
✓ Custom reliability acceptable
✓ Willing to implement congestion control
✓ Protocol requires it (DNS, DHCP, QUIC)

Consider QUIC When:

✓ Want UDP benefits with reliability
✓ Need multiple streams without HoL blocking
✓ Want 0-RTT connection resumption
✓ Willing to use a more complex library
✓ Building modern web services

Performance Comparison Summary

┌────────────────────────────────────────────────────────────────────────┐
│ Metric                   │ TCP                 │ UDP                   │
├──────────────────────────┼─────────────────────┼───────────────────────┤
│ Initial latency          │ 1-1.5 RTT overhead  │ No overhead           │
│ Steady-state latency     │ Similar             │ Similar               │
│ Latency under loss       │ High (retransmit)   │ Low (skip if desired) │
│ Throughput (clean)       │ Good                │ Can exceed            │
│ Throughput (lossy)       │ Degrades gracefully │ Application-dependent │
│ Header overhead          │ 20-60 bytes         │ 8 bytes               │
│ Server memory            │ High                │ Low                   │
│ Server CPU               │ Moderate            │ Low                   │
│ Implementation effort    │ Low (OS handles)    │ High (if reliability) │
└────────────────────────────────────────────────┴───────────────────────┘

Hybrid Approaches

Many applications use both:

Example: Online Game

TCP for:
  - Authentication
  - Chat messages
  - Purchases/transactions
  - Downloading updates

UDP for:
  - Player positions
  - World state
  - Audio chat
  - Time-sensitive events

Single codebase, two transports, best of both worlds.

Summary

The choice between TCP and UDP depends on your specific requirements:

RequirementPrefer
SimplicityTCP
Reliability built-inTCP
Lowest latencyUDP
Real-time tolerance for lossUDP
Broadcast/multicastUDP
Corporate firewall traversalTCP
Custom protocol over UDPConsider QUIC

Neither protocol is universally “better.” Understanding the trade-offs lets you make the right choice for your application—or use both where appropriate.

This completes our coverage of UDP. Next, we’ll explore DNS—the internet’s naming system that typically uses UDP for queries.

DNS: The Internet’s Directory

DNS (Domain Name System) is the internet’s phone book. It translates human-readable domain names like example.com into IP addresses like 93.184.216.34. Without DNS, we’d have to memorize IP addresses for every website—the internet would be unusable.

Why DNS Matters

Every network connection starts with DNS:

You type: https://github.com
Browser needs: IP address

1. Browser → DNS: "What's the IP for github.com?"
2. DNS → Browser: "140.82.114.3"
3. Browser → 140.82.114.3: "GET / HTTP/1.1"
4. Server → Browser: "Here's the page!"

DNS lookup happens before any connection.
DNS performance affects EVERY request.

The Hierarchical Design

DNS is a distributed database organized as a tree:

                              . (root)
                              │
         ┌────────────────────┼────────────────────┐
         │                    │                    │
        com                  org                  net
         │                    │                    │
    ┌────┼────┐          ┌────┼────┐              ...
    │    │    │          │    │
example google ...     wikipedia ...
    │
   www

Domain: www.example.com
  - "." is the root (usually implicit)
  - "com" is the Top-Level Domain (TLD)
  - "example" is the Second-Level Domain
  - "www" is a subdomain

Key Concepts

Domain Names

Fully Qualified Domain Name (FQDN):
  www.example.com.
                 └── Trailing dot means "this is complete"
                     (Usually omitted in browsers)

Labels: Separated by dots
  - Each label: 1-63 characters
  - Total FQDN: max 253 characters
  - Case-insensitive (Example.COM = example.com)

DNS Servers

┌─────────────────────────────────────────────────────────────┐
│                  Types of DNS Servers                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Recursive Resolver (Caching Nameserver)                    │
│    - What your computer typically talks to                  │
│    - Does the heavy lifting of finding answers              │
│    - Caches results for faster subsequent queries           │
│    - Examples: 8.8.8.8 (Google), 1.1.1.1 (Cloudflare)       │
│                                                             │
│  Authoritative Nameserver                                   │
│    - Holds the actual DNS records for a zone                │
│    - Is the "source of truth" for that domain               │
│    - Responds to queries about its zones                    │
│                                                             │
│  Root Nameservers                                           │
│    - 13 root server clusters (a.root-servers.net, etc.)     │
│    - Know where to find TLD servers                         │
│    - Foundation of the entire DNS system                    │
│                                                             │
│  TLD Nameservers                                            │
│    - Manage .com, .org, .net, country codes, etc.           │
│    - Know authoritative servers for each domain             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The Resolution Process (Preview)

Query: "What's the IP for www.example.com?"

Your Computer → Recursive Resolver
                        │
                        ├──> Root Server: "Who handles .com?"
                        │    "Go ask a.gtld-servers.net"
                        │
                        ├──> .com TLD: "Who handles example.com?"
                        │    "Go ask ns1.example.com"
                        │
                        ├──> example.com NS: "What's www.example.com?"
                        │    "It's 93.184.216.34"
                        │
                        └──> Returns answer to your computer

Multiple round trips, but caching makes it fast.

Why DNS Uses UDP (Mostly)

Traditional DNS:
  - Small queries (~50 bytes)
  - Small responses (~100-500 bytes)
  - Single request-response
  - Speed matters (affects every page load)

UDP advantages:
  - No connection overhead
  - Faster resolution
  - Lower server load

When TCP is used:
  - Responses > 512 bytes (EDNS extends this)
  - Zone transfers between servers
  - DNS over TLS (DoT)
  - DNS over HTTPS (DoH)

What You’ll Learn

In this chapter:

  1. DNS Resolution Process: How lookups actually work
  2. Record Types: A, AAAA, CNAME, MX, and more
  3. DNS Caching: How TTLs and caching improve performance
  4. DNSSEC: Securing DNS against tampering

Understanding DNS helps you:

  • Debug “cannot resolve hostname” errors
  • Configure domains correctly
  • Understand CDN and load balancing behavior
  • Recognize DNS-based attacks

DNS Resolution Process

When your browser needs to find example.com, a complex but elegant process unfolds. Understanding this process helps you debug DNS issues and optimize performance.

The Query Journey

A full DNS resolution involves multiple servers:

┌─────────────┐    ┌───────────────┐    ┌───────────────┐
│Your Computer│───>│   Recursive   │───>│ Root Servers  │
│             │    │   Resolver    │    │ (13 clusters) │
│  (Stub      │    │  (8.8.8.8)    │    │               │
│   Resolver) │    │               │    └───────┬───────┘
└─────────────┘    │               │            │
                   │               │<───────────┘
                   │               │    ┌───────────────┐
                   │               │───>│ TLD Servers   │
                   │               │    │ (.com, .org)  │
                   │               │    │               │
                   │               │<───┴───────────────┘
                   │               │    ┌───────────────┐
                   │               │───>│ Authoritative │
                   │               │    │   Nameserver  │
                   └───────┬───────┘    │ (example.com) │
                           │            └───────────────┘
                           │
                    Answer returned
                    to your computer

Step-by-Step Resolution

Let’s trace a query for www.example.com:

Step 1: Local Stub Resolver

Your computer checks (in order):
  1. Local cache (recently resolved names)
  2. /etc/hosts file (manual overrides)
  3. If not found → Query configured DNS server

$ cat /etc/hosts
127.0.0.1   localhost
192.168.1.10  myserver.local

$ cat /etc/resolv.conf  # Linux
nameserver 8.8.8.8
nameserver 8.8.4.4

If not in cache or hosts → Send UDP query to 8.8.8.8

Step 2: Recursive Resolver Check Cache

Recursive resolver (8.8.8.8) checks its cache:

Cache might have:
  - www.example.com → 93.184.216.34 (exact match!)
  - example.com NS → ns1.example.com (partial help)
  - .com NS → a.gtld-servers.net (partial help)

Cache hit? Return immediately!
Cache miss? Start the recursive lookup.

Step 3: Query Root Servers

Resolver → Root Server (a.root-servers.net)

Q: "What's the IP for www.example.com?"

Root server response:
  "I don't know www.example.com, but .com is handled by:
   a.gtld-servers.net (192.5.6.30)
   b.gtld-servers.net (192.33.14.30)
   ... (and others)

   This is a REFERRAL, not an answer.
   Go ask them."

Type: NS (Name Server) referral

Step 4: Query TLD Servers

Resolver → .com TLD Server (a.gtld-servers.net)

Q: "What's the IP for www.example.com?"

TLD server response:
  "I don't know www.example.com, but example.com is handled by:
   ns1.example.com (93.184.216.34)
   ns2.example.com (93.184.216.34)

   Go ask them."

Type: NS referral + glue records (IPs of nameservers)

Step 5: Query Authoritative Server

Resolver → Authoritative NS (ns1.example.com)

Q: "What's the IP for www.example.com?"

Authoritative response:
  "www.example.com has address 93.184.216.34"

Type: A record (the actual answer!)

This server IS authoritative for example.com.
The answer is definitive, not a referral.

Step 6: Return to Client

Recursive resolver:
  1. Caches the answer (and intermediate results)
  2. Returns answer to your computer

Your computer:
  1. Caches the answer
  2. Uses IP to connect

Total time: 50-200ms (uncached)
Cached lookup: <1ms

Query Types

Recursive Query

Client → Recursive Resolver:
"Get me the answer, do whatever it takes"

Resolver must:
  - Return the answer, OR
  - Return an error

Client doesn't do iterative lookups itself.

Iterative Query

Resolver → Authoritative Servers:
"Tell me what you know"

Server response can be:
  - The answer (if authoritative)
  - A referral (try somewhere else)
  - Error (doesn't exist)

Resolver follows referrals iteratively.

DNS Message Format

DNS Query/Response Structure:

┌────────────────────────────────────────────────────────────┐
│                        Header                              │
│  - Query ID (match responses to queries)                   │
│  - Flags (query/response, recursion desired, etc.)         │
│  - Question count, Answer count, Authority count, etc.     │
├────────────────────────────────────────────────────────────┤
│                       Question                             │
│  - Name: www.example.com                                   │
│  - Type: A (or AAAA, MX, etc.)                             │
│  - Class: IN (Internet)                                    │
├────────────────────────────────────────────────────────────┤
│                        Answer                              │
│  - Name: www.example.com                                   │
│  - Type: A                                                 │
│  - TTL: 3600                                               │
│  - Data: 93.184.216.34                                     │
├────────────────────────────────────────────────────────────┤
│                      Authority                             │
│  (Nameservers for the zone)                                │
├────────────────────────────────────────────────────────────┤
│                      Additional                            │
│  (Extra helpful records, like NS IP addresses)             │
└────────────────────────────────────────────────────────────┘

DNS Query in Action

Using dig to see the resolution:

$ dig www.example.com +trace

; <<>> DiG 9.16.1 <<>> www.example.com +trace
;; global options: +cmd
.                       518400  IN  NS  a.root-servers.net.
.                       518400  IN  NS  b.root-servers.net.
;; Received 262 bytes from 8.8.8.8#53(8.8.8.8) in 12 ms

com.                    172800  IN  NS  a.gtld-servers.net.
com.                    172800  IN  NS  b.gtld-servers.net.
;; Received 828 bytes from 192.58.128.30#53(a.root-servers.net) in 24 ms

example.com.            172800  IN  NS  ns1.example.com.
example.com.            172800  IN  NS  ns2.example.com.
;; Received 268 bytes from 192.5.6.30#53(a.gtld-servers.net) in 32 ms

www.example.com.        3600    IN  A   93.184.216.34
;; Received 56 bytes from 93.184.216.34#53(ns1.example.com) in 16 ms

Negative Responses

What if the domain doesn’t exist?

NXDOMAIN

Query: thisdomaindoesnotexist.com

Response:
  Status: NXDOMAIN (Non-Existent Domain)
  Meaning: Domain doesn't exist at all

This is authoritative - the domain really doesn't exist.
Can be cached (negative caching).

NODATA

Query: example.com (type AAAA for IPv6)

Response:
  Status: NODATA
  Meaning: Domain exists but no record of this type

example.com has A records but no AAAA records.
Also cached negatively.

Resolver Behavior

Timeouts and Retries

Resolver query to server times out:

Default behavior:
  Timeout: ~2 seconds
  Retries: 2-3 attempts
  Tries alternate servers in list

Total resolution might take:
  Best case: <50ms (cached)
  Typical: 50-200ms (uncached)
  Worst case: Several seconds (timeouts)

Server Selection

Multiple nameservers for redundancy:

ns1.example.com
ns2.example.com

Resolver tracks:
  - Response times per server
  - Failure counts
  - Prefers faster/more reliable servers

"Smoothed Round Trip Time" (SRTT) helps pick fastest.

Common Resolution Issues

“Could not resolve hostname”

Causes:
  1. DNS server unreachable (network issue)
  2. Domain doesn't exist (NXDOMAIN)
  3. DNS server returning errors
  4. Local resolver misconfigured

Debug:
  $ nslookup example.com
  $ dig example.com
  $ ping 8.8.8.8  # Can you reach DNS server?

Slow Resolution

Causes:
  1. Cache empty (first lookup is slow)
  2. DNS server far away
  3. DNS server overloaded
  4. Network latency

Solutions:
  - Use closer DNS server
  - Increase local cache size/TTL
  - Pre-resolve critical domains

Stale Cache

Situation:
  Website changed IP
  Your cache still has old IP
  Connection fails

Solutions:
  $ sudo systemd-resolve --flush-caches  # Linux systemd
  $ sudo dscacheutil -flushcache          # macOS
  $ ipconfig /flushdns                    # Windows

  Or wait for TTL to expire.

Programming with DNS

Basic Lookup (Python)

import socket

# Simple lookup
ip = socket.gethostbyname('example.com')
print(ip)  # 93.184.216.34

# Get all addresses (IPv4 + IPv6)
infos = socket.getaddrinfo('example.com', 80)
for info in infos:
    family, socktype, proto, canonname, sockaddr = info
    print(f"{family.name}: {sockaddr[0]}")

Using dnspython Library

import dns.resolver

# A record lookup
answers = dns.resolver.resolve('example.com', 'A')
for rdata in answers:
    print(f"IP: {rdata}")

# MX record lookup
answers = dns.resolver.resolve('example.com', 'MX')
for rdata in answers:
    print(f"Mail server: {rdata.exchange} (priority {rdata.preference})")

# Tracing (like dig +trace)
import dns.query
import dns.zone

# ... more advanced queries

Summary

DNS resolution follows a hierarchical pattern:

Your Computer
    │
    ▼
Recursive Resolver (does the work)
    │
    ├──> Root Servers (.com? .org? .net?)
    │
    ├──> TLD Servers (example.com? github.com?)
    │
    └──> Authoritative Servers (www? mail? api?)
            │
            ▼
         Answer!

Key points:

  • Stub resolvers on your computer do minimal work
  • Recursive resolvers (like 8.8.8.8) do the heavy lifting
  • Caching at every level makes it fast
  • Authoritative servers are the source of truth
  • TTL values control cache duration

Next, we’ll explore the different types of DNS records and their uses.

DNS Record Types

DNS stores more than just IP addresses. Different record types serve different purposes—from pointing domain names to servers, to routing email, to verifying domain ownership.

Common Record Types

┌───────────────────────────────────────────────────────────────────────┐
│ Type  │ Name                   │ Purpose                              │
├───────┼────────────────────────┼──────────────────────────────────────┤
│  A    │ Address                │ Maps name to IPv4 address            │
│ AAAA  │ IPv6 Address           │ Maps name to IPv6 address            │
│ CNAME │ Canonical Name         │ Alias to another name                │
│  MX   │ Mail Exchange          │ Email server for domain              │
│  TXT  │ Text                   │ Arbitrary text (verification, SPF)   │
│  NS   │ Name Server            │ Authoritative servers for zone       │
│ SOA   │ Start of Authority     │ Zone metadata and parameters         │
│  PTR  │ Pointer                │ Reverse DNS (IP to name)             │
│  SRV  │ Service                │ Service location (port, priority)    │
│ CAA   │ Cert. Authority Auth.  │ Which CAs can issue certificates     │
└───────┴────────────────────────┴──────────────────────────────────────┘

A Record (Address)

Maps a domain name to an IPv4 address.

Record:
  Name:  example.com
  Type:  A
  TTL:   3600
  Value: 93.184.216.34

Lookup:
$ dig example.com A
example.com.    3600    IN    A    93.184.216.34

Multiple A Records

Load balancing via DNS:

example.com.    300    IN    A    192.0.2.1
example.com.    300    IN    A    192.0.2.2
example.com.    300    IN    A    192.0.2.3

Client picks one (often randomly or round-robin).
Simple load distribution, no dedicated load balancer.

Drawbacks:
  - No health checking
  - Uneven distribution possible
  - Cached entries persist after server failure

AAAA Record (IPv6 Address)

Maps a domain name to an IPv6 address.

Record:
  Name:  example.com
  Type:  AAAA
  TTL:   3600
  Value: 2606:2800:220:1:248:1893:25c8:1946

Lookup:
$ dig example.com AAAA
example.com.    3600    IN    AAAA    2606:2800:220:1:248:1893:25c8:1946

"AAAA" = four times "A" = four times the address size (32 → 128 bits)

Dual Stack

Many domains have both A and AAAA:

example.com.    A       93.184.216.34
example.com.    AAAA    2606:2800:220:1:248:1893:25c8:1946

Client chooses based on connectivity:
  - Happy Eyeballs algorithm prefers IPv6
  - Falls back to IPv4 if IPv6 fails

CNAME Record (Canonical Name)

Creates an alias pointing to another domain name.

Record:
  Name:  www.example.com
  Type:  CNAME
  TTL:   3600
  Value: example.com

Lookup:
$ dig www.example.com
www.example.com.    3600    IN    CNAME    example.com.
example.com.        3600    IN    A        93.184.216.34

Resolver follows the chain:
  www.example.com → example.com → 93.184.216.34

CNAME Use Cases

1. WWW alias:
   www.example.com → example.com

2. CDN integration:
   cdn.example.com → d1234.cloudfront.net

3. Service endpoints:
   api.example.com → api-prod.company.internal

4. Environment switching:
   app.example.com → staging.example.com  (during testing)
   app.example.com → production.example.com  (in production)

CNAME Restrictions

Cannot coexist with other records at same name:

INVALID:
  example.com    CNAME    other.com
  example.com    A        1.2.3.4      ← Conflict!

INVALID:
  example.com    CNAME    other.com
  example.com    MX       mail.example.com  ← Conflict!

Therefore: Cannot use CNAME at zone apex (example.com)
           Must use A/AAAA records there

Workarounds:
  - ALIAS records (provider-specific, not standard DNS)
  - ANAME records (draft standard)

MX Record (Mail Exchange)

Specifies email servers for a domain.

Record:
  Name:     example.com
  Type:     MX
  TTL:      3600
  Priority: 10
  Value:    mail.example.com

Lookup:
$ dig example.com MX
example.com.    3600    IN    MX    10 mail.example.com.
example.com.    3600    IN    MX    20 backup.example.com.

MX Priority

Lower number = higher priority

example.com.    MX    10 primary.mail.example.com.
example.com.    MX    20 secondary.mail.example.com.
example.com.    MX    30 backup.mail.example.com.

Email delivery attempts:
  1. Try primary (priority 10)
  2. If unavailable, try secondary (priority 20)
  3. Last resort: backup (priority 30)

Email Flow with MX

Sending email to user@example.com:

1. Sender's MTA queries: example.com MX
2. Gets: mail.example.com (priority 10)
3. Queries: mail.example.com A
4. Gets: 93.184.216.100
5. Connects to 93.184.216.100:25 (SMTP)
6. Delivers email

TXT Record

Stores arbitrary text. Used for verification and email security.

Record:
  Name:  example.com
  Type:  TXT
  TTL:   3600
  Value: "v=spf1 include:_spf.google.com ~all"

Lookup:
$ dig example.com TXT
example.com.    3600    IN    TXT    "v=spf1 include:_spf.google.com ~all"

Common TXT Uses

1. SPF (Sender Policy Framework):
   "v=spf1 include:_spf.google.com ~all"
   Specifies authorized email senders

2. DKIM (DomainKeys Identified Mail):
   selector._domainkey.example.com TXT "v=DKIM1; k=rsa; p=..."
   Public key for email signing

3. DMARC (Domain-based Message Authentication):
   _dmarc.example.com TXT "v=DMARC1; p=reject; rua=mailto:..."
   Policy for handling authentication failures

4. Domain verification:
   example.com TXT "google-site-verification=abc123..."
   Proves domain ownership to services

5. Custom data:
   example.com TXT "contact=admin@example.com"

NS Record (Name Server)

Specifies authoritative nameservers for a zone.

Record:
  Name:  example.com
  Type:  NS
  TTL:   86400
  Value: ns1.example.com

Lookup:
$ dig example.com NS
example.com.    86400    IN    NS    ns1.example.com.
example.com.    86400    IN    NS    ns2.example.com.

Delegation

NS records delegate subdomains:

example.com zone has:
  subdomain.example.com.    NS    ns1.subdomain-hosting.com.

Now subdomain.example.com has its own nameservers.
The parent zone "delegates" authority.

SOA Record (Start of Authority)

Contains zone metadata and parameters.

$ dig example.com SOA
example.com.    3600    IN    SOA    ns1.example.com. admin.example.com. (
                            2024011501 ; Serial
                            7200       ; Refresh
                            3600       ; Retry
                            1209600    ; Expire
                            3600 )     ; Minimum TTL

Fields:
  Primary NS:     ns1.example.com
  Admin email:    admin@example.com (@ replaced with .)
  Serial:         Version number (often YYYYMMDDNN)
  Refresh:        How often secondaries check for updates
  Retry:          Retry interval after failed refresh
  Expire:         When secondary data becomes invalid
  Minimum:        Negative caching TTL

PTR Record (Pointer)

Maps IP addresses back to names (reverse DNS).

IP to name lookup:

IP: 93.184.216.34
Reverse zone: 34.216.184.93.in-addr.arpa

Record:
  34.216.184.93.in-addr.arpa.    PTR    example.com.

Lookup:
$ dig -x 93.184.216.34
34.216.184.93.in-addr.arpa. 3600 IN PTR example.com.

Reverse DNS Uses

1. Email server verification:
   Receiving servers check if sender IP has valid PTR
   Missing/mismatched PTR → likely spam

2. Logging and auditing:
   Convert IPs to names for readable logs

3. Security analysis:
   Quick identification of attacking IPs

SRV Record (Service)

Specifies location of services with port and priority.

Record format:
  _service._proto.name    SRV    priority weight port target

Example:
  _sip._tcp.example.com.    SRV    10 60 5060 sipserver.example.com.
  _xmpp._tcp.example.com.   SRV    10 50 5222 xmpp.example.com.

Fields:
  Priority: Lower = preferred (like MX)
  Weight:   For load balancing among same priority
  Port:     Service port number
  Target:   Server hostname

SRV Use Cases

1. VoIP/SIP:
   _sip._tcp.example.com → voip.example.com:5060

2. XMPP/Jabber:
   _xmpp-client._tcp.example.com → chat.example.com:5222

3. LDAP:
   _ldap._tcp.example.com → ldap.example.com:389

4. Kubernetes services:
   _http._tcp.myservice.namespace → pod-ip:port

CAA Record (Certificate Authority Authorization)

Controls which Certificate Authorities can issue SSL certificates.

Record:
  example.com.    CAA    0 issue "letsencrypt.org"
  example.com.    CAA    0 issuewild ";"
  example.com.    CAA    0 iodef "mailto:security@example.com"

Meanings:
  issue:     Which CA can issue regular certs
  issuewild: Which CA can issue wildcard certs (";" = none)
  iodef:     Where to report violations

$ dig example.com CAA
example.com.    3600    IN    CAA    0 issue "letsencrypt.org"

Querying Different Record Types

# A record (IPv4)
$ dig example.com A

# AAAA record (IPv6)
$ dig example.com AAAA

# MX record (mail servers)
$ dig example.com MX

# TXT record (text)
$ dig example.com TXT

# All records
$ dig example.com ANY  # Note: Many servers don't support ANY

# Specific nameserver
$ dig @8.8.8.8 example.com A

# Short output
$ dig +short example.com A
93.184.216.34

# Trace resolution path
$ dig +trace example.com

Record TTL Considerations

TTL (Time To Live) controls caching:

Long TTL (86400 = 24 hours):
  + Fewer queries, lower load
  + Faster lookups (cached)
  - Slow to update, changes take time

Short TTL (60 = 1 minute):
  + Quick updates
  + Fast failover
  - More queries
  - Higher load on nameservers

Recommendations:
  Stable records: 3600-86400 (1-24 hours)
  Dynamic/failover: 60-300 (1-5 minutes)
  During migration: Reduce before, restore after

Summary

RecordPurposeExample Value
AIPv4 address93.184.216.34
AAAAIPv6 address2606:2800:220:1::1
CNAMEAliaswww → example.com
MXMail server10 mail.example.com
TXTText/verification“v=spf1 …”
NSNameserversns1.example.com
SOAZone metadataSerial, timers
PTRReverse lookupIP → name
SRVService locationpriority weight port target
CAACA authorization0 issue “letsencrypt.org”

Understanding record types helps you:

  • Configure domains correctly
  • Debug email delivery issues
  • Set up SSL certificates
  • Implement service discovery

Next, we’ll explore DNS caching and how TTLs affect performance.

DNS Caching

Caching is what makes DNS fast. Without it, every web request would require multiple round trips to root servers, TLD servers, and authoritative servers. Understanding caching helps you balance performance against update speed.

The Caching Hierarchy

DNS caches exist at multiple levels:

┌─────────────────────────────────────────────────────────────┐
│                    Caching Layers                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Browser Cache          (seconds to minutes)                │
│       ↓                                                     │
│  Operating System       (minutes to hours)                  │
│       ↓                                                     │
│  Local DNS Server       (minutes to hours)                  │
│  (home router, office)                                      │
│       ↓                                                     │
│  Recursive Resolver     (minutes to days)                   │
│  (8.8.8.8, ISP DNS)                                         │
│       ↓                                                     │
│  Authoritative Server   (source of truth)                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Each level can serve cached responses.
Request only goes further if cache misses.

TTL (Time To Live)

Every DNS record has a TTL that controls how long it can be cached:

Record with TTL:
  example.com.    3600    IN    A    93.184.216.34
                   │
                   └── Cache for 3600 seconds (1 hour)

When cached:
  - Resolver stores record with timestamp
  - Returns cached response for subsequent queries
  - After TTL expires, must re-query authoritative server

How TTL Decrements

Authoritative server returns:
  example.com.    3600    IN    A    93.184.216.34

Resolver caches at T=0:
  TTL remaining: 3600

After 1000 seconds (T=1000):
  Client queries resolver
  Resolver returns from cache with TTL=2600

After 3600 seconds (T=3600):
  TTL=0, entry expired
  Next query goes to authoritative server
  Fresh record cached with new TTL

Browser DNS Cache

Browsers maintain their own DNS cache:

Chrome: chrome://net-internals/#dns
Firefox: about:networking#dns
Safari: No direct viewer

Typical browser cache TTL: Capped at 1-60 seconds
(Shorter than OS cache to detect changes faster)

Clearing browser cache:
  - Chrome: Settings → Privacy → Clear browsing data
  - Firefox: Settings → Privacy → Clear Data
  - Or restart browser

Operating System Cache

Linux (systemd-resolved)

# View cache statistics
$ resolvectl statistics

# View cached entries (limited)
$ resolvectl query example.com

# Flush cache
$ sudo resolvectl flush-caches

# Alternative (older systems)
$ sudo systemctl restart systemd-resolved

macOS

# Flush DNS cache
$ sudo dscacheutil -flushcache
$ sudo killall -HUP mDNSResponder

# View cached entries (limited visibility)
$ sudo killall -INFO mDNSResponder
# Check Console.app for output

Windows

# View cache
> ipconfig /displaydns

# Flush cache
> ipconfig /flushdns

# Check DNS client service
> Get-Service dnscache

Recursive Resolver Cache

Public resolvers like 8.8.8.8 cache extensively:

Benefits:
  - Single query serves millions of users
  - Popular domains almost always cached
  - Reduced load on authoritative servers

Cache characteristics:
  - Respects TTL from authoritative
  - May apply minimum TTL (typically 60s)
  - May cap maximum TTL (typically 24-48h)
  - Huge cache (millions of entries)

Cache Warming

Large resolvers “warm” their caches:

Popular domain (google.com):
  - Millions of queries per second
  - Always in cache
  - TTL never truly expires (refreshed constantly)

Obscure domain (your-small-site.com):
  - Few queries
  - May fall out of cache between queries
  - Each visitor might trigger fresh lookup

Negative Caching

Failed lookups are also cached:

Query: nonexistent.example.com
Response: NXDOMAIN (doesn't exist)

Cached as negative response:
  - Saves repeated queries for invalid domains
  - TTL from SOA minimum field
  - Typically cached for minutes to hours

RFC 2308 defines negative caching behavior.

Negative Cache Problems

Scenario:
  1. Query new domain before DNS propagates
  2. Get NXDOMAIN (not yet available)
  3. NXDOMAIN cached for 1 hour
  4. Domain IS available 5 minutes later
  5. Still getting NXDOMAIN from cache!

Solution:
  - Wait for negative cache to expire
  - Flush local DNS cache
  - Use different resolver temporarily

TTL Strategies

Long TTL (3600-86400 seconds)

Pros:
  + Fewer queries to authoritative servers
  + Faster lookups (usually cached)
  + Less DNS infrastructure needed

Cons:
  - Slow propagation of changes
  - Failover takes time
  - Users may hit stale data

Best for:
  - Stable infrastructure
  - Rarely-changing records
  - Cost/performance optimization

Short TTL (60-300 seconds)

Pros:
  + Quick propagation of changes
  + Fast failover
  + More control over traffic

Cons:
  - More queries (higher load)
  - Slightly higher latency on cache miss
  - More authoritative server capacity needed

Best for:
  - Dynamic infrastructure
  - Traffic management
  - Disaster recovery scenarios

TTL Strategy by Record Type

┌──────────────────────────────────────────────────────────────┐
│ Record Type     │ Recommended TTL    │ Rationale             │
├─────────────────┼────────────────────┼───────────────────────┤
│ NS              │ 86400 (24 hours)   │ Rarely change         │
│ MX              │ 3600-14400         │ Email can retry       │
│ A/AAAA (stable) │ 3600-86400         │ Usually cached anyway │
│ A/AAAA (dynamic)│ 60-300             │ Need quick updates    │
│ CNAME           │ 3600               │ Depends on target     │
│ TXT (SPF/DKIM)  │ 3600               │ Reasonable balance    │
└──────────────────────────────────────────────────────────────┘

TTL and DNS Migrations

When changing DNS records, manage TTL proactively:

Timeline for IP change:

T-24h: Reduce TTL
  example.com.    300    IN    A    93.184.216.34
  (Old IP, short TTL now)

T-0: Make the change
  example.com.    300    IN    A    198.51.100.50
  (New IP)

T+1h: Verify traffic shifted

T+24h: Restore normal TTL
  example.com.    3600    IN    A    198.51.100.50
  (New IP, normal TTL)

The "reduce before, restore after" pattern minimizes
stale cache impact during changes.

Cache Debugging

Check What’s Cached

# Query specific resolver (bypass local cache)
$ dig @8.8.8.8 example.com

# Check TTL remaining
$ dig example.com | grep -A1 "ANSWER SECTION"
example.com.    2847    IN    A    93.184.216.34
                 │
                 └── 2847 seconds remaining in cache

# Compare different resolvers
$ dig @8.8.8.8 example.com +short
$ dig @1.1.1.1 example.com +short
$ dig @9.9.9.9 example.com +short

# Different results = propagation in progress

Force Fresh Lookup

# Query authoritative directly
$ dig @ns1.example.com example.com

# Trace (bypasses cache, queries authoritatively)
$ dig +trace example.com

# No recursion (only ask one server)
$ dig +norecurse @a.root-servers.net example.com

Caching Issues

Inconsistent Results

Problem:
  dig @8.8.8.8 example.com → 1.2.3.4
  dig @1.1.1.1 example.com → 5.6.7.8

Causes:
  - Recent change, propagation in progress
  - Different servers have different cache ages
  - Anycast resolvers hit different instances

Solution:
  Wait for TTL to expire everywhere
  Typically resolves within max(TTL) time

Cached Failure

Problem:
  DNS change made, but users still see old/error

Causes:
  - Negative caching (NXDOMAIN cached)
  - Old positive record still valid
  - Client-side cache not flushed

Debug:
  1. Check TTL of cached record
  2. Check negative TTL (SOA minimum)
  3. Flush caches at multiple levels
  4. Wait for TTL expiration

Cache Poisoning (Security)

Attack:
  Attacker injects fake record into resolver cache
  Users sent to malicious server

Mitigations:
  - DNSSEC (cryptographic validation)
  - Source port randomization
  - Query ID randomization
  - Response validation (0x20 encoding)

Summary

DNS caching is hierarchical and TTL-controlled:

Cache LocationTypical TTL CapFlush Method
Browser60sRestart or clear
OSvariesSystem-specific
Local resolvervariesRestart service
Recursive resolverRespects record TTLWait

TTL guidelines:

  • Stable records: 3600-86400 seconds
  • Dynamic records: 60-300 seconds
  • Before changes: Reduce TTL in advance
  • After changes: Wait for old TTL to expire

Caching makes DNS fast but requires understanding for:

  • Planning DNS changes
  • Debugging resolution issues
  • Balancing freshness vs. performance

Next, we’ll explore DNSSEC—how DNS responses can be cryptographically validated.

DNSSEC

DNSSEC (Domain Name System Security Extensions) adds cryptographic authentication to DNS. It allows resolvers to verify that DNS responses haven’t been tampered with—protecting against attacks like cache poisoning.

The Problem DNSSEC Solves

Traditional DNS has no authentication:

Without DNSSEC:

Client: "What's the IP for bank.com?"

Legitimate response:     OR    Attacker's response:
  bank.com → 1.2.3.4            bank.com → 6.6.6.6 (malicious)

How does client know which is real?
It can't! DNS responses are unsigned.

Attacks possible:
  - Cache poisoning (inject fake records)
  - Man-in-the-middle (intercept and modify)
  - Redirection to phishing sites

How DNSSEC Works

DNSSEC adds digital signatures to DNS records:

With DNSSEC:

Zone operator:
  1. Generates signing keys
  2. Signs each record set
  3. Publishes signatures alongside records

Resolver:
  1. Receives record + signature
  2. Retrieves zone's public key
  3. Verifies signature
  4. If valid → Trust the record
  5. If invalid → Reject (SERVFAIL)

Attacker cannot forge valid signatures without private key.

DNSSEC Record Types

DNSKEY (Public Key)

Zone's public key for signature verification:

example.com.    3600    IN    DNSKEY    257 3 13 (
                                mdsswUyr3DPW132mOi8V9xESWE8jTo0d
                                xCjjnopKl+GqJxpVXckHAeF+KkxLbxIL
                                fDLUT0rAK9iUzy1L53eKGQ==
                              )

Fields:
  257 = Zone Signing Key (ZSK) or Key Signing Key (KSK)
  3   = Protocol (always 3)
  13  = Algorithm (13 = ECDSA P-256)
  Base64 = The public key

RRSIG (Resource Record Signature)

Signature over a record set:

example.com.    3600    IN    A       93.184.216.34
example.com.    3600    IN    RRSIG   A 13 2 3600 (
                                20240215000000
                                20240201000000
                                12345 example.com.
                                oJB1W6WNGv+ldvQ3WDG0MQkg5IEhjRip
                                8WTrPYGv07h108dUKGMeDPKijVCHX3DD
                                Kdfb+v6oB9wfuh3DTJXUAfI= )

Fields:
  A         = Type being signed
  13        = Algorithm
  2         = Labels in name
  3600      = Original TTL
  Dates     = Signature validity period
  12345     = Key tag (identifies signing key)
  Base64    = The signature

DS (Delegation Signer)

Links child zone to parent (chain of trust):

In .com zone:
  example.com.    86400    IN    DS    12345 13 2 (
                                  49FD46E6C4B45C55D4AC69CBD3CD3440
                                  9B20CAC6B08F4E7FAE3F2BDDBF1BB349 )

Fields:
  12345   = Key tag of child's KSK
  13      = Algorithm
  2       = Digest type (2 = SHA-256)
  Hex     = Hash of child's DNSKEY

Parent vouches for child's key.
Enables trust chain from root.

NSEC/NSEC3 (Authenticated Denial)

Proves a name doesn't exist:

Query: nonexistent.example.com
Response: NXDOMAIN + NSEC record

NSEC proves there's no record between two names:
  aaa.example.com.    NSEC    zzz.example.com. A AAAA

"There's nothing between aaa and zzz"
Therefore nonexistent.example.com doesn't exist.

NSEC3: Hashed version (prevents zone enumeration)

Chain of Trust

DNSSEC builds a chain from root to leaf:

                    ┌──────────────────┐
                    │   Root Zone (.)  │
                    │                  │
                    │  DNSKEY (root)   │ ← Hardcoded in resolvers
                    └────────┬─────────┘   (trust anchor)
                             │
                    Signed DS record for .com
                             │
                    ┌────────▼─────────┐
                    │   .com TLD       │
                    │                  │
                    │  DNSKEY (.com)   │
                    └────────┬─────────┘
                             │
                    Signed DS record for example.com
                             │
                    ┌────────▼─────────┐
                    │   example.com    │
                    │                  │
                    │  DNSKEY          │
                    │  A record + RRSIG│
                    └──────────────────┘

Each level signs the next level's key hash.
Trust flows from root anchor to leaf records.

Validation Process

Resolver validating example.com A record:

1. Get example.com A + RRSIG
2. Get example.com DNSKEY
3. Verify RRSIG with DNSKEY ✓

4. Get DS for example.com (from .com zone)
5. Verify DS matches DNSKEY hash ✓

6. Get .com DNSKEY
7. Verify DS RRSIG with .com DNSKEY ✓

8. Get DS for .com (from root zone)
9. Verify DS matches .com DNSKEY hash ✓

10. Get root DNSKEY
11. Verify against trust anchor ✓

All checks pass → Record is authenticated!
Any check fails → SERVFAIL (reject response)

Querying DNSSEC Records

# Request DNSSEC records
$ dig example.com +dnssec

# Check if domain is signed
$ dig example.com DNSKEY
$ dig example.com DS

# Verify signature chain
$ dig +sigchase example.com

# Use delv (DNSSEC-aware dig)
$ delv example.com
; fully validated
example.com.    86400    IN    A    93.184.216.34

# Check validation status
$ dig +cd example.com    # CD = Checking Disabled (skip validation)

DNSSEC Status Check

# Online validators:
# https://dnssec-analyzer.verisignlabs.com/
# https://dnsviz.net/

# Command line check:
$ delv @8.8.8.8 example.com
; fully validated         ← DNSSEC working
; unsigned answer         ← Not signed
; validation failed       ← Signature invalid

# Check with drill
$ drill -S example.com

Key Management

Key Types

Zone Signing Key (ZSK):
  - Signs zone records
  - Rotated frequently (monthly to quarterly)
  - Smaller key (faster signing)

Key Signing Key (KSK):
  - Signs the ZSK
  - Rotated less often (yearly)
  - Referenced by parent's DS record
  - Larger key (more security)

Why two keys?
  ZSK rotation doesn't require parent update
  KSK rotation requires new DS in parent zone

Key Rollover

ZSK Rollover (simpler):
  1. Generate new ZSK
  2. Publish both old and new DNSKEY
  3. Sign with new ZSK
  4. After TTL, remove old ZSK

KSK Rollover (complex):
  1. Generate new KSK
  2. Publish both DNSKEYs
  3. Submit new DS to parent
  4. Wait for parent propagation
  5. Sign ZSKs with new KSK
  6. After parent DS TTL, remove old KSK

Automated by most DNS providers.

Deployment Considerations

Enabling DNSSEC

Domain owner must:
  1. Sign zone with DNSSEC keys
  2. Upload DS record to registrar
  3. Registrar submits DS to TLD
  4. Chain of trust established

Many registrars/DNS providers automate this:
  - Cloudflare: One-click DNSSEC
  - Route53: Supports DNSSEC
  - Google Domains: Easy setup

Response Size

DNSSEC adds significant size:

Without DNSSEC:
  example.com A → ~50 bytes

With DNSSEC:
  example.com A + RRSIG + DNSKEY → ~1000+ bytes

Implications:
  - May exceed 512-byte UDP limit
  - Requires EDNS (larger UDP) or TCP
  - More bandwidth usage

Validation Failures

If DNSSEC validation fails:

Validating resolver returns: SERVFAIL
User sees: DNS error / site unreachable

Causes:
  - Expired signatures (operator forgot renewal)
  - Incorrect DS record (misconfiguration)
  - Clock skew (signature timestamps)
  - Key rollover problems

This is a feature, not a bug!
Invalid signatures could mean attack.
But operational errors can cause outages.

Limitations

DNSSEC protects authenticity, not privacy:

DNSSEC provides:
  ✓ Authentication (record from legitimate source)
  ✓ Integrity (record not modified)
  ✓ Authenticated denial (NXDOMAIN is real)

DNSSEC does NOT provide:
  ✗ Confidentiality (queries/responses visible)
  ✗ Protection from DNS operator
  ✗ Protection of last-mile (resolver to client)

For privacy: DNS over HTTPS (DoH) or DNS over TLS (DoT)

DNSSEC Adoption

Adoption varies by TLD:

Signed TLDs: .com, .org, .net (all major TLDs)

Domain signing rates:
  .nl (Netherlands):  ~50%
  .se (Sweden):       ~40%
  .com:               ~3%

Validation by resolvers:
  8.8.8.8 (Google):     Validates
  1.1.1.1 (Cloudflare): Validates
  ISP resolvers:        Varies

Growing but not universal.

Alternatives and Complements

DNS over HTTPS (DoH)

HTTPS encryption for DNS queries:
  - Hides queries from network observers
  - Bypasses some filtering
  - Runs on port 443 (like web traffic)

Complements DNSSEC:
  DoH   = Privacy (encrypted transport)
  DNSSEC = Authenticity (signed records)

Can use both together.

DNS over TLS (DoT)

TLS encryption for DNS:
  - Dedicated port 853
  - Easier to identify/block than DoH
  - Same privacy benefits as DoH

Adoption growing in mobile and resolvers.

Summary

DNSSEC adds cryptographic security to DNS:

ComponentPurpose
DNSKEYZone’s public keys
RRSIGSignatures on record sets
DSLinks child to parent (trust chain)
NSEC/NSEC3Proves non-existence

Key points:

  • Chain of trust from root to leaf
  • Signatures prevent tampering
  • Validation failures block responses
  • Doesn’t provide privacy (use DoH/DoT)

Considerations:

  • Operational complexity (key management)
  • Larger responses (more bandwidth)
  • Validation failures can cause outages
  • Growing but not universal adoption

This completes our DNS coverage. Next, we’ll explore the evolution of HTTP—from 1.0 to HTTP/3.

HTTP Evolution

HTTP (Hypertext Transfer Protocol) is the foundation of the web. What started as a simple protocol for retrieving hypertext documents has evolved into a sophisticated system powering everything from websites to APIs to real-time applications.

The Journey

┌─────────────────────────────────────────────────────────────────────────┐
│                        HTTP Timeline                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  1991    HTTP/0.9    One-line protocol, GET only                        │
│    │                                                                    │
│  1996    HTTP/1.0    Headers, methods, status codes                     │
│    │                 Problem: One request per connection                │
│    │                                                                    │
│  1997    HTTP/1.1    Persistent connections, pipelining                 │
│    │                 Problem: Head-of-line blocking                     │
│    │                                                                    │
│  2015    HTTP/2      Binary, multiplexing, server push                  │
│    │                 Problem: TCP head-of-line blocking                 │
│    │                                                                    │
│  2022    HTTP/3      QUIC transport, UDP-based                          │
│                      Eliminates transport-level blocking                │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Why HTTP Keeps Evolving

Each HTTP version addressed limitations of its predecessor:

HTTP/1.0 → HTTP/1.1
  Problem: Opening new TCP connection per request is slow
  Solution: Keep connections open (persistent connections)

HTTP/1.1 → HTTP/2
  Problem: Requests must wait in line, even on persistent connections
  Solution: Multiplex requests over single connection

HTTP/2 → HTTP/3
  Problem: TCP packet loss blocks ALL streams
  Solution: Use QUIC (UDP-based), independent stream delivery

Request-Response Model

Despite version differences, HTTP maintains its fundamental model:

┌────────────────────────────────────────────────────────────────┐
│                        HTTP Transaction                        │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  Client                                Server                  │
│     │                                     │                    │
│     │─────────── Request ────────────────>│                    │
│     │                                     │                    │
│     │  GET /index.html HTTP/1.1           │                    │
│     │  Host: example.com                  │                    │
│     │  Accept: text/html                  │                    │
│     │                                     │                    │
│     │                                     │                    │
│     │<──────────── Response ──────────────│                    │
│     │                                     │                    │
│     │  HTTP/1.1 200 OK                    │                    │
│     │  Content-Type: text/html            │                    │
│     │  Content-Length: 1234               │                    │
│     │                                     │                    │
│     │  <!DOCTYPE html>...                 │                    │
│     │                                     │                    │
└────────────────────────────────────────────────────────────────┘

Request  = Method + Path + Headers + (optional) Body
Response = Status + Headers + (optional) Body

Key HTTP Concepts

Methods

GET      Retrieve a resource
POST     Submit data, create resource
PUT      Replace a resource
PATCH    Partially modify a resource
DELETE   Remove a resource
HEAD     GET without body (metadata only)
OPTIONS  Describe communication options

Status Codes

1xx  Informational    100 Continue, 101 Switching Protocols
2xx  Success          200 OK, 201 Created, 204 No Content
3xx  Redirection      301 Moved, 302 Found, 304 Not Modified
4xx  Client Error     400 Bad Request, 401 Unauthorized, 404 Not Found
5xx  Server Error     500 Internal Error, 502 Bad Gateway, 503 Unavailable

Headers

Request headers:
  Host: example.com           (required in HTTP/1.1+)
  Accept: application/json    (preferred response type)
  Authorization: Bearer xyz   (credentials)
  Cookie: session=abc123      (state)

Response headers:
  Content-Type: text/html     (body format)
  Content-Length: 1234        (body size)
  Cache-Control: max-age=3600 (caching rules)
  Set-Cookie: session=abc123  (set state)

What You’ll Learn

In this chapter:

  1. HTTP/1.0 and HTTP/1.1: The text-based foundation
  2. HTTP/2: Binary framing and multiplexing
  3. HTTP/3 and QUIC: The modern, UDP-based protocol

Understanding HTTP evolution helps you:

  • Choose appropriate protocol versions
  • Optimize web performance
  • Debug connection issues
  • Design efficient APIs

HTTP/1.0 and HTTP/1.1

HTTP/1.x established the patterns still used today: request-response over TCP, text-based headers, and the familiar verbs like GET and POST. Understanding these versions explains why later versions were needed.

HTTP/1.0 (1996)

Basic Request-Response

Client connects to server:
  1. TCP handshake (SYN, SYN-ACK, ACK)
  2. Send HTTP request
  3. Receive HTTP response
  4. Close connection

Every request = New TCP connection!

Request Format

GET /index.html HTTP/1.0
User-Agent: Mozilla/5.0
Accept: text/html

That’s it—method, path, version, and optional headers. Blank line ends headers.

Response Format

HTTP/1.0 200 OK
Content-Type: text/html
Content-Length: 1234

<!DOCTYPE html>
<html>
...
</html>

Status line, headers, blank line, body.

The Connection Problem

Loading a webpage with HTTP/1.0:

Page needs:
  - index.html (1 request)
  - style.css (1 request)
  - script.js (1 request)
  - logo.png (1 request)
  - header.png (1 request)

HTTP/1.0 timeline (sequential):
┌─────────────────────────────────────────────────────────────────────────┐
│                                                                         │
│  ├─TCP─┤├───HTML────┤                                                   │
│                      ├─TCP─┤├───CSS────┤                                │
│                                         ├─TCP─┤├───JS────┤              │
│                                                          ├─TCP─┤├PNG1─┤ │
│                                                                   ...   │
│                                                                         │
│  Total: 5 TCP handshakes + 5 requests = Very slow!                      │
└─────────────────────────────────────────────────────────────────────────┘

Each resource requires:
  - TCP handshake (~1 RTT)
  - Request + response (~1 RTT)
  - TCP teardown

For 10 resources over 100ms RTT: ~2 seconds just for overhead!

HTTP/1.1 (1997)

HTTP/1.1 addressed the connection overhead with several improvements.

Persistent Connections

Connections stay open by default:

HTTP/1.0:
  Connection: close        (default, close after response)
  Connection: keep-alive   (optional, keep open)

HTTP/1.1:
  Connection: keep-alive   (default, keep open)
  Connection: close        (optional, close after response)

Connection Reuse

HTTP/1.1 timeline (persistent connection):
┌─────────────────────────────────────────────────────────────────────────┐
│                                                                         │
│  ├─TCP─┤├─HTML─┤├─CSS─┤├─JS─┤├─PNG1─┤├─PNG2─┤                           │
│                                                                         │
│  One TCP handshake, multiple requests!                                  │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Savings: 4 fewer TCP handshakes = ~400ms on 100ms RTT link

Pipelining

Send multiple requests without waiting for responses:

Without pipelining:
  Request 1 → Response 1 → Request 2 → Response 2

With pipelining:
  Request 1 → Request 2 → Request 3 → Response 1 → Response 2 → Response 3

┌─────────────────────────────────────────────────────────────────────────┐
│                                                                         │
│  Client:  [Req1][Req2][Req3]                                            │
│  Server:                    [Resp1][Resp2][Resp3]                       │
│                                                                         │
│  Server processes in parallel (potentially)                             │
│  But responses MUST be in request order!                                │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Pipelining’s Fatal Flaw: Head-of-Line Blocking

Pipelining problem:

Requests sent:  [HTML][CSS][JS]
Server ready:   JS(10ms), CSS(20ms), HTML(500ms)

Must respond in order:
  ├────────HTML (500ms)────────┤├CSS┤├JS┤

JS is ready instantly but waits 500ms for HTML!
This is "head-of-line blocking."

Reality: Pipelining rarely used
  - Complex to implement correctly
  - Many proxies don't support it
  - HOL blocking negates benefits
  - Browsers disabled it by default

Multiple Connections Workaround

Browsers work around HTTP/1.1 limitations:

Browser opens 6 parallel connections per domain:

Connection 1: [HTML]─────────[Image5]──────────
Connection 2: [CSS]─────[Image1]──────[Image6]─
Connection 3: [JS1]─────[Image2]───────────────
Connection 4: [JS2]─────[Image3]───────────────
Connection 5: [Font]────[Image4]───────────────
Connection 6: [Icon]────[Image7]───────────────

Parallel downloads without pipelining!

But: 6 TCP connections = 6× overhead
     6× congestion control windows
     Not efficient

Domain Sharding (Historical)

Workaround for 6-connection limit:

Instead of:
  example.com/style.css
  example.com/script.js
  example.com/image1.png

Use:
  example.com/style.css
  static1.example.com/script.js
  static2.example.com/image1.png

Browser sees different domains:
  6 connections to example.com
  6 connections to static1.example.com
  6 connections to static2.example.com
  = 18 parallel connections!

Downsides:
  - More TCP overhead
  - More TLS handshakes (if HTTPS)
  - DNS lookups for each domain
  - Cache fragmentation

Note: Harmful with HTTP/2! (multiplexing is better)

Host Header (Required)

HTTP/1.1 requires the Host header:

GET /page.html HTTP/1.1
Host: www.example.com

Enables virtual hosting—multiple sites on one IP:

Server at 192.168.1.100 hosts:
  - www.example.com
  - www.another-site.com
  - api.example.com

Host header tells server which site is requested.
Without it: Server doesn't know which site you want!

Chunked Transfer Encoding

Send response without knowing size upfront:

HTTP/1.1 200 OK
Transfer-Encoding: chunked

1a
This is the first chunk.
1b
This is the second chunk.
0

Format: Size (hex) + CRLF + Data + CRLF, ending with 0

Use cases:

  • Streaming responses
  • Server-generated content
  • Live data feeds

Additional HTTP/1.1 Features

100 Continue

Client: POST /upload HTTP/1.1
        Content-Length: 10000000
        Expect: 100-continue

Server: HTTP/1.1 100 Continue

Client: (sends 10MB body)

Server: HTTP/1.1 200 OK

Avoids sending large body if server will reject it.

Range Requests

GET /large-file.zip HTTP/1.1
Range: bytes=1000-1999

HTTP/1.1 206 Partial Content
Content-Range: bytes 1000-1999/50000

Resume interrupted downloads, video seeking.

Cache Control

Cache-Control: max-age=3600, must-revalidate
ETag: "abc123"
Last-Modified: Wed, 21 Oct 2015 07:28:00 GMT

Sophisticated caching for performance.

HTTP/1.1 Example Session

$ telnet example.com 80
Trying 93.184.216.34...
Connected to example.com.

GET / HTTP/1.1
Host: example.com
Connection: keep-alive

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Content-Length: 1256
Connection: keep-alive
Cache-Control: max-age=604800

<!doctype html>
<html>
<head>
    <title>Example Domain</title>
...

GET /favicon.ico HTTP/1.1
Host: example.com
Connection: close

HTTP/1.1 404 Not Found
Content-Length: 0
Connection: close

Connection closed by foreign host.

HTTP/1.x Limitations Summary

┌─────────────────────────────────────────────────────────────────────────┐
│               HTTP/1.x Limitations                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  1. Head-of-Line Blocking                                               │
│     Responses must be in request order                                  │
│     One slow response blocks all others                                 │
│                                                                         │
│  2. Textual Protocol Overhead                                           │
│     Headers are uncompressed text                                       │
│     Same headers sent repeatedly                                        │
│                                                                         │
│  3. No Request Prioritization                                           │
│     Can't indicate which resources are critical                         │
│     Server processes arbitrarily                                        │
│                                                                         │
│  4. Client-Initiated Only                                               │
│     Server can't push resources proactively                             │
│     Client must request everything explicitly                           │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

When HTTP/1.1 Is Still Used

HTTP/1.1 remains common for:
  - Simple APIs (few requests per connection)
  - Internal services (low latency networks)
  - Legacy system compatibility
  - Debugging (human-readable)
  - When HTTP/2 isn't supported

Modern web: HTTP/2 or HTTP/3 preferred
  Better performance with no application changes

Summary

FeatureHTTP/1.0HTTP/1.1
Persistent connectionsOptionalDefault
Host headerOptionalRequired
Chunked transferNoYes
PipeliningNoYes (rarely used)
Cache-ControlLimitedFull support
Range requestsNoYes
100 ContinueNoYes

HTTP/1.1 significantly improved on 1.0 but still suffers from head-of-line blocking. HTTP/2 was designed to solve this—which we’ll explore next.

HTTP/2: Multiplexing Revolution

HTTP/2 (2015) reimagined how HTTP works at the wire level while maintaining full compatibility with HTTP/1.1 semantics. The result: dramatically faster page loads with no application changes required.

The Core Innovation: Multiplexing

HTTP/2’s killer feature is multiplexing—sending multiple requests and responses over a single TCP connection simultaneously:

HTTP/1.1 (head-of-line blocking):
┌─────────────────────────────────────────────────────────────────────────┐
│  Connection 1: [──────Req 1──────][──────Req 2──────][──Req 3──]       │
│  Connection 2: [──────Req 4──────][──Req 5──]                          │
│  Connection 3: [──Req 6──][──────Req 7──────]                          │
│                                                                         │
│  Sequential on each connection. Multiple connections needed.            │
└─────────────────────────────────────────────────────────────────────────┘

HTTP/2 (multiplexed):
┌─────────────────────────────────────────────────────────────────────────┐
│  Single connection:                                                     │
│    [R1][R2][R3][R1][R4][R2][R5][R3][R1][R6][R7]...                      │
│                                                                         │
│  All requests interleaved on one connection!                            │
│  No head-of-line blocking at HTTP level.                                │
└─────────────────────────────────────────────────────────────────────────┘

Binary Framing Layer

HTTP/2 replaces text with binary frames:

HTTP/1.1 (text):
┌────────────────────────────────────────┐
│ GET /page HTTP/1.1\r\n                 │
│ Host: example.com\r\n                  │
│ Accept: text/html\r\n                  │
│ \r\n                                   │
└────────────────────────────────────────┘

HTTP/2 (binary frames):
┌────────────────────────────────────────┐
│ ┌─────────────┐ ┌─────────────┐        │
│ │HEADERS Frame│ │ DATA Frame  │        │
│ │ Stream ID: 1│ │ Stream ID: 1│        │
│ │ (compressed)│ │ (payload)   │        │
│ └─────────────┘ └─────────────┘        │
└────────────────────────────────────────┘

Binary format:
  + Efficient parsing (no text scanning)
  + Compact representation
  + Clear frame boundaries
  - Not human-readable (need tools)

Frame Structure

Every HTTP/2 message is a series of frames:

Frame Format:
┌────────────────────────────────────────────────────────────────────┐
│ Length (24 bits) │ Type (8) │ Flags (8) │ R │ Stream ID (31 bits) │
├────────────────────────────────────────────────────────────────────┤
│                        Frame Payload                               │
└────────────────────────────────────────────────────────────────────┘

Length:    Size of payload (max 16KB default, configurable)
Type:      DATA, HEADERS, PRIORITY, RST_STREAM, SETTINGS, etc.
Flags:     Type-specific flags (END_STREAM, END_HEADERS, etc.)
Stream ID: Which stream this frame belongs to

Frame Types

┌──────────────────────────────────────────────────────────────────┐
│  Type        │ Purpose                                           │
├──────────────┼───────────────────────────────────────────────────┤
│  DATA        │ Request/response body data                        │
│  HEADERS     │ Request/response headers (compressed)             │
│  PRIORITY    │ Stream priority information                       │
│  RST_STREAM  │ Terminate a stream                                │
│  SETTINGS    │ Connection configuration                          │
│  PUSH_PROMISE│ Server push notification                          │
│  PING        │ Connection health check                           │
│  GOAWAY      │ Graceful connection shutdown                      │
│  WINDOW_UPDATE│ Flow control window adjustment                   │
│  CONTINUATION│ Continuation of HEADERS                           │
└──────────────┴───────────────────────────────────────────────────┘

Streams

A stream is a bidirectional sequence of frames within a connection:

Single HTTP/2 connection with multiple streams:

┌─────────────────────────────────────────────────────────────────────┐
│                         TCP Connection                              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Stream 1: [HEADERS]──[DATA]──[DATA]──[DATA]                 │   │
│  │ Stream 3: [HEADERS]──[DATA]                                 │   │
│  │ Stream 5: [HEADERS]                                         │   │
│  │ Stream 7: [HEADERS]──[DATA]──[DATA]                         │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Stream IDs:
  - Odd numbers: Client-initiated
  - Even numbers: Server-initiated (push)
  - 0: Connection-level messages (SETTINGS, PING, GOAWAY)

Request/Response as Streams

HTTP/2 Request (Stream 1):
┌────────────────────────────────────────────────────────────────────┐
│  HEADERS Frame (Stream 1)                                          │
│    :method = GET                                                   │
│    :path = /index.html                                             │
│    :scheme = https                                                 │
│    :authority = example.com                                        │
│    accept = text/html                                              │
│    END_HEADERS, END_STREAM flags                                   │
└────────────────────────────────────────────────────────────────────┘

HTTP/2 Response (Stream 1):
┌────────────────────────────────────────────────────────────────────┐
│  HEADERS Frame (Stream 1)                                          │
│    :status = 200                                                   │
│    content-type = text/html                                        │
│    END_HEADERS flag                                                │
├────────────────────────────────────────────────────────────────────┤
│  DATA Frame (Stream 1)                                             │
│    [HTML content...]                                               │
│    END_STREAM flag                                                 │
└────────────────────────────────────────────────────────────────────┘

Header Compression (HPACK)

HTTP/2 compresses headers using HPACK:

HTTP/1.1 headers (sent every request):
  Host: example.com                    ~17 bytes
  User-Agent: Mozilla/5.0...           ~70 bytes
  Accept: text/html,application/...    ~100 bytes
  Accept-Language: en-US,en;q=0.5      ~25 bytes
  Accept-Encoding: gzip, deflate       ~25 bytes
  Cookie: session=abc123;...           ~50+ bytes
  ──────────────────────────────────────────────
  Total: ~300 bytes per request!

  10 requests = 3KB just in headers!

HPACK compression:
  1. Static table: 61 common headers (predefined)
  2. Dynamic table: Recently used headers (learned)
  3. Huffman coding: Compress literal values

  First request: ~300 bytes → ~150 bytes (Huffman)
  Second request: Same headers → ~30 bytes (indexed!)

  10 requests ≈ 300 bytes total (vs 3KB)

HPACK Example

First request headers sent:
  :method: GET           → Index 2 (static table)
  :path: /index.html     → Literal, indexed
  :authority: example.com → Literal, indexed, Huffman
  accept: text/html      → Literal, indexed

Dynamic table after request:
  [62] :path: /index.html
  [63] :authority: example.com
  [64] accept: text/html

Second request to same server:
  :method: GET           → Index 2 (static)
  :path: /style.css      → Literal (new path)
  :authority: example.com → Index 63 (dynamic!)
  accept: text/css       → Literal (different)

Much smaller because authority is now indexed!

Server Push

Servers can proactively send resources:

Without server push:
  Client: GET /index.html
  Server: (sends HTML)
  Client: (parses, sees style.css link)
  Client: GET /style.css       ← Extra round trip!
  Server: (sends CSS)

With server push:
  Client: GET /index.html
  Server: PUSH_PROMISE /style.css    ← "I'll send this too"
  Server: (sends HTML)
  Server: (sends CSS on separate stream)
  Client: (already has CSS when parsing HTML!)

Saves round trip for critical resources.

Server Push Caveats

Push sounds great but has issues:

1. May push already-cached resources
   Server doesn't know client cache state
   Waste bandwidth pushing what client has

2. Priority problems
   Pushed resources may compete with requested ones
   Can slow down critical content

3. Limited browser support
   Chrome deprecated push support (2022)
   Most CDNs recommend disabling

Alternative: 103 Early Hints
  Server sends hints before full response
  Client can preload without full push complexity

Stream Prioritization

Clients can indicate resource importance:

Priority information:
  - Weight: 1-256 (relative importance)
  - Dependency: Stream this depends on

Example:
  Stream 1 (HTML):      Weight=256 (highest)
  Stream 3 (CSS):       Weight=128, depends on Stream 1
  Stream 5 (JS):        Weight=128, depends on Stream 1
  Stream 7 (image):     Weight=64, depends on Stream 3

Priority tree:
          [Stream 1 - HTML]
              /       \
    [Stream 3-CSS]  [Stream 5-JS]
          |
    [Stream 7-image]

Server should:
  1. Complete Stream 1 first
  2. Then CSS and JS equally
  3. Images last

Reality: Server implementation varies
         Many servers ignore priorities

Flow Control

HTTP/2 has stream-level flow control:

Connection flow control:
  Each side advertises receive window
  Similar to TCP flow control

Stream flow control:
  Each stream has its own window
  Prevents one stream from consuming all bandwidth

WINDOW_UPDATE frame:
  Signals capacity for more data

┌────────────────────────────────────────────────────────────────────┐
│  Stream 1: Window=65535                                            │
│  Stream 3: Window=65535                                            │
│  Connection: Window=1048576                                        │
│                                                                    │
│  Server sends 32768 bytes on Stream 1:                             │
│    Stream 1: Window=32767                                          │
│    Connection: Window=1015808                                      │
│                                                                    │
│  Client sends WINDOW_UPDATE (Stream 1, 32768):                     │
│    Stream 1: Window=65535 (restored)                               │
└────────────────────────────────────────────────────────────────────┘

HTTP/2 Connection Setup

HTTP/2 requires TLS in practice (browsers require HTTPS):

1. TCP handshake
2. TLS handshake (ALPN negotiates HTTP/2)
3. HTTP/2 connection preface:
   Client sends: "PRI * HTTP/2.0\r\n\r\nSM\r\n\r\n"
   Both send: SETTINGS frame

4. Ready for requests!

ALPN (Application-Layer Protocol Negotiation):
  Client TLS hello includes: "I support h2, http/1.1"
  Server chooses: "Let's use h2"
  Connection established as HTTP/2

The Remaining Problem: TCP HOL Blocking

HTTP/2 solved HTTP-level head-of-line blocking but TCP has its own:

HTTP/2 over TCP problem:

Stream 1: [Frame 1][Frame 2][Frame 3]
Stream 3: [Frame A][Frame B]
Stream 5: [Frame X][Frame Y]

TCP sees: [1][A][2][X][B][3][Y]

If TCP packet containing [2] is lost:
  TCP retransmits [2]
  ALL subsequent data waits!
  Frames [X][B][3][Y] all blocked

Even though X,B,Y are independent streams!

This is TCP-level head-of-line blocking.
HTTP/3 solves this with QUIC.

HTTP/2 Performance

When HTTP/2 shines:
  - Many small resources (multiplexing wins)
  - High latency connections (fewer round trips)
  - Header-heavy requests (compression helps)
  - HTTPS (required anyway, TLS overhead amortized)

When HTTP/2 helps less:
  - Single large download (one stream anyway)
  - Very low latency networks (overhead matters less)
  - Lossy networks (TCP HOL blocking hurts)

Typical improvements:
  Page load time: 10-50% faster
  Time to first byte: Similar or slightly better
  Number of connections: 1 vs 6+ (simpler)

Debugging HTTP/2

# curl with HTTP/2
$ curl -I --http2 https://example.com
HTTP/2 200
content-type: text/html

# nghttp client
$ nghttp -nv https://example.com

# Chrome DevTools
  Network tab → Protocol column shows "h2"

# Wireshark
  Filter: http2
  Decode TLS with SSLKEYLOGFILE

Summary

HTTP/2’s key innovations:

FeatureBenefit
Binary framingEfficient parsing, clear boundaries
MultiplexingMultiple requests on one connection
Header compression~85% reduction in header size
Stream prioritizationBetter resource loading order
Server pushProactive resource delivery
Flow controlPer-stream bandwidth management

Limitations:

  • TCP head-of-line blocking remains
  • Server push deprecated in browsers
  • Complexity increased

HTTP/2 is a significant improvement over HTTP/1.1, but TCP’s head-of-line blocking motivated HTTP/3’s move to QUIC—which we’ll cover next.

HTTP/3 and QUIC

HTTP/3 (2022) takes a radical approach: instead of building on TCP, it uses QUIC—a new transport protocol running over UDP. This eliminates TCP’s head-of-line blocking and enables features impossible with TCP.

Why Replace TCP?

HTTP/2’s remaining problem was TCP itself:

TCP Head-of-Line Blocking:

HTTP/2 multiplexes streams:
  Stream 1: [A][B][C]
  Stream 3: [X][Y][Z]

TCP sees single byte stream:
  [A][X][B][Y][C][Z]

TCP packet lost (containing [B]):
  - TCP waits for retransmit
  - ALL data after [B] blocked
  - [Y][C][Z] wait even though they're independent

This defeats HTTP/2's multiplexing benefits on lossy networks.

QUIC: UDP-Based Transport

QUIC (Quick UDP Internet Connections) provides TCP-like reliability over UDP:

┌─────────────────────────────────────────────────────────────────────┐
│                      Protocol Comparison                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  HTTP/1.1, HTTP/2:              HTTP/3:                             │
│  ┌─────────────┐               ┌─────────────┐                      │
│  │    HTTP     │               │   HTTP/3    │                      │
│  ├─────────────┤               ├─────────────┤                      │
│  │    TLS      │               │    QUIC     │  ← Includes TLS!    │
│  ├─────────────┤               ├─────────────┤                      │
│  │    TCP      │               │    UDP      │                      │
│  ├─────────────┤               ├─────────────┤                      │
│  │     IP      │               │     IP      │                      │
│  └─────────────┘               └─────────────┘                      │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

QUIC provides:
  - Reliable delivery (like TCP)
  - Congestion control (like TCP)
  - Encryption (built-in TLS 1.3)
  - Stream multiplexing (independent streams!)
  - Connection migration
  - 0-RTT connection resumption

No Head-of-Line Blocking

QUIC streams are independent:

QUIC stream multiplexing:

Stream 1: [A]──[B]──[C]
Stream 3: [X]──[Y]──[Z]

UDP packets:
  Packet 1: [A][X]
  Packet 2: [B][Y]  ← Lost!
  Packet 3: [C][Z]

What happens:
  Packet 3 arrives, QUIC delivers:
    Stream 3: [Z] delivered immediately!
  Packet 2 retransmitted, then:
    Stream 1: [B] delivered
    Stream 3: [Y] delivered

Stream 3 doesn't wait for Stream 1's retransmit!
Each stream has independent delivery.

Faster Connection Establishment

TCP+TLS: 2-3 Round Trips

TCP + TLS 1.3 connection:

Client                                    Server
   │                                         │
   │────── TCP SYN ──────────────────────────>│
   │<───── TCP SYN-ACK ───────────────────────│
   │────── TCP ACK ──────────────────────────>│  ← 1 RTT (TCP)
   │                                         │
   │────── TLS ClientHello ──────────────────>│
   │<───── TLS ServerHello + Finished ────────│
   │────── TLS Finished + HTTP Request ──────>│  ← 1 RTT (TLS)
   │<───── HTTP Response ─────────────────────│
   │                                         │

Total: 2 RTT before first HTTP response
       (3 RTT with TLS 1.2)

QUIC: 1 Round Trip (or 0!)

QUIC initial connection:

Client                                    Server
   │                                         │
   │────── QUIC Initial + TLS Hello ─────────>│
   │<───── QUIC Initial + TLS + ACK ──────────│
   │────── QUIC + HTTP Request ──────────────>│  ← 1 RTT total!
   │<───── HTTP Response ─────────────────────│
   │                                         │

QUIC combines transport + crypto handshake!
TLS 1.3 is integrated into QUIC.

0-RTT Connection Resumption

Returning to a previously visited server:

Client                                    Server
   │                                         │
   │────── QUIC 0-RTT + HTTP Request ────────>│  ← No handshake!
   │<───── HTTP Response ─────────────────────│
   │                                         │

How it works:
  - Client cached server's "resumption token"
  - Client sends encrypted request immediately
  - Server validates token, responds immediately

Caveat: 0-RTT data is replayable
  - Attackers can replay the request
  - Safe only for idempotent requests (GET)
  - Server can implement replay protection

Connection Migration

QUIC connections can survive network changes:

Traditional TCP:
  Connection = (Source IP, Source Port, Dest IP, Dest Port)

  Phone switches WiFi → Cellular:
    IP address changes!
    TCP connection breaks
    Must establish new connection
    HTTP request fails/retries

QUIC:
  Connection = Connection ID (random identifier)

  Phone switches WiFi → Cellular:
    IP address changes
    Connection ID unchanged
    QUIC connection continues!
    HTTP request completes seamlessly

Connection Migration Flow

┌─────────────────────────────────────────────────────────────────────┐
│                       Connection Migration                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Client on WiFi:  192.168.1.100                                     │
│  Server:          93.184.216.34                                     │
│  Connection ID:   0xABCD1234                                        │
│                                                                     │
│  Client ────[QUIC 0xABCD1234]──────> Server                         │
│  Server <───[QUIC 0xABCD1234]─────── Client                         │
│                                                                     │
│  --- Client moves to cellular: 10.0.0.50 ---                        │
│                                                                     │
│  Client ────[QUIC 0xABCD1234]──────> Server                         │
│         ↑                             │                             │
│         │ New IP, same connection ID! │                             │
│         │                             ▼                             │
│  Server validates connection ID,                                    │
│  continues same connection!                                         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

QUIC Encryption

All QUIC packets are encrypted (except initial handshake):

TCP + TLS:
  TCP header:    Visible (port, seq, etc.)
  TLS record:    Encrypted
  HTTP data:     Encrypted

  Middleboxes can see TCP headers, manipulate connections

QUIC:
  UDP header:    Visible (minimal: ports only)
  QUIC header:   Partially encrypted
  QUIC payload:  Fully encrypted

  Middleboxes see only UDP ports
  Cannot inspect or manipulate QUIC layer

Benefits of Always-On Encryption

1. Privacy: HTTP/3 headers/content always encrypted
2. Security: Harder to inject/modify traffic
3. Ossification prevention: Middleboxes can't break QUIC
4. Future-proofing: Protocol can evolve without breaking

HTTP/3 Frames

HTTP/3 uses frames similar to HTTP/2, but over QUIC streams:

HTTP/3 Frame Types:
  DATA          - Request/response body
  HEADERS       - Headers (QPACK compressed)
  CANCEL_PUSH   - Cancel server push
  SETTINGS      - Connection settings
  PUSH_PROMISE  - Server push notification
  GOAWAY        - Connection shutdown
  MAX_PUSH_ID   - Limit on push streams

Key difference from HTTP/2:
  - Each request/response on separate QUIC stream
  - No stream multiplexing in HTTP/3 (QUIC handles it)
  - Flow control handled by QUIC layer

QPACK Header Compression

HTTP/3 uses QPACK (QUIC-aware HPACK variant):

HPACK problem with QUIC:
  HPACK uses dynamic table updated per header
  Headers arrive out of order in QUIC
  Can't update table until all prior updates processed
  → Head-of-line blocking in header compression!

QPACK solution:
  - Separate unidirectional streams for table updates
  - Encoder/decoder can choose blocking behavior
  - Trades compression ratio for lower latency

Result: Slightly less compression than HPACK,
        but no header compression blocking

HTTP/3 Adoption

As of 2024:
  - ~25% of websites support HTTP/3
  - All major browsers support HTTP/3
  - Major CDNs (Cloudflare, Akamai, Fastly) support HTTP/3
  - Google, Facebook, and others use HTTP/3 heavily

Server support:
  - nginx: Experimental
  - Cloudflare: Full support
  - LiteSpeed: Full support
  - Caddy: Full support
  - IIS: Not yet

Client support:
  - Chrome: Yes
  - Firefox: Yes
  - Safari: Yes
  - Edge: Yes
  - curl: Yes (with HTTP/3 build)

Deploying HTTP/3

Server Configuration

# nginx (experimental)
server {
    listen 443 quic reuseport;
    listen 443 ssl;

    ssl_certificate /path/to/cert.pem;
    ssl_certificate_key /path/to/key.pem;

    # Advertise HTTP/3 support
    add_header Alt-Svc 'h3=":443"; ma=86400';
}

Alt-Svc Header

Browsers discover HTTP/3 via Alt-Svc:

HTTP/2 200 OK
alt-svc: h3=":443"; ma=86400

Meaning:
  h3=":443"  - HTTP/3 available on port 443
  ma=86400   - Cache this for 24 hours

Browser flow:
  1. Connect via HTTP/2 (known to work)
  2. See Alt-Svc header
  3. Try HTTP/3 for subsequent requests
  4. Fall back to HTTP/2 if QUIC blocked

When HTTP/3 Helps Most

Significant improvement:
  - High latency connections (satellite, intercontinental)
  - Lossy networks (mobile, WiFi congestion)
  - Many small resources (API calls)
  - Users switching networks (mobile)

Moderate improvement:
  - Low latency, reliable networks
  - Large single downloads

May not help:
  - Local/datacenter communication
  - UDP blocked (corporate firewalls)

Debugging HTTP/3

# curl with HTTP/3
$ curl --http3 https://example.com -v
* using HTTP/3
* h3 [:method: GET]
* h3 [:path: /]
...

# Check if site supports HTTP/3
$ curl -sI https://example.com | grep -i alt-svc

# Chrome DevTools
  Network tab → Protocol shows "h3"

# qlog for QUIC debugging
  Standardized logging format for QUIC
  Visualize with qvis (https://qvis.quictools.info/)

Summary

HTTP/3 over QUIC provides:

FeatureBenefit
UDP-basedAvoids TCP ossification
Independent streamsNo transport HOL blocking
0-RTT resumptionInstant subsequent connections
Connection migrationSurvives network changes
Built-in encryptionAlways secure, anti-ossification
QPACK compressionEfficient headers without blocking

Trade-offs:

  • UDP may be blocked by firewalls
  • More CPU for encryption/decryption
  • Newer, less mature implementations
  • Debugging tools still evolving

HTTP/3 represents the cutting edge of web protocols. For deep dives into QUIC itself, see the next chapter.

QUIC Protocol

QUIC is a general-purpose transport protocol that originated at Google and is now standardized by the IETF. While HTTP/3 is its most visible use, QUIC can transport any application protocol that currently uses TCP.

QUIC at a Glance

┌─────────────────────────────────────────────────────────────────────┐
│                        QUIC Features                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Transport Layer                                                    │
│    ✓ Reliable, ordered delivery per stream                         │
│    ✓ Congestion control                                             │
│    ✓ Flow control (connection and stream level)                     │
│                                                                     │
│  Encryption                                                         │
│    ✓ TLS 1.3 integrated (mandatory)                                 │
│    ✓ Encrypted headers and payload                                  │
│    ✓ Protected from middlebox interference                          │
│                                                                     │
│  Multiplexing                                                       │
│    ✓ Independent streams (no HOL blocking)                          │
│    ✓ Bidirectional and unidirectional streams                       │
│    ✓ Stream priorities                                              │
│                                                                     │
│  Connection                                                         │
│    ✓ Connection IDs (survives IP changes)                           │
│    ✓ 0-RTT resumption                                               │
│    ✓ 1-RTT initial connection                                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Why Build on UDP?

Why not improve TCP?

1. Kernel Dependency
   TCP is implemented in OS kernels
   Changes require kernel updates
   Deployment takes years

2. Middlebox Ossification
   Firewalls, NATs inspect TCP headers
   "Unknown" TCP options get dropped
   TCP extensions rarely deploy successfully

3. Head-of-Line Blocking
   TCP's byte stream model is fundamental
   Cannot fix without breaking compatibility

QUIC on UDP:
   - Implemented in userspace (fast iteration)
   - UDP passes through middleboxes unchanged
   - Full control over protocol behavior
   - Can add features without kernel changes

What You’ll Learn

This chapter covers:

  1. Why QUIC Exists: The problems it solves
  2. Connection Establishment: 0-RTT and 1-RTT handshakes
  3. Multiplexing: How streams eliminate HOL blocking
  4. Connection Migration: Surviving network changes

QUIC is the future of transport protocols. Understanding it prepares you for where networking is heading.

Why QUIC Exists

QUIC wasn’t created to replace TCP for its own sake. It addresses specific, persistent problems that couldn’t be solved within TCP’s constraints.

Problem 1: TCP Head-of-Line Blocking

TCP guarantees ordered delivery of a byte stream:

Application sends:
  write(1000 bytes)
  write(500 bytes)
  write(700 bytes)

TCP segments sent:
  Segment 1: bytes 0-999
  Segment 2: bytes 1000-1499
  Segment 3: bytes 1500-2199

If Segment 2 is lost:
  Segment 3 arrives, but TCP buffers it
  Application sees nothing until Segment 2 retransmitted

With HTTP/2:
  Stream A data in Segment 1
  Stream B data in Segment 2 (lost)
  Stream C data in Segment 3

  Stream C waits for Stream B retransmit!
  Even though they're independent streams.

QUIC Solution

QUIC streams are independent:

  Stream A ─────────────────────────> Delivered immediately
  Stream B ─────X────[retransmit]──> Delivered when ready
  Stream C ─────────────────────────> Delivered immediately

Each stream has its own sequence space.
Loss on one stream doesn't block others.

Problem 2: Connection Establishment Latency

TCP + TLS requires multiple round trips:

TCP + TLS 1.2: 3 RTT before first byte
┌──────────────────────────────────────────────────────────────────┐
│  TCP SYN       ──────────────────────────────────>               │
│  TCP SYN-ACK   <──────────────────────────────────               │
│  TCP ACK       ──────────────────────────────────>   1 RTT       │
│                                                                  │
│  TLS Hello     ──────────────────────────────────>               │
│  TLS Hello     <──────────────────────────────────               │
│  TLS Finished  ──────────────────────────────────>   2 RTT       │
│  TLS Finished  <──────────────────────────────────               │
│                                                                  │
│  HTTP Request  ──────────────────────────────────>   3 RTT       │
│  HTTP Response <──────────────────────────────────               │
└──────────────────────────────────────────────────────────────────┘

TCP + TLS 1.3: 2 RTT (TLS 1.3 is 1-RTT)

On 100ms RTT: 200-300ms before data flows

QUIC Solution

QUIC: 1 RTT (or 0 RTT for repeat visits)
┌──────────────────────────────────────────────────────────────────┐
│  QUIC Initial + TLS Hello ─────────────────────────>             │
│  QUIC Initial + TLS      <───────────────────────────            │
│  QUIC + Request          ─────────────────────────>   1 RTT      │
│  QUIC + Response         <───────────────────────────            │
└──────────────────────────────────────────────────────────────────┘

0-RTT resumption:
┌──────────────────────────────────────────────────────────────────┐
│  QUIC + TLS ticket + Request ───────────────────────> 0 RTT!     │
│  QUIC + Response            <─────────────────────────           │
└──────────────────────────────────────────────────────────────────┘

QUIC combines transport and crypto handshake.

Problem 3: Network Ossification

Middleboxes (firewalls, NATs, load balancers) inspect and sometimes modify traffic:

TCP extension deployment problem:

New TCP option added:
  1. RFC published
  2. OS kernels implement it
  3. Middlebox sees "unknown" option
  4. Middlebox strips it or drops packet!
  5. Feature doesn't work

Real examples:
  - TCP Fast Open: ~50% of paths don't work
  - ECN: Historically broken by many middleboxes
  - Multipath TCP: Often stripped

Result: TCP is effectively frozen.
        Can't add new features reliably.

QUIC Solution

QUIC encrypts everything:

UDP Header:  [Source Port] [Dest Port] [Length] [Checksum]
             ↑ Visible to middleboxes

QUIC Header: [Connection ID] [Packet Number] ...
             ↑ Encrypted (except initial packets)

QUIC Payload: [Encrypted frames]
              ↑ Encrypted

Middleboxes can see:
  - UDP ports
  - That it's QUIC (maybe)

Middleboxes cannot:
  - Inspect QUIC headers
  - Modify QUIC content
  - Apply TCP-specific rules

Result: QUIC can evolve without middlebox interference.

Problem 4: Connection Bound to IP Address

TCP connections are identified by:

(Source IP, Source Port, Destination IP, Destination Port)

Your phone on WiFi:
  192.168.1.100:52000 → 93.184.216.34:443

Phone moves to cellular:
  10.0.0.50:??? → 93.184.216.34:443

TCP: "That's a different connection!"
     Connection reset. Start over.

Mobile users experience this constantly:
  - WiFi to cellular handoff
  - Moving between cell towers
  - VPN connects/disconnects

QUIC Solution

QUIC connections identified by Connection ID:

Connection ID: 0x1A2B3C4D (random, opaque)

WiFi:    192.168.1.100 + CID 0x1A2B3C4D
Cellular: 10.0.0.50    + CID 0x1A2B3C4D

Server sees same Connection ID → Same connection!

Connection survives:
  - Network changes
  - IP address changes
  - NAT rebinding

Seamless for user. No reconnection needed.

Summary: QUIC’s Value Proposition

┌──────────────────────────────────────────────────────────────────┐
│           Problem             │        QUIC Solution             │
├───────────────────────────────┼──────────────────────────────────┤
│ TCP head-of-line blocking     │ Independent streams              │
│ High connection latency       │ 1-RTT, 0-RTT resumption          │
│ Protocol ossification         │ Encrypted, userspace impl.       │
│ Connections break on move     │ Connection ID migration          │
│ Unencrypted metadata          │ All headers encrypted            │
└───────────────────────────────┴──────────────────────────────────┘

These aren’t theoretical problems—they affect billions of users daily. QUIC provides solutions that TCP cannot, which is why it’s becoming the foundation for modern protocols.

Connection Establishment and 0-RTT

QUIC’s handshake integrates transport and cryptographic establishment, dramatically reducing connection latency.

1-RTT Handshake

A new QUIC connection to a server:

Client                                            Server
   │                                                 │
   │─── Initial[TLS ClientHello, CRYPTO] ───────────>│
   │                                                 │
   │    (1 RTT passes)                               │
   │                                                 │
   │<── Initial[TLS ServerHello, CRYPTO] ────────────│
   │<── Handshake[TLS EncryptedExtensions] ──────────│
   │<── Handshake[TLS Certificate] ──────────────────│
   │<── Handshake[TLS CertVerify, Finished] ─────────│
   │                                                 │
   │─── Handshake[TLS Finished] ────────────────────>│
   │                                                 │
   │    === Connection Established ===               │
   │                                                 │
   │─── Application Data ───────────────────────────>│
   │<── Application Data ────────────────────────────│

Packet Types During Handshake

Initial Packets:
  - First packets sent
  - Protected with Initial Keys (derived from DCID)
  - Contains CRYPTO frames with TLS messages
  - Minimum 1200 bytes (amplification protection)

Handshake Packets:
  - Sent after Initial exchange
  - Protected with Handshake Keys
  - Complete the TLS 1.3 handshake

1-RTT Packets:
  - After handshake completes
  - Protected with Application Keys
  - Used for all application data

0-RTT Resumption

If client has previously connected, it can send data immediately:

First connection:
  - Client receives "session ticket" from server
  - Contains resumption secret and server config
  - Cached for future use

Subsequent connection:
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  Client                                            Server           │
│     │                                                 │             │
│     │─── Initial[TLS ClientHello] ───────────────────>│             │
│     │─── 0-RTT[Application Data] ────────────────────>│ ← Data      │
│     │                                                 │   sent      │
│     │<── Initial[TLS ServerHello] ────────────────────│   before    │
│     │<── Handshake[TLS Finished] ─────────────────────│   handshake │
│     │<── 1-RTT[Application Data Response] ────────────│   completes!│
│     │                                                 │             │
└─────────────────────────────────────────────────────────────────────┘

Client sends request BEFORE receiving server's response!

0-RTT Security Considerations

0-RTT data can be replayed:

Attacker captures:
  [Initial + 0-RTT packets]

Attacker replays:
  [Initial + 0-RTT packets] → Server processes request again!

Safe for 0-RTT:
  ✓ GET requests (idempotent)
  ✓ Read-only operations
  ✓ Operations with other replay protection

NOT safe for 0-RTT:
  ✗ POST/PUT (non-idempotent)
  ✗ Financial transactions
  ✗ Anything with side effects

Server controls:
  - Can reject 0-RTT entirely
  - Can accept but limit to safe operations
  - Can implement replay detection (within limits)

Connection IDs

QUIC connections are identified by Connection IDs, not IP/port tuples:

Connection ID structure:
  - Variable length (0-20 bytes)
  - Chosen by each endpoint
  - Destination CID: What I put in packets TO you
  - Source CID: What you put in packets TO ME

Initial exchange:
  Client → Server: Dest CID = random, Source CID = client's CID
  Server → Client: Dest CID = client's CID, Source CID = server's CID

After handshake:
  Both sides agree on CIDs to use
  Server typically provides multiple CIDs for migration

Connection ID Benefits

1. NAT Rebinding Tolerance
   NAT timeout changes source port
   CID unchanged → Connection continues

2. Load Balancer Routing
   CID can encode server selection
   Any frontend can route to correct backend

3. Privacy (with rotation)
   CID can be changed periodically
   Harder to track connections across time

Amplification Attack Protection

QUIC prevents DDoS amplification:

Attack scenario without protection:
  Attacker: Sends 50-byte Initial with spoofed source IP
  Server: Responds with 10,000 bytes to victim
  Amplification factor: 200x

QUIC protection:
  Before address validation:
    Server can send ≤ 3× what client sent

  Client Initial minimum: 1200 bytes
  Server can send: ≤ 3600 bytes

  For more, server requires address validation:
    - Send Retry packet (stateless)
    - Or use address validation token

Retry Flow

Client                                            Server
   │                                                 │
   │─── Initial (1200 bytes) ──────────────────────>│
   │                                                 │
   │<── Retry[token, new SCID] ─────────────────────│
   │                                                 │
   │─── Initial[token] ────────────────────────────>│
   │                                                 │
   │    (server validates token, proceeds normally)  │

Connection Termination

Graceful Close

Endpoint sends CONNECTION_CLOSE frame:
  - Error code: NO_ERROR (0x0) for clean close
  - Reason phrase (optional)

Both sides:
  - Stop sending new data
  - Send acknowledgments for received data
  - Enter closing period (3× PTO)
  - Then fully close

Stateless Reset

For when connection state is lost:

Server crashes and restarts:
  - Lost all connection state
  - Client sends packets server doesn't recognize

Server sends Stateless Reset:
  - Looks like a regular packet
  - Contains reset token (derived from CID)
  - Client recognizes token, closes connection

Prevents hanging connections after server restart.

Summary

QUIC connection establishment:

ScenarioRound TripsData Delay
TCP + TLS 1.23 RTT3 RTT
TCP + TLS 1.32 RTT2 RTT
QUIC new1 RTT1 RTT
QUIC 0-RTT0 RTT0 RTT

Key mechanisms:

  • Integrated handshake: Transport + crypto combined
  • Connection IDs: Enable migration, load balancing
  • 0-RTT: Instant resumption (with replay caveats)
  • Amplification protection: Prevents DDoS abuse

Multiplexing Without Head-of-Line Blocking

QUIC’s stream multiplexing is its most impactful feature for application performance. Unlike TCP, QUIC streams are truly independent.

Streams in QUIC

A QUIC connection supports multiple concurrent streams:

┌─────────────────────────────────────────────────────────────────────┐
│                        QUIC Connection                              │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  Stream 0 (bidirectional): [HTTP request/response]             │ │
│  │  Stream 4 (bidirectional): [Another request/response]          │ │
│  │  Stream 8 (bidirectional): [Third request/response]            │ │
│  │  Stream 2 (unidirectional): [Control messages →]               │ │
│  │  Stream 6 (unidirectional): [Server push →]                    │ │
│  └────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

Stream types:
  Bidirectional:  Data flows both ways
  Unidirectional: Data flows one way only

Stream IDs:
  0, 4, 8, 12...  Client-initiated bidirectional
  1, 5, 9, 13...  Server-initiated bidirectional
  2, 6, 10, 14... Client-initiated unidirectional
  3, 7, 11, 15... Server-initiated unidirectional

Independence Guarantee

Each stream maintains its own state:

Stream states:
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  Stream 0: offset 0-1000 received, expecting 1001                   │
│  Stream 4: offset 0-500 received, expecting 501                     │
│  Stream 8: offset 0-2000 received, complete                         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

If Stream 4 data at offset 501-1000 is lost:
  - Stream 0: Continues receiving, delivering data
  - Stream 4: Waits for retransmit of 501-1000
  - Stream 8: Already complete, unaffected

NO cross-stream blocking!

Comparison with HTTP/2 over TCP

HTTP/2 over TCP:

Request 1: [----data----]
Request 2: [--data--]
Request 3: [------data------]

TCP sees: [1][2][3][1][2][3][1][3][1][3]

TCP packet 3 lost (contains Request 2 data):
  All subsequent packets buffered by TCP
  Requests 1 and 3 blocked waiting for Request 2!

─────────────────────────────────────────────────────────────────────

HTTP/3 over QUIC:

Request 1: Stream 0
Request 2: Stream 4
Request 3: Stream 8

QUIC packet with Stream 4 data lost:
  Stream 4 data retransmitted
  Streams 0 and 8 continue independently!

Each HTTP request truly independent.

Stream Flow Control

QUIC has two levels of flow control:

1. Stream-level flow control:
   Each stream has its own receive window
   Prevents one stream from consuming all buffer

2. Connection-level flow control:
   Total bytes across all streams
   Prevents connection from overwhelming receiver

┌─────────────────────────────────────────────────────────────────────┐
│  Connection MAX_DATA: 1,000,000 bytes                               │
│  ├── Stream 0 MAX_STREAM_DATA: 100,000 bytes                        │
│  ├── Stream 4 MAX_STREAM_DATA: 100,000 bytes                        │
│  └── Stream 8 MAX_STREAM_DATA: 100,000 bytes                        │
│                                                                     │
│  Stream 0 can use up to 100KB                                       │
│  All streams combined can use up to 1MB                             │
└─────────────────────────────────────────────────────────────────────┘

Flow control frames:
  MAX_DATA: Update connection limit
  MAX_STREAM_DATA: Update stream limit
  DATA_BLOCKED: Signal sender is blocked
  STREAM_DATA_BLOCKED: Signal stream is blocked

Stream Prioritization

Applications can indicate stream importance:

Priority information per stream:
  - Urgency: 0-7 (0 highest)
  - Incremental: true/false (can process partially)

Example (HTTP/3):
  HTML:   Stream 0, urgency=0, incremental=false
  CSS:    Stream 4, urgency=1, incremental=false
  JS:     Stream 8, urgency=1, incremental=false
  Images: Stream 12+, urgency=5, incremental=true

Sender should:
  1. Send all urgency=0 data first
  2. Round-robin among same urgency
  3. Incremental streams can be interleaved

Note: Priority is a hint, not enforced by QUIC itself.

Stream Lifecycle

                    ┌──────────────────┐
                    │       Idle       │
                    └────────┬─────────┘
                             │ Open (send/receive)
                    ┌────────▼─────────┐
                    │       Open       │
                    │                  │
                    │ Send/Receive data│
                    └──┬──────────┬────┘
                       │          │
         Send FIN ─────┘          └───── Receive FIN
                       │          │
              ┌────────▼──┐    ┌──▼─────────┐
              │Half-Closed│    │Half-Closed │
              │  (local)  │    │  (remote)  │
              └─────┬─────┘    └──────┬─────┘
                    │                 │
    Receive FIN ────┘                 └──── Send FIN
                    │                 │
                    └────────┬────────┘
                    ┌────────▼─────────┐
                    │      Closed      │
                    └──────────────────┘

Stream can also be reset (RST_STREAM) at any point.

Practical Impact

Scenario: Page with 50 resources over lossy network (2% loss)

HTTP/2 over TCP:
  Any lost packet blocks ALL pending responses
  On 2% loss: Significant stalls and delays
  Measured: 3-4x slower on lossy mobile

HTTP/3 over QUIC:
  Lost packet only affects its stream
  Other 49 resources continue loading
  Measured: Near-optimal even with loss

Real-world impact is most visible on:
  - Mobile networks (variable quality)
  - Satellite connections (high latency + loss)
  - Congested WiFi

Stream Limits

Connections limit maximum streams:

MAX_STREAMS frames:
  - MAX_STREAMS (bidi): Max bidirectional streams
  - MAX_STREAMS (uni): Max unidirectional streams

Typical defaults:
  100 bidirectional streams
  100 unidirectional streams

If limit reached:
  Sender must wait for streams to close
  Or wait for MAX_STREAMS increase
  STREAMS_BLOCKED frame signals waiting

Summary

QUIC stream multiplexing provides:

FeatureBenefit
Independent streamsNo head-of-line blocking
Per-stream flow controlFair resource allocation
Stream prioritiesImportant content first
Unidirectional streamsEfficient one-way data
Low overheadStream creation is cheap

This is QUIC’s key advantage over TCP for multiplexed protocols. Loss on one stream doesn’t impact others, making it ideal for modern web applications with many parallel requests.

Connection Migration

QUIC connections can survive network changes—a game-changer for mobile users who constantly switch between WiFi and cellular networks.

The Problem with TCP

TCP connections are bound to IP addresses:

TCP connection tuple:
  (192.168.1.100, 52000, 93.184.216.34, 443)
    └── Client IP ──┘

When IP changes (WiFi → cellular):
  New tuple: (10.0.0.50, 48000, 93.184.216.34, 443)

Server: "Who are you? I don't have a connection from 10.0.0.50"
Connection dies. Application must reconnect.

Impact:
  - HTTP request fails
  - Download interrupted
  - Streaming buffers
  - User experience degraded

QUIC’s Solution: Connection IDs

QUIC identifies connections by Connection ID, not IP:

QUIC connection:
  Connection ID: 0x1A2B3C4D5E6F

Packets from 192.168.1.100 with CID 0x1A2B3C4D5E6F
Packets from 10.0.0.50 with CID 0x1A2B3C4D5E6F

Server: "Same CID? Same connection! Continue."

Migration Flow

┌─────────────────────────────────────────────────────────────────────┐
│                      Connection Migration                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Time 0: Client on WiFi (192.168.1.100)                             │
│    Client ──[CID: ABC]──> Server                                    │
│    Server <──[CID: XYZ]── Client                                    │
│                                                                     │
│  Time 1: Client switches to cellular (10.0.0.50)                    │
│    Client ──[CID: ABC]──> Server (from new IP!)                     │
│                                                                     │
│  Time 2: Server validates new path                                  │
│    Server ──[PATH_CHALLENGE]──> Client                              │
│    Client ──[PATH_RESPONSE]──> Server                               │
│                                                                     │
│  Time 3: Migration complete                                         │
│    Connection continues seamlessly                                  │
│    In-flight data retransmitted if needed                           │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Path Validation

Migration requires validating the new path:

Why validate?
  - Prove client owns new address (anti-spoofing)
  - Verify path works for bidirectional traffic
  - Update RTT estimates for new path

PATH_CHALLENGE:
  Server sends random 8-byte challenge to new address

PATH_RESPONSE:
  Client echoes the 8 bytes back

If response matches: Path validated, migration complete
If no response: Revert to previous path

Connection ID Management

Multiple Connection IDs enable smooth migration:

Server provides multiple CIDs:

NEW_CONNECTION_ID frames:
  CID 1: 0xAAAAAA, Sequence 0
  CID 2: 0xBBBBBB, Sequence 1
  CID 3: 0xCCCCCC, Sequence 2

Client can use any of these CIDs.

On migration:
  Client switches to unused CID
  Server correlates new CID to connection
  Old CID retired for privacy

RETIRE_CONNECTION_ID:
  "I'm done using CID sequence 0"

Privacy Benefits

Without CID rotation:
  Observer: "CID 0xABC on WiFi... same CID on cellular"
  Observer: "This is the same user, tracked!"

With CID rotation:
  WiFi: Uses CID 0xAAA
  Cellular: Uses CID 0xBBB (unused before)

  Observer: "Different CIDs, can't correlate"
  (Connection continues, but linkability reduced)

Probing and Preferred Paths

QUIC can probe multiple paths:

Client has:
  - WiFi connection (reliable, maybe slow)
  - Cellular connection (less reliable, maybe faster)

Client can:
  1. Probe both paths with PATH_CHALLENGE
  2. Measure RTT and loss on each
  3. Choose preferred path
  4. Keep other path as backup

This enables:
  - Seamless handoff
  - Make-before-break migration
  - Multipath in future extensions

NAT Rebinding

Even without physical network change, NAT can disrupt:

NAT timeout scenario:
  Connection idle for 30 minutes
  NAT forgets the mapping
  NAT assigns new external port

TCP: Connection times out or RST

QUIC:
  CID unchanged
  Server validates new path
  Connection continues

Server-Side Considerations

Load Balancer Routing

With TCP:
  Load balancer routes by 4-tuple
  IP change → Different backend → Connection state lost

With QUIC:
  Load balancer can route by CID
  Server encodes routing info in CID

CID format (example):
  [Server ID: 4 bytes][Random: 8 bytes]

Any frontend extracts Server ID from CID
Routes to correct backend regardless of client IP

Connection State

Server must maintain:
  - Connection state by CID (not by IP)
  - Multiple CIDs per connection
  - Token→Connection mapping for 0-RTT

Storage indexed by CID, not IP address.

Mobile Experience Impact

Real-world scenarios improved:

1. Elevator/Subway
   TCP: Connection dies, app reconnects
   QUIC: Brief pause, then continues

2. Walking between access points
   TCP: Each AP change = potential reset
   QUIC: Seamless, user unaware

3. VPN connect/disconnect
   TCP: All connections reset
   QUIC: Continues through VPN changes

4. NAT timeout during idle
   TCP: Silent failure on next request
   QUIC: Automatic path revalidation

Summary

Connection migration enables:

FeatureUser Benefit
IP address change survivalSeamless WiFi/cellular handoff
CID-based identificationLoad balancer flexibility
Path validationSecurity against spoofing
CID rotationPrivacy from observers
NAT rebinding toleranceFewer “connection reset” errors

Migration is one of QUIC’s most user-visible improvements, particularly for mobile users who previously experienced constant interruptions during network transitions.

WebSockets

WebSockets provide full-duplex, bidirectional communication over a single TCP connection. Unlike HTTP’s request-response model, WebSockets allow both client and server to send messages at any time.

Why WebSockets?

HTTP is request-response: client asks, server answers. But many applications need real-time, bidirectional communication:

HTTP Limitations:
  - Client must initiate every exchange
  - Server can't push data spontaneously
  - New request needed for each interaction
  - Header overhead for every message

Workarounds before WebSockets:
  Polling:        Client asks "any updates?" every N seconds
  Long-polling:   Server holds request until data available
  Server-Sent:    Server streams events (one-way only)

All have overhead, latency, or direction limitations.

WebSocket Advantages

┌─────────────────────────────────────────────────────────────────────┐
│                    WebSocket Benefits                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Full-Duplex: Both sides send simultaneously                        │
│  Low Latency: No per-message handshake                              │
│  Low Overhead: 2-10 bytes per frame (vs ~100+ for HTTP)             │
│  Persistent: Single connection for entire session                   │
│  Push: Server sends without client request                          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Perfect for:
  - Chat applications
  - Live notifications
  - Real-time collaboration
  - Live sports/stock updates
  - Online gaming
  - IoT device communication

The Protocol

WebSocket starts as HTTP, then “upgrades” to a different protocol:

┌─────────────────────────────────────────────────────────────────────┐
│                    WebSocket Connection                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. HTTP Request with Upgrade header                                │
│  2. Server responds 101 Switching Protocols                         │
│  3. Connection becomes WebSocket                                    │
│  4. Bidirectional frames flow                                       │
│  5. Either side can close                                           │
│                                                                     │
│  ┌──────────┐                              ┌──────────┐             │
│  │  Client  │                              │  Server  │             │
│  └────┬─────┘                              └────┬─────┘             │
│       │                                         │                   │
│       │─── HTTP Upgrade Request ───────────────>│                   │
│       │<── HTTP 101 Switching ──────────────────│                   │
│       │                                         │                   │
│       │═══ WebSocket Connection ════════════════│                   │
│       │                                         │                   │
│       │─── Message ────────────────────────────>│                   │
│       │<── Message ─────────────────────────────│                   │
│       │<── Message ─────────────────────────────│                   │
│       │─── Message ────────────────────────────>│                   │
│       │                                         │                   │
│       │─── Close Frame ────────────────────────>│                   │
│       │<── Close Frame ─────────────────────────│                   │
│       │                                         │                   │
│       ╳                                         ╳                   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

What You’ll Learn

  1. The Upgrade Handshake: How HTTP becomes WebSocket
  2. Full-Duplex Communication: Frame format and messaging
  3. WebSocket Use Cases: When and why to use WebSockets

The Upgrade Handshake

WebSocket connections begin as HTTP requests, then “upgrade” to the WebSocket protocol. This allows WebSockets to work through HTTP infrastructure (proxies, load balancers) while establishing a different communication pattern.

Client Request

GET /chat HTTP/1.1
Host: server.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13
Origin: http://example.com
Sec-WebSocket-Protocol: chat, superchat

Required Headers

HeaderPurpose
Upgrade: websocketRequest protocol upgrade
Connection: UpgradeIndicates upgrade requested
Sec-WebSocket-KeyRandom base64 value for handshake validation
Sec-WebSocket-Version: 13WebSocket protocol version

Optional Headers

HeaderPurpose
OriginWhere request originates (CORS)
Sec-WebSocket-ProtocolSubprotocol preferences (application-defined)
Sec-WebSocket-ExtensionsExtension negotiation (e.g., compression)

Server Response

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
Sec-WebSocket-Protocol: chat

Sec-WebSocket-Accept Calculation

Server proves it received the client’s key:

1. Take client's Sec-WebSocket-Key:
   "dGhlIHNhbXBsZSBub25jZQ=="

2. Append magic GUID:
   "dGhlIHNhbXBsZSBub25jZQ==" + "258EAFA5-E914-47DA-95CA-C5AB0DC85B11"

3. SHA-1 hash the result

4. Base64 encode the hash:
   "s3pPLMBiTxaQ9kYGzzhZRbK+xOo="

Client verifies this matches expected value.
Prevents accidental connections or caching issues.

After the Handshake

HTTP/1.1 101 Switching Protocols
...

─── HTTP ENDS HERE ───

┌─────────────────────────────────────────┐
│     WebSocket Frames (Binary)           │
│                                         │
│  [Frame Header][Payload]                │
│  [Frame Header][Payload]                │
│  ...                                    │
└─────────────────────────────────────────┘

Same TCP connection, different protocol.
No more HTTP until connection closes.

Handshake Implementation

JavaScript (Browser)

const ws = new WebSocket('wss://server.example.com/chat');

ws.onopen = () => {
  console.log('Connected!');
  ws.send('Hello Server!');
};

ws.onmessage = (event) => {
  console.log('Received:', event.data);
};

ws.onclose = (event) => {
  console.log('Closed:', event.code, event.reason);
};

ws.onerror = (error) => {
  console.error('Error:', error);
};

Python Server (websockets library)

import asyncio
import websockets

async def handler(websocket, path):
    async for message in websocket:
        print(f"Received: {message}")
        await websocket.send(f"Echo: {message}")

async def main():
    async with websockets.serve(handler, "localhost", 8765):
        await asyncio.Future()  # Run forever

asyncio.run(main())

Node.js Server (ws library)

const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', (ws) => {
  console.log('Client connected');

  ws.on('message', (message) => {
    console.log('Received:', message.toString());
    ws.send(`Echo: ${message}`);
  });

  ws.on('close', () => {
    console.log('Client disconnected');
  });
});

Subprotocols

Subprotocols define application-level meaning:

Client: Sec-WebSocket-Protocol: graphql, json, protobuf
Server: Sec-WebSocket-Protocol: graphql

Agreement: Use GraphQL over WebSocket.

Common subprotocols:
  - graphql-ws (GraphQL subscriptions)
  - mqtt (IoT messaging)
  - wamp (Web Application Messaging)
  - soap (legacy)

Extensions

Extensions modify the protocol (typically for compression):

Client: Sec-WebSocket-Extensions: permessage-deflate
Server: Sec-WebSocket-Extensions: permessage-deflate

permessage-deflate:
  - Compresses message payloads
  - Significant bandwidth savings for text
  - Supported by most implementations

Error Cases

Server doesn't support WebSocket:
  HTTP/1.1 400 Bad Request
  (or 404, or no Upgrade response)

Wrong Sec-WebSocket-Accept:
  Client: Abort connection
  Prevents man-in-the-middle returning wrong accept

Origin not allowed (CORS-like):
  HTTP/1.1 403 Forbidden
  Server rejects based on Origin header

Secure WebSockets (WSS)

ws://  - Unencrypted WebSocket (like HTTP)
wss:// - Encrypted WebSocket over TLS (like HTTPS)

Process for wss://:
  1. TCP connection
  2. TLS handshake (certificate validation)
  3. HTTP Upgrade request (encrypted)
  4. WebSocket frames (encrypted)

Always use wss:// in production.

Summary

The WebSocket handshake:

  1. Client sends HTTP GET with Upgrade: websocket
  2. Server validates and responds 101 Switching Protocols
  3. Server sends Sec-WebSocket-Accept derived from client’s key
  4. Connection becomes bidirectional WebSocket

After upgrade, it’s no longer HTTP—just WebSocket frames on TCP.

Full-Duplex Communication

After the handshake, WebSocket communication happens through frames—small packets that can carry text, binary data, or control messages.

Frame Format

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
│F│R│R│R│  Opcode │M│         Payload Length                   │
│I│S│S│S│  (4)    │A│             (7)                          │
│N│V│V│V│         │S│                                          │
│ │1│2│3│         │K│                                          │
├─┴─┴─┴─┴─────────┴─┴───────────────────────────────────────────┤
│     Extended payload length (16/64 bits, if needed)          │
├───────────────────────────────────────────────────────────────┤
│     Masking key (32 bits, if MASK=1)                         │
├───────────────────────────────────────────────────────────────┤
│     Payload Data                                             │
└───────────────────────────────────────────────────────────────┘

Minimum frame: 2 bytes (header only)
Typical small message: 6-8 bytes overhead

Frame Fields

FieldBitsDescription
FIN1Final fragment of message
RSV1-33Reserved for extensions
Opcode4Frame type
MASK1Payload is masked (required from client)
Payload Length7+Size of payload

Opcodes

0x0  Continuation   Part of fragmented message
0x1  Text          UTF-8 text data
0x2  Binary        Binary data
0x8  Close         Connection close request
0x9  Ping          Heartbeat request
0xA  Pong          Heartbeat response

Message Types

Text Messages

// Send
ws.send("Hello, World!");

// Frame created:
// FIN=1, Opcode=0x1 (text), MASK=1, Payload="Hello, World!"

// Receive
ws.onmessage = (event) => {
  console.log(event.data);  // "Hello, World!" (string)
};

Binary Messages

// Send ArrayBuffer
const buffer = new ArrayBuffer(8);
const view = new DataView(buffer);
view.setFloat64(0, 3.14159);
ws.send(buffer);

// Send Blob
const blob = new Blob(['Binary data'], {type: 'application/octet-stream'});
ws.send(blob);

// Receive
ws.binaryType = 'arraybuffer';  // or 'blob'
ws.onmessage = (event) => {
  const data = event.data;  // ArrayBuffer or Blob
};

Control Frames

Ping/Pong (Heartbeat)

Ping: "Are you still there?"
Pong: "Yes, I'm here."

Server sends:  Ping (opcode 0x9)
Client sends:  Pong (opcode 0xA) with same payload

Purpose:
  - Detect dead connections
  - Keep NAT mappings alive
  - Measure latency

Client browser handles Pong automatically.

Close Frame

Graceful shutdown:

1. Initiator sends Close frame
   - Opcode 0x8
   - Optional: status code (2 bytes) + reason (text)

2. Recipient sends Close frame back

3. TCP connection closed

Status codes:
  1000  Normal closure
  1001  Endpoint going away
  1002  Protocol error
  1003  Unsupported data type
  1006  Abnormal closure (no close frame)
  1011  Server error

Masking

Client-to-server frames must be masked:

Why masking?
  Cache poisoning attack prevention.
  Proxies might cache WebSocket data as HTTP.
  Masking makes data look random, prevents caching.

Masking process:
  1. Generate random 32-bit masking key
  2. XOR each byte of payload with key (rotating)

  masked[i] = payload[i] XOR key[i % 4]

Server-to-client: No masking required.

Fragmentation

Large messages can be split into fragments:

Large message (1MB):

Fragment 1: FIN=0, Opcode=0x1 (text), data[0:64KB]
Fragment 2: FIN=0, Opcode=0x0 (continuation), data[64KB:128KB]
Fragment 3: FIN=0, Opcode=0x0 (continuation), data[128KB:192KB]
...
Fragment N: FIN=1, Opcode=0x0 (continuation), data[last portion]

Receiver reassembles before delivering to application.
Allows interleaving with control frames (Ping/Pong).

Full-Duplex in Action

┌─────────────────────────────────────────────────────────────────────┐
│                    Simultaneous Communication                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Client                                            Server           │
│     │                                                 │             │
│     │──── "Hello" ────────────────────────────────────│             │
│     │────────────────────────────── "World" ──────────│             │
│     │                        X                        │             │
│     │──── "How are you?" ─────────────────────────────│             │
│     │────────────────────── "Message for you" ────────│             │
│     │                                                 │             │
│     │ Messages cross "in flight"                      │             │
│     │ No waiting for response                         │             │
│     │ Both sides send whenever ready                  │             │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Implementation Patterns

Message Protocol

// Define message format
const message = {
  type: 'chat',
  payload: {
    user: 'alice',
    text: 'Hello everyone!'
  },
  timestamp: Date.now()
};

ws.send(JSON.stringify(message));

// Receiver
ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  switch (msg.type) {
    case 'chat':
      displayChat(msg.payload);
      break;
    case 'notification':
      showNotification(msg.payload);
      break;
  }
};

Reconnection Logic

function connect() {
  const ws = new WebSocket('wss://example.com/socket');

  ws.onopen = () => {
    console.log('Connected');
    reconnectAttempts = 0;
  };

  ws.onclose = (event) => {
    if (event.code !== 1000) {  // Not normal close
      // Exponential backoff
      const delay = Math.min(1000 * 2 ** reconnectAttempts, 30000);
      reconnectAttempts++;
      setTimeout(connect, delay);
    }
  };

  return ws;
}

Summary

WebSocket communication features:

AspectDetails
Frame overhead2-14 bytes (vs 100+ for HTTP)
Message typesText (UTF-8), Binary
Control framesPing, Pong, Close
DirectionFull-duplex (simultaneous both ways)
FragmentationLarge messages split across frames
MaskingRequired for client→server

WebSocket Use Cases

WebSockets excel when you need persistent, low-latency, bidirectional communication. Understanding when to use them—and when not to—helps you choose the right tool.

Ideal Use Cases

Real-Time Chat

Chat requirements:
  ✓ Instant message delivery
  ✓ Multiple participants
  ✓ Typing indicators
  ✓ Presence status

WebSocket flow:
  User types ──[typing indicator]──> Server ──broadcast──> All users
  User sends ──[message]──────────> Server ──broadcast──> All users
  User joins ──[presence]─────────> Server ──broadcast──> All users

Single connection handles all message types.
Sub-100ms latency achievable.

Live Notifications

Without WebSocket (polling):
  Client: "Any notifications?" (every 5 seconds)
  Server: "No"
  Client: "Any notifications?"
  Server: "No"
  ... 50 requests later ...
  Server: "Yes, you have a message!"

With WebSocket:
  Server: "New message!" (instant when it happens)

Benefits:
  - Immediate delivery
  - No wasted requests
  - Lower server load

Collaborative Editing

Google Docs / Figma style:

User A types ──> Server ──> User B (sees cursor, text)
User B draws ──> Server ──> User A (sees drawing)

Requirements:
  - Low latency (feels responsive)
  - High frequency (keystroke level)
  - Bidirectional (everyone sees everyone)
  - Reliable (no lost changes)

WebSocket + Operational Transform/CRDT

Online Gaming

Game server sending:
  - Player positions (60 fps)
  - Game events
  - Chat messages

Players sending:
  - Input commands
  - Actions
  - Chat

WebSocket provides:
  - Single connection (efficient)
  - Binary messages (compact)
  - Low latency (responsive)

Note: Competitive games may prefer UDP/QUIC for
      lower latency at cost of reliability.

Financial Data

Stock ticker:
  [AAPL: 150.25] ──> [AAPL: 150.30] ──> [AAPL: 150.28]

Dozens of updates per second.
HTTP request-response unsuitable.
Server-Sent Events work but one-way only.
WebSocket ideal for bidirectional (quotes + orders).

IoT Dashboard

Sensors ──> Server ──WebSocket──> Dashboard

Real-time display of:
  - Temperature
  - Humidity
  - Motion
  - System status

Dashboard can also send commands back:
  Dashboard ──> Server ──> Device (turn on/off)

When NOT to Use WebSockets

Simple CRUD APIs

Creating a user:
  POST /users
  {"name": "Alice"}

Response:
  201 Created
  {"id": 123, "name": "Alice"}

One request, one response, done.
WebSocket is overkill—use REST/HTTP.

Infrequent Updates

Weather data (updates hourly):
  - Polling once per hour is fine
  - SSE (Server-Sent Events) works well
  - WebSocket connection overhead not justified

Rule of thumb:
  Updates > 1/minute: Consider WebSocket
  Updates < 1/minute: HTTP polling or SSE

Public APIs

Public REST API considerations:
  - Stateless (easy to scale)
  - Cacheable
  - Standard tools (curl, Postman)
  - Rate limiting straightforward

WebSocket:
  - Stateful (harder to scale)
  - Not cacheable
  - Fewer debugging tools
  - Rate limiting complex

Alternatives Comparison

┌────────────────────────────────────────────────────────────────────┐
│ Technique          │ Direction    │ Latency │ Best For             │
├────────────────────┼──────────────┼─────────┼──────────────────────┤
│ HTTP Polling       │ Client→Svr   │ High    │ Simple, infrequent   │
│ Long Polling       │ Client→Svr   │ Medium  │ Moderate updates     │
│ Server-Sent Events │ Server→Clt   │ Low     │ One-way streaming    │
│ WebSocket          │ Bidirectional│ Low     │ Real-time, two-way   │
│ WebRTC             │ P2P          │ Lowest  │ Audio/video, gaming  │
└────────────────────────────────────────────────────────────────────┘

Server-Sent Events (SSE)

Good for:
  - Live feeds (news, sports)
  - Notifications (server→client only)
  - Simpler than WebSocket

Limitations:
  - One-way (server to client)
  - Text only (no binary)
  - Fewer connections per browser

// Server
res.setHeader('Content-Type', 'text/event-stream');
res.write('data: Hello\n\n');

// Client
const source = new EventSource('/stream');
source.onmessage = (e) => console.log(e.data);

Scaling WebSockets

Connection Limits

Challenge:
  10,000 concurrent users = 10,000 open connections
  Each connection uses memory and file descriptors

Solutions:
  - Horizontal scaling (multiple servers)
  - Connection limits per server
  - Load balancing by connection ID

State Distribution

User A connected to Server 1
User B connected to Server 2
User A sends message to User B

Server 1 must route to Server 2!

Solutions:
  - Redis pub/sub
  - Dedicated message queue
  - Sticky sessions (same user → same server)

Architecture Pattern

                    ┌─────────────────┐
                    │  Load Balancer  │
                    └────────┬────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
  ┌──────▼─────┐      ┌──────▼─────┐      ┌──────▼─────┐
  │ WS Server 1│      │ WS Server 2│      │ WS Server 3│
  └──────┬─────┘      └──────┬─────┘      └──────┬─────┘
         │                   │                   │
         └───────────────────┼───────────────────┘
                             │
                    ┌────────▼────────┐
                    │  Redis Pub/Sub  │
                    │  (or similar)   │
                    └─────────────────┘

Messages published to Redis, all servers receive.

Summary

Use WebSockets for:

  • Real-time bidirectional communication
  • High-frequency updates
  • Push from server
  • Interactive applications

Consider alternatives for:

  • Simple request-response (HTTP)
  • One-way server→client (SSE)
  • Infrequent updates (polling)
  • Peer-to-peer (WebRTC)

WebSocket is powerful but adds complexity. Choose based on actual requirements.

TLS/SSL

TLS (Transport Layer Security) encrypts network communication, protecting data from eavesdropping and tampering. It’s what puts the “S” in HTTPS and secures most internet traffic today.

What TLS Provides

┌─────────────────────────────────────────────────────────────────────┐
│                       TLS Security Goals                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Confidentiality                                                    │
│    Data encrypted, only endpoints can read it                       │
│    Eavesdropper sees random bytes                                   │
│                                                                     │
│  Integrity                                                          │
│    Data tampering detected                                          │
│    HMAC ensures message authenticity                                │
│                                                                     │
│  Authentication                                                     │
│    Server proves identity via certificate                           │
│    Client verifies it's talking to real server                      │
│    (Optional: Client can prove identity too)                        │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

TLS in the Stack

┌─────────────────────────────────────────────────────────────────────┐
│                       Protocol Stack                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌───────────────────────────────────────────────────────┐          │
│  │              Application (HTTP, SMTP, etc.)           │          │
│  ├───────────────────────────────────────────────────────┤          │
│  │                        TLS                            │ ← Here   │
│  ├───────────────────────────────────────────────────────┤          │
│  │                        TCP                            │          │
│  ├───────────────────────────────────────────────────────┤          │
│  │                        IP                             │          │
│  └───────────────────────────────────────────────────────┘          │
│                                                                     │
│  TLS sits between application and transport.                        │
│  Application sees plain data.                                       │
│  Network sees encrypted data.                                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Brief History

1995: SSL 2.0 (Netscape) - First public version, insecure
1996: SSL 3.0 - Major improvements, still vulnerabilities
1999: TLS 1.0 - Standardized by IETF, based on SSL 3.0
2006: TLS 1.1 - Minor security improvements
2008: TLS 1.2 - Modern cipher suites, still widely used
2018: TLS 1.3 - Simplified, faster, more secure

Today:
  TLS 1.3 preferred
  TLS 1.2 acceptable
  TLS 1.0/1.1 deprecated (should be disabled)
  SSL: Do not use

How TLS Works (Overview)

┌─────────────────────────────────────────────────────────────────────┐
│                        TLS Connection                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Handshake                                                       │
│     - Client and server negotiate parameters                        │
│     - Server presents certificate                                   │
│     - Key exchange establishes shared secret                        │
│     - Derive session keys                                           │
│                                                                     │
│  2. Encrypted Communication                                         │
│     - All data encrypted with session keys                          │
│     - MAC ensures integrity                                         │
│     - Sequence numbers prevent replay                               │
│                                                                     │
│  3. Closure                                                         │
│     - Close notify alert                                            │
│     - Prevents truncation attacks                                   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

What You’ll Learn

  1. The TLS Handshake: How secure connections are established
  2. Certificates and PKI: How identity is verified
  3. Cipher Suites: The cryptographic algorithms used
  4. TLS 1.3 Improvements: What makes the latest version better

The TLS Handshake

The TLS handshake establishes a secure connection by negotiating cryptographic parameters and authenticating the server. Understanding it helps debug connection issues and appreciate TLS 1.3’s improvements.

TLS 1.2 Handshake

Client                                               Server
   │                                                    │
   │─────────── ClientHello ───────────────────────────>│
   │  - TLS version                                     │
   │  - Random bytes                                    │
   │  - Cipher suites supported                         │
   │  - Extensions (SNI, etc.)                          │
   │                                                    │
   │<────────── ServerHello ────────────────────────────│
   │  - Chosen TLS version                              │
   │  - Server random                                   │
   │  - Selected cipher suite                           │
   │                                                    │
   │<────────── Certificate ────────────────────────────│
   │  - Server's certificate chain                      │
   │                                                    │
   │<────────── ServerKeyExchange ─────────────────────│
   │  - Key exchange parameters (if needed)             │
   │                                                    │
   │<────────── ServerHelloDone ───────────────────────│
   │                                                    │
   │─────────── ClientKeyExchange ─────────────────────>│
   │  - Pre-master secret (encrypted)                   │
   │                                                    │
   │─────────── ChangeCipherSpec ──────────────────────>│
   │  - "Switching to encrypted"                        │
   │                                                    │
   │─────────── Finished ──────────────────────────────>│
   │  - Encrypted verification                          │
   │                                                    │
   │<────────── ChangeCipherSpec ───────────────────────│
   │<────────── Finished ───────────────────────────────│
   │                                                    │
   │══════════ Encrypted Application Data ══════════════│

TLS 1.2: 2 round trips before application data

TLS 1.3 Handshake (Simplified)

Client                                               Server
   │                                                    │
   │─────────── ClientHello ───────────────────────────>│
   │  - TLS 1.3                                         │
   │  - Supported groups & key shares                   │
   │  - Signature algorithms                            │
   │                                                    │
   │<────────── ServerHello ────────────────────────────│
   │  - Selected key share                              │
   │                                                    │
   │<────────── EncryptedExtensions ────────────────────│
   │<────────── Certificate ────────────────────────────│
   │<────────── CertificateVerify ─────────────────────│
   │<────────── Finished ───────────────────────────────│
   │                                                    │
   │─────────── Finished ──────────────────────────────>│
   │                                                    │
   │══════════ Encrypted Application Data ══════════════│

TLS 1.3: 1 round trip before application data

Key Exchange

How do client and server agree on encryption keys?

Diffie-Hellman Key Exchange

The magic: Agree on a shared secret over an insecure channel.

1. Public parameters: Prime p, Generator g

2. Client picks random a, computes A = g^a mod p
   Server picks random b, computes B = g^b mod p

3. Exchange A and B (visible to eavesdroppers)

4. Client computes: secret = B^a mod p = g^(ab) mod p
   Server computes: secret = A^b mod p = g^(ab) mod p

Both have same secret! Eavesdropper has A and B but
cannot compute g^(ab) without knowing a or b (discrete log problem).

Modern TLS uses Elliptic Curve Diffie-Hellman (ECDHE) for efficiency.

Perfect Forward Secrecy

Why ephemeral keys matter:

Without PFS (RSA key exchange):
  - Server's long-term key encrypts pre-master secret
  - If key later compromised, all past traffic decryptable

With PFS (ECDHE):
  - New DH keys generated per session
  - Session keys destroyed after use
  - Compromising server key doesn't reveal past sessions

TLS 1.3 requires PFS (ECDHE or DHE only).

Server Name Indication (SNI)

Problem: Single IP hosts multiple HTTPS sites.
         Which certificate should server present?

Solution: SNI extension in ClientHello.

ClientHello includes:
  server_name = "www.example.com"

Server sees hostname BEFORE certificate selection.
Presents correct certificate for that hostname.

Note: SNI is sent unencrypted in TLS 1.2.
      Encrypted Client Hello (ECH) in TLS 1.3 hides it.

Session Resumption

Avoiding full handshake for repeat connections:

Session IDs (TLS 1.2)

First connection: Full handshake, server assigns session ID
Subsequent: Client presents session ID, server looks up keys
            Abbreviated handshake (1 RTT instead of 2)

Limitation: Server must store session state (doesn't scale).

Session Tickets (TLS 1.2)

Server encrypts session state into ticket.
Client stores ticket, presents on reconnection.
Server decrypts ticket, recovers session state.

Advantage: Stateless server, better scaling.

0-RTT Resumption (TLS 1.3)

Client sends:
  ClientHello + Early Data (encrypted with previous session key)

Server can respond to early data immediately.
No round trip before application data!

Security caveat: Early data is replayable.
Only safe for idempotent requests.

Handshake Failures

Certificate Errors

ERR_CERT_AUTHORITY_INVALID
  - Certificate not trusted
  - Self-signed or unknown CA
  - Missing intermediate certificate

ERR_CERT_DATE_INVALID
  - Certificate expired
  - Certificate not yet valid
  - System clock wrong

ERR_CERT_COMMON_NAME_INVALID
  - Hostname doesn't match certificate
  - Wrong server or misconfiguration

Protocol Errors

ERR_SSL_VERSION_OR_CIPHER_MISMATCH
  - No common TLS version
  - No common cipher suite
  - Often: Server only supports old protocols

ERR_SSL_PROTOCOL_ERROR
  - Malformed handshake messages
  - Middlebox interference
  - Implementation bugs

Debugging TLS

# OpenSSL client
$ openssl s_client -connect example.com:443 -servername example.com

# Show certificate
$ openssl s_client -connect example.com:443 2>/dev/null | \
    openssl x509 -text -noout

# Test specific TLS version
$ openssl s_client -connect example.com:443 -tls1_2
$ openssl s_client -connect example.com:443 -tls1_3

# curl with verbose TLS info
$ curl -v https://example.com 2>&1 | grep -i ssl

# Test suite (ssllabs.com/ssltest online, or testssl.sh locally)
$ ./testssl.sh example.com

Summary

TLS handshake accomplishes:

GoalMechanism
Version negotiationClientHello/ServerHello
Cipher suite selectionClientHello/ServerHello
Server authenticationCertificate + signature
Key exchangeECDHE (Diffie-Hellman)
Forward secrecyEphemeral keys
Session resumptionTickets, 0-RTT

TLS 1.3 improvements:

  • 1-RTT handshake (vs 2-RTT)
  • 0-RTT resumption option
  • Only secure cipher suites
  • Encrypted handshake data
  • Simpler, more secure

Certificates and PKI

Certificates prove a server’s identity. The Public Key Infrastructure (PKI) is the trust hierarchy that makes this verification possible.

What’s in a Certificate

X.509 Certificate Structure:
┌─────────────────────────────────────────────────────────────────────┐
│  Version: 3 (X.509v3)                                               │
│  Serial Number: 04:00:00:00:00:01:15:4B:5A:C3:94                     │
│  Signature Algorithm: sha256WithRSAEncryption                       │
│                                                                     │
│  Issuer: CN=DigiCert Global CA, O=DigiCert Inc, C=US                │
│  Validity:                                                          │
│      Not Before: Jan 15 00:00:00 2024 GMT                           │
│      Not After:  Jan 14 23:59:59 2025 GMT                           │
│  Subject: CN=www.example.com, O=Example Inc, C=US                   │
│                                                                     │
│  Subject Public Key Info:                                           │
│      Public Key Algorithm: rsaEncryption                            │
│      RSA Public Key: (2048 bit)                                     │
│          Modulus: 00:c3:9b:...                                      │
│          Exponent: 65537                                            │
│                                                                     │
│  X509v3 Extensions:                                                 │
│      Subject Alternative Name:                                      │
│          DNS:www.example.com, DNS:example.com                       │
│      Key Usage: Digital Signature, Key Encipherment                 │
│      Extended Key Usage: TLS Web Server Authentication              │
│                                                                     │
│  Signature: 3c:b3:4e:...                                            │
└─────────────────────────────────────────────────────────────────────┘

Key Fields

FieldPurpose
SubjectWho the certificate identifies
IssuerWho signed (issued) the certificate
ValidityWhen certificate is valid
Public KeyServer’s public key for encryption
Subject Alt NamesAdditional valid hostnames
SignatureCA’s signature verifying the certificate

Certificate Chain

Certificates form a chain of trust:

┌─────────────────────────────────────────────────────────────────────┐
│                       Certificate Chain                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────────────────────────────┐                            │
│  │          Root CA Certificate        │ ← Trusted by OS/browser   │
│  │  Issuer: DigiCert Root CA           │   (pre-installed)         │
│  │  Subject: DigiCert Root CA          │   Self-signed             │
│  └─────────────────┬───────────────────┘                            │
│                    │ Signs                                          │
│  ┌─────────────────▼───────────────────┐                            │
│  │     Intermediate CA Certificate     │ ← Server sends this       │
│  │  Issuer: DigiCert Root CA           │                            │
│  │  Subject: DigiCert Global CA        │                            │
│  └─────────────────┬───────────────────┘                            │
│                    │ Signs                                          │
│  ┌─────────────────▼───────────────────┐                            │
│  │       End-Entity Certificate        │ ← Server's certificate    │
│  │  Issuer: DigiCert Global CA         │                            │
│  │  Subject: www.example.com           │                            │
│  └─────────────────────────────────────┘                            │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Validation:
1. Server sends leaf + intermediate(s)
2. Client finds trusted root CA
3. Verifies each signature up the chain
4. Checks validity dates, revocation, hostname
5. Trust established!

Certificate Validation

What clients check:

1. Chain of Trust
   Each certificate signed by issuer's key
   Chain leads to trusted root CA

2. Validity Period
   Current date within Not Before / Not After

3. Hostname Match
   Requested hostname in Subject CN or SAN
   www.example.com matches *.example.com (wildcard)

4. Revocation Status
   Certificate not revoked (CRL or OCSP)

5. Key Usage
   Certificate allowed for TLS server authentication

6. Cryptographic Verification
   Signatures mathematically valid
   Key sizes acceptable

Certificate Types

Domain Validation (DV)

Proves: Control of domain
Verification: Email, DNS, or HTTP challenge
Trust level: Low (only domain ownership)
Example: Let's Encrypt certificates

Organization Validation (OV)

Proves: Domain control + organization exists
Verification: Legal documents, phone calls
Trust level: Medium
Example: Business websites

Extended Validation (EV)

Proves: Domain + organization + legal verification
Verification: Extensive background checks
Trust level: High
Example: Banks, financial institutions
Note: Browsers no longer show green bar

Getting Certificates

Let’s Encrypt (Free, Automated)

# Using certbot
$ sudo certbot certonly --webroot -w /var/www/html -d example.com

# Auto-renewal
$ sudo certbot renew

# Certificates at:
# /etc/letsencrypt/live/example.com/fullchain.pem
# /etc/letsencrypt/live/example.com/privkey.pem

Commercial CAs

  1. Generate CSR (Certificate Signing Request)
  2. Submit to CA with payment
  3. Complete validation
  4. Receive certificate
# Generate private key and CSR
$ openssl req -new -newkey rsa:2048 -nodes \
    -keyout server.key -out server.csr

# Submit server.csr to CA
# Receive server.crt back

Certificate Revocation

When certificates need to be invalidated:

CRL (Certificate Revocation List)

CA publishes list of revoked certificates.
Client downloads CRL, checks if cert is listed.

Problems:
  - CRLs can be large
  - Caching means delayed revocation detection
  - Download can be slow

OCSP (Online Certificate Status Protocol)

Client asks CA: "Is this certificate revoked?"
CA responds: "Valid" or "Revoked"

Better than CRL but:
  - Latency for each connection
  - Privacy (CA sees what sites you visit)

OCSP Stapling

Server fetches OCSP response periodically.
Server "staples" response to TLS handshake.
Client gets proof without contacting CA.

Best practice: Enable OCSP stapling on your server.

Common Issues

Missing Intermediate

Problem:
  Server sends only leaf certificate
  Client can't build chain to root
  "Certificate not trusted" error

Solution:
  Configure server to send full chain:
  ssl_certificate /path/to/fullchain.pem;
  (includes leaf + intermediates)

Expired Certificate

Problem:
  Certificate validity period ended
  Browsers show security warning

Solution:
  Renew certificate before expiration
  Set up automated renewal (Let's Encrypt)
  Monitor certificate expiration

Hostname Mismatch

Problem:
  Request to example.com
  Certificate for www.example.com
  Names don't match

Solution:
  Include all domains in SAN
  Use wildcard (*.example.com) if appropriate
  Redirect to canonical hostname

Summary

Certificate/PKI key concepts:

ComponentPurpose
CertificateBinds public key to identity
Root CATrusted anchor (pre-installed)
Intermediate CABridges root to end-entity
ChainPath from leaf to trusted root
CRL/OCSPRevocation checking
SANMultiple hostnames in one cert

Best practices:

  • Use certificates from trusted CAs
  • Include full certificate chain
  • Enable OCSP stapling
  • Monitor expiration dates
  • Use strong key sizes (RSA 2048+ or ECDSA P-256+)

Cipher Suites

A cipher suite is a combination of cryptographic algorithms used in a TLS connection. Understanding them helps you configure secure connections and debug compatibility issues.

Cipher Suite Components

TLS 1.2 cipher suite name:
  TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
  │   │     │       │   │   │   │
  │   │     │       │   │   │   └── PRF hash
  │   │     │       │   │   └────── Mode (GCM)
  │   │     │       │   └────────── Key size (256-bit)
  │   │     │       └────────────── Encryption (AES)
  │   │     └────────────────────── Authentication (RSA cert)
  │   └──────────────────────────── Key Exchange (ECDHE)
  └──────────────────────────────── Protocol

Components:
  Key Exchange: How to agree on encryption keys
  Authentication: How to verify server identity
  Encryption: How to encrypt data
  MAC/Hash: How to verify integrity

Algorithm Categories

Key Exchange

RSA           Server's RSA key encrypts pre-master secret
              No forward secrecy
              DO NOT USE (TLS 1.3 removed)

DHE           Diffie-Hellman Ephemeral
              Forward secrecy
              Slower than ECDHE

ECDHE         Elliptic Curve Diffie-Hellman Ephemeral
              Forward secrecy
              Fast, secure
              RECOMMENDED

Authentication

RSA           RSA certificate, RSA signature
              Widely supported

ECDSA         Elliptic Curve DSA
              Smaller keys, faster
              Growing adoption

EdDSA         Ed25519/Ed448
              Modern, fast
              TLS 1.3 support

Bulk Encryption

AES-GCM       AES Galois/Counter Mode (AEAD)
              Fast, secure, authenticated encryption
              RECOMMENDED

AES-CBC       AES Cipher Block Chaining
              Older, requires separate MAC
              Vulnerable to padding oracles
              AVOID

ChaCha20-Poly1305  Stream cipher (AEAD)
              Fast on devices without AES hardware
              Good alternative to AES-GCM

MAC/Hash

SHA-384       For AEAD ciphers, used in PRF
SHA-256       For AEAD ciphers, used in PRF

SHA-1         Old, deprecated
              Only for compatibility
              DO NOT USE if avoidable
Preferred (forward secrecy, AEAD):
  TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
  TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
  TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
  TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
  TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
  TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256

Acceptable (for compatibility):
  TLS_DHE_RSA_WITH_AES_256_GCM_SHA384
  TLS_DHE_RSA_WITH_AES_128_GCM_SHA256

Avoid:
  Anything with RSA key exchange (no PFS)
  Anything with CBC mode (padding attacks)
  Anything with 3DES (slow, weak)
  Anything with RC4 (broken)
  Anything with NULL (no encryption!)
  Anything with EXPORT (intentionally weak)

TLS 1.3 Cipher Suites

TLS 1.3 simplified cipher suites dramatically:

Only 5 cipher suites:
  TLS_AES_256_GCM_SHA384
  TLS_AES_128_GCM_SHA256
  TLS_CHACHA20_POLY1305_SHA256
  TLS_AES_128_CCM_SHA256
  TLS_AES_128_CCM_8_SHA256

Key exchange (ECDHE) and authentication (certificate signature)
are negotiated separately via extensions.

All TLS 1.3 suites provide:
  ✓ Forward secrecy (mandatory)
  ✓ AEAD encryption (mandatory)
  ✓ Strong algorithms (weak ones removed)

Configuring Cipher Suites

nginx

ssl_protocols TLSv1.2 TLSv1.3;
ssl_prefer_server_cipher_on;

# TLS 1.2 ciphers
ssl_ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;

# TLS 1.3 ciphers (usually automatic)
ssl_conf_command Ciphersuites TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256;

Apache

SSLProtocol all -SSLv3 -TLSv1 -TLSv1.1
SSLCipherSuite ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:...
SSLHonorCipherOrder on

Testing Cipher Suites

# Test supported ciphers
$ nmap --script ssl-enum-ciphers -p 443 example.com

# OpenSSL test specific cipher
$ openssl s_client -connect example.com:443 \
    -cipher ECDHE-RSA-AES256-GCM-SHA384

# Show negotiated cipher
$ openssl s_client -connect example.com:443 2>/dev/null | \
    grep "Cipher is"
# New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-GCM-SHA384

# SSL Labs test (comprehensive)
# https://www.ssllabs.com/ssltest/

Security Levels

┌─────────────────────────────────────────────────────────────────────┐
│  Level    │ Key Exchange  │ Encryption    │ Bits of Security       │
├───────────┼───────────────┼───────────────┼────────────────────────┤
│  Modern   │ ECDHE P-256+  │ AES-128-GCM+  │ 128-bit security       │
│           │               │ ChaCha20      │                        │
├───────────┼───────────────┼───────────────┼────────────────────────┤
│  Compat.  │ ECDHE/DHE     │ AES-128+      │ 112-128 bit            │
│           │ 2048-bit+     │               │                        │
├───────────┼───────────────┼───────────────┼────────────────────────┤
│  Legacy   │ RSA 2048      │ AES/3DES      │ ~80-112 bit            │
│  (avoid)  │               │               │                        │
├───────────┼───────────────┼───────────────┼────────────────────────┤
│  Broken   │ RSA < 2048    │ RC4, DES      │ Effectively none       │
│  (never)  │ EXPORT        │ NULL          │                        │
└───────────┴───────────────┴───────────────┴────────────────────────┘

Summary

Good cipher suite configuration:

  1. Use TLS 1.3 when possible (automatic good choices)
  2. Prefer ECDHE for key exchange (forward secrecy)
  3. Use AEAD encryption (AES-GCM or ChaCha20-Poly1305)
  4. Disable weak ciphers (RSA key exchange, CBC, old algorithms)
  5. Test your configuration (SSL Labs, testssl.sh)

TLS 1.3 removes the complexity—all its cipher suites are secure.

TLS 1.3 Improvements

TLS 1.3 (RFC 8446, 2018) represents a major overhaul, not just incremental improvement. It’s faster, simpler, and more secure than TLS 1.2.

Key Improvements

┌─────────────────────────────────────────────────────────────────────┐
│                    TLS 1.3 vs TLS 1.2                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Performance:                                                       │
│    TLS 1.2: 2 RTT handshake                                         │
│    TLS 1.3: 1 RTT handshake                                         │
│             0 RTT resumption                                        │
│                                                                     │
│  Security:                                                          │
│    Removed: RSA key exchange, CBC ciphers, SHA-1, RC4, 3DES         │
│    Required: Forward secrecy (ECDHE/DHE only)                       │
│    Encrypted: More handshake data hidden                            │
│                                                                     │
│  Simplicity:                                                        │
│    Cipher suites: 37+ → 5                                           │
│    Fewer options = fewer misconfigurations                          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

1-RTT Handshake

TLS 1.3 combines key exchange and authentication:

TLS 1.2:
  ClientHello        →            1st RTT
  ←  ServerHello + Cert
  ClientKeyExchange  →            2nd RTT
  ←  Finished
  Application Data   →            Finally!

TLS 1.3:
  ClientHello + KeyShare  →       1st RTT
  ←  ServerHello + KeyShare
  ←  EncryptedExtensions
  ←  Certificate, Finished
  Finished                →
  Application Data        →       Immediately!

Client sends key share in first message.
Server can compute keys immediately.
Encrypted data flows after 1 RTT.

0-RTT Resumption

For returning clients:

TLS 1.3 0-RTT:
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  First Connection:                                                  │
│    Full handshake + receive session ticket                          │
│                                                                     │
│  Subsequent Connection:                                             │
│    ClientHello + EarlyData  →  (request sent IMMEDIATELY)           │
│    ←  ServerHello + response                                        │
│                                                                     │
│  No waiting! Request in first packet.                               │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Security caveat:
  0-RTT data can be replayed by attacker
  Only safe for idempotent operations (GET, not POST)
  Server can reject 0-RTT for sensitive operations

Removed Features

TLS 1.3 removed insecure and unnecessary features:

Removed entirely:
  ✗ RSA key exchange (no forward secrecy)
  ✗ Static Diffie-Hellman (no forward secrecy)
  ✗ CBC mode ciphers (padding oracle attacks)
  ✗ RC4 (broken)
  ✗ 3DES (slow, small block size)
  ✗ MD5 and SHA-1 in signature algorithms
  ✗ Compression (CRIME attack)
  ✗ Renegotiation
  ✗ Custom DHE groups
  ✗ ChangeCipherSpec message

Result: All TLS 1.3 connections have forward secrecy
        and use authenticated encryption (AEAD).

Encrypted Handshake

More of the handshake is encrypted:

TLS 1.2 visible to eavesdropper:
  - Certificate (server identity)
  - Server extensions
  - Much of handshake

TLS 1.3 encrypted:
  - Certificate
  - Extensions after ServerHello
  - Most handshake messages

Only visible:
  - ClientHello (including SNI)
  - ServerHello

Future: Encrypted Client Hello (ECH) hides SNI too.

Simplified Cipher Suites

TLS 1.2: 37+ cipher suites (many weak/redundant)
TLS 1.3: 5 cipher suites (all secure)

TLS 1.3 cipher suites:
  TLS_AES_128_GCM_SHA256          (required)
  TLS_AES_256_GCM_SHA384          (recommended)
  TLS_CHACHA20_POLY1305_SHA256    (good for non-AES hardware)
  TLS_AES_128_CCM_SHA256          (IoT)
  TLS_AES_128_CCM_8_SHA256        (IoT, constrained)

Key exchange negotiated separately via supported_groups.
Signature algorithms negotiated separately.
Simpler configuration, fewer mistakes.

Downgrade Protection

TLS 1.3 prevents protocol downgrade attacks:

Attack scenario:
  Client supports TLS 1.3
  Server supports TLS 1.3
  Attacker modifies ClientHello to say "TLS 1.2 only"
  Connection uses weaker TLS 1.2

Protection:
  Server random includes special bytes when downgrading
  Client detects this and aborts
  Man-in-the-middle cannot force downgrade

Migration Considerations

Compatibility

TLS 1.3 designed for compatibility:
  - Uses same port (443)
  - Can negotiate down to TLS 1.2 if needed
  - Works with most proxies/load balancers

Potential issues:
  - Old middleboxes may break 1.3
  - Some intrusion detection fails on 1.3
  - 0-RTT requires application awareness

Server Configuration

# nginx - enable TLS 1.3
ssl_protocols TLSv1.2 TLSv1.3;

# Enable 0-RTT (use with caution)
ssl_early_data on;

# In proxy situations, tell backend about early data
proxy_set_header Early-Data $ssl_early_data;

Application Changes for 0-RTT

# Check if request was 0-RTT
early_data = request.headers.get('Early-Data')

if early_data == '1':
    # This request might be replayed!
    if not is_idempotent(request):
        # Reject or require retry without 0-RTT
        return Response(status=425)  # Too Early

Measuring TLS 1.3 Adoption

As of 2024:
  - ~70% of websites support TLS 1.3
  - All major browsers support TLS 1.3
  - All major CDNs support TLS 1.3

Verify your site:
  $ curl -v https://yoursite.com 2>&1 | grep "SSL connection"
  * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384

Summary

TLS 1.3 advantages:

FeatureImprovement
Handshake1 RTT (vs 2 RTT)
Resumption0 RTT possible
SecurityOnly secure options remain
Configuration5 ciphers vs 37+
PrivacyMore encrypted handshake
Forward secrecyMandatory

TLS 1.3 should be enabled on all new deployments. The only reason to stay on TLS 1.2 is legacy client compatibility, and that’s decreasing rapidly.

Application Protocols

Beyond HTTP, many other application protocols power essential internet services. Understanding them provides insight into protocol design and helps when integrating with these systems.

Common Application Protocols

┌─────────────────────────────────────────────────────────────────────┐
│               Major Application Protocols                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Email:                                                             │
│    SMTP (25, 587)     Sending email between servers                 │
│    IMAP (143, 993)    Accessing mailbox, server stores mail         │
│    POP3 (110, 995)    Downloading mail, client stores               │
│                                                                     │
│  File Transfer:                                                     │
│    FTP (21, 20)       Classic file transfer (insecure)              │
│    SFTP (22)          SSH-based file transfer (secure)              │
│    SCP (22)           Secure copy over SSH                          │
│                                                                     │
│  Remote Access:                                                     │
│    SSH (22)           Secure shell, tunneling, file transfer        │
│    Telnet (23)        Insecure remote access (legacy)               │
│    RDP (3389)         Windows remote desktop                        │
│                                                                     │
│  Name Resolution:                                                   │
│    DNS (53)           Domain name → IP address                      │
│    mDNS (5353)        Multicast DNS (local discovery)               │
│                                                                     │
│  Time:                                                              │
│    NTP (123)          Network time synchronization                  │
│                                                                     │
│  Directory:                                                         │
│    LDAP (389, 636)    Directory services (Active Directory)         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Protocol Characteristics

Most application protocols share common traits:

Request-Response:
  Client sends command/request
  Server sends response
  Back and forth until done

Text vs Binary:
  Text:   Human-readable (SMTP, HTTP/1.1)
  Binary: Machine-efficient (HTTP/2, Protocol Buffers)

Stateful vs Stateless:
  Stateful:  Server remembers session (SMTP, FTP)
  Stateless: Each request independent (HTTP, DNS)

What You’ll Learn

  1. SMTP: How email travels across the internet
  2. FTP and Alternatives: File transfer evolution
  3. SSH: Secure remote access and more

SMTP: Email Delivery

SMTP (Simple Mail Transfer Protocol) is how email moves between servers. Despite being from 1982, it remains the backbone of email delivery.

How Email Flows

┌─────────────────────────────────────────────────────────────────────┐
│                     Email Delivery Path                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  alice@gmail.com sends to bob@example.com                           │
│                                                                     │
│  ┌────────────┐                                                     │
│  │   Alice    │                                                     │
│  │  (Gmail)   │                                                     │
│  └─────┬──────┘                                                     │
│        │ 1. Compose & Send                                          │
│        ▼                                                            │
│  ┌────────────┐                                                     │
│  │Gmail Server│                                                     │
│  │    MTA     │                                                     │
│  └─────┬──────┘                                                     │
│        │ 2. DNS lookup: example.com MX                              │
│        │ 3. SMTP to mail.example.com                                │
│        ▼                                                            │
│  ┌────────────┐                                                     │
│  │Example.com │                                                     │
│  │Mail Server │                                                     │
│  └─────┬──────┘                                                     │
│        │ 4. Store in Bob's mailbox                                  │
│        ▼                                                            │
│  ┌────────────┐                                                     │
│  │    Bob     │ 5. Retrieve via IMAP/POP3                           │
│  └────────────┘                                                     │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

SMTP Conversation

S: 220 mail.example.com ESMTP ready
C: EHLO gmail.com
S: 250-mail.example.com
S: 250-SIZE 35882577
S: 250-STARTTLS
S: 250 OK

C: STARTTLS
S: 220 Ready to start TLS
   (TLS handshake happens)

C: EHLO gmail.com
S: 250 OK

C: MAIL FROM:<alice@gmail.com>
S: 250 OK

C: RCPT TO:<bob@example.com>
S: 250 OK

C: DATA
S: 354 Start mail input
C: From: alice@gmail.com
C: To: bob@example.com
C: Subject: Hello!
C:
C: Hi Bob, how are you?
C: .
S: 250 OK, message queued

C: QUIT
S: 221 Bye

Ports and Security

Port 25:   Server-to-server (MTA to MTA)
           Often blocked by ISPs for end users

Port 587:  Client submission (with authentication)
           Modern email clients use this

Port 465:  SMTPS (implicit TLS)
           Deprecated but re-standardized

Security:
  STARTTLS: Upgrade plain connection to TLS
  AUTH:     Login with username/password
  SPF:      Verify sender IP authorized
  DKIM:     Cryptographic message signature
  DMARC:    Policy for SPF/DKIM failures

Email Authentication (SPF, DKIM, DMARC)

SPF (DNS TXT record):
  example.com TXT "v=spf1 include:_spf.google.com -all"
  "Only Google's servers can send as @example.com"

DKIM (signature in header):
  DKIM-Signature: v=1; a=rsa-sha256; d=example.com; s=selector;
    h=from:to:subject:date; bh=...; b=...
  Receiver fetches public key from DNS, verifies signature.

DMARC (policy):
  _dmarc.example.com TXT "v=DMARC1; p=reject; rua=mailto:..."
  "If SPF/DKIM fail, reject the message and report."

Common Issues

Rejected as spam:
  - Missing SPF/DKIM/DMARC
  - IP on blocklist
  - Poor sending reputation

Connection refused:
  - Port 25 blocked (use 587)
  - Firewall rules
  - Server down

Authentication failed:
  - Wrong credentials
  - App-specific password needed
  - TLS required but not enabled

FTP and Secure Alternatives

FTP (File Transfer Protocol) is one of the oldest internet protocols (1971). While still used, security concerns have led to better alternatives.

How FTP Works

FTP uses two connections:

Control Connection (Port 21):
  - Commands and responses
  - Stays open during session
  - Text-based protocol

Data Connection (Port 20 or ephemeral):
  - Actual file transfer
  - Opened per transfer
  - Closed after each file

┌────────────┐                    ┌────────────┐
│   Client   │                    │   Server   │
├────────────┤                    ├────────────┤
│ Control ───┼────── Port 21 ─────┼─── Control │
│            │                    │            │
│   Data  ◄──┼─── Port 20/high ───┼──►  Data   │
└────────────┘                    └────────────┘

Active vs Passive Mode

Active Mode:
  1. Client opens control connection to server:21
  2. Client tells server: "Connect to me on port 5000"
  3. Server connects FROM port 20 TO client:5000

  Problem: Client firewalls block incoming connections

Passive Mode (PASV):
  1. Client opens control connection to server:21
  2. Client: "PASV" (I'll connect to you)
  3. Server: "227 Entering Passive (192,168,1,100,195,149)"
     (Connect to 192.168.1.100 port 50069)
  4. Client connects to server's data port

  Better: Client initiates both connections (firewall-friendly)

FTP Session Example

$ ftp ftp.example.com
220 Welcome to Example FTP
Name: alice
331 Password required
Password: ********
230 Login successful

ftp> pwd
257 "/" is current directory

ftp> ls
227 Entering Passive Mode (192,168,1,100,195,149)
150 Here comes the directory listing
drwxr-xr-x    2 alice  staff   68 Jan 15 10:00 documents
-rw-r--r--    1 alice  staff 1234 Jan 14 09:00 readme.txt
226 Directory send OK

ftp> get readme.txt
227 Entering Passive Mode (192,168,1,100,195,150)
150 Opening data connection
226 Transfer complete

ftp> quit
221 Goodbye

FTP Security Problems

✗ Passwords sent in plaintext
✗ Data transferred unencrypted
✗ No server authentication
✗ Complex firewall requirements

Anyone on the network can see:
  - Username and password
  - All file contents
  - All commands

Secure Alternatives

SFTP (SSH File Transfer Protocol)

Runs over SSH (port 22):
  ✓ Encrypted connection
  ✓ Strong authentication
  ✓ Single port (firewall-friendly)
  ✓ Widely supported

$ sftp user@server.example.com
sftp> put localfile.txt
sftp> get remotefile.txt
sftp> ls
sftp> exit

SCP (Secure Copy)

Simple file copy over SSH:

# Copy local to remote
$ scp file.txt user@server:/path/

# Copy remote to local
$ scp user@server:/path/file.txt ./

# Copy directory recursively
$ scp -r localdir user@server:/path/

FTPS (FTP over TLS)

FTP with TLS encryption:
  - Implicit FTPS: TLS from start (port 990)
  - Explicit FTPS: STARTTLS upgrade (port 21)

Still has FTP complexity (dual connections).
SFTP generally preferred.

Recommendation

For new deployments:
  1. SFTP      Best overall (secure, firewall-friendly)
  2. SCP       Simple file copies
  3. rsync     Efficient synchronization
  4. HTTPS     API-based file transfer

Avoid:
  - Plain FTP (insecure)
  - TFTP (no authentication at all)

SSH: Secure Shell

SSH (Secure Shell) provides encrypted remote access, replacing insecure protocols like Telnet and rlogin. Beyond shell access, SSH enables secure file transfer, port forwarding, and tunneling.

SSH Capabilities

┌─────────────────────────────────────────────────────────────────────┐
│                      SSH Use Cases                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Remote Shell:     Interactive command line on remote server        │
│  File Transfer:    SFTP, SCP over encrypted channel                 │
│  Port Forwarding:  Tunnel any TCP connection through SSH            │
│  X11 Forwarding:   Run graphical apps remotely                      │
│  Agent Forwarding: Use local keys on remote servers                 │
│  Git Transport:    Secure repository access                         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Authentication Methods

Password Authentication

$ ssh user@server.example.com
user@server.example.com's password: ********

Simple but:
  - Vulnerable to brute force
  - Requires typing password
  - Can't be automated safely

Public Key Authentication

# Generate key pair
$ ssh-keygen -t ed25519 -C "alice@laptop"
# Creates: ~/.ssh/id_ed25519 (private)
#          ~/.ssh/id_ed25519.pub (public)

# Copy public key to server
$ ssh-copy-id user@server.example.com
# Or manually add to ~/.ssh/authorized_keys

# Login (no password!)
$ ssh user@server.example.com

Key Types

Ed25519:     Modern, fast, secure (recommended)
RSA:         Widely compatible (4096-bit minimum)
ECDSA:       Elliptic curve (P-256, P-384)

Avoid:
  DSA:       Deprecated, weak
  RSA <2048: Too short

SSH Configuration

Client Config (~/.ssh/config)

# Default settings
Host *
    AddKeysToAgent yes
    IdentityFile ~/.ssh/id_ed25519

# Named host
Host myserver
    HostName server.example.com
    User alice
    Port 22
    IdentityFile ~/.ssh/work_key

# Now just:
$ ssh myserver

Server Config (/etc/ssh/sshd_config)

# Secure settings
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers alice bob
Protocol 2

Port Forwarding

Local Forwarding (-L)

Access remote service through local port:

$ ssh -L 8080:localhost:80 user@server

  Local:8080  ──────SSH Tunnel──────>  Server ────> localhost:80
     │                                      (server's port 80)
     └── Your browser connects here

Use case: Access web app behind firewall

Remote Forwarding (-R)

Expose local service to remote:

$ ssh -R 9000:localhost:3000 user@server

  Server:9000  <─────SSH Tunnel───────  Local:3000
     │                                      │
     └── Internet can access          Your dev server

Use case: Share local development server

Dynamic Forwarding (-D)

SOCKS proxy through SSH:

$ ssh -D 1080 user@server

Configure browser to use SOCKS proxy localhost:1080
All browser traffic goes through server.

Use case: Bypass network restrictions, privacy

SSH Tunnels for Security

┌─────────────────────────────────────────────────────────────────────┐
│                    SSH Tunnel Example                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Scenario: Connect to database behind firewall                      │
│                                                                     │
│  ┌────────┐        ┌────────────┐        ┌──────────┐              │
│  │ Laptop │──SSH──>│ Jump Host  │        │ Database │              │
│  │        │        │ (bastion)  │──────>│ :5432    │              │
│  └────────┘        └────────────┘        └──────────┘              │
│                                                                     │
│  $ ssh -L 5432:db.internal:5432 user@bastion                        │
│  $ psql -h localhost -p 5432 mydb                                   │
│                                                                     │
│  Database connection encrypted through SSH tunnel.                  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Best Practices

Key Management:
  ✓ Use Ed25519 keys
  ✓ Protect private key with passphrase
  ✓ Use ssh-agent to avoid retyping passphrase
  ✓ Rotate keys periodically

Server Security:
  ✓ Disable password authentication
  ✓ Disable root login
  ✓ Use fail2ban for brute force protection
  ✓ Keep SSH updated
  ✓ Use non-standard port (security through obscurity, minor)

Access Control:
  ✓ Limit allowed users
  ✓ Use bastion/jump hosts
  ✓ Audit authorized_keys regularly

Troubleshooting

# Verbose output
$ ssh -v user@server      # Basic
$ ssh -vvv user@server    # Maximum verbosity

# Check key permissions
$ ls -la ~/.ssh/
# id_ed25519 should be 600
# authorized_keys should be 600

# Test authentication
$ ssh -T git@github.com

# Check server logs (on server)
$ sudo tail -f /var/log/auth.log

Protocol Design Principles

When building networked systems, you often need custom protocols or must extend existing ones. This chapter covers principles for designing protocols that are robust, evolvable, and performant.

Design Considerations

┌─────────────────────────────────────────────────────────────────────┐
│                    Protocol Design Questions                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Communication Pattern:                                             │
│    Request-response? Streaming? Pub-sub? Full-duplex?               │
│                                                                     │
│  Reliability Requirements:                                          │
│    Every message must arrive? Some loss acceptable?                 │
│                                                                     │
│  Latency Requirements:                                              │
│    Real-time? Best-effort? Batch acceptable?                        │
│                                                                     │
│  Message Size:                                                      │
│    Small fixed? Variable? Very large?                               │
│                                                                     │
│  Security:                                                          │
│    Authentication? Encryption? Integrity?                           │
│                                                                     │
│  Compatibility:                                                     │
│    Must work with existing systems? Future evolution?               │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Key Topics

  1. Versioning Strategies: How to evolve protocols over time
  2. Backwards Compatibility: Supporting old and new clients
  3. Performance Considerations: Optimizing for speed and efficiency

Versioning Strategies

Protocols evolve. New features are added, bugs are fixed, and requirements change. Good versioning makes this evolution manageable.

Why Version?

Without versioning:

  Client (v1): Send message type A
  Server (v2): Expects message type B
  Result: Confusion, errors, failures

With versioning:

  Client (v1): "I speak version 1"
  Server (v2): "I understand v1 and v2, let's use v1"
  Result: Graceful interoperability

Versioning Approaches

Explicit Version Numbers

In protocol header:
┌──────────────────────────────────────────────────────────────┐
│ Version │ Message Type │ Length │ Payload...                 │
│   (1)   │     (2)      │  (4)   │                            │
└──────────────────────────────────────────────────────────────┘

HTTP:
  GET / HTTP/1.1
  GET / HTTP/2

Pros: Clear, explicit
Cons: Major versions can break compatibility

Feature Negotiation

Instead of single version, negotiate capabilities:

Client: "I support: compression, encryption, batch"
Server: "I support: encryption, streaming"
Both: "We'll use: encryption"

TLS does this with cipher suites.
HTTP/2 does this with SETTINGS frames.

Pros: Granular, flexible
Cons: Complex negotiation

Semantic Versioning

MAJOR.MINOR.PATCH

Major: Breaking changes (v1 → v2)
Minor: New features, backwards compatible (v1.1 → v1.2)
Patch: Bug fixes only (v1.1.0 → v1.1.1)

For APIs:
  v1 clients work with v1.x servers
  v2 might require migration

Pros: Clear expectations
Cons: Major bumps still painful

Wire Format Versioning

Message format evolution:

Version 1:
  { "name": "Alice", "age": 30 }

Version 2 (additive):
  { "name": "Alice", "age": 30, "email": "alice@example.com" }

Old clients ignore unknown fields.
New clients handle missing fields.
No version number needed if done carefully.

Version in Different Layers

URL versioning (REST APIs):
  /api/v1/users
  /api/v2/users

Header versioning:
  Accept: application/vnd.myapi.v2+json

Query parameter:
  /api/users?version=2

Content negotiation:
  Accept: application/json; version=2

Best Practices

1. Include version from day one
   Adding versioning later is painful.

2. Plan for evolution
   Reserve bits/fields for future use.

3. Support multiple versions
   Don't force immediate upgrades.

4. Deprecation timeline
   v1 supported until 2025-01-01.

5. Version at right granularity
   API version? Message version? Both?

Backwards Compatibility

Maintaining backwards compatibility lets you evolve protocols without breaking existing deployments. It’s often the difference between smooth upgrades and painful migrations.

Compatibility Types

Backwards Compatible:
  New servers work with old clients.
  Client v1 ──────> Server v2 ✓

Forwards Compatible:
  Old servers handle new clients gracefully.
  Client v2 ──────> Server v1 ✓ (degraded)

Full Compatibility:
  Both directions work.
  Ideal but not always achievable.

Techniques for Compatibility

Ignore Unknown Fields

// Client v1 sends:
{ "name": "Alice", "age": 30 }

// Server v2 expects:
{ "name": "Alice", "age": 30, "email": "?" }

// Server should:
// - Accept missing email (use default or null)
// - Not reject the request

// Client v2 sends:
{ "name": "Bob", "age": 25, "email": "bob@example.com" }

// Server v1 should:
// - Ignore unknown "email" field
// - Process name and age normally

Optional Fields with Defaults

// Protocol Buffers example
message User {
  string name = 1;
  int32 age = 2;
  optional string email = 3;  // Added in v2
}

// Missing optional fields get default values.
// Old messages work with new code.
// New messages work with old code (email ignored).

Extensible Enums

Bad: Fixed enum, no room to grow
  enum Status { OK = 0, ERROR = 1 }

Good: Reserve unknown handling
  enum Status {
    UNKNOWN = 0,  // Default for unrecognized
    OK = 1,
    ERROR = 2
    // Future: PENDING = 3
  }

Old code receiving new status → UNKNOWN (handled gracefully)

Reserved Fields

message User {
  string name = 1;
  reserved 2;  // Was 'age', removed in v3
  string email = 3;
  reserved "age";  // Prevent reuse of name
}

// Prevents accidentally reusing field numbers
// which would cause data corruption.

Breaking Changes

Sometimes breaking changes are necessary:

What's Breaking:
  - Removing required fields
  - Changing field types
  - Renaming fields (in JSON)
  - Changing semantics of existing fields
  - Removing supported message types

Mitigation Strategies:
  1. New endpoint/message type (keep old working)
  2. Deprecation period with warnings
  3. Version bump (v1 → v2)
  4. Feature flags during transition

Postel’s Law (Robustness Principle)

"Be conservative in what you send,
 be liberal in what you accept."

Send: Strictly conform to spec
Accept: Handle variations gracefully

This enables interoperability between
implementations with slight differences.

Testing Compatibility

# Test old client against new server
$ old-client --server=new-server --test-suite

# Test new client against old server
$ new-client --server=old-server --test-suite

# Fuzz testing with version mixing
$ compatibility-fuzzer --versions=v1,v2,v3

# Contract testing
$ pact-verify --provider=server --consumer=client-v1

Real-World Examples

JSON (excellent compatibility):
  - Unknown fields ignored
  - Missing fields → null/default
  - Easy to extend

Protocol Buffers (good compatibility):
  - Field numbers provide stability
  - Unknown fields preserved
  - Wire format stable

HTTP (exceptional compatibility):
  - 20+ years of evolution
  - HTTP/1.1 still works everywhere
  - Headers ignore unknown values

Performance Considerations

Protocol design choices significantly impact performance. Understanding the trade-offs helps you make informed decisions.

Key Performance Factors

┌─────────────────────────────────────────────────────────────────────┐
│                  Performance Dimensions                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Latency:     Time for message round-trip                           │
│               Affected by: RTTs, encoding time, processing          │
│                                                                     │
│  Throughput:  Data volume per unit time                             │
│               Affected by: Message size, connection limits          │
│                                                                     │
│  Overhead:    Wasted bandwidth (headers, framing)                   │
│               Affected by: Protocol verbosity, encoding             │
│                                                                     │
│  Efficiency:  CPU/memory per message                                │
│               Affected by: Parsing, serialization                   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Message Format Trade-offs

Text vs Binary

Text (JSON, XML):
  + Human readable
  + Easy debugging
  + Universal parsers
  - Larger messages
  - Slower parsing
  - Ambiguous types

Binary (Protocol Buffers, MessagePack):
  + Compact messages
  + Fast parsing
  + Precise types
  - Requires schema/decoder
  - Harder debugging
  - Versioning complexity

Rule of thumb:
  Internal services: Binary (efficiency)
  Public APIs: JSON (interoperability)
  High-volume: Binary (worth complexity)

Size Comparison

Same data in different formats:

JSON (70 bytes):
  {"id":123,"name":"Alice","email":"alice@example.com"}

Protocol Buffers (35 bytes):
  [binary encoded, ~50% smaller]

MessagePack (45 bytes):
  [binary JSON, ~35% smaller]

For millions of messages, these differences matter!

Connection Strategies

Persistent vs Per-Request

Per-request connections:
  Each request: TCP handshake + TLS handshake + request
  Latency: High (multiple RTTs)
  Resources: Connection churn

Persistent connections:
  One connection: Multiple requests
  Latency: Low (no repeated handshakes)
  Resources: Connection management

Always prefer persistent for repeated interactions.

Multiplexing

HTTP/1.1 (no multiplexing):
  Connection 1: Request A ─────> Response A
  Connection 2: Request B ─────> Response B
  (Need multiple connections for parallelism)

HTTP/2 (multiplexing):
  Connection 1: [A][B][C]───>[A][B][C]
  (All requests on one connection)

Multiplexing reduces:
  - Connection overhead
  - Memory usage
  - Head-of-line blocking (with QUIC)

Batching and Pipelining

Individual requests:
  Request 1 → Response 1 → Request 2 → Response 2
  Time: 4 RTT for 2 requests

Pipelining:
  Request 1 → Request 2 → Response 1 → Response 2
  Time: 2 RTT for 2 requests

Batching:
  [Request 1, Request 2] → [Response 1, Response 2]
  Time: 1 RTT for 2 requests

Trade-off: Batching adds latency for first item.

Compression

When to compress:
  ✓ Large text payloads (JSON, HTML)
  ✓ Repeated patterns in data
  ✓ Slow/metered networks

When not to compress:
  ✗ Already compressed (images, video)
  ✗ Tiny messages (overhead > savings)
  ✗ CPU-constrained environments

Common algorithms:
  gzip:   Universal, good compression
  br:     Better ratio, slower
  zstd:   Fast, good ratio (emerging)

Caching

Cacheable responses reduce load:

Without caching:
  Every request → Server processing → Response

With caching:
  First request → Server → Response (cached)
  Repeat requests → Cache hit → Immediate response

Design for cacheability:
  - Stable URLs for same content
  - Proper cache headers
  - ETags for validation
  - Separate static/dynamic content

Measurement

Measure before optimizing:

Latency metrics:
  - P50, P95, P99 response times
  - Time to first byte (TTFB)
  - Round-trip time

Throughput metrics:
  - Requests per second
  - Bytes per second
  - Messages per connection

Tools:
  - wrk, ab (HTTP benchmarking)
  - tcpdump, wireshark (packet analysis)
  - perf, flamegraphs (CPU profiling)

Summary

Performance optimization priorities:

  1. Reduce round trips (biggest impact)
  2. Use persistent connections
  3. Choose appropriate message format
  4. Enable compression for large text
  5. Implement caching where possible
  6. Batch when latency allows

Measure, then optimize. Premature optimization is the root of all evil, but informed optimization is essential.

Real-World Patterns

Production systems use additional infrastructure beyond basic protocols. This chapter covers common patterns for scaling, reliability, and performance.

Infrastructure Components

┌─────────────────────────────────────────────────────────────────────┐
│                    Modern Web Architecture                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  User ──> CDN ──> Load Balancer ──> App Servers ──> Database        │
│            │           │                 │                          │
│            │           │                 └── Cache (Redis)          │
│            │           │                                            │
│            │           └── WAF (Web Application Firewall)           │
│            │                                                        │
│            └── Edge cache, DDoS protection                          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Each component serves specific purposes:
  CDN:           Cache static content near users
  Load Balancer: Distribute traffic, health checks
  WAF:           Security filtering
  App Servers:   Business logic
  Cache:         Fast data access
  Database:      Persistent storage

Key Topics

  1. Load Balancing: Distributing traffic across servers
  2. Proxies: Forward and reverse proxies
  3. CDNs: Content delivery at scale
  4. Connection Pooling: Efficient resource usage

Load Balancing

Load balancers distribute traffic across multiple servers, improving availability and performance. Understanding load balancing helps you design scalable systems.

Why Load Balance?

Without load balancing:
  All traffic → Single server
  Problems: Single point of failure, limited capacity

With load balancing:
  Traffic → Load Balancer → Multiple servers
  Benefits: Redundancy, scalability, maintenance flexibility

Load Balancing Algorithms

Round Robin

Request 1 → Server A
Request 2 → Server B
Request 3 → Server C
Request 4 → Server A (repeat)

Pros: Simple, even distribution
Cons: Ignores server capacity, session state

Weighted Round Robin

Server A (weight 3): Gets 3x traffic
Server B (weight 1): Gets 1x traffic

Request pattern: A, A, A, B, A, A, A, B, ...

Use case: Servers with different capacities

Least Connections

Route to server with fewest active connections.

Server A: 10 connections
Server B: 5 connections
Server C: 8 connections

New request → Server B

Better for variable request durations.

IP Hash

Hash(Client IP) → Server selection

Same client always hits same server.
Useful for session affinity without cookies.

hash("192.168.1.100") % 3 = Server B

Least Response Time

Route to server with fastest response.

Combines: Connection count + response time
Best for: Heterogeneous backends
Requires: Active health monitoring

Layer 4 vs Layer 7

Layer 4 (Transport):
  - Routes based on IP/port
  - Faster (less inspection)
  - Protocol-agnostic
  - No content-based routing

Layer 7 (Application):
  - Routes based on content (URL, headers, cookies)
  - Can modify requests/responses
  - SSL termination
  - More flexible, more overhead

Example Layer 7 rules:
  /api/*     → API servers
  /static/*  → CDN
  /admin/*   → Admin servers

Health Checks

Load balancer monitors backends:

Active checks:
  - Periodic HTTP requests to /health
  - TCP connection attempts
  - Custom scripts

Passive checks:
  - Monitor real request success/failure
  - Track response times

Unhealthy server:
  - Remove from rotation
  - Continue checking
  - Return when healthy

Session Persistence

Problem: User state spread across servers
  Login on Server A
  Next request hits Server B
  "Please login again" 😞

Solutions:

Sticky Sessions (affinity):
  Set-Cookie: SERVERID=A
  Load balancer routes by cookie

Shared Session Store:
  All servers use Redis/Memcached for sessions
  Any server can handle any request

Stateless Design:
  JWT tokens contain user state
  No server-side session needed (best!)

Common Load Balancers

Software:
  - HAProxy: High performance, Layer 4/7
  - nginx: Web server + load balancer
  - Envoy: Modern, service mesh focused
  - Traefik: Cloud-native, auto-discovery

Cloud:
  - AWS ALB/NLB: Layer 7/4
  - GCP Load Balancing: Global, anycast
  - Azure Load Balancer: Layer 4
  - Cloudflare: CDN + load balancing

Hardware (legacy):
  - F5 BIG-IP
  - Citrix NetScaler

Configuration Example (nginx)

upstream backend {
    least_conn;
    server 10.0.0.1:8080 weight=3;
    server 10.0.0.2:8080 weight=2;
    server 10.0.0.3:8080 backup;

    keepalive 32;
}

server {
    listen 80;

    location / {
        proxy_pass http://backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }

    location /health {
        return 200 "OK";
    }
}

Proxies and Reverse Proxies

Proxies act as intermediaries in network communication. Understanding them helps you design secure architectures and debug connectivity issues.

Forward Proxy

Client-side proxy: Client → Proxy → Internet

┌─────────────────────────────────────────────────────────────────────┐
│                       Forward Proxy                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ┌────────┐       ┌───────┐       ┌──────────────┐                │
│   │ Client │──────>│ Proxy │──────>│   Internet   │                │
│   └────────┘       └───────┘       │   (Server)   │                │
│                        │           └──────────────┘                │
│                        │                                           │
│   Proxy hides client identity from server.                         │
│   Server sees proxy's IP, not client's.                            │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Use cases:
  - Corporate content filtering
  - Caching (reduce bandwidth)
  - Anonymity
  - Access control

Reverse Proxy

Server-side proxy: Internet → Reverse Proxy → Servers

┌─────────────────────────────────────────────────────────────────────┐
│                       Reverse Proxy                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ┌──────────────┐       ┌───────────┐       ┌────────┐            │
│   │   Internet   │──────>│  Reverse  │──────>│ Server │            │
│   │   (Client)   │       │   Proxy   │──────>│ Server │            │
│   └──────────────┘       └───────────┘──────>│ Server │            │
│                               │              └────────┘            │
│                               │                                    │
│   Clients don't know backend servers exist.                        │
│   Single entry point to multiple backends.                         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Use cases:
  - SSL termination
  - Load balancing
  - Caching
  - Compression
  - Security (hide backend)
  - A/B testing

Reverse Proxy Functions

SSL Termination

Client ──HTTPS──> Reverse Proxy ──HTTP──> Backend

Proxy handles TLS:
  - Certificate management in one place
  - Offloads crypto from backends
  - Backends get plain HTTP (simpler)
  - Internal traffic often trusted network

Request Routing

Based on URL path:
  /api/*     → API servers
  /images/*  → Image servers
  /          → Web servers

Based on header:
  Host: api.example.com → API servers
  Host: www.example.com → Web servers

Based on cookie:
  beta_user=true → Beta servers

Caching

Cache responses at proxy level:

Request 1: GET /logo.png
  Proxy → Backend → Response (cached at proxy)

Request 2: GET /logo.png
  Proxy → Cache hit → Response (no backend call)

Reduces backend load significantly.

Common Proxy Headers

X-Forwarded-For: Client IP (through proxy chain)
  X-Forwarded-For: 203.0.113.195, 70.41.3.18, 150.172.238.178

X-Forwarded-Proto: Original protocol
  X-Forwarded-Proto: https

X-Forwarded-Host: Original Host header
  X-Forwarded-Host: www.example.com

X-Real-IP: Single client IP (nginx convention)
  X-Real-IP: 203.0.113.195

Proxy Protocols

HTTP CONNECT (Forward Proxy)

Client → Proxy: CONNECT example.com:443 HTTP/1.1
Proxy → Client: HTTP/1.1 200 Connection Established
Client → (tunnel) → Server

Proxy creates TCP tunnel.
Used for HTTPS through forward proxies.

PROXY Protocol (Reverse Proxy)

HAProxy-style protocol:
  Passes original client info to backend.
  Binary or text header prepended to connection.

PROXY TCP4 192.168.1.1 10.0.0.1 56789 80\r\n
(Then normal HTTP traffic)

Backend sees real client IP.

nginx Reverse Proxy Config

server {
    listen 443 ssl;
    server_name example.com;

    ssl_certificate /etc/nginx/cert.pem;
    ssl_certificate_key /etc/nginx/key.pem;

    location / {
        proxy_pass http://backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Debugging Through Proxies

# Check what proxy sees
$ curl -v -x http://proxy:8080 https://example.com

# See forwarded headers
$ curl -s https://httpbin.org/headers

# Trace proxy chain
$ curl -s https://httpbin.org/ip
# Returns visible IP (proxy's if through proxy)

CDNs

Content Delivery Networks (CDNs) cache content at edge locations worldwide, reducing latency by serving users from nearby servers.

How CDNs Work

┌─────────────────────────────────────────────────────────────────────┐
│                      CDN Architecture                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│           Origin Server (Your Server)                               │
│                    │                                                │
│        ┌──────────┼──────────┐                                      │
│        │          │          │                                      │
│        ▼          ▼          ▼                                      │
│   ┌────────┐ ┌────────┐ ┌────────┐                                  │
│   │  Edge  │ │  Edge  │ │  Edge  │                                  │
│   │  US    │ │  EU    │ │  Asia  │                                  │
│   └────┬───┘ └────┬───┘ └────┬───┘                                  │
│        │          │          │                                      │
│        ▼          ▼          ▼                                      │
│      Users      Users      Users                                    │
│                                                                     │
│   User requests → Nearest Edge → Cached response (fast!)            │
│   Cache miss → Edge fetches from Origin → Caches → Responds         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Benefits

Performance:
  - Reduced latency (content closer to users)
  - Faster page loads
  - Better user experience

Scalability:
  - Offload traffic from origin
  - Handle traffic spikes
  - Global reach without global infrastructure

Availability:
  - DDoS protection
  - Origin failover
  - Always-on edge presence

What to Put on CDN

Ideal for CDN:
  ✓ Static files (JS, CSS, images)
  ✓ Videos and media
  ✓ Downloadable files
  ✓ Public API responses (if cacheable)

Not for CDN (usually):
  ✗ User-specific content
  ✗ Real-time data
  ✗ Authenticated endpoints
  ✗ Frequently changing data

Cache Control

# Cache for 1 day, revalidate after
Cache-Control: public, max-age=86400, must-revalidate

# Cache for 1 year (immutable assets)
Cache-Control: public, max-age=31536000, immutable

# No caching
Cache-Control: no-store

# Private only (not CDN)
Cache-Control: private, max-age=3600

CDN Configuration Concepts

TTL (Time To Live):
  How long edge caches content
  Balance freshness vs. origin load

Cache Keys:
  What makes requests "same" for caching
  URL, headers, cookies, query strings

Purge/Invalidation:
  Force refresh of cached content
  By URL, tag, or entire cache

Edge Functions:
  Run code at edge (Cloudflare Workers, Lambda@Edge)
  Customize responses, A/B testing, auth
Global CDNs:
  - Cloudflare: Free tier, security focus
  - Fastly: Real-time purging, edge compute
  - Akamai: Enterprise, largest network
  - AWS CloudFront: AWS integration
  - Google Cloud CDN: GCP integration

Specialized:
  - Bunny CDN: Cost-effective
  - KeyCDN: Simple pricing
  - imgix: Image optimization focus

Setting Up CDN

Basic setup:
  1. Sign up with CDN provider
  2. Configure origin (your server)
  3. Get CDN domain (e.g., cdn.example.com)
  4. Update DNS or reference CDN URLs
  5. Configure cache rules

DNS example:
  cdn.example.com CNAME example.cdn-provider.net

Or full site through CDN:
  example.com → CDN → origin.example.com

Debugging CDN

# Check cache status
$ curl -I https://cdn.example.com/image.png
X-Cache: HIT          # Served from edge
X-Cache: MISS         # Fetched from origin
CF-Cache-Status: HIT  # Cloudflare specific

# Check which edge served request
$ curl -I https://cdn.example.com/image.png
CF-RAY: 123abc-SJC    # Cloudflare San Jose

# Bypass cache
$ curl -H "Cache-Control: no-cache" https://cdn.example.com/image.png

Connection Pooling

Connection pooling reuses established connections instead of creating new ones for each request. This is essential for performance in database access, HTTP clients, and service communication.

Why Pool Connections?

Without pooling (connection per request):
  Request 1: [TCP handshake][TLS handshake][Query][Response][Close]
  Request 2: [TCP handshake][TLS handshake][Query][Response][Close]
  Request 3: [TCP handshake][TLS handshake][Query][Response][Close]

  Each request pays full connection overhead!

With pooling (reuse connections):
  [TCP][TLS] ← Once
  Request 1: [Query][Response]
  Request 2: [Query][Response]
  Request 3: [Query][Response]

  Connection overhead paid once, amortized across requests.

Performance Impact

Connection setup costs:
  TCP handshake:  1 RTT (~50ms intercontinental)
  TLS handshake:  1-2 RTT (~50-100ms)
  Auth/setup:     Varies

Without pooling (100ms RTT):
  1000 requests × 150ms setup = 150 seconds in overhead alone!

With pooling:
  10 connections × 150ms setup = 1.5 seconds
  Requests reuse existing connections.

10-100x improvement in connection overhead.

Pool Configuration

Key parameters:

Min connections:
  Connections kept open even when idle.
  Ready for immediate use.

Max connections:
  Upper limit on concurrent connections.
  Prevents resource exhaustion.

Idle timeout:
  Close connections unused for this long.
  Frees resources, reduces stale connections.

Max lifetime:
  Close connections older than this.
  Prevents issues with long-lived connections.

Connection timeout:
  How long to wait for new connection.
  Fails fast if pool exhausted.

Database Connection Pooling

# Python with SQLAlchemy
from sqlalchemy import create_engine

engine = create_engine(
    'postgresql://user:pass@host/db',
    pool_size=5,          # Maintained connections
    max_overflow=10,      # Extra connections allowed
    pool_timeout=30,      # Wait for connection
    pool_recycle=3600,    # Recreate after 1 hour
    pool_pre_ping=True    # Test connections before use
)

# Each request borrows from pool
with engine.connect() as conn:
    result = conn.execute("SELECT 1")
# Connection returned to pool automatically

HTTP Connection Pooling

# Python requests with session (pooled)
import requests

# BAD: New connection per request
for url in urls:
    response = requests.get(url)  # New connection each time

# GOOD: Reuse connections via session
session = requests.Session()
adapter = requests.adapters.HTTPAdapter(
    pool_connections=10,
    pool_maxsize=20,
    max_retries=3
)
session.mount('https://', adapter)

for url in urls:
    response = session.get(url)  # Reuses connections

Common Pooling Issues

Pool Exhaustion

All connections in use, new requests must wait.

Symptoms:
  - Requests timeout waiting for connection
  - "Connection pool exhausted" errors
  - Latency spikes

Solutions:
  - Increase pool size
  - Reduce connection hold time
  - Add timeouts for borrowing
  - Monitor pool usage

Connection Leaks

Connections borrowed but never returned.

Causes:
  - Exception before returning connection
  - Forgot to close/return connection
  - Infinite loop holding connection

Solutions:
  - Always use try-finally or context managers
  - Set connection timeouts
  - Monitor active vs. available connections
  - Implement leak detection

Stale Connections

Connection in pool is dead (server closed it).

Causes:
  - Server timeout (closed idle connection)
  - Network issue
  - Server restart

Solutions:
  - Connection validation before use (pool_pre_ping)
  - Maximum connection lifetime
  - Proper error handling with retry

Pool Sizing Guidelines

Too few connections:
  - Requests queue up
  - Increased latency
  - Underutilized backend

Too many connections:
  - Memory waste
  - May exceed server limits
  - Connection thrashing

Starting point:
  connections = (requests_per_second × avg_request_duration) × 1.5

Example:
  100 req/s × 0.1s duration = 10 concurrent
  Pool size: 15 (10 × 1.5)

Adjust based on monitoring!

Monitoring Pool Health

Key metrics:
  - Active connections (in use)
  - Idle connections (available)
  - Wait time for connections
  - Connection creation rate
  - Timeout/exhaustion errors

Alerts:
  - Pool utilization > 80% sustained
  - Connection wait time > threshold
  - Pool exhaustion events

Conclusion

You’ve journeyed through the layers of network protocols that power the internet. From the fundamentals of OSI and TCP/IP to the cutting edge of HTTP/3 and QUIC, you now have a comprehensive understanding of how computers communicate.

Key Takeaways

The Layered Architecture Works

The genius of network layering:
  - Each layer has a specific job
  - Layers can evolve independently
  - Complexity is manageable
  - Interoperability is possible

Application ──────────────────────────────────
Transport ────────────────────────────────────
Network ──────────────────────────────────────
Link ─────────────────────────────────────────

This structure has served us for 50 years and counting.

Trade-offs Are Everywhere

Reliability vs. Latency:
  TCP: Reliable, higher latency
  UDP: Fast, no guarantees
  QUIC: Best of both (complex)

Simplicity vs. Performance:
  HTTP/1.1: Simple, limited parallelism
  HTTP/2: Complex, highly parallel

Security vs. Speed:
  Full TLS: Secure, connection overhead
  0-RTT: Fast, replay risks

No perfect choice—understand your requirements.

Evolution Never Stops

1991: HTTP/0.9 (simple document retrieval)
2024: HTTP/3 + QUIC (multiplexed, encrypted, mobile-ready)

IPv4 → IPv6 (ongoing)
TLS 1.2 → TLS 1.3 (complete)
TCP → QUIC (emerging)

The protocols will continue to evolve.
The fundamentals you've learned will help you adapt.

Applying Your Knowledge

As a Developer

  • Choose the right protocol for your use case
  • Configure connections efficiently (pooling, keep-alive)
  • Implement proper error handling and retries
  • Understand timeout behavior
  • Consider security at every layer

As a Debugger

  • Use tools: tcpdump, Wireshark, curl, dig
  • Understand what each layer provides
  • Know where to look for different problems
  • Read packet captures with confidence

As an Architect

  • Design for resilience (multiple layers of redundancy)
  • Plan for scale (load balancing, CDNs)
  • Consider latency in distributed systems
  • Stay current with protocol evolution

Keep Learning

Protocols not covered in depth:
  - gRPC and Protocol Buffers
  - GraphQL
  - MQTT and IoT protocols
  - BGP and routing details
  - IPsec and VPNs
  - SIP and VoIP

Resources for continued learning:
  - RFCs (the definitive specifications)
  - Wireshark packet analysis
  - Building your own implementations
  - Production system observation

Final Thought

Networks are the invisible infrastructure connecting billions of devices. Every API call, every web page, every video stream relies on the protocols covered in this book. Understanding them makes you a more effective developer—one who can debug the mysterious, optimize the slow, and design the robust.

The internet is a marvel of human collaboration and engineering. Now you understand how it works.

Happy networking!

Appendix: Tools and Debugging

This appendix covers essential tools for network debugging, packet analysis, and protocol troubleshooting.

Command Line Tools

curl - HTTP Client

# Basic GET request
$ curl https://example.com

# Verbose output (see headers, TLS handshake)
$ curl -v https://example.com

# Show only response headers
$ curl -I https://example.com

# POST with JSON
$ curl -X POST https://api.example.com/data \
  -H "Content-Type: application/json" \
  -d '{"key": "value"}'

# Follow redirects
$ curl -L https://example.com

# Save response to file
$ curl -o output.html https://example.com

# Show timing breakdown
$ curl -w "@curl-timing.txt" -o /dev/null -s https://example.com

# Custom timing format
$ curl -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nTotal: %{time_total}s\n" \
  -o /dev/null -s https://example.com

dig - DNS Queries

# Query A record
$ dig example.com

# Query specific record type
$ dig example.com AAAA
$ dig example.com MX
$ dig example.com TXT

# Use specific DNS server
$ dig @8.8.8.8 example.com

# Short output
$ dig +short example.com

# Trace resolution path
$ dig +trace example.com

# Reverse lookup
$ dig -x 93.184.216.34

# Show all records
$ dig example.com ANY

nslookup - DNS Lookup (Alternative)

# Basic lookup
$ nslookup example.com

# Specify record type
$ nslookup -type=MX example.com

# Use specific server
$ nslookup example.com 8.8.8.8

netstat / ss - Network Connections

# Show all TCP connections
$ netstat -ant     # Linux/Mac
$ ss -ant          # Linux (faster)

# Show listening ports
$ netstat -tlnp    # Linux
$ ss -tlnp         # Linux

# Show UDP sockets
$ ss -u

# Show process using port
$ ss -tlnp | grep :8080
$ lsof -i :8080    # Mac/Linux

tcpdump - Packet Capture

# Capture all traffic on interface
$ sudo tcpdump -i eth0

# Capture specific port
$ sudo tcpdump -i eth0 port 80

# Capture specific host
$ sudo tcpdump -i eth0 host 192.168.1.100

# Save to file (for Wireshark)
$ sudo tcpdump -i eth0 -w capture.pcap

# Read from file
$ tcpdump -r capture.pcap

# Show packet contents (ASCII)
$ sudo tcpdump -i eth0 -A port 80

# Show packet contents (hex + ASCII)
$ sudo tcpdump -i eth0 -X port 80

# Capture only TCP SYN packets
$ sudo tcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0'

# Capture DNS queries
$ sudo tcpdump -i eth0 port 53

ping - Connectivity Test

# Basic ping
$ ping example.com

# Specify count
$ ping -c 4 example.com

# Set interval
$ ping -i 0.5 example.com

# Set packet size
$ ping -s 1000 example.com

# IPv6 ping
$ ping6 example.com

traceroute - Path Discovery

# Trace route to destination
$ traceroute example.com

# Use ICMP (like ping)
$ traceroute -I example.com    # Linux
$ traceroute example.com       # Mac (ICMP default)

# Use TCP
$ traceroute -T -p 80 example.com

# Use UDP (default on Linux)
$ traceroute -U example.com

mtr - Combined Ping + Traceroute

# Interactive mode
$ mtr example.com

# Report mode (run 10 times, output)
$ mtr -r -c 10 example.com

# Show IP addresses only
$ mtr -n example.com

openssl - TLS/SSL Testing

# Connect and show certificate
$ openssl s_client -connect example.com:443

# Show certificate details
$ openssl s_client -connect example.com:443 2>/dev/null | \
  openssl x509 -text -noout

# Check certificate expiration
$ openssl s_client -connect example.com:443 2>/dev/null | \
  openssl x509 -noout -dates

# Test specific TLS version
$ openssl s_client -connect example.com:443 -tls1_2
$ openssl s_client -connect example.com:443 -tls1_3

# Show supported ciphers
$ openssl s_client -connect example.com:443 -cipher 'ALL' 2>&1 | \
  grep "Cipher is"

nc (netcat) - TCP/UDP Tool

# Connect to port
$ nc example.com 80

# Listen on port
$ nc -l 8080

# Send UDP packet
$ echo "test" | nc -u 192.168.1.1 53

# Port scanning
$ nc -zv example.com 20-25

# Transfer file
$ nc -l 8080 > received.txt       # Receiver
$ nc host 8080 < file.txt          # Sender

Wireshark

Wireshark is the standard GUI tool for packet analysis.

Capture Filters (BPF Syntax)

# Capture specific host
host 192.168.1.100

# Capture specific port
port 80

# Capture range of ports
portrange 8000-9000

# Capture TCP only
tcp

# Combine filters
host 192.168.1.100 and port 443
tcp and not port 22

Display Filters

# Filter by IP
ip.addr == 192.168.1.100
ip.src == 192.168.1.100
ip.dst == 10.0.0.1

# Filter by port
tcp.port == 80
tcp.dstport == 443

# Filter by protocol
http
dns
tls
tcp
udp

# HTTP specific
http.request.method == "GET"
http.response.code == 200

# TCP flags
tcp.flags.syn == 1
tcp.flags.fin == 1
tcp.flags.reset == 1

# TLS specific
tls.handshake.type == 1    # Client Hello
tls.handshake.type == 2    # Server Hello

# DNS specific
dns.qry.name == "example.com"

# Combine filters
ip.addr == 192.168.1.100 && tcp.port == 443
http.request || http.response

Useful Wireshark Features

Follow TCP Stream:
  Right-click packet → Follow → TCP Stream
  Shows complete conversation in readable format

Flow Graph:
  Statistics → Flow Graph
  Visualizes packet flow between hosts

Protocol Hierarchy:
  Statistics → Protocol Hierarchy
  Shows breakdown of protocols in capture

Expert Info:
  Analyze → Expert Information
  Highlights anomalies, retransmissions, errors

I/O Graph:
  Statistics → I/O Graph
  Visualizes traffic over time

HTTP-Specific Tools

httpie - Modern HTTP Client

# GET request
$ http example.com

# POST with JSON (automatic)
$ http POST api.example.com/users name=john age:=25

# Custom headers
$ http example.com Authorization:"Bearer token123"

# Form data
$ http -f POST example.com/login user=john pass=secret

wget - Download Tool

# Download file
$ wget https://example.com/file.zip

# Continue interrupted download
$ wget -c https://example.com/large-file.zip

# Mirror website
$ wget -m https://example.com

# Download with custom filename
$ wget -O output.zip https://example.com/file.zip

ab (Apache Bench) - Load Testing

# 1000 requests, 10 concurrent
$ ab -n 1000 -c 10 https://example.com/

# With keep-alive
$ ab -n 1000 -c 10 -k https://example.com/

wrk - Modern Load Testing

# 30 second test, 12 threads, 400 connections
$ wrk -t12 -c400 -d30s https://example.com/

# With Lua script for custom requests
$ wrk -t12 -c400 -d30s -s script.lua https://example.com/

Debugging Common Issues

Connection Refused

$ curl https://example.com:8080
curl: (7) Failed to connect: Connection refused

Causes:
  - Service not running
  - Wrong port
  - Firewall blocking

Debug:
  $ ss -tlnp | grep 8080         # Is anything listening?
  $ sudo iptables -L -n           # Check firewall
  $ systemctl status service      # Check service

Connection Timeout

$ curl --connect-timeout 5 https://example.com
curl: (28) Connection timed out

Causes:
  - Host unreachable
  - Firewall dropping packets (not rejecting)
  - Network routing issue

Debug:
  $ ping example.com              # Basic connectivity
  $ traceroute example.com        # Where does it stop?
  $ tcpdump -i eth0 host example.com  # See outgoing packets

DNS Resolution Failure

$ curl https://example.com
curl: (6) Could not resolve host: example.com

Debug:
  $ dig example.com               # Query DNS directly
  $ dig @8.8.8.8 example.com      # Try different DNS
  $ cat /etc/resolv.conf          # Check DNS config

TLS/SSL Errors

$ curl https://example.com
curl: (60) SSL certificate problem

Debug:
  $ openssl s_client -connect example.com:443
  # Check for:
  #   - Certificate chain
  #   - Expiration date
  #   - Common name / SAN matching

  $ curl -v https://example.com 2>&1 | grep -i ssl

Slow Connections

Debug with timing:
  $ curl -w "DNS: %{time_namelookup}s
  TCP: %{time_connect}s
  TLS: %{time_appconnect}s
  TTFB: %{time_starttransfer}s
  Total: %{time_total}s\n" -o /dev/null -s https://example.com

High DNS time: DNS resolver issue
High TCP time: Network latency
High TLS time: TLS negotiation slow
High TTFB: Server processing slow

Quick Reference

┌────────────────────────────────────────────────────────────────┐
│                    Tool Quick Reference                        │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  What you need               Tool to use                       │
│  ─────────────────────────────────────────────────────────     │
│  HTTP debugging              curl -v, httpie                   │
│  DNS lookup                  dig, nslookup                     │
│  Connectivity test           ping, nc                          │
│  Path tracing                traceroute, mtr                   │
│  Port checking               ss, netstat, lsof                 │
│  Packet capture              tcpdump, Wireshark                │
│  TLS/Certificate check       openssl s_client                  │
│  Load testing                ab, wrk                           │
│  File download               curl, wget                        │
│                                                                │
└────────────────────────────────────────────────────────────────┘