Introduction
Welcome to Networking Protocols: A Developer’s Guide. This book is designed to give you a deep understanding of how computers communicate over networks—knowledge that will make you a more effective developer, debugger, and system designer.
Why Learn Network Protocols?
Every time you make an API call, load a webpage, or send a message, dozens of protocols work together to make it happen. Yet most developers treat networking as a black box. Understanding what happens beneath the surface gives you:
- Better debugging skills: When things go wrong, you’ll know where to look
- Informed architecture decisions: Choose the right protocol for your use case
- Performance optimization: Understand why things are slow and how to fix them
- Security awareness: Know what protections exist and their limitations
What This Book Covers
┌─────────────────────────────────────────────────────────────┐
│ Your Application │
├─────────────────────────────────────────────────────────────┤
│ HTTP/2 │ WebSocket │ DNS │ SMTP │ Custom Protocol │
├─────────────────────────────────────────────────────────────┤
│ TLS/SSL (Security) │
├─────────────────────────────────────────────────────────────┤
│ TCP │ UDP │
├─────────────────────────────────────────────────────────────┤
│ IP (IPv4 / IPv6) │
├─────────────────────────────────────────────────────────────┤
│ Network Interface │
└─────────────────────────────────────────────────────────────┘
The Protocol Stack
We’ll work through this stack from bottom to top:
- Foundations: The conceptual models that organize network communication
- IP Layer: How data finds its way across the internet
- Transport Layer: TCP and UDP—reliability vs. speed
- Security Layer: TLS and how encryption protects your data
- Application Layer: HTTP, DNS, WebSockets, and more
- Real-World Patterns: Load balancing, CDNs, and production concerns
How to Read This Book
This book is structured to be read sequentially, with each chapter building on previous concepts. However, if you’re already familiar with networking basics, feel free to jump to specific topics that interest you.
Throughout the book, you’ll find:
- ASCII diagrams illustrating packet structures and protocol flows
- Code examples showing practical implementations
- “Deep Dive” sections for those who want extra detail
- “In Practice” sections with real-world tips and gotchas
Prerequisites
You should be comfortable with:
- Basic programming concepts
- Command-line usage
- Reading simple code examples (we use Python and pseudocode)
No prior networking knowledge is required—we’ll build everything from the ground up.
A Note on Diagrams
Network protocols are inherently visual—packets flow, handshakes happen, connections open and close. We use ASCII diagrams extensively because they:
- Work everywhere (including terminals and plain text)
- Force clarity (no hiding complexity behind pretty graphics)
- Are easy to reproduce and modify
For example, here’s a TCP three-way handshake:
Client Server
│ │
│─────────── SYN ──────────────────>│
│ │
│<────────── SYN-ACK ───────────────│
│ │
│─────────── ACK ──────────────────>│
│ │
│ Connection Established │
Get used to reading diagrams like this—they’ll appear throughout the book.
Let’s Begin
Networks are fascinating. They’re the invisible infrastructure that connects billions of devices, enabling everything from casual browsing to global financial systems. Understanding how they work isn’t just academically interesting—it’s practically valuable.
Let’s start with the fundamentals.
Network Fundamentals
Before diving into specific protocols, we need to establish a common vocabulary and conceptual framework. This chapter covers the foundational concepts that everything else builds upon.
What Is a Protocol?
A protocol is a set of rules that govern how two parties communicate. In networking, protocols define:
- Message format: What does the data look like?
- Message semantics: What does each field mean?
- Timing: When should messages be sent?
- Error handling: What happens when things go wrong?
Think of protocols like human languages—they’re agreements that allow parties to understand each other. Just as you can’t have a conversation if one person speaks English and the other speaks Mandarin (without translation), computers can’t communicate without agreeing on a protocol.
The Need for Layering
Early networks were monolithic—each application had to handle everything from electrical signals to data formatting. This was:
- Inflexible: Changing one thing meant changing everything
- Duplicative: Every application reimplemented the same logic
- Error-prone: More code means more bugs
The solution was layering: dividing responsibilities into distinct layers, each with a specific job. This is the fundamental insight that makes modern networking possible.
┌─────────────────────────────────────────────┐
│ Application Layer │
│ "What data do we want to send?" │
├─────────────────────────────────────────────┤
│ Transport Layer │
│ "How do we ensure reliable delivery?" │
├─────────────────────────────────────────────┤
│ Network Layer │
│ "How does data find its destination?" │
├─────────────────────────────────────────────┤
│ Link Layer │
│ "How do bits travel on the physical wire?"│
└─────────────────────────────────────────────┘
Each layer:
- Provides services to the layer above
- Uses services from the layer below
- Has no knowledge of layers beyond its immediate neighbors
This separation of concerns is powerful. You can change the physical network (switch from Ethernet to WiFi) without touching your application. You can change applications without affecting how packets are routed.
What You’ll Learn
In this chapter, we’ll cover:
- The OSI Model: The theoretical seven-layer reference model
- The TCP/IP Stack: The practical four-layer model the internet actually uses
- Encapsulation: How data is wrapped and unwrapped as it moves through layers
- Ports and Sockets: How multiple applications share a single network connection
These concepts form the foundation for everything that follows.
The OSI Model
The Open Systems Interconnection (OSI) model is a conceptual framework that standardizes how network communication should be organized. Created by the International Organization for Standardization (ISO) in 1984, it divides networking into seven distinct layers.
The Seven Layers
┌───────────────────────────────────────────────────────────────┐
│ Layer 7: Application │ HTTP, FTP, SMTP, DNS │
├───────────────────────────────────────────────────────────────┤
│ Layer 6: Presentation │ Encryption, Compression, Format │
├───────────────────────────────────────────────────────────────┤
│ Layer 5: Session │ Session Management, RPC │
├───────────────────────────────────────────────────────────────┤
│ Layer 4: Transport │ TCP, UDP │
├───────────────────────────────────────────────────────────────┤
│ Layer 3: Network │ IP, ICMP, Routing │
├───────────────────────────────────────────────────────────────┤
│ Layer 2: Data Link │ Ethernet, WiFi, MAC addresses │
├───────────────────────────────────────────────────────────────┤
│ Layer 1: Physical │ Cables, Radio waves, Voltages │
└───────────────────────────────────────────────────────────────┘
Layer 1: Physical Layer
The physical layer deals with the actual transmission of raw bits over a physical medium.
Responsibilities:
- Defining physical connectors and cables
- Encoding bits as electrical signals, light pulses, or radio waves
- Specifying transmission rates (bandwidth)
- Managing physical topology (how devices connect)
Examples:
- Ethernet cables (Cat5, Cat6)
- Fiber optic cables
- WiFi radio signals
- USB connections
What it looks like:
Bit stream: 10110010 01001101 11010010 ...
↓
Physical: ▁▁▔▔▁▔▁▁ ▁▔▁▁▔▔▁▔ ▔▔▁▔▁▁▔▁
(voltage levels on copper wire)
Layer 2: Data Link Layer
The data link layer handles communication between directly connected devices on the same network segment.
Responsibilities:
- Framing: Organizing bits into frames
- MAC (Media Access Control) addressing
- Error detection (not correction)
- Flow control between adjacent nodes
Key Concepts:
- MAC Address: A 48-bit hardware address (e.g.,
00:1A:2B:3C:4D:5E) - Frame: The unit of data at this layer
Ethernet Frame Structure:
┌──────────┬──────────┬──────┬─────────────────┬─────┐
│ Dest MAC │ Src MAC │ Type │ Payload │ FCS │
│ (6 bytes)│ (6 bytes)│(2 B) │ (46-1500 B) │(4 B)│
└──────────┴──────────┴──────┴─────────────────┴─────┘
FCS = Frame Check Sequence (error detection)
Layer 3: Network Layer
The network layer enables communication across different networks—it’s what makes “inter-networking” (the Internet) possible.
Responsibilities:
- Logical addressing (IP addresses)
- Routing packets between networks
- Fragmentation and reassembly
- Quality of Service (QoS)
Key Protocols:
- IP (Internet Protocol): The primary protocol
- ICMP (Internet Control Message Protocol): Error reporting and diagnostics
- ARP (Address Resolution Protocol): Maps IP to MAC addresses
Routing Decision:
Source: 192.168.1.100
Destination: 8.8.8.8
Is destination on local network? NO
↓
Send to default gateway (router)
↓
Router examines destination, forwards to next hop
↓
Process repeats until packet reaches destination
Layer 4: Transport Layer
The transport layer provides end-to-end communication services, handling the complexities of reliable data transfer.
Responsibilities:
- Segmentation and reassembly
- Connection management
- Reliability (for TCP)
- Flow control
- Multiplexing via ports
Key Protocols:
- TCP (Transmission Control Protocol): Reliable, ordered delivery
- UDP (User Datagram Protocol): Fast, connectionless delivery
Port Multiplexing:
Single IP address, multiple applications:
IP: 192.168.1.100
├── Port 80: Web Server
├── Port 443: HTTPS Server
├── Port 22: SSH Server
└── Port 3000: Development Server
Layer 5: Session Layer
The session layer manages sessions—ongoing dialogues between applications.
Responsibilities:
- Establishing, maintaining, and terminating sessions
- Session checkpointing and recovery
- Synchronization
In Practice: This layer is often merged with the application layer in real implementations. Few protocols exist purely at this layer.
Examples:
- NetBIOS
- RPC (Remote Procedure Call)
- Session tokens in web applications (conceptually)
Layer 6: Presentation Layer
The presentation layer handles data representation—how information is formatted, encoded, and encrypted.
Responsibilities:
- Data translation between formats
- Encryption and decryption
- Compression and decompression
- Character encoding (ASCII, UTF-8)
In Practice: Like the session layer, this is often absorbed into the application layer. TLS can be considered a presentation layer protocol.
Examples:
- SSL/TLS (encryption)
- JPEG, GIF (image formatting)
- MIME types
Layer 7: Application Layer
The application layer is where network applications and their protocols operate. This is the layer developers interact with most directly.
Responsibilities:
- Providing network services to applications
- User authentication
- Resource sharing
Examples:
- HTTP/HTTPS (web)
- SMTP, POP3, IMAP (email)
- FTP, SFTP (file transfer)
- DNS (name resolution)
- SSH (secure shell)
How Data Flows Through Layers
When you send data, it travels down the stack on your machine, across the network, and up the stack on the destination:
Sender Receiver
┌───────────┐ ┌───────────┐
│Application│ ──────────────────────────>│Application│
├───────────┤ ├───────────┤
│Presentation─────────────────────────────Presentation
├───────────┤ ├───────────┤
│ Session │ ──────────────────────────── Session │
├───────────┤ ├───────────┤
│ Transport │ ────────────────────────────│ Transport │
├───────────┤ ├───────────┤
│ Network │ ────────────────────────────│ Network │
├───────────┤ ┌─────────┐ ├───────────┤
│ Data Link │ ─────│ Router │───────────│ Data Link │
├───────────┤ └─────────┘ ├───────────┤
│ Physical │ ═══════════════════════════│ Physical │
└───────────┘ Physical Medium └───────────┘
Each layer adds its own header (and sometimes trailer) to the data—a process called encapsulation.
OSI in the Real World
Here’s an important truth: the OSI model is a teaching tool, not a strict blueprint.
The internet wasn’t built on OSI—it was built on TCP/IP, which predates OSI and uses a simpler four-layer model. Real protocols often don’t fit neatly into single layers:
- TLS spans presentation and session layers
- HTTP is application layer but handles some session concerns
- TCP handles some session-layer functions
The OSI model is valuable for:
- Learning and discussing networking concepts
- Troubleshooting (“Is this a Layer 2 or Layer 3 problem?”)
- Understanding where protocols fit conceptually
But don’t expect real-world protocols to follow it rigidly.
Memorization Tricks
Many people use mnemonics to remember the layers. From Layer 1 to 7:
- Please Do Not Throw Sausage Pizza Away
- Physical, Data Link, Network, Transport, Session, Presentation, Application
Or from 7 to 1:
- All People Seem To Need Data Processing
Summary
The OSI model provides a framework for understanding network communication:
| Layer | Name | Key Function | Example |
|---|---|---|---|
| 7 | Application | User interface to network | HTTP, DNS |
| 6 | Presentation | Data formatting | TLS, JPEG |
| 5 | Session | Dialog management | RPC |
| 4 | Transport | End-to-end delivery | TCP, UDP |
| 3 | Network | Routing between networks | IP |
| 2 | Data Link | Local network delivery | Ethernet |
| 1 | Physical | Bit transmission | Cables, WiFi |
In the next section, we’ll look at the TCP/IP model—what the internet actually uses.
The TCP/IP Stack
While the OSI model is a useful teaching framework, the TCP/IP model is what the internet actually runs on. Developed in the 1970s by Vint Cerf and Bob Kahn, it’s simpler, more pragmatic, and battle-tested by decades of real-world use.
Four Layers vs. Seven
The TCP/IP model condenses networking into four layers:
┌─────────────────────────────────────────────────────────────┐
│ TCP/IP Model │ OSI Model │
├─────────────────────────────────────────────────────────────┤
│ │ Application (Layer 7) │
│ Application Layer │ Presentation (Layer 6) │
│ │ Session (Layer 5) │
├─────────────────────────────────────────────────────────────┤
│ Transport Layer │ Transport (Layer 4) │
├─────────────────────────────────────────────────────────────┤
│ Internet Layer │ Network (Layer 3) │
├─────────────────────────────────────────────────────────────┤
│ Network Access Layer │ Data Link (Layer 2) │
│ (Link Layer) │ Physical (Layer 1) │
└─────────────────────────────────────────────────────────────┘
This simplification isn’t accidental—it reflects reality. The top three OSI layers often blend together in practice, and the bottom two are typically handled by the same hardware/drivers.
Layer 1: Network Access Layer
Also called the Link Layer, this combines OSI’s physical and data link layers. It handles everything needed to send packets across a physical network segment.
Responsibilities:
- Physical transmission
- MAC addressing
- Frame formatting
- Local delivery
The TCP/IP model is agnostic about this layer. Whether you’re using:
- Ethernet
- WiFi
- Cellular (4G/5G)
- Satellite
- Carrier pigeon (yes, there’s an RFC for that: RFC 1149)
…the upper layers don’t care. This abstraction is what allows the internet to work across wildly different physical media.
Layer 2: Internet Layer
The internet layer handles logical addressing and routing. Its job is getting packets from source to destination across multiple networks.
Key Protocol: IP (Internet Protocol)
IP's Job: Get this packet from A to B, somehow.
Network A Network B Network C
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Host A │ │ Router │ │ Host B │
│192.168.1.5├──────┤ 1 ║ 2 ├───────┤10.0.0.100 │
└───────────┘ └─────╨─────┘ └───────────┘
IP handles: addressing, routing, fragmentation
IP doesn't handle: reliability, ordering, delivery confirmation
Other Internet Layer Protocols:
- ICMP (Internet Control Message Protocol): Error reporting, ping
- ARP (Address Resolution Protocol): Finds MAC address for an IP
- IGMP (Internet Group Management Protocol): Multicast group membership
Key Characteristics of IP:
- Connectionless: Each packet is independent
- Best-effort: No guarantee of delivery
- Unreliable: Packets can be lost, duplicated, or reordered
This might seem like a weakness, but it’s actually a feature. By keeping IP simple, it can be fast and widely implemented. Reliability can be added at higher layers when needed.
Layer 3: Transport Layer
The transport layer provides end-to-end communication between applications. It’s where we choose between reliability and speed.
TCP (Transmission Control Protocol)
TCP provides reliable, ordered, error-checked delivery.
TCP Provides:
✓ Connection-oriented (explicit setup and teardown)
✓ Reliable delivery (acknowledgments, retransmission)
✓ Ordered delivery (sequence numbers)
✓ Flow control (don't overwhelm the receiver)
✓ Congestion control (don't overwhelm the network)
TCP Costs:
✗ Connection overhead (handshake latency)
✗ Head-of-line blocking (one lost packet stalls everything)
✗ Higher latency than UDP
UDP (User Datagram Protocol)
UDP provides minimal transport services—just multiplexing and checksums.
UDP Provides:
✓ Connectionless (no setup overhead)
✓ Fast (minimal processing)
✓ No head-of-line blocking
✓ Optional checksum
UDP Lacks:
✗ No reliability (packets can be lost)
✗ No ordering (packets can arrive out of order)
✗ No flow control
✗ No congestion control
When to use which?
| Use Case | Protocol | Why |
|---|---|---|
| Web browsing | TCP | Need complete, ordered pages |
| File transfer | TCP | Can’t have missing bytes |
| TCP | Reliability required | |
| Video streaming | UDP* | Some loss acceptable, low latency important |
| Online gaming | UDP | Real-time updates, old data worthless |
| DNS queries | UDP | Small, single request/response |
| VoIP | UDP | Real-time, loss preferable to delay |
*Modern streaming often uses TCP or QUIC for adaptive bitrate streaming.
Layer 4: Application Layer
The application layer is where user-facing protocols live. It combines the application, presentation, and session layers from OSI.
Common Application Layer Protocols:
┌─────────────────────────────────────────────────────────────┐
│ Application Layer │
├──────────────┬──────────────┬──────────────┬───────────────┤
│ HTTP │ DNS │ SMTP │ SSH │
│ (Web) │ (Names) │ (Email) │ (Secure │
│ │ │ │ Shell) │
├──────────────┼──────────────┼──────────────┼───────────────┤
│ FTP │ DHCP │ SNMP │ NTP │
│ (Files) │ (Config) │ (Management) │ (Time) │
└──────────────┴──────────────┴──────────────┴───────────────┘
This layer handles:
- Data formatting and encoding
- Session management
- Application-specific protocols
- User authentication (in many protocols)
Putting It All Together
Let’s trace what happens when you request a webpage:
1. APPLICATION LAYER
Your browser creates an HTTP request:
"GET /index.html HTTP/1.1"
2. TRANSPORT LAYER
TCP segments the data, adds:
- Source port (e.g., 52431)
- Destination port (80)
- Sequence number
- Checksum
3. INTERNET LAYER
IP adds:
- Source IP (192.168.1.100)
- Destination IP (93.184.216.34)
- TTL (Time to Live)
4. NETWORK ACCESS LAYER
Ethernet adds:
- Source MAC
- Destination MAC (router's MAC)
- Frame check sequence
5. PHYSICAL
Converted to electrical signals on the wire
On the receiving end, this process reverses—each layer strips its header and passes data up.
The Protocol Graph
Rather than a strict stack, TCP/IP is better visualized as a graph:
┌─────────────────────────────────────┐
│ Applications │
│ HTTP SMTP DNS SSH Custom │
└──────────────┬──────────────────────┘
│
┌─────────────────┴─────────────────┐
│ │
┌────┴────┐ ┌────┴────┐
│ TCP │ │ UDP │
└────┬────┘ └────┬────┘
│ │
└─────────────────┬─────────────────┘
│
┌─────┴─────┐
│ IP │
└─────┬─────┘
│
┌────────────────┬─────┴─────┬────────────────┐
│ │ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│Ethernet │ │ WiFi │ │Cellular │ │ Other │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
Any application can use TCP or UDP. Both use IP. IP can run over any network technology. This flexibility is why the internet works.
Why TCP/IP Won
The OSI model was designed by committee to be complete and correct. TCP/IP was designed by engineers to work. Key differences:
| Aspect | OSI | TCP/IP |
|---|---|---|
| Design approach | Top-down, theoretical | Bottom-up, practical |
| Implementation | Came after spec | Spec described working code |
| Layer count | 7 (sometimes awkward) | 4 (pragmatic) |
| Real-world use | Reference model | Running on billions of devices |
TCP/IP’s success came from:
- Working code first: The spec described implementations that already worked
- Simplicity: Fewer layers, clearer responsibilities
- Flexibility: “Be liberal in what you accept, conservative in what you send”
- Open standards: Anyone could implement it
Summary
The TCP/IP model is the practical foundation of the internet:
| Layer | Function | Key Protocols |
|---|---|---|
| Application | User services | HTTP, DNS, SMTP, SSH |
| Transport | End-to-end delivery | TCP, UDP |
| Internet | Routing between networks | IP, ICMP |
| Network Access | Local delivery | Ethernet, WiFi |
Understanding this model—especially the separation between IP (best-effort routing) and TCP (reliable delivery)—is essential for understanding how the internet works.
Next, we’ll look at how data is wrapped and unwrapped as it moves through these layers: encapsulation.
Encapsulation
Encapsulation is the process by which each layer wraps data with its own header (and sometimes trailer) information. It’s how layers communicate without knowing about each other’s internals.
The Concept
Think of encapsulation like mailing a letter:
1. You write a letter [Your message]
2. Put it in an envelope [+ Your address, recipient address]
3. The post office puts it in a bin [+ Sorting codes, routing info]
4. The bin goes in a truck [+ Truck manifest, destination hub]
Each layer adds information needed for its job, without looking inside what it received.
Layer-by-Layer Encapsulation
Let’s trace a web request through the TCP/IP stack:
┌─────────────────────────────────────────────────────────────────┐
│ Application Layer │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ HTTP Request (Data) │ │
│ │ "GET /index.html HTTP/1.1\r\nHost: example.com\r\n\r\n" │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Transport Layer (TCP) │
│ │
│ ┌──────────────┬────────────────────────────────────────────┐ │
│ │ TCP Header │ Data │ │
│ │ (20+ bytes) │ (HTTP Request from above) │ │
│ └──────────────┴────────────────────────────────────────────┘ │
│ │
│ TCP Segment │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Internet Layer (IP) │
│ │
│ ┌──────────────┬────────────────────────────────────────────┐ │
│ │ IP Header │ Data │ │
│ │ (20+ bytes) │ (TCP Segment from above) │ │
│ └──────────────┴────────────────────────────────────────────┘ │
│ │
│ IP Packet (or Datagram) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Network Access Layer (Ethernet) │
│ │
│ ┌──────────────┬────────────────────────────────────┬───────┐ │
│ │Ethernet Hdr │ Data │ FCS │ │
│ │ (14 bytes) │ (IP Packet from above) │(4 B) │ │
│ └──────────────┴────────────────────────────────────┴───────┘ │
│ │
│ Ethernet Frame │
└─────────────────────────────────────────────────────────────────┘
Terminology
Different layers use different names for their data units:
┌─────────────────────────────────────────┐
│ Layer │ Data Unit Name │
├─────────────────────────────────────────┤
│ Application │ Message / Data │
│ Transport │ Segment (TCP) │
│ │ Datagram (UDP) │
│ Internet │ Packet │
│ Network │ Frame │
└─────────────────────────────────────────┘
These terms matter when debugging—if someone mentions “packet loss,” they’re typically talking about the IP layer.
Detailed Header View
Here’s what each header actually contains:
Ethernet Header (14 bytes)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
├───────────────────────────────────────────────────────────────┤
│ Destination MAC Address │
│ (6 bytes) │
├───────────────────────────────────────────────────────────────┤
│ Source MAC Address │
│ (6 bytes) │
├───────────────────────────────────────────────────────────────┤
│ EtherType (2 bytes) │
│ (0x0800 = IPv4, 0x86DD = IPv6) │
└───────────────────────────────────┘
IPv4 Header (20-60 bytes)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
├───────┬───────┬───────────────┬───────────────────────────────┤
│Version│ IHL │ DSCP/ECN │ Total Length │
├───────┴───────┴───────────────┼───────┬───────────────────────┤
│ Identification │ Flags │ Fragment Offset │
├───────────────┬───────────────┼───────┴───────────────────────┤
│ TTL │ Protocol │ Header Checksum │
├───────────────┴───────────────┴───────────────────────────────┤
│ Source IP Address │
├───────────────────────────────────────────────────────────────┤
│ Destination IP Address │
├───────────────────────────────────────────────────────────────┤
│ Options (if IHL > 5) │
└───────────────────────────────────────────────────────────────┘
TCP Header (20-60 bytes)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
├───────────────────────────────┬───────────────────────────────┤
│ Source Port │ Destination Port │
├───────────────────────────────┴───────────────────────────────┤
│ Sequence Number │
├───────────────────────────────────────────────────────────────┤
│ Acknowledgment Number │
├───────┬───────┬───────────────┬───────────────────────────────┤
│ Data │ │C│E│U│A│P│R│S│F│ │
│ Offset│ Rsrvd │W│C│R│C│S│S│Y│I│ Window Size │
│ │ │R│E│G│K│H│T│N│N│ │
├───────┴───────┴───────────────┼───────────────────────────────┤
│ Checksum │ Urgent Pointer │
├───────────────────────────────┴───────────────────────────────┤
│ Options (if Data Offset > 5) │
└───────────────────────────────────────────────────────────────┘
Overhead Analysis
Each layer adds overhead. For a small HTTP request:
Layer Header Size Running Total
─────────────────────────────────────────────
HTTP Data ~50 bytes 50 bytes
TCP Header 20 bytes 70 bytes
IP Header 20 bytes 90 bytes
Ethernet 18 bytes* 108 bytes
─────────────────────────────────────────────
*14 header + 4 FCS
Efficiency: 50/108 = 46% payload
For small packets, overhead can be significant. This is why protocols often batch multiple operations or use compression.
Decapsulation (Receiving)
On the receiving side, each layer strips its header and passes the payload up:
Receiving Host
─────────────────────────────────────────────────────────
Frame arrives → Network Card
┌─────────────────────────────────────────────────────┐
│ Link Layer │
│ │
│ 1. Verify FCS (checksum) │
│ 2. Check destination MAC │
│ 3. Read EtherType → 0x0800 (IPv4) │
│ 4. Strip Ethernet header, pass up │
└───────────────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Network Layer │
│ │
│ 1. Verify header checksum │
│ 2. Check destination IP │
│ 3. Read Protocol field → 6 (TCP) │
│ 4. Strip IP header, pass up │
└───────────────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Transport Layer │
│ │
│ 1. Verify checksum │
│ 2. Read destination port → 80 │
│ 3. Find socket listening on port 80 │
│ 4. Process TCP state machine │
│ 5. Strip TCP header, pass up │
└───────────────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Application Layer │
│ │
│ Web server receives: "GET /index.html HTTP/1.1" │
└─────────────────────────────────────────────────────┘
How Layers Know What’s Inside
Each layer includes a field indicating what’s in the payload:
Ethernet EtherType:
0x0800 = IPv4
0x86DD = IPv6
0x0806 = ARP
IP Protocol:
1 = ICMP
6 = TCP
17 = UDP
47 = GRE
TCP/UDP Port:
80 = HTTP
443 = HTTPS
22 = SSH
53 = DNS
This is how a packet finds its way to the right application.
Encapsulation in Code
Here’s a simplified view of building a packet in Python (conceptual):
# Application layer - your data
http_request = b"GET / HTTP/1.1\r\nHost: example.com\r\n\r\n"
# Transport layer - add TCP header
tcp_segment = TCPHeader(
src_port=52431,
dst_port=80,
seq_num=1000,
ack_num=0,
flags=SYN
) + http_request
# Network layer - add IP header
ip_packet = IPHeader(
src_ip="192.168.1.100",
dst_ip="93.184.216.34",
protocol=TCP,
ttl=64
) + tcp_segment
# Link layer - add Ethernet header
ethernet_frame = EthernetHeader(
src_mac="00:11:22:33:44:55",
dst_mac="aa:bb:cc:dd:ee:ff",
ethertype=IPv4
) + ip_packet + calculate_fcs()
# Send it!
network_card.send(ethernet_frame)
In practice, the operating system’s network stack handles this, but understanding the process helps when debugging.
Practical Implications
MTU (Maximum Transmission Unit)
The link layer limits frame size. For Ethernet, the MTU is typically 1500 bytes:
Ethernet Frame Limit: 1518 bytes total
- Ethernet header: 14 bytes
- Payload: 1500 bytes (MTU)
- FCS: 4 bytes
Available for IP packet: 1500 bytes
- IP header: 20 bytes
- TCP header: 20 bytes
- Application data: 1460 bytes (typical MSS)
If data exceeds this, it must be fragmented—which has performance costs.
Jumbo Frames
Some networks support larger MTUs (up to 9000 bytes):
- Reduces overhead ratio
- Common in data centers
- Not universal—can cause problems if intermediate networks don’t support them
Summary
Encapsulation is the mechanism that makes layered networking work:
- Each layer adds its own header with information needed for its function
- Headers contain “next layer” indicators so receivers know how to decode
- Layers are independent—changes to one don’t affect others
- Overhead accumulates—important for small packet performance
Understanding encapsulation helps you:
- Debug network issues at the right layer
- Understand packet capture output
- Make informed decisions about protocol overhead
Next, we’ll explore ports and sockets—how multiple applications share a single network connection.
Ports and Sockets
A single computer can run dozens of networked applications simultaneously—a web browser, email client, chat application, and more. How does the operating system route incoming data to the right application? The answer lies in ports and sockets.
The Problem
Consider a server with IP address 192.168.1.100 running:
- A web server
- An SSH server
- A database
- An API service
When a packet arrives addressed to 192.168.1.100, which application should receive it?
Incoming Packets
│
▼
┌─────────────────────┐
│ IP: 192.168.1.100 │
│ │
│ ??? Which app ??? │
│ │
│ ┌───┐ ┌───┐ ┌───┐ │
│ │Web│ │SSH│ │DB │ │
│ └───┘ └───┘ └───┘ │
└─────────────────────┘
Ports: Application Addressing
Ports are 16-bit numbers (0-65535) that identify specific applications or services on a host. Combined with an IP address, a port uniquely identifies an application endpoint.
┌─────────────────────────────────────────────────────────────┐
│ Port Number Space │
│ │
│ 0 ─────── 1023 ──────── 49151 ──────── 65535 │
│ │ │ │ │ │
│ │ Well-Known│ Registered │ Dynamic/ │ │
│ │ Ports │ Ports │ Private │ │
│ │ │ │ Ports │ │
│ │ (System) │ (IANA reg) │ (Ephemeral) │ │
└─────────────────────────────────────────────────────────────┘
Port Ranges
| Range | Name | Purpose |
|---|---|---|
| 0-1023 | Well-Known Ports | Reserved for standard services; require root/admin |
| 1024-49151 | Registered Ports | Can be registered with IANA for specific services |
| 49152-65535 | Dynamic/Private | Used for client-side ephemeral ports |
Common Well-Known Ports
Port Protocol Service
────────────────────────────
20 TCP FTP Data
21 TCP FTP Control
22 TCP SSH
23 TCP Telnet
25 TCP SMTP
53 TCP/UDP DNS
67/68 UDP DHCP
80 TCP HTTP
110 TCP POP3
143 TCP IMAP
443 TCP HTTPS
465 TCP SMTPS
587 TCP SMTP Submission
993 TCP IMAPS
995 TCP POP3S
3306 TCP MySQL
5432 TCP PostgreSQL
6379 TCP Redis
27017 TCP MongoDB
How Ports Enable Multiplexing
With ports, our server can now direct traffic:
Incoming Packets
│
┌─────────┴─────────┐
│ Check port │
└─────────┬─────────┘
│
┌─────────┼─────────────────┬──────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│Port 80│ │Port 22│ │Port │ │Port │
│ HTTP │ │ SSH │ │ 5432 │ │ 3000 │
│Server │ │Server │ │Postgre│ │ API │
└───────┘ └───────┘ └───────┘ └───────┘
Sockets: The Programming Interface
A socket is an endpoint for network communication. It’s the API that applications use to send and receive data over the network.
The Socket Tuple
A socket is uniquely identified by a 5-tuple:
┌─────────────────────────────────────────────────────────────┐
│ Socket 5-Tuple │
├─────────────────────────────────────────────────────────────┤
│ 1. Protocol (TCP or UDP) │
│ 2. Local IP (192.168.1.100) │
│ 3. Local Port (80) │
│ 4. Remote IP (10.0.0.50) │
│ 5. Remote Port (52431) │
└─────────────────────────────────────────────────────────────┘
This combination uniquely identifies a connection.
Why the Tuple Matters
Multiple connections can share the same local port:
Server listening on port 80 (192.168.1.100:80)
Connection 1: (TCP, 192.168.1.100, 80, 10.0.0.50, 52431)
Connection 2: (TCP, 192.168.1.100, 80, 10.0.0.50, 52432)
Connection 3: (TCP, 192.168.1.100, 80, 10.0.0.99, 41000)
└─┬─┘ └──────┬─────┘ └┬┘ └────┬────┘ └──┬──┘
Proto Local IP Local Remote IP Remote
Port Port
All three connections go to the same server port, but
each is a unique connection due to different remote endpoints.
This is how a web server can handle thousands of simultaneous connections on port 80.
Socket Types
Stream Sockets (SOCK_STREAM)
Used with TCP:
- Connection-oriented
- Reliable, ordered byte stream
- Most common for applications
Client Server
│ │
│────── connect() ─────────────────>│
│ │ accept()
│<──────────────────────────────────│
│ │
│═══════ Bidirectional Stream ══════│
│ │
│────── send(data) ────────────────>│
│<───── send(response) ─────────────│
│ │
│────── close() ───────────────────>│
Datagram Sockets (SOCK_DGRAM)
Used with UDP:
- Connectionless
- Individual messages (datagrams)
- No guarantee of delivery or order
Client Server
│ │
│────── sendto(data, addr) ────────>│
│<───── sendto(response, addr) ─────│
│ │
│ (No connection state) │
│ │
│────── sendto(data, addr) ────────>│
Socket Programming Example
Here’s a simple TCP server and client in Python:
TCP Server
import socket
# Create socket
server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Allow address reuse (helpful during development)
server_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
# Bind to address and port
server_socket.bind(('0.0.0.0', 8080))
# Listen for connections (backlog of 5)
server_socket.listen(5)
print("Server listening on port 8080...")
while True:
# Accept incoming connection
client_socket, client_address = server_socket.accept()
print(f"Connection from {client_address}")
# Receive data
data = client_socket.recv(1024)
print(f"Received: {data.decode()}")
# Send response
client_socket.send(b"Hello from server!")
# Close connection
client_socket.close()
TCP Client
import socket
# Create socket
client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Connect to server
client_socket.connect(('localhost', 8080))
# Send data
client_socket.send(b"Hello from client!")
# Receive response
response = client_socket.recv(1024)
print(f"Received: {response.decode()}")
# Close connection
client_socket.close()
The Socket Lifecycle
Server Side
┌─────────────────────────────────────────────────────────────┐
│ Server Socket Lifecycle │
└─────────────────────────────────────────────────────────────┘
socket() Create the socket
│
▼
bind() Assign local address and port
│
▼
listen() Mark socket as passive (accepting connections)
│
▼
┌──────────────────────────────────────┐
│ accept() │◄────┐
│ (blocks until client connects) │ │
└──────────────┬───────────────────────┘ │
│ │
▼ │
New connected socket │
│ │
┌──────────┴──────────┐ │
│ │ │
▼ ▼ │
recv()/send() spawn thread/ │
│ handle async │
▼ │ │
close() │ │
│ └──────────────────┘
│
(Handle next connection)
Client Side
┌─────────────────────────────────────────────────────────────┐
│ Client Socket Lifecycle │
└─────────────────────────────────────────────────────────────┘
socket() Create the socket
│
▼
connect() Connect to remote server
│ (OS assigns ephemeral local port)
▼
send()/recv() Exchange data
│
▼
close() Terminate connection
Ephemeral Ports
When a client connects to a server, the OS automatically assigns a ephemeral port (temporary port) for the client side:
Client Server
┌─────────────┐ ┌─────────────┐
│ 10.0.0.50 │ │192.168.1.100│
│ │ │ │
│ Port: ??? │── connect() ──────────>│ Port: 80 │
└─────────────┘ └─────────────┘
OS assigns ephemeral port (e.g., 52431)
┌─────────────┐ ┌─────────────┐
│ 10.0.0.50 │ │192.168.1.100│
│ │ │ │
│ Port: 52431 │<═══════════════════════│ Port: 80 │
└─────────────┘ Connection └─────────────┘
Ephemeral Port Range
Different systems use different ranges:
| OS | Default Range |
|---|---|
| Linux | 32768-60999 |
| Windows | 49152-65535 |
| macOS | 49152-65535 |
You can check and modify this on Linux:
$ cat /proc/sys/net/ipv4/ip_local_port_range
32768 60999
$ sudo sysctl -w net.ipv4.ip_local_port_range="10000 65535"
Port Exhaustion
Each outbound connection uses an ephemeral port. If your application makes many outbound connections, you can exhaust available ports:
Problem Scenario:
─────────────────
Application makes 50,000 connections to an API server.
Each connection uses one ephemeral port.
Default range: 32768-60999 = ~28,000 ports
If connections aren't closed properly (lingering in TIME_WAIT),
you run out of ports!
Solutions:
─────────────────
1. Expand ephemeral port range
2. Enable TCP reuse options (SO_REUSEADDR, tcp_tw_reuse)
3. Use connection pooling
4. Properly close connections
Viewing Port Usage
Linux/macOS
# List all listening ports
$ netstat -tlnp
Proto Local Address Foreign Address State PID/Program
tcp 0.0.0.0:22 0.0.0.0:* LISTEN 1234/sshd
tcp 0.0.0.0:80 0.0.0.0:* LISTEN 5678/nginx
# Or with ss (modern replacement)
$ ss -tlnp
# List all connections
$ netstat -anp | grep ESTABLISHED
# Show which process owns a port
$ lsof -i :80
Windows
# List all listening ports
netstat -an | findstr LISTENING
# Show process IDs
netstat -ano | findstr :80
Special Port Behaviors
Binding to 0.0.0.0
Binding to 0.0.0.0 means “all interfaces”:
┌─────────────────────────────────────────────────────────────┐
│ Server with multiple interfaces │
│ │
│ eth0: 192.168.1.100 │
│ eth1: 10.0.0.50 │
│ lo: 127.0.0.1 │
│ │
│ bind('0.0.0.0', 80) → accepts on ALL interfaces │
│ bind('192.168.1.100', 80) → accepts only on eth0 │
│ bind('127.0.0.1', 80) → accepts only on localhost │
└─────────────────────────────────────────────────────────────┘
Port 0
Binding to port 0 asks the OS to assign any available port:
server_socket.bind(('0.0.0.0', 0))
actual_port = server_socket.getsockname()[1]
print(f"Assigned port: {actual_port}") # e.g., 54321
Reserved Ports (< 1024)
On Unix systems, ports below 1024 require root privileges:
$ python -c "import socket; s=socket.socket(); s.bind(('',80))"
PermissionError: [Errno 13] Permission denied
$ sudo python -c "import socket; s=socket.socket(); s.bind(('',80))"
# Works
This prevents unprivileged users from impersonating system services.
Summary
- Ports (0-65535) identify applications on a host
- Sockets are the programming interface for network I/O
- A connection is uniquely identified by the 5-tuple: (protocol, local IP, local port, remote IP, remote port)
- Ephemeral ports are automatically assigned for outbound connections
- Multiple connections can share a server port because remote endpoints differ
Understanding ports and sockets is essential for:
- Writing networked applications
- Debugging connectivity issues
- Understanding firewall rules
- Diagnosing port exhaustion problems
With the fundamentals covered, we’re ready to dive into the IP layer—how data finds its way across the internet.
The IP Layer
The Internet Protocol (IP) is the foundation of the internet. It provides logical addressing and routing—the ability to send packets from any device to any other device, regardless of the physical networks in between.
IP’s Simple Contract
IP makes a simple promise: “I’ll try to get this packet to its destination.”
Notice what IP doesn’t promise:
- Packets will arrive (they might be dropped)
- Packets will arrive in order (they might take different routes)
- Packets will arrive only once (duplicates can happen)
- Packets will arrive intact (corruption is possible, though detected)
This “best-effort” service might seem inadequate, but it’s deliberately minimal. By keeping IP simple, it can be:
- Fast: Minimal processing per packet
- Scalable: Routers don’t maintain connection state
- Universal: Works over any link layer
Higher layers (like TCP) can add reliability when needed.
The Two IP Versions
Today’s internet runs on two versions of IP:
┌─────────────────────────────────────────────────────────────┐
│ │
│ IPv4 (1981) │ IPv6 (1998) │
│ ─────────────────────────│───────────────────────────── │
│ 32-bit addresses │ 128-bit addresses │
│ ~4.3 billion addresses │ ~340 undecillion addresses │
│ Widely deployed │ Growing adoption │
│ NAT commonly used │ NAT generally unnecessary │
│ Simpler header │ Fixed header, extensions │
│ │
└─────────────────────────────────────────────────────────────┘
Both are in active use. Your device likely uses both daily.
What You’ll Learn
In this chapter, we’ll cover:
- IPv4 Addressing: The original 32-bit addressing scheme
- IPv6: The next generation with its vastly larger address space
- Subnetting: Dividing networks into smaller segments
- Routing: How packets find their way across networks
- Fragmentation: What happens when packets are too big
Key Concepts Preview
Addresses Identify Interfaces, Not Hosts
A common misconception is that an IP address identifies a computer. Actually, it identifies a network interface. A computer with two network cards has two IP addresses:
┌────────────────────────────────────────────────┐
│ Server │
│ │
│ ┌────────────┐ ┌────────────┐ │
│ │ eth0 │ │ eth1 │ │
│ │192.168.1.10│ │ 10.0.0.10 │ │
│ └──────┬─────┘ └──────┬─────┘ │
└──────────┼───────────────────────┼────────────┘
│ │
┌────┴─────┐ ┌────┴─────┐
│ Network A │ │ Network B │
└──────────┘ └──────────┘
Routing Is Hop-by-Hop
No device knows the complete path to a destination. Each router makes a local decision about the next hop:
Source ──> Router1 ──> Router2 ──> Router3 ──> Destination
Each router:
1. Looks at destination IP
2. Consults routing table
3. Forwards to next hop
4. Forgets about the packet
No router knows the full path. Each just knows "for this
destination, send to that next router."
TTL Prevents Infinite Loops
The Time to Live (TTL) field starts at some value (typically 64 or 128) and decrements at each hop. If it reaches 0, the packet is discarded. This prevents packets from circulating forever if there’s a routing loop.
TTL at source: 64
After router 1: 63
After router 2: 62
...
If routing loop: Eventually hits 0 → packet dropped
Let’s dive into the details, starting with IPv4.
IPv4 Addressing
IPv4 (Internet Protocol version 4) has been the backbone of the internet since 1981. Despite its age and limitations, it still carries the majority of internet traffic.
The IPv4 Address
An IPv4 address is a 32-bit number, typically written as four decimal numbers separated by dots (dotted-decimal notation):
Binary: 11000000 10101000 00000001 01100100
└──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘
Decimal: 192 . 168 . 1 . 100
Each number (octet) ranges from 0-255 (8 bits)
Total: 4 octets × 8 bits = 32 bits
Address Space Size
32 bits gives us 2³² = 4,294,967,296 addresses. Sounds like a lot, but:
- Many are reserved for special purposes
- Allocation was historically wasteful
- Every device needs an address (phones, IoT, servers…)
We ran out of new IPv4 blocks in 2011.
The IPv4 Header
Every IP packet starts with a header containing routing and handling information:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
│Version│ IHL │ DSCP │ECN│ Total Length │
├───────┴───────┴───────────┴───┼───────┬───────────────────────┤
│ Identification │ Flags │ Fragment Offset │
├───────────────┬───────────────┼───────┴───────────────────────┤
│ TTL │ Protocol │ Header Checksum │
├───────────────┴───────────────┴───────────────────────────────┤
│ Source IP Address │
├───────────────────────────────────────────────────────────────┤
│ Destination IP Address │
├───────────────────────────────────────────────────────────────┤
│ Options (if IHL > 5) │
└───────────────────────────────────────────────────────────────┘
Minimum header size: 20 bytes (no options)
Maximum header size: 60 bytes (with options)
Key Header Fields
| Field | Size | Purpose |
|---|---|---|
| Version | 4 bits | IP version (4 for IPv4) |
| IHL | 4 bits | Header length in 32-bit words |
| DSCP/ECN | 8 bits | Quality of Service hints |
| Total Length | 16 bits | Packet size (header + data) |
| Identification | 16 bits | Unique ID for fragmentation |
| Flags | 3 bits | Fragmentation control |
| Fragment Offset | 13 bits | Position in fragmented packet |
| TTL | 8 bits | Hop limit (prevents loops) |
| Protocol | 8 bits | Upper layer protocol (TCP=6, UDP=17) |
| Header Checksum | 16 bits | Error detection for header |
| Source IP | 32 bits | Sender’s address |
| Destination IP | 32 bits | Receiver’s address |
Address Classes (Historical)
Originally, IPv4 used a classful addressing scheme:
Class A: 0xxxxxxx.xxxxxxxx.xxxxxxxx.xxxxxxxx
│└──────────────┬───────────────────┘
Network (8 bits) Host (24 bits)
Range: 1.0.0.0 - 126.255.255.255
Networks: 126 Hosts/Network: 16 million
Class B: 10xxxxxx.xxxxxxxx.xxxxxxxx.xxxxxxxx
└───────┬────────┘└───────┬────────┘
Network (16 bits) Host (16 bits)
Range: 128.0.0.0 - 191.255.255.255
Networks: 16,384 Hosts/Network: 65,534
Class C: 110xxxxx.xxxxxxxx.xxxxxxxx.xxxxxxxx
└──────────┬──────────────┘└───┬───┘
Network (24 bits) Host (8 bits)
Range: 192.0.0.0 - 223.255.255.255
Networks: 2 million Hosts/Network: 254
Class D: 1110xxxx.xxxxxxxx.xxxxxxxx.xxxxxxxx
Multicast addresses (224.0.0.0 - 239.255.255.255)
Class E: 1111xxxx.xxxxxxxx.xxxxxxxx.xxxxxxxx
Reserved/Experimental (240.0.0.0 - 255.255.255.255)
This system is obsolete. It was too inflexible—an organization needing 300 addresses had to get a Class B (65,534 addresses) because Class C was too small (254). This wasted addresses. Modern networks use CIDR (classless addressing) instead.
Special and Reserved Addresses
Several address ranges have special meanings:
┌──────────────────────────────────────────────────────────────┐
│ Address Range │ Purpose │
├──────────────────────────────────────────────────────────────┤
│ 0.0.0.0/8 │ "This network" / unspecified │
│ 10.0.0.0/8 │ Private network (Class A) │
│ 127.0.0.0/8 │ Loopback (localhost) │
│ 169.254.0.0/16 │ Link-local (auto-config) │
│ 172.16.0.0/12 │ Private network (Class B range) │
│ 192.168.0.0/16 │ Private network (Class C range) │
│ 224.0.0.0/4 │ Multicast │
│ 255.255.255.255 │ Broadcast │
└──────────────────────────────────────────────────────────────┘
Private Addresses (RFC 1918)
Three ranges are designated for private use—they’re not routable on the public internet:
10.0.0.0 - 10.255.255.255 (10.0.0.0/8) 16 million addresses
172.16.0.0 - 172.31.255.255 (172.16.0.0/12) 1 million addresses
192.168.0.0 - 192.168.255.255 (192.168.0.0/16) 65,536 addresses
Your home network almost certainly uses one of these ranges (typically 192.168.x.x). To reach the internet, your router performs NAT (Network Address Translation).
Loopback Address
127.0.0.1 (or any 127.x.x.x) is the loopback address. Traffic sent here never leaves your machine—it’s used for local testing:
$ ping 127.0.0.1
PING 127.0.0.1: 64 bytes, seq=0 time=0.054 ms
# Same as:
$ ping localhost
Broadcast Address
255.255.255.255 is the limited broadcast address. Packets sent here go to all devices on the local network segment.
Each network also has a directed broadcast address (the highest address in the range). For 192.168.1.0/24, the broadcast is 192.168.1.255.
Network vs. Host Portions
An IP address has two parts:
192.168.1.100
└───┬───┘└┬┘
Network Host
Portion Portion
The division is determined by the subnet mask.
The network portion identifies which network a host belongs to. The host portion identifies the specific device on that network.
Subnet Mask
A subnet mask indicates how many bits are network vs. host:
IP Address: 192.168.1.100 = 11000000.10101000.00000001.01100100
Subnet Mask: 255.255.255.0 = 11111111.11111111.11111111.00000000
└────────── Network ─────────────┘└ Host ┘
AND them together to get the network address:
Network: 192.168.1.0 = 11000000.10101000.00000001.00000000
CIDR Notation
CIDR (Classless Inter-Domain Routing) notation appends a slash and the number of network bits:
192.168.1.100/24
└── 24 bits for network = 255.255.255.0 mask
Common CIDR blocks:
/8 = 255.0.0.0 = 16,777,214 hosts
/16 = 255.255.0.0 = 65,534 hosts
/24 = 255.255.255.0 = 254 hosts
/32 = 255.255.255.255 = 1 host (single address)
Determining If Two Hosts Are on the Same Network
Hosts on the same network can communicate directly. Hosts on different networks need a router.
Host A: 192.168.1.100/24
Host B: 192.168.1.200/24
Host C: 192.168.2.50/24
Apply mask to each:
A network: 192.168.1.100 AND 255.255.255.0 = 192.168.1.0
B network: 192.168.1.200 AND 255.255.255.0 = 192.168.1.0
C network: 192.168.2.50 AND 255.255.255.0 = 192.168.2.0
A and B: Same network (192.168.1.0) → Direct communication
A and C: Different networks → Need router
NAT (Network Address Translation)
With private addresses and limited IPv4 space, NAT lets many devices share one public IP:
Private Network (192.168.1.0/24) Internet
┌─────────────────────────────────┐
│ ┌─────────┐ │ ┌─────────────────┐
│ │ Laptop │ │ │ │
│ │ .100 ├──┐ │ │ Web Server │
│ └─────────┘ │ ┌─────────┐ │ │ 93.184.216.34 │
│ ├────┤ Router ├──┼────>│ │
│ ┌─────────┐ │ │ NAT │ │ │ │
│ │ Phone ├──┘ │ │ │ └─────────────────┘
│ │ .101 │ │ Public: │ │
│ └─────────┘ │73.45.2.1│ │
│ └─────────┘ │
└─────────────────────────────────┘
Laptop sends: src=192.168.1.100:52000 dst=93.184.216.34:80
NAT rewrites: src=73.45.2.1:40123 dst=93.184.216.34:80
Response comes back to 73.45.2.1:40123
NAT looks up mapping, forwards to 192.168.1.100:52000
NAT is why billions of devices can use the internet with only ~4 billion addresses.
Working with IP Addresses in Code
Python
import ipaddress
# Parse an address
ip = ipaddress.ip_address('192.168.1.100')
print(ip.is_private) # True
print(ip.is_loopback) # False
# Work with networks
network = ipaddress.ip_network('192.168.1.0/24')
print(network.num_addresses) # 256
print(network.netmask) # 255.255.255.0
# Check if address is in network
ip = ipaddress.ip_address('192.168.1.100')
print(ip in network) # True
# Iterate over hosts
for host in network.hosts():
print(host) # 192.168.1.1 through 192.168.1.254
Bash
# Get your IP addresses
$ ip addr show
# or
$ ifconfig
# Check if you can reach an IP
$ ping -c 3 192.168.1.1
# Trace route to destination
$ traceroute 8.8.8.8
# Look up your public IP
$ curl ifconfig.me
Practical Tips
Finding Your IP Address
# Linux/Mac - local IP
$ hostname -I
192.168.1.100
# Windows - local IP
> ipconfig
# Public IP (what the internet sees)
$ curl ifconfig.me
Common Issues
“Network is unreachable”
- Check if you have an IP (DHCP may have failed)
- Check subnet mask is correct
- Check default gateway is set
“No route to host”
- Destination may be down
- Firewall may be blocking
- ARP resolution may have failed
“Connection refused”
- You reached the host, but no service is listening
- This is a good sign for network debugging—networking works!
Summary
IPv4’s 32-bit addressing scheme, while showing its age, remains the internet’s foundation:
- Addresses are written as four octets (e.g., 192.168.1.100)
- Network and host portions are determined by the subnet mask
- Private ranges (10.x, 172.16-31.x, 192.168.x) are for internal use
- NAT allows address sharing but adds complexity
- CIDR replaced wasteful classful addressing
The address shortage led to IPv6, which we’ll cover next.
IPv6 and the Future
IPv6 was designed to solve IPv4’s address exhaustion problem—and to fix several other shortcomings along the way. With 128-bit addresses, IPv6 provides enough addresses for every grain of sand on Earth to have its own IP… many times over.
Why IPv6?
The primary driver was address space:
IPv4: 2³² = 4.3 billion addresses
IPv6: 2¹²⁸ = 340 undecillion addresses
340,282,366,920,938,463,463,374,607,431,768,211,456
That's 340 trillion trillion trillion addresses.
Or about 50 octillion addresses per human alive today.
But IPv6 also addressed other IPv4 limitations:
- No more NAT required (enough addresses for everyone)
- Simplified header (faster routing)
- Built-in security (IPsec)
- Better multicast support
- Stateless address autoconfiguration
IPv6 Address Format
An IPv6 address is 128 bits, written as eight groups of four hexadecimal digits:
Full form:
2001:0db8:85a3:0000:0000:8a2e:0370:7334
└──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘
│ │ │ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
Each group = 16 bits (4 hex digits)
8 groups × 16 bits = 128 bits
Address Shortening Rules
IPv6 addresses can be shortened for readability:
Rule 1: Remove leading zeros in each group
2001:0db8:0042:0000:0000:0000:0000:0001
↓
2001:db8:42:0:0:0:0:1
Rule 2: Replace one sequence of all-zero groups with ::
2001:db8:42:0:0:0:0:1
↓
2001:db8:42::1
Important: :: can only appear once per address (otherwise it’s ambiguous).
Examples
Full Shortened
────────────────────────────────────────────────────────────
2001:0db8:0000:0000:0000:0000:0000:0001 2001:db8::1
0000:0000:0000:0000:0000:0000:0000:0001 ::1 (loopback)
0000:0000:0000:0000:0000:0000:0000:0000 :: (unspecified)
fe80:0000:0000:0000:0215:5dff:fe00:0000 fe80::215:5dff:fe00:0
2001:0db8:85a3:0000:0000:8a2e:0370:7334 2001:db8:85a3::8a2e:370:7334
The IPv6 Header
IPv6’s header is simpler than IPv4’s—fixed at 40 bytes with no options in the base header:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
├───────┬───────────────┬───────────────────────────────────────┤
│Version│ Traffic Class │ Flow Label │
├───────┴───────────────┼───────────────────┬───────────────────┤
│ Payload Length │ Next Header │ Hop Limit │
├───────────────────────┴───────────────────┴───────────────────┤
│ │
│ Source Address │
│ (128 bits) │
│ │
├───────────────────────────────────────────────────────────────┤
│ │
│ Destination Address │
│ (128 bits) │
│ │
└───────────────────────────────────────────────────────────────┘
Key Differences from IPv4
| IPv4 | IPv6 |
|---|---|
| Variable header (20-60 bytes) | Fixed header (40 bytes) |
| Header checksum | No checksum (relies on link layer) |
| Fragmentation in header | Extension headers |
| Options in header | Extension headers |
Extension Headers
IPv6 uses extension headers for optional features. They chain together:
┌──────────────┬────────────────┬──────────────┬─────────────┐
│ IPv6 Header │ Hop-by-Hop │ Destination │ TCP │
│ Next: Hop-by │ Next: Dest │ Next: TCP │ Segment │
│ -Hop │ Options │ │ │
└──────────────┴────────────────┴──────────────┴─────────────┘
Common Extension Headers:
- Hop-by-Hop Options (processed by every router)
- Routing (specify intermediate routers)
- Fragment (for packet fragmentation)
- Authentication Header (IPsec)
- Encapsulating Security Payload (IPsec encryption)
- Destination Options (for destination only)
Address Types
IPv6 has three address types (no broadcast!):
┌─────────────────────────────────────────────────────────────┐
│ IPv6 Address Types │
├─────────────────────────────────────────────────────────────┤
│ Unicast One-to-one communication │
│ Single sender, single receiver │
│ │
│ Multicast One-to-many communication │
│ Single sender, multiple receivers │
│ (Replaces broadcast) │
│ │
│ Anycast One-to-nearest communication │
│ Delivered to closest node in a group │
│ (Same address on multiple nodes) │
└─────────────────────────────────────────────────────────────┘
Special Address Prefixes
Prefix Type Purpose
──────────────────────────────────────────────────────────────
::1/128 Loopback Local host (localhost)
::/128 Unspecified No address assigned
fe80::/10 Link-local Same network only
fc00::/7 Unique local Private addresses
ff00::/8 Multicast Group communication
2000::/3 Global unicast Public internet
::ffff:0:0/96 IPv4-mapped IPv4 in IPv6 format
64:ff9b::/96 IPv4-IPv6 translation NAT64
Link-Local Addresses
Every IPv6 interface automatically gets a link-local address starting with fe80:::
Interface: eth0
Link-local: fe80::1a2b:3c4d:5e6f:7890
These addresses:
- Auto-generated from MAC address (or random)
- Valid only on local network segment
- Not routed beyond local link
- Always present, even without DHCP/manual config
Global Unicast Addresses
Public IPv6 addresses typically start with 2 or 3:
2001:db8:1234:5678:9abc:def0:1234:5678
└───────────┬──────────┘└────────┬──────┘
Routing Prefix Interface ID
(Network portion) (Host portion)
Typical allocation:
/48 - Organization gets this from ISP
/64 - Single subnet (standard recommendation)
Address Autoconfiguration
IPv6 supports Stateless Address Autoconfiguration (SLAAC)—devices can configure their own addresses without DHCP:
1. Interface comes up
↓
2. Generate link-local address (fe80::...)
↓
3. Router sends Router Advertisement (RA)
Contains: Network prefix (e.g., 2001:db8:1::/64)
↓
4. Host generates global address:
Prefix from RA + Interface ID = Global Address
2001:db8:1::1a2b:3c4d:5e6f:7890/64
↓
5. Host verifies uniqueness (DAD - Duplicate Address Detection)
↓
6. Address is ready to use!
DHCPv6 is available for networks needing more control.
IPv4 to IPv6 Transition
The world is slowly transitioning. Several mechanisms help:
Dual Stack
Devices run both IPv4 and IPv6:
┌─────────────────────────────────────┐
│ Application │
├──────────────┬──────────────────────┤
│ IPv4 │ IPv6 │
├──────────────┼──────────────────────┤
│ Network Interface │
└─────────────────────────────────────┘
Device has both:
IPv4: 192.168.1.100
IPv6: 2001:db8::1234
Tunneling
IPv6 packets wrapped in IPv4 to cross IPv4-only networks:
┌───────────────────────────────────────────────────────┐
│ IPv4 Header │
│ (src: 203.0.113.1, dst: 198.51.100.1) │
├───────────────────────────────────────────────────────┤
│ ┌───────────────────────────────────────────────────┐ │
│ │ IPv6 Header │ │
│ │ (src: 2001:db8::1, dst: 2001:db8::2) │ │
│ ├───────────────────────────────────────────────────┤ │
│ │ Original Data │ │
│ └───────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────┘
NAT64/DNS64
Allows IPv6-only devices to reach IPv4 servers:
IPv6-only Client NAT64 Gateway IPv4 Server
│ │ │
│─────IPv6 packet──────────────>│ │
│ dst: 64:ff9b::93.184.216.34 │ │
│ │──────IPv4 packet────────>│
│ │ dst: 93.184.216.34 │
│ │ │
│ │<─────IPv4 response───────│
│<─────IPv6 response───────────│ │
Working with IPv6
Command Line
# Show IPv6 addresses
$ ip -6 addr show
2: eth0: <BROADCAST,MULTICAST,UP>
inet6 2001:db8::1/64 scope global
inet6 fe80::1/64 scope link
# Ping IPv6
$ ping6 ::1
$ ping -6 google.com
# Trace route
$ traceroute6 google.com
# DNS lookup
$ dig AAAA google.com
In URLs
IPv6 addresses in URLs must be bracketed:
http://[2001:db8::1]:8080/path
└─────────────┘
IPv6 address in brackets
Without brackets, colons are ambiguous:
http://2001:db8::1:8080 ← Is 8080 the port or part of address?
Python
import ipaddress
# Parse IPv6
ip = ipaddress.ip_address('2001:db8::1')
print(ip.is_global) # True
print(ip.is_link_local) # False
print(ip.exploded) # 2001:0db8:0000:0000:0000:0000:0000:0001
# Network operations
net = ipaddress.ip_network('2001:db8::/32')
print(net.num_addresses) # 79228162514264337593543950336
# Socket programming
import socket
sock = socket.socket(socket.AF_INET6, socket.SOCK_STREAM)
sock.connect(('2001:db8::1', 80))
IPv6 Adoption Status
As of recent measurements:
- ~40% of Google traffic is over IPv6
- Major cloud providers fully support IPv6
- Mobile networks often IPv6-primary
- Many ISPs support IPv6 (but not all)
Adoption varies by region:
India: ~70% IPv6
USA: ~50% IPv6
Germany: ~60% IPv6
China: ~30% IPv6
Global: ~40% IPv6 (and growing)
Practical Considerations
When You Need IPv6
- Modern mobile app development
- IoT devices (often IPv6-only)
- Reaching IPv6-only users
- Future-proofing infrastructure
Common Issues
“Network unreachable” to IPv6 addresses
- Your network may not have IPv6 connectivity
- Check:
ping6 ::1(should work - loopback) - Check:
ping6 google.com(needs IPv6 internet)
Application doesn’t support IPv6
- Some older software hardcodes IPv4
- Check for IPv6/dual-stack support in dependencies
Firewall not configured for IPv6
- IPv6 rules are often separate from IPv4
- Don’t forget to configure both!
Summary
IPv6 solves IPv4’s address exhaustion with a vastly larger address space:
| Feature | IPv4 | IPv6 |
|---|---|---|
| Address size | 32 bits | 128 bits |
| Address format | Dotted decimal | Colon hexadecimal |
| Header size | Variable (20-60) | Fixed (40) |
| Address config | DHCP or manual | SLAAC, DHCPv6, or manual |
| NAT | Common | Generally unnecessary |
| IPsec | Optional | Built-in |
The transition to IPv6 is ongoing but inevitable. New projects should support both protocols.
Next, we’ll look at subnetting—how to divide networks into smaller, manageable pieces.
Subnetting
Subnetting is the practice of dividing a larger network into smaller, logical segments. It’s a fundamental skill for network design and one of the most practical topics in IP networking.
Why Subnet?
Without subnetting, you’d have one flat network for all devices:
Single Network (No Subnetting):
┌─────────────────────────────────────────────────────────────┐
│ All 65,534 possible hosts on one network segment │
│ │
│ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ... thousands │
│ │PC1│ │PC2│ │PC3│ │Srv│ │Dev│ │IoT│ │...│ more │
│ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘ │
│ │
│ Problems: │
│ - Broadcast traffic reaches everyone │
│ - No isolation between departments │
│ - Security is harder to manage │
│ - Single failure can affect everyone │
└─────────────────────────────────────────────────────────────┘
With subnetting:
Subnetted Network:
┌────────────────────┐ ┌────────────────────┐ ┌────────────────────┐
│ Engineering │ │ Marketing │ │ Servers │
│ 192.168.1.0/26 │ │ 192.168.1.64/26 │ │ 192.168.1.128/26 │
│ │ │ │ │ │
│ ┌───┐ ┌───┐ ┌───┐ │ │ ┌───┐ ┌───┐ │ │ ┌───┐ ┌───┐ │
│ │PC1│ │PC2│ │PC3│ │ │ │PC4│ │PC5│ │ │ │Web│ │DB │ │
│ └───┘ └───┘ └───┘ │ │ └───┘ └───┘ │ │ └───┘ └───┘ │
│ 62 usable hosts │ │ 62 usable hosts │ │ 62 usable hosts │
└─────────┬──────────┘ └─────────┬──────────┘ └─────────┬──────────┘
│ │ │
└──────────────────────┼──────────────────────┘
│
┌────┴────┐
│ Router │
└─────────┘
Benefits:
- Broadcast containment: Broadcasts stay within subnet
- Security: Apply different policies to different subnets
- Organization: Logical grouping of related systems
- Performance: Less broadcast traffic per segment
- Troubleshooting: Easier to isolate issues
Understanding CIDR Notation
CIDR (Classless Inter-Domain Routing) notation specifies how many bits are used for the network portion:
192.168.1.0/24
└── 24 network bits, 8 host bits
Binary breakdown:
Address: 11000000.10101000.00000001.00000000
└────────── 24 bits ──────────┘└ 8 ┘
Network Host
Subnet mask: 255.255.255.0
11111111.11111111.11111111.00000000
Common CIDR Blocks
| CIDR | Subnet Mask | Network Bits | Host Bits | Usable Hosts |
|---|---|---|---|---|
| /8 | 255.0.0.0 | 8 | 24 | 16,777,214 |
| /16 | 255.255.0.0 | 16 | 16 | 65,534 |
| /24 | 255.255.255.0 | 24 | 8 | 254 |
| /25 | 255.255.255.128 | 25 | 7 | 126 |
| /26 | 255.255.255.192 | 26 | 6 | 62 |
| /27 | 255.255.255.224 | 27 | 5 | 30 |
| /28 | 255.255.255.240 | 28 | 4 | 14 |
| /29 | 255.255.255.248 | 29 | 3 | 6 |
| /30 | 255.255.255.252 | 30 | 2 | 2 |
| /31 | 255.255.255.254 | 31 | 1 | 2* |
| /32 | 255.255.255.255 | 32 | 0 | 1 |
/31 is special: Used for point-to-point links (no broadcast needed).
Why “Usable Hosts” Is Less Than 2^n
Two addresses in every subnet are reserved:
- Network address: All host bits = 0 (identifies the subnet)
- Broadcast address: All host bits = 1 (reaches all hosts)
192.168.1.0/24:
Network: 192.168.1.0 (first address)
Hosts: 192.168.1.1 - 192.168.1.254
Broadcast: 192.168.1.255 (last address)
Usable = 2^(host bits) - 2 = 256 - 2 = 254
Calculating Subnets
Method 1: Binary Calculation
Given 192.168.1.0/24, create 4 subnets:
Step 1: Determine bits needed
4 subnets = 2^2, so we need 2 additional network bits
New prefix: /24 + 2 = /26
Step 2: Calculate subnet size
Host bits = 32 - 26 = 6
Addresses per subnet = 2^6 = 64
Usable hosts = 64 - 2 = 62
Step 3: List subnets (increment by 64)
Subnet 0: 192.168.1.0/26 (hosts .1-.62, broadcast .63)
Subnet 1: 192.168.1.64/26 (hosts .65-.126, broadcast .127)
Subnet 2: 192.168.1.128/26 (hosts .129-.190, broadcast .191)
Subnet 3: 192.168.1.192/26 (hosts .193-.254, broadcast .255)
Method 2: The “Magic Number” Method
The “magic number” is 256 minus the last non-zero octet of the subnet mask:
For /26: Mask = 255.255.255.192
Magic number = 256 - 192 = 64
Subnets start at multiples of 64:
192.168.1.0, 192.168.1.64, 192.168.1.128, 192.168.1.192
Subnet Calculation Chart
┌──────────────────────────────────────────────────────────────────────┐
│ CIDR Mask Magic# Subnets(from /24) Hosts Range │
├──────────────────────────────────────────────────────────────────────┤
│ /25 255.255.255.128 128 2 126 /2 │
│ /26 255.255.255.192 64 4 62 /4 │
│ /27 255.255.255.224 32 8 30 /8 │
│ /28 255.255.255.240 16 16 14 /16 │
│ /29 255.255.255.248 8 32 6 /32 │
│ /30 255.255.255.252 4 64 2 /64 │
└──────────────────────────────────────────────────────────────────────┘
Practical Examples
Example 1: Office Network Design
Requirement: Design a network for a small office with:
- 50 employees (workstations)
- 10 servers
- 5 network devices
- Room for 50% growth
Given: 192.168.10.0/24
Solution:
Department Hosts Needed Subnet Usable Range
────────────────────────────────────────────────────────────────
Workstations 50 (→75) /25 (126) 192.168.10.0/25
.1 - .126
Servers 10 (→15) /27 (30) 192.168.10.128/27
.129 - .158
Network Devices 5 (→8) /28 (14) 192.168.10.160/28
.161 - .174
Future Use - /28 (14) 192.168.10.176/28
Management - /28 (14) 192.168.10.192/28
Remaining: 192.168.10.208 - 192.168.10.255 (/28 + partial)
Example 2: Finding Subnet for an IP
Question: What subnet does 192.168.1.147/26 belong to?
Step 1: Find the magic number
/26 mask = 255.255.255.192
Magic = 256 - 192 = 64
Step 2: Find which multiple of 64 contains .147
0, 64, 128, 192...
128 ≤ 147 < 192
Step 3: Answer
Network: 192.168.1.128/26
Range: 192.168.1.128 - 192.168.1.191
Broadcast: 192.168.1.191
Example 3: Are Two IPs on Same Subnet?
Question: Are 10.1.1.50/28 and 10.1.1.60/28 on the same subnet?
For /28: Magic number = 256 - 240 = 16
10.1.1.50: Falls in 10.1.1.48/28 (48 ≤ 50 < 64)
10.1.1.60: Falls in 10.1.1.48/28 (48 ≤ 60 < 64)
Answer: Yes, same subnet (10.1.1.48/28)
VLSM (Variable Length Subnet Mask)
VLSM allows different subnets to have different sizes, optimizing address usage:
Without VLSM (Fixed /26):
┌──────────────────────────────────────────────────────────────┐
│ Dept A: 60 hosts │ Dept B: 10 hosts │ Links: 2 hosts │
│ /26 (62 usable) ✓ │ /26 (62 usable) │ /26 (62 usable)│
│ │ 52 wasted! │ 60 wasted! │
└──────────────────────────────────────────────────────────────┘
With VLSM:
┌──────────────────────────────────────────────────────────────┐
│ Dept A: 60 hosts │ Dept B: 10 hosts │ Links: 2 hosts │
│ /26 (62 usable) ✓ │ /28 (14 usable) ✓ │ /30 (2 usable)✓│
│ │ 4 spare │ 0 wasted │
└──────────────────────────────────────────────────────────────┘
VLSM Planning Process
- List requirements from largest to smallest
- Assign subnets starting with largest
- Use remaining space for smaller subnets
Given: 172.16.0.0/16
Requirements:
- Engineering: 500 hosts
- Sales: 100 hosts
- HR: 50 hosts
- Point-to-point links: 4 (need 2 hosts each)
Allocation:
Engineering: 172.16.0.0/23 (510 hosts) 172.16.0.1 - 172.16.1.254
Sales: 172.16.2.0/25 (126 hosts) 172.16.2.1 - 172.16.2.126
HR: 172.16.2.128/26 (62 hosts) 172.16.2.129 - 172.16.2.190
Link 1: 172.16.2.192/30 (2 hosts) 172.16.2.193 - 172.16.2.194
Link 2: 172.16.2.196/30 (2 hosts) 172.16.2.197 - 172.16.2.198
Link 3: 172.16.2.200/30 (2 hosts) 172.16.2.201 - 172.16.2.202
Link 4: 172.16.2.204/30 (2 hosts) 172.16.2.205 - 172.16.2.206
Remaining: 172.16.2.208 - 172.16.255.255 (available for future)
Supernetting (Route Aggregation)
Supernetting (or CIDR aggregation) combines multiple smaller networks into one larger route:
Before aggregation (4 routes):
192.168.0.0/24
192.168.1.0/24
192.168.2.0/24
192.168.3.0/24
After aggregation (1 route):
192.168.0.0/22
This reduces routing table size and improves router efficiency.
Binary visualization:
192.168.0.0 = 11000000.10101000.000000|00.00000000
192.168.1.0 = 11000000.10101000.000000|01.00000000
192.168.2.0 = 11000000.10101000.000000|10.00000000
192.168.3.0 = 11000000.10101000.000000|11.00000000
└──────┘
These bits vary
Common prefix: 22 bits → /22 covers all four
IPv6 Subnetting
IPv6 subnetting is conceptually similar but the numbers are larger:
Standard allocation:
ISP receives: /32 or /48 from registry
Organization gets: /48 from ISP
Site/Subnet: /64 (standard LAN)
/48 to /64 gives: 16 bits = 65,536 subnets
Each /64 has: 64 bits for hosts = 2^64 addresses
Example:
Organization: 2001:db8:abcd::/48
Subnets:
2001:db8:abcd:0000::/64 - HQ Floor 1
2001:db8:abcd:0001::/64 - HQ Floor 2
2001:db8:abcd:0002::/64 - HQ Servers
...
2001:db8:abcd:ffff::/64 - 65,536th subnet
Tools for Subnetting
Command Line
# ipcalc (Linux)
$ ipcalc 192.168.1.0/26
Address: 192.168.1.0 11000000.10101000.00000001.00 000000
Netmask: 255.255.255.192 = 26 11111111.11111111.11111111.11 000000
Wildcard: 0.0.0.63 00000000.00000000.00000000.00 111111
Network: 192.168.1.0/26 11000000.10101000.00000001.00 000000
HostMin: 192.168.1.1 11000000.10101000.00000001.00 000001
HostMax: 192.168.1.62 11000000.10101000.00000001.00 111110
Broadcast: 192.168.1.63 11000000.10101000.00000001.00 111111
Hosts/Net: 62
# sipcalc (more features)
$ sipcalc 192.168.1.0/24 -s 26
Python
import ipaddress
# Create network
network = ipaddress.ip_network('192.168.1.0/24')
# Get subnet info
print(f"Network: {network.network_address}")
print(f"Netmask: {network.netmask}")
print(f"Broadcast: {network.broadcast_address}")
print(f"Hosts: {network.num_addresses - 2}")
# Divide into subnets
subnets = list(network.subnets(new_prefix=26))
for subnet in subnets:
print(f" {subnet}")
# Output:
# 192.168.1.0/26
# 192.168.1.64/26
# 192.168.1.128/26
# 192.168.1.192/26
# Check if IP is in network
ip = ipaddress.ip_address('192.168.1.100')
print(ip in network) # True
Common Mistakes
-
Forgetting reserved addresses
- Always subtract 2 from total for usable hosts
-
Overlapping subnets
- 192.168.1.0/25 and 192.168.1.64/26 overlap!
- Plan carefully, especially with VLSM
-
Not planning for growth
- Networks grow; leave room for expansion
-
Using /30 for LANs
- /30 is for point-to-point links only
- LANs need room for multiple hosts
Summary
Subnetting divides networks for better organization, security, and efficiency:
- CIDR notation (/24) indicates network vs. host bits
- Subnet mask shows the network boundary
- Magic number (256 - mask octet) gives subnet size
- VLSM allows different-sized subnets for efficiency
- Supernetting aggregates routes for simpler routing
Practice is key—work through examples until it becomes intuitive.
Next, we’ll explore how packets actually find their way across networks: routing fundamentals.
Routing Fundamentals
Routing is the process of selecting paths for network traffic. When you send a packet to a destination across the internet, it passes through many intermediate devices (routers) that make forwarding decisions. Understanding routing helps you debug connectivity issues and design better network architectures.
The Core Concept
Routing works through a simple, repeated process:
At each router:
┌─────────────────────────────────────────────────────────────┐
│ 1. Receive packet │
│ 2. Examine destination IP address │
│ 3. Consult routing table │
│ 4. Forward packet to next hop (or deliver locally) │
│ 5. Decrement TTL │
│ 6. Forget about the packet │
└─────────────────────────────────────────────────────────────┘
No router knows the complete path. Each makes a local decision.
This hop-by-hop routing is fundamental to the internet’s scalability and resilience.
Direct vs. Indirect Delivery
When a host wants to send a packet, it first determines if the destination is local (same network) or remote (different network):
Source: 192.168.1.100/24
Case 1: Destination 192.168.1.200 (same network)
┌───────────────────────────────────────────────────────────────┐
│ Apply subnet mask: │
│ 192.168.1.100 AND 255.255.255.0 = 192.168.1.0 │
│ 192.168.1.200 AND 255.255.255.0 = 192.168.1.0 │
│ Same network! → Direct delivery via ARP │
└───────────────────────────────────────────────────────────────┘
Case 2: Destination 10.0.0.50 (different network)
┌───────────────────────────────────────────────────────────────┐
│ Apply subnet mask: │
│ 192.168.1.100 AND 255.255.255.0 = 192.168.1.0 │
│ 10.0.0.50 AND 255.255.255.0 = 10.0.0.0 │
│ Different networks! → Send to default gateway │
└───────────────────────────────────────────────────────────────┘
The Routing Table
A routing table maps destination networks to next hops. Every device with IP networking has one:
$ ip route show # Linux
$ netstat -rn # Linux/Mac
$ route print # Windows
Example output:
┌──────────────────┬──────────────────┬─────────────┬───────────┐
│ Destination │ Gateway │ Iface │ Metric │
├──────────────────┼──────────────────┼─────────────┼───────────┤
│ 0.0.0.0/0 │ 192.168.1.1 │ eth0 │ 100 │
│ 192.168.1.0/24 │ 0.0.0.0 │ eth0 │ 0 │
│ 10.10.0.0/16 │ 192.168.1.254 │ eth0 │ 100 │
│ 127.0.0.0/8 │ 0.0.0.0 │ lo │ 0 │
└──────────────────┴──────────────────┴─────────────┴───────────┘
Entry meanings:
- 0.0.0.0/0: Default route ("everything else") → send to 192.168.1.1
- 192.168.1.0/24: Local network → deliver directly (0.0.0.0 gateway)
- 10.10.0.0/16: Route to remote network → via 192.168.1.254
- 127.0.0.0/8: Loopback → handled locally
Routing Table Lookup
When forwarding a packet, the router finds the most specific matching route (longest prefix match):
Destination: 10.10.5.100
Routing table entries:
0.0.0.0/0 → Gateway A (default)
10.0.0.0/8 → Gateway B
10.10.0.0/16 → Gateway C
10.10.5.0/24 → Gateway D
Matching process:
0.0.0.0/0 - Matches (but only 0 bits specific)
10.0.0.0/8 - Matches (8 bits specific)
10.10.0.0/16 - Matches (16 bits specific)
10.10.5.0/24 - Matches (24 bits specific) ← WINNER
Result: Forward to Gateway D (most specific match)
Static vs. Dynamic Routing
Static Routing
Routes manually configured by an administrator:
# Add a static route (Linux)
$ sudo ip route add 10.20.0.0/16 via 192.168.1.254
# Persistent (varies by distro, often in /etc/network/interfaces or netplan)
Pros:
- Simple, predictable
- No protocol overhead
- Good for small, stable networks
Cons:
- Doesn’t adapt to failures
- Tedious for large networks
- Error-prone at scale
Dynamic Routing
Routes learned automatically via routing protocols:
┌─────────────────────────────────────────────────────────────┐
│ Routing Protocols │
├─────────────────────────────────────────────────────────────┤
│ Interior Gateway Protocols (within organization): │
│ RIP - Simple, distance-vector, limited scale │
│ OSPF - Link-state, widely used, complex │
│ EIGRP - Cisco proprietary, efficient │
│ IS-IS - Link-state, used by large ISPs │
│ │
│ Exterior Gateway Protocol (between organizations): │
│ BGP - Border Gateway Protocol, runs the internet │
└─────────────────────────────────────────────────────────────┘
Pros:
- Automatically adapts to failures
- Scales to large networks
- Finds optimal paths
Cons:
- Protocol overhead
- More complex to configure
- Convergence time during changes
How Routing Protocols Work
Distance-Vector (RIP)
Routers share their entire routing table with neighbors periodically:
Initial state:
┌───────────────┐
Router A ─────── Router B ─────── Router C
Knows: Knows: Knows:
Net 1 Net 2 Net 3
After exchange:
Router A Router B Router C
Knows: Knows: Knows:
Net 1 (direct) Net 1 (via A) Net 1 (via B)
Net 2 (via B) Net 2 (direct) Net 2 (via B)
Net 3 (via B) Net 3 (via C) Net 3 (direct)
Link-State (OSPF)
Each router learns the complete network topology and calculates best paths:
1. Each router discovers neighbors
2. Each router floods link-state info to all routers
3. Every router has identical network map
4. Each router independently calculates best paths (Dijkstra's algorithm)
Advantage: Faster convergence, no routing loops during transition
Disadvantage: More memory and CPU intensive
BGP: The Internet’s Routing Protocol
BGP (Border Gateway Protocol) is how autonomous systems (AS) exchange routing information:
┌──────────────────┐ ┌──────────────────┐
│ AS 65001 │ │ AS 65002 │
│ (Your ISP) │──BGP────│ (Another ISP) │
│ │ │ │
│ Announces: │ │ Announces: │
│ 203.0.113.0/24 │ │ 198.51.100.0/24 │
└──────────────────┘ └──────────────────┘
BGP characteristics:
- Path-vector protocol (tracks AS path)
- Policy-based routing (not just shortest path)
- Slow convergence (stability over speed)
- ~900,000+ routes in global table
BGP Path Selection
BGP chooses routes based on multiple criteria (simplified):
- Highest local preference
- Shortest AS path
- Lowest origin type
- Lowest MED (Multi-Exit Discriminator)
- Prefer eBGP over iBGP
- Lowest IGP metric to next hop
- … (many more tie-breakers)
Routing in Action
Let’s trace a packet from your laptop to a web server:
Your Laptop (192.168.1.100)
│
│ Destination: 93.184.216.34 (example.com)
│ Different network → send to default gateway
▼
Home Router (192.168.1.1)
│
│ Routing table: default route → ISP
▼
ISP Router #1
│
│ BGP table: 93.184.216.0/24 → via AS 15133
│ (Multiple paths available, chooses best)
▼
ISP Router #2
│
│ BGP: Next hop toward destination AS
▼
... (several more hops) ...
│
▼
Destination Router
│
│ 93.184.216.0/24 is directly connected
│ ARP for 93.184.216.34, deliver to server
▼
Web Server (93.184.216.34)
Traceroute: Seeing the Path
Traceroute reveals the path packets take by exploiting TTL:
$ traceroute example.com
1 192.168.1.1 (192.168.1.1) 1.234 ms
2 96.120.92.1 (96.120.92.1) 12.456 ms
3 68.86.90.137 (68.86.90.137) 15.789 ms
4 * * * (no response)
5 be-33651-cr02.nyc (66.109.6.81) 25.123 ms
6 93.184.216.34 (93.184.216.34) 28.456 ms
How it works:
Send packet with TTL=1 → First router replies "TTL exceeded"
Send packet with TTL=2 → Second router replies
Send packet with TTL=3 → Third router replies
... continue until destination reached
Reading Traceroute Output
Hop 4: * * *
This means:
- Router didn't respond to traceroute probes
- Could be: firewall blocking, ICMP rate limiting, high latency
- Packets might still pass through this router fine
- Not necessarily a problem
Multiple times per hop:
3 68.86.90.137 15.789 ms 16.123 ms 14.567 ms
└─────────────┴─ Three separate probes, showing latency variation
Common Routing Issues
Routing Loops
Misconfiguration can cause packets to circle:
Router A: "To reach 10.0.0.0/8, send to B"
Router B: "To reach 10.0.0.0/8, send to C"
Router C: "To reach 10.0.0.0/8, send to A"
Packet bounces: A → B → C → A → B → C → ...
Until TTL reaches 0!
Solutions:
- TTL prevents infinite loops
- Routing protocols have loop prevention (split horizon, etc.)
- BGP uses AS path to detect loops
Asymmetric Routing
Outbound and inbound paths can differ:
Request: A → B → C → D → Server
Response: Server → E → F → A
This is normal and common!
Can complicate:
- Troubleshooting
- Stateful firewalls
- Performance analysis
Black Holes
Traffic enters but doesn’t come out:
Causes:
- Null route (route to nowhere)
- Firewall silently drops
- Network failure with no alternative path
- MTU issues (packets too large)
Debugging:
- Traceroute to find where packets stop
- Check routing tables
- Verify firewall rules
Routing Table Management
Viewing Routes
# Linux
$ ip route show
$ ip -6 route show # IPv6
# macOS
$ netstat -rn
# Windows
> route print
Adding/Removing Routes
# Linux - add route
$ sudo ip route add 10.20.0.0/16 via 192.168.1.254
# Linux - remove route
$ sudo ip route del 10.20.0.0/16
# Linux - change default gateway
$ sudo ip route replace default via 192.168.1.1
# Make persistent (varies by distro)
# Ubuntu/Debian: /etc/netplan/*.yaml
# RHEL/CentOS: /etc/sysconfig/network-scripts/route-eth0
Summary
Routing is the backbone of internet connectivity:
- Hop-by-hop forwarding: Each router makes local decisions
- Routing tables: Map destinations to next hops
- Longest prefix match: Most specific route wins
- Static routing: Manual configuration for simple networks
- Dynamic routing: Protocols (OSPF, BGP) for automatic adaptation
- BGP: The protocol that makes the internet work
Key debugging tools:
ip route/netstat -rn: View routing tablestraceroute/tracert: See packet pathsping: Test basic connectivity
Understanding routing helps you diagnose why packets aren’t reaching their destination and design networks that are resilient to failures.
Next, we’ll cover IP fragmentation—what happens when packets are too large for a network link.
IP Fragmentation
Different network links have different maximum packet sizes. When a packet is too large for a link, it must be fragmented—split into smaller pieces. Understanding fragmentation helps you diagnose performance problems and configure networks properly.
MTU: Maximum Transmission Unit
The MTU is the largest packet size a link can carry:
┌─────────────────────────────────────────────────────────────┐
│ Common MTU Values │
├─────────────────────────────────────────────────────────────┤
│ Ethernet: 1500 bytes (standard) │
│ Jumbo Frames: 9000 bytes (data centers) │
│ PPPoE (DSL): 1492 bytes │
│ VPN Tunnels: ~1400-1450 bytes (overhead) │
│ IPv6 minimum: 1280 bytes │
│ Dial-up (PPP): 576 bytes (historical) │
└─────────────────────────────────────────────────────────────┘
When a packet exceeds the outgoing link’s MTU, something must happen.
How Fragmentation Works
IPv4 routers can fragment packets when needed:
Original Packet (3000 bytes payload + 20 byte header = 3020 bytes)
┌──────────────────────────────────────────────────────────────┐
│IP Hdr│ Payload (3000 bytes) │
│ 20B │ │
└──────────────────────────────────────────────────────────────┘
Link MTU: 1500 bytes
Max payload per fragment: 1500 - 20 = 1480 bytes
After Fragmentation:
┌────────────────────────────────┐
│IP Hdr│ Fragment 1 (1480 B) │ Offset: 0, MF=1
│ 20B │ ID: 12345 │
└────────────────────────────────┘
┌────────────────────────────────┐
│IP Hdr│ Fragment 2 (1480 B) │ Offset: 1480, MF=1
│ 20B │ ID: 12345 │
└────────────────────────────────┘
┌─────────────────────┐
│IP Hdr│ Fragment 3 │ Offset: 2960, MF=0 (last fragment)
│ 20B │ (40 B) │ ID: 12345
└─────────────────────┘
Fragmentation Header Fields
Three IP header fields manage fragmentation:
┌───────────────────────────────────────────────────────────────┐
│ Identification (16 bits) │
│ Unique ID for the original packet │
│ All fragments share the same ID │
├───────────────────────────────────────────────────────────────┤
│ Flags (3 bits) │
│ Bit 0: Reserved (must be 0) │
│ Bit 1: DF (Don't Fragment) │
│ 0 = May fragment │
│ 1 = Don't fragment (drop if too big) │
│ Bit 2: MF (More Fragments) │
│ 0 = Last fragment (or unfragmented) │
│ 1 = More fragments follow │
├───────────────────────────────────────────────────────────────┤
│ Fragment Offset (13 bits) │
│ Position of this fragment in original packet │
│ Measured in 8-byte units (not bytes!) │
│ Max offset: 8191 × 8 = 65,528 bytes │
└───────────────────────────────────────────────────────────────┘
Fragment Offset Calculation
Fragment offsets must be multiples of 8 bytes:
Original payload: 3000 bytes
MTU: 1500 bytes
Max payload per fragment: 1480 bytes (must be multiple of 8)
Fragment 1:
Offset: 0 (bytes) / 8 = 0
Size: 1480 bytes
MF: 1 (more fragments)
Fragment 2:
Offset: 1480 / 8 = 185
Size: 1480 bytes
MF: 1 (more fragments)
Fragment 3:
Offset: 2960 / 8 = 370
Size: 40 bytes (remaining)
MF: 0 (last fragment)
Reassembly
Fragments are reassembled only at the final destination, not at intermediate routers:
Sender → Router1 → Router2 → Router3 → Receiver
│
▼
┌──────────────────┐
│ Reassembly │
│ │
│ Wait for all │
│ fragments with │
│ same ID │
│ │
│ Arrange by │
│ offset │
│ │
│ Check MF=0 for │
│ last piece │
│ │
│ Reconstruct │
│ original packet │
└──────────────────┘
Reassembly Timeout
If fragments don’t all arrive within a timeout (typically 30-120 seconds), the partial packet is discarded:
Fragment 1: ✓ Received
Fragment 2: ✓ Received
Fragment 3: ✗ Lost
After timeout:
All fragments discarded
ICMP "Fragment Reassembly Time Exceeded" may be sent
Upper layer (TCP) must retransmit entire original packet
Problems with Fragmentation
Fragmentation has significant drawbacks:
1. Performance Overhead
Single 3000-byte packet vs. 3 fragments:
Original (1 packet):
Processing: 1 header lookup
Transmission: 1 packet
Fragmented (3 packets):
Processing: 3 header lookups (3x)
Headers: 60 bytes (vs. 20)
Reassembly: Buffer management, timeout tracking
2. Fragment Loss Amplification
If any fragment is lost, the entire packet is lost:
3 fragments, 1% loss rate each:
Probability all arrive = 0.99³ = 97%
Probability of packet loss = 3%
vs. unfragmented: 1% loss
More fragments = higher effective loss rate
3. Security Issues
- Tiny fragment attacks: Malicious fragments too small to contain port numbers
- Overlapping fragment attacks: Crafted to bypass firewalls
- Fragment flood DoS: Exhaust reassembly buffers
Many firewalls drop fragments by default.
4. Stateful Firewall Problems
Firewall examines:
Source/Dest IP: In every fragment ✓
Source/Dest Port: Only in FIRST fragment!
Fragment 2 arrives first:
No port information
Firewall can't apply port-based rules
May drop or pass incorrectly
Path MTU Discovery (PMTUD)
Modern systems avoid fragmentation using Path MTU Discovery:
1. Sender sends packet with DF (Don't Fragment) bit set
2. If packet is too large, router sends ICMP "Fragmentation Needed"
3. Sender reduces packet size and retries
4. Repeat until path MTU is found
┌────────┐ ┌────────┐
│ Sender │ │ Dest │
└───┬────┘ └───┬────┘
│ │
│──── 1500 byte packet, DF=1 ──────────>│
│ │
│ ┌────────┐ │
│ │ Router │ │
│ │MTU=1400│ │
│ └───┬────┘ │
│ │ │
│<── ICMP "Frag Needed, MTU=1400" ──────│
│ │
│──── 1400 byte packet, DF=1 ──────────>│
│ │
│<─────────── Response ─────────────────│
PMTUD Problems
Black Hole Routers: Some routers don’t send ICMP messages (or firewalls block them):
Sender → Router1 → Router2 → Dest
│
└── Has MTU 1400
Drops packet (too big, DF=1)
Doesn't send ICMP (broken/filtered)
Sender keeps trying 1500-byte packets
All silently dropped = "black hole"
Workarounds:
- MSS clamping (TCP)
- Fallback to minimum MTU
- Manual MTU configuration
TCP and MTU
TCP uses the MSS (Maximum Segment Size) to avoid IP fragmentation:
MSS = MTU - IP header - TCP header
MSS = 1500 - 20 - 20 = 1460 bytes (typical)
TCP segments are sized to fit in one IP packet:
┌──────────────────────────────────────────────────┐
│ IP Hdr │ TCP Hdr │ TCP Data (≤ MSS) │
│ 20B │ 20B │ 1460 bytes │
└──────────────────────────────────────────────────┘
Total: ≤ MTU (1500 bytes)
MSS is negotiated during the TCP handshake:
Client → SYN, MSS=1460
Server → SYN-ACK, MSS=1400
(Server's path MTU is smaller)
Connection uses minimum: MSS=1400
IPv6 and Fragmentation
IPv6 handles fragmentation differently:
IPv4:
- Routers can fragment
- Sender can fragment
- Minimum MTU: 576 bytes
IPv6:
- Routers CANNOT fragment (must be done at source)
- PMTUD is mandatory
- Minimum MTU: 1280 bytes
- Uses Fragment extension header
IPv6’s approach improves router performance (no fragmentation processing) but requires working PMTUD.
Practical Considerations
Checking MTU
# Linux - show interface MTU
$ ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500
# macOS
$ ifconfig en0 | grep mtu
# Test path MTU with ping
$ ping -M do -s 1472 example.com # Linux: -M do = DF bit
$ ping -D -s 1472 example.com # macOS: -D = DF bit
# 1472 + 8 (ICMP) + 20 (IP) = 1500
# If it fails, reduce size until it works
Setting MTU
# Linux - temporary
$ sudo ip link set eth0 mtu 1400
# Linux - permanent (varies by distro)
# Netplan (Ubuntu):
# /etc/netplan/01-network.yaml
network:
ethernets:
eth0:
mtu: 1400
# Windows
> netsh interface ipv4 set subinterface "Ethernet" mtu=1400
Common MTU Issues
VPN tunnels:
Original: 1500 MTU
VPN overhead: ~60-80 bytes (encryption, headers)
Effective MTU: ~1420-1440 bytes
If not configured, causes fragmentation or black holes
Docker/containers:
Host MTU: 1500
Container default: 1500
Overlay network: Adds headers
May need: MTU 1450 or lower inside containers
PPPoE (DSL):
Ethernet MTU: 1500
PPPoE overhead: 8 bytes
Effective: 1492 MTU
ISP-provided routers usually handle this
Manual configurations may need adjustment
Debugging Fragmentation Issues
Symptoms
- Large file transfers fail, small requests work
- Connections hang during data transfer
- Works on LAN, fails over VPN/WAN
- PMTUD blackhole (DF packets disappear)
Diagnostics
# Check if fragmentation is occurring
$ netstat -s | grep -i frag # Linux
fragments received
fragments created
# Tcpdump for fragments
$ tcpdump -i eth0 'ip[6:2] & 0x3fff != 0'
# Test specific sizes
$ ping -M do -s SIZE destination
# Traceroute with MTU discovery
$ tracepath example.com
Summary
IP fragmentation handles oversized packets but comes with costs:
| Aspect | Impact |
|---|---|
| Performance | Multiple packets, reassembly overhead |
| Reliability | One lost fragment = lost packet |
| Security | Fragment attacks, firewall issues |
| Modern approach | Avoid via PMTUD, MSS clamping |
Best practices:
- Design for 1500-byte MTU (or smaller if tunneling)
- Use PMTUD where possible
- Configure MSS clamping on border routers
- Test with large packets during deployment
IPv6 eliminates router fragmentation entirely, making PMTUD mandatory but more predictable.
This completes our coverage of the IP layer. Next, we’ll dive deep into TCP—the protocol that provides reliable, ordered delivery on top of IP’s best-effort service.
TCP Deep Dive
TCP (Transmission Control Protocol) transforms IP’s unreliable packet delivery into a reliable, ordered byte stream. It’s the foundation for most internet applications—web browsing, email, file transfer, and API calls all typically use TCP.
What TCP Provides
TCP adds these guarantees on top of IP:
┌─────────────────────────────────────────────────────────────┐
│ TCP Guarantees │
├─────────────────────────────────────────────────────────────┤
│ ✓ Reliable Delivery │
│ Lost packets are detected and retransmitted │
│ │
│ ✓ Ordered Delivery │
│ Data arrives in the order it was sent │
│ │
│ ✓ Error Detection │
│ Corrupted data is detected via checksums │
│ │
│ ✓ Flow Control │
│ Sender doesn't overwhelm receiver │
│ │
│ ✓ Congestion Control │
│ Sender doesn't overwhelm the network │
│ │
│ ✓ Connection-Oriented │
│ Explicit setup and teardown │
│ │
│ ✓ Full-Duplex │
│ Data flows in both directions simultaneously │
└─────────────────────────────────────────────────────────────┘
The Trade-offs
These guarantees come at a cost:
Reliability vs. Latency
──────────────────────────────────────────
TCP must wait for acknowledgments
Lost packet? Wait for retransmission
Connection setup requires round trips
Ordering vs. Throughput
──────────────────────────────────────────
Head-of-line blocking: One lost packet
stalls delivery of everything behind it
Packet: 1 2 3 4 5 6 7
↑
Lost
Received: 1 2 [waiting...] 3 4 5 6 7
│
Can't deliver 4-7 until 3 arrives
This is why some applications (real-time video, gaming) use UDP instead.
TCP vs. IP
Think of TCP and IP as two different jobs:
┌─────────────────────────────────────────────────────────────┐
│ IP │
│ "I'll try to get this packet to the destination address" │
│ │
│ - No guarantee of delivery │
│ - No guarantee of order │
│ - Packets are independent │
│ - Fast, simple, stateless │
└─────────────────────────────────────────────────────────────┘
↑
│
┌─────────────────────────────────────────────────────────────┐
│ TCP │
│ "I'll make sure all data arrives correctly and in order" │
│ │
│ - Reliable delivery (detects loss, retransmits) │
│ - Ordered delivery (sequence numbers) │
│ - Connection state (both sides track progress) │
│ - Slower, complex, stateful │
└─────────────────────────────────────────────────────────────┘
Key Concepts Preview
Sequence Numbers
Every byte in a TCP stream has a sequence number:
Application sends: "Hello, World!" (13 bytes)
TCP assigns:
Seq 1000: 'H'
Seq 1001: 'e'
Seq 1002: 'l'
...
Seq 1012: '!'
Segments might be:
Segment 1: Seq=1000, "Hello, "
Segment 2: Seq=1007, "World!"
Acknowledgments
The receiver tells the sender what it’s received:
Sender Receiver
│ │
│──── Seq=1000, "Hello" ───────>│
│ │
│<──── ACK=1005 ────────────────│
│ "I've received up to │
│ byte 1004, send 1005" │
The Window
The receiver advertises how much data it can accept:
"My receive buffer can hold 65535 more bytes"
Window = 65535
Sender can send that much without waiting for ACKs
(Sliding window protocol)
What You’ll Learn
This chapter covers TCP in depth:
- The Three-Way Handshake: How connections are established
- TCP Header and Segments: The packet format and key fields
- Flow Control: Preventing receiver overload
- Congestion Control: Preventing network overload
- Retransmission: How lost data is recovered
- TCP States: The connection lifecycle
Understanding TCP helps you:
- Debug connection problems
- Optimize application performance
- Make informed protocol choices
- Understand why things sometimes feel slow
Let’s start with the handshake—how two systems establish a TCP connection.
The Three-Way Handshake
Before TCP can transfer data, both sides must establish a connection. This happens through a three-message exchange called the three-way handshake.
Why a Handshake?
The handshake serves several purposes:
- Verify both endpoints are reachable and willing
- Exchange initial sequence numbers (ISNs)
- Negotiate connection parameters (MSS, window scaling, etc.)
- Synchronize state between client and server
The Three Steps
Client Server
│ │
│ State: LISTEN │
│ (waiting for connections) │
│ │
┌────┴────┐ │
│ SYN │ │
│ Seq=100 │────────────────────────────────────>
│ │ "I want to connect, │
└─────────┘ my ISN is 100" │
│ │
│ ┌────┴────┐
│ │ SYN-ACK │
<─────────────────────────────────────│Seq=300 │
│ "OK, I acknowledge your SYN │ACK=101 │
│ (expecting byte 101 next), └────────┘
│ and here's my ISN: 300" │
│ │
┌────┴────┐ │
│ ACK │ │
│ACK=301 │────────────────────────────────────>
│ │ "I acknowledge your SYN, │
└─────────┘ expecting byte 301" │
│ │
│ CONNECTION ESTABLISHED │
│ │
Step 1: SYN (Synchronize)
Client initiates the connection:
TCP Header:
┌─────────────────────────────────────────┐
│ Source Port: 52431 │
│ Dest Port: 80 │
│ Sequence Number: 100 (client's ISN) │
│ Acknowledgment: 0 (not yet used) │
│ Flags: SYN=1 │
│ Window: 65535 │
│ Options: MSS=1460, Window Scale, etc. │
└─────────────────────────────────────────┘
The Initial Sequence Number (ISN) is randomized for security reasons (prevents sequence prediction attacks).
Step 2: SYN-ACK
Server acknowledges and synchronizes:
TCP Header:
┌─────────────────────────────────────────┐
│ Source Port: 80 │
│ Dest Port: 52431 │
│ Sequence Number: 300 (server's ISN) │
│ Acknowledgment: 101 (client's ISN + 1) │
│ Flags: SYN=1, ACK=1 │
│ Window: 65535 │
│ Options: MSS=1460, Window Scale, etc. │
└─────────────────────────────────────────┘
The ACK value (101) means “I’ve received everything up to byte 100, send me byte 101 next.”
Step 3: ACK
Client confirms:
TCP Header:
┌─────────────────────────────────────────┐
│ Source Port: 52431 │
│ Dest Port: 80 │
│ Sequence Number: 101 │
│ Acknowledgment: 301 (server's ISN + 1) │
│ Flags: ACK=1 │
│ Window: 65535 │
└─────────────────────────────────────────┘
At this point, both sides have verified connectivity and exchanged initial sequence numbers.
Why Three Messages?
Could we do it in two? No—here’s why:
Two-way handshake problem:
Client ──SYN──> Server
Client <──ACK── Server
What if the server's ACK is lost?
- Server thinks connection is established
- Client thinks connection failed
- Server waits forever for data that won't come
Three-way solves this:
- Both sides must acknowledge the other's SYN
- Both sides know the other received their ISN
State Changes During Handshake
Client States Server States
────────────────────────────────────────────────────
CLOSED CLOSED
│ │
│ listen()
│ │
│ ▼
│ LISTEN
│ │
connect() │
│ │
▼ │
SYN_SENT ──────── SYN ─────────────>│
│ │
│ ▼
│ SYN_RCVD
│ │
│<─────────── SYN-ACK ───────────│
│ │
▼ │
ESTABLISHED ────── ACK ────────────>│
│ │
│ ▼
│ ESTABLISHED
Options Negotiated in Handshake
Several important options are exchanged during the SYN and SYN-ACK:
Maximum Segment Size (MSS)
"The largest TCP segment I can receive"
Typical values:
Ethernet: MSS = 1460 (1500 MTU - 20 IP - 20 TCP)
Jumbo: MSS = 8960 (9000 MTU - headers)
Both sides advertise their MSS
Connection uses the minimum of the two
Window Scaling (RFC 7323)
Original window field: 16 bits = max 65535 bytes
Too small for high-bandwidth, high-latency networks
Window scaling multiplies by 2^scale:
Scale=7: Window can be 65535 × 128 = 8MB
SYN: Window Scale = 7
SYN-ACK: Window Scale = 8
Enables large windows for high-performance networks
Selective Acknowledgment (SACK)
"I support SACK - I can tell you exactly which
bytes I've received, not just the contiguous ones"
Without SACK: ACK=1000 means "got 1-999"
If 1000 is lost but 1001-2000 arrived,
can only ACK up to 999
With SACK: ACK=1000, SACK: 1001-2000
"Got 1-999 and 1001-2000, missing 1000"
Sender retransmits only byte 1000
Timestamps (RFC 7323)
Used for:
1. RTT measurement (Round Trip Time)
2. PAWS (Protection Against Wrapped Sequences)
TSval: Sender's timestamp
TSecr: Echoed timestamp from peer
Helps with:
- Accurate timeout calculation
- Distinguishing old duplicate packets
Handshake in Action
Here’s a real handshake captured with tcpdump:
$ tcpdump -i eth0 port 80
14:23:15.123456 IP 192.168.1.100.52431 > 93.184.216.34.80:
Flags [S], seq 1823761425, win 65535,
options [mss 1460,sackOK,TS val 1234567 ecr 0,
nop,wscale 7], length 0
14:23:15.156789 IP 93.184.216.34.80 > 192.168.1.100.52431:
Flags [S.], seq 2948572615, ack 1823761426, win 65535,
options [mss 1460,sackOK,TS val 9876543 ecr 1234567,
nop,wscale 8], length 0
14:23:15.156801 IP 192.168.1.100.52431 > 93.184.216.34.80:
Flags [.], ack 2948572616, win 512,
options [nop,nop,TS val 1234568 ecr 9876543], length 0
Reading the flags:
[S]= SYN[S.]= SYN-ACK (the dot means ACK is set)[.]= ACK only
Connection Latency
The handshake adds latency before data transfer can begin:
Timeline:
────────────────────────────────────────────────────────
0ms Client sends SYN
│
50ms Server receives SYN
Server sends SYN-ACK
│
100ms Client receives SYN-ACK
Client sends ACK
Client can NOW send data!
│
150ms Server receives ACK
Server can NOW send data!
Minimum: 1 RTT before client can send
1.5 RTT before server can send
For a 100ms RTT connection:
100ms before HTTP request can be sent
150ms before HTTP response can begin
This is why connection reuse (HTTP keep-alive, connection pooling) matters for performance.
TCP Fast Open (TFO)
TCP Fast Open allows data in the SYN packet:
First connection (normal):
Client ──SYN──────────────> Server
Client <──SYN-ACK + Cookie── Server
Client ──ACK──────────────> Server
Client ──Data─────────────> Server
Subsequent connections (with TFO cookie):
Client ──SYN + Cookie + Data──> Server
Server can respond immediately!
Client <───────Response───────── Server
Saves 1 RTT on repeat connections!
TFO requires:
- Both client and server support
- Idempotent initial request (retry-safe)
- Not universally deployed due to middlebox issues
Handshake Failures
Connection Refused
Client ──SYN──> Server (no service on port)
Client <──RST── Server
"RST" (Reset) means "I'm not accepting connections on this port"
$ telnet example.com 12345
Connection refused
Connection Timeout
Client ──SYN──> (packet lost or server unreachable)
... wait ...
Client ──SYN──> (retry 1)
... wait longer ...
Client ──SYN──> (retry 2)
... give up after multiple attempts
Typical: 3 retries over ~75 seconds
SYN Flood Attack
Attacker sends many SYNs without completing handshake:
Attacker ──SYN (fake source)──> Server
Attacker ──SYN (fake source)──> Server
Attacker ──SYN (fake source)──> Server
... thousands more ...
Server:
- Allocates resources for each half-open connection
- SYN queue fills up
- Can't accept legitimate connections
Mitigations:
- SYN cookies (stateless SYN handling)
- Rate limiting
- Larger SYN queues
Simultaneous Open (Rare)
Both sides can simultaneously send SYN:
Client ──SYN──> <──SYN── Server
↓ ↓
Client ──SYN-ACK──> <──SYN-ACK── Server
Both sides:
1. Receive SYN while in SYN_SENT
2. Send SYN-ACK
3. Move to ESTABLISHED when ACK received
Same result, different path. Rare in practice.
Summary
The three-way handshake establishes TCP connections:
| Step | Direction | Flags | Purpose |
|---|---|---|---|
| 1 | Client → Server | SYN | “I want to connect” |
| 2 | Server → Client | SYN-ACK | “OK, I acknowledge” |
| 3 | Client → Server | ACK | “Confirmed” |
Key points:
- Exchanges initial sequence numbers
- Negotiates options (MSS, window scale, SACK)
- Takes 1-1.5 RTT before data can flow
- Connection reuse avoids repeated handshakes
- SYN cookies protect against SYN floods
Next, we’ll examine the TCP header in detail—understanding each field and how they work together.
TCP Header and Segments
Understanding the TCP header is essential for debugging network issues, interpreting packet captures, and grasping how TCP works. Every TCP segment starts with a header containing control information.
TCP Segment Structure
A TCP segment is the unit of data at the transport layer:
┌─────────────────────────────────────────────────────────────┐
│ TCP Segment │
├──────────────────────────────┬──────────────────────────────┤
│ TCP Header │ TCP Payload │
│ (20-60 bytes) │ (0 to MSS bytes) │
└──────────────────────────────┴──────────────────────────────┘
Segments are encapsulated in IP packets:
┌─────────────────────────────────────────────────────────────┐
│ IP Header │ TCP Header │ TCP Payload │
│ (20 bytes) │ (20-60 bytes) │ (application data) │
└─────────────────────────────────────────────────────────────┘
The TCP Header
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
│ Source Port │ Destination Port │
├───────────────────────────────┴───────────────────────────────┤
│ Sequence Number │
├───────────────────────────────────────────────────────────────┤
│ Acknowledgment Number │
├───────┬───────┬─┬─┬─┬─┬─┬─┬─┬─┬───────────────────────────────┤
│ Data │ │C│E│U│A│P│R│S│F│ │
│ Offset│ Rsrvd │W│C│R│C│S│S│Y│I│ Window │
│ │ │R│E│G│K│H│T│N│N│ │
├───────┴───────┴─┴─┴─┴─┴─┴─┴─┴─┼───────────────────────────────┤
│ Checksum │ Urgent Pointer │
├───────────────────────────────┴───────────────────────────────┤
│ Options (if Data Offset > 5) │
│ ... │
├───────────────────────────────────────────────────────────────┤
│ Payload │
│ ... │
└───────────────────────────────────────────────────────────────┘
Header Fields Explained
Source Port and Destination Port (16 bits each)
Source Port: The sender's port (often ephemeral)
Destination Port: The receiver's port (often well-known)
Together with IP addresses, these form the connection 5-tuple:
(Protocol, Source IP, Source Port, Dest IP, Dest Port)
Example:
Client → Server HTTP request:
Source: 192.168.1.100:52431
Dest: 93.184.216.34:80
Sequence Number (32 bits)
Identifies the position of data in the byte stream.
If Sequence = 1000 and Payload = 100 bytes:
This segment contains bytes 1000-1099
First SYN: Sequence = ISN (Initial Sequence Number)
Subsequent: ISN + bytes sent
32 bits → wraps around after ~4GB
(Timestamps help disambiguate in fast networks)
Acknowledgment Number (32 bits)
"I've received all bytes up to this number - 1"
"Send me this byte next"
If ACK = 1100:
"I have bytes 0-1099, expecting 1100"
Only valid when ACK flag is set.
Data Offset (4 bits)
Length of TCP header in 32-bit words.
Minimum: 5 (5 × 4 = 20 bytes, no options)
Maximum: 15 (15 × 4 = 60 bytes, 40 bytes of options)
Tells receiver where the payload begins.
Reserved (4 bits)
Reserved for future use. Must be zero.
(Historically 6 bits, 2 repurposed for CWR/ECE)
Control Flags (8 bits)
Each flag is 1 bit:
┌─────┬─────────────────────────────────────────────────────────┐
│ CWR │ Congestion Window Reduced │
│ │ Sender reduced congestion window │
├─────┼─────────────────────────────────────────────────────────┤
│ ECE │ ECN-Echo │
│ │ Congestion notification echo │
├─────┼─────────────────────────────────────────────────────────┤
│ URG │ Urgent pointer field is valid │
│ │ (Rarely used today) │
├─────┼─────────────────────────────────────────────────────────┤
│ ACK │ Acknowledgment field is valid │
│ │ Set on almost every segment after SYN │
├─────┼─────────────────────────────────────────────────────────┤
│ PSH │ Push - deliver data immediately to application │
│ │ Don't buffer waiting for more data │
├─────┼─────────────────────────────────────────────────────────┤
│ RST │ Reset - abort the connection immediately │
│ │ Something went wrong │
├─────┼─────────────────────────────────────────────────────────┤
│ SYN │ Synchronize - connection establishment │
│ │ Only set during handshake │
├─────┼─────────────────────────────────────────────────────────┤
│ FIN │ Finish - sender is done sending │
│ │ Graceful connection termination │
└─────┴─────────────────────────────────────────────────────────┘
Common flag combinations:
SYN = Connection request
SYN + ACK = Connection accepted
ACK = Data or acknowledgment
PSH + ACK = Push data (common for requests)
FIN + ACK = Done sending, acknowledging
RST = Connection abort
RST + ACK = Reset with acknowledgment
Window (16 bits)
Receive window size: "I can accept this many more bytes"
Range: 0 - 65535 bytes
With window scaling (negotiated in SYN):
Actual window = Window × 2^scale
Example with scale=7:
Window=512 means 512 × 128 = 65536 bytes
Checksum (16 bits)
Covers header, data, and a pseudo-header:
┌─────────────────────────────────────────────────────────────┐
│ Pseudo-Header │
├─────────────────────────────────────────────────────────────┤
│ Source IP Address (from IP header) │
│ Destination IP Address (from IP header) │
│ Zero | Protocol (6 for TCP) | TCP Length │
└─────────────────────────────────────────────────────────────┘
Why pseudo-header?
- Ensures segment reaches correct destination
- Protects against IP address spoofing
Urgent Pointer (16 bits)
Offset to end of urgent data (if URG flag set).
Largely obsolete - rarely used in modern applications.
Was intended for out-of-band signaling.
TCP Options
Options extend the header beyond 20 bytes:
Option Format:
┌─────────┬────────┬──────────────────────┐
│ Kind │ Length │ Data │
│ (1 byte)│(1 byte)│ (Length-2 bytes) │
└─────────┴────────┴──────────────────────┘
Single-byte options: Kind only (no length/data)
Kind 0: End of Options
Kind 1: NOP (padding)
Common Options
┌──────┬────────┬────────────────────────────────────────────────┐
│ Kind │ Length │ Description │
├──────┼────────┼────────────────────────────────────────────────┤
│ 0 │ - │ End of Options List │
│ 1 │ - │ NOP (No Operation) - padding │
│ 2 │ 4 │ MSS (Maximum Segment Size) │
│ 3 │ 3 │ Window Scale │
│ 4 │ 2 │ SACK Permitted │
│ 5 │ var │ SACK (Selective Acknowledgment) │
│ 8 │ 10 │ Timestamps (TSval, TSecr) │
└──────┴────────┴────────────────────────────────────────────────┘
MSS Option (Kind 2)
Maximum Segment Size - largest payload sender can receive.
┌─────────┬────────┬─────────────────────┐
│ Kind=2 │ Len=4 │ MSS Value (16b) │
└─────────┴────────┴─────────────────────┘
Typical: 1460 (Ethernet) or 1440 (with timestamps)
Only in SYN and SYN-ACK segments.
Window Scale Option (Kind 3)
Multiplier for window field: Window × 2^scale
┌─────────┬────────┬────────────┐
│ Kind=3 │ Len=3 │ Shift (8b) │
└─────────┴────────┴────────────┘
Shift range: 0-14
Max window: 65535 × 2^14 = ~1GB
Only in SYN and SYN-ACK.
SACK Option (Kind 5)
Reports non-contiguous received blocks:
┌─────────┬────────┬──────────┬──────────┬─────┐
│ Kind=5 │ Length │ Left Edge│Right Edge│ ... │
└─────────┴────────┴──────────┴──────────┴─────┘
Example: SACK 1001-1500, 2001-3000
"I have bytes 1001-1500 and 2001-3000,
missing 1501-2000"
Timestamps Option (Kind 8)
┌─────────┬────────┬────────────────┬────────────────┐
│ Kind=8 │ Len=10 │ TSval (32 bit) │ TSecr (32 bit) │
└─────────┴────────┴────────────────┴────────────────┘
TSval: Sender's current timestamp
TSecr: Echo of peer's last timestamp
Uses:
1. RTT measurement (TSecr shows when original was sent)
2. PAWS - detect old duplicates by timestamp
Example TCP Segments
SYN Segment
Client initiating connection to web server:
Source Port: 52431
Dest Port: 80
Sequence: 2837465182 (random ISN)
Acknowledgment: 0 (not used)
Data Offset: 8 (32 bytes header = options)
Flags: SYN
Window: 65535
Checksum: 0x1a2b
Urgent: 0
Options:
MSS: 1460
SACK Permitted
Timestamps: TSval=1234567, TSecr=0
NOP (padding)
Window Scale: 7
Data Segment
HTTP request being sent:
Source Port: 52431
Dest Port: 80
Sequence: 2837465183
Acknowledgment: 948271635
Data Offset: 8
Flags: PSH, ACK
Window: 502
Checksum: 0x3c4d
Urgent: 0
Options:
NOP, NOP
Timestamps: TSval=1234590, TSecr=9876543
Payload (95 bytes):
GET / HTTP/1.1\r\n
Host: example.com\r\n
\r\n
ACK-only Segment
Acknowledging received data (no payload):
Source Port: 80
Dest Port: 52431
Sequence: 948272000
Acknowledgment: 2837465278
Data Offset: 8
Flags: ACK
Window: 1024
Checksum: 0x5e6f
Urgent: 0
Options:
NOP, NOP
Timestamps: TSval=9876600, TSecr=1234590
Payload: (empty)
Segment Size Considerations
Maximum Segment Size (MSS)
MSS = MTU - IP Header - TCP Header
MSS = 1500 - 20 - 20 = 1460 bytes (typical Ethernet)
With timestamps (common):
MSS = 1500 - 20 - 32 = 1448 bytes
The actual payload in a segment ≤ MSS
Why Segment Size Matters
Too small:
- More packets = more overhead
- More ACKs needed
- Less efficient
Too large:
- IP fragmentation (bad for performance)
- Higher chance of loss requiring retransmit
Optimal: Just under MTU (Path MTU Discovery helps)
Viewing TCP Headers
Using tcpdump
$ tcpdump -i eth0 -nn tcp port 80 -vvX
15:30:45.123456 IP 192.168.1.100.52431 > 93.184.216.34.80:
Flags [S], cksum 0x1a2b (correct),
seq 2837465182, win 65535,
options [mss 1460,sackOK,TS val 1234567 ecr 0,
nop,wscale 7],
length 0
Using Wireshark
Wireshark provides a graphical view with all fields decoded:
Transmission Control Protocol, Src Port: 52431, Dst Port: 80
Source Port: 52431
Destination Port: 80
Sequence Number: 2837465182
Acknowledgment Number: 0
Header Length: 32 bytes (8)
Flags: 0x002 (SYN)
.... .... ..0. = Reserved
.... .... ...0 = Nonce
.... ...0 .... = Congestion Window Reduced
.... ..0. .... = ECN-Echo
.... .0.. .... = Urgent
.... 0... .... = Acknowledgment
...0 .... .... = Push
..0. .... .... = Reset
.0.. .... .... = Syn: Set
0... .... .... = Fin
Window: 65535
Options: (12 bytes)
Summary
The TCP header contains everything needed for reliable, ordered delivery:
| Field | Size | Purpose |
|---|---|---|
| Source/Dest Port | 16 bits each | Identify applications |
| Sequence Number | 32 bits | Track byte position |
| Acknowledgment | 32 bits | Confirm receipt |
| Data Offset | 4 bits | Header length |
| Flags | 8 bits | Control (SYN, ACK, FIN, etc.) |
| Window | 16 bits | Flow control |
| Checksum | 16 bits | Error detection |
| Options | Variable | MSS, SACK, timestamps, etc. |
Understanding these fields helps you:
- Debug connection problems
- Interpret packet captures
- Tune TCP performance
- Recognize attacks (SYN floods, RST attacks)
Next, we’ll explore flow control—how TCP prevents senders from overwhelming receivers.
Flow Control
Flow control prevents a fast sender from overwhelming a slow receiver. Without it, a server could blast data faster than your application can process it, leading to lost data and wasted retransmissions.
The Problem
Consider a file download:
Fast Server Slow Client
(100 Mbps) (processes 1 MB/s)
│──── 1 MB ────────────────────────>│ Buffer: [1MB]
│──── 1 MB ────────────────────────>│ Buffer: [2MB]
│──── 1 MB ────────────────────────>│ Buffer: [3MB]
│──── 1 MB ────────────────────────>│ Buffer: [4MB] ← FULL!
│──── 1 MB ────────────────────────>│ Buffer: OVERFLOW!
│ │
│ Data lost! Must retransmit. │
│ Waste of bandwidth. │
Without flow control, fast senders can:
- Overflow receiver buffers
- Cause packet loss
- Trigger unnecessary retransmissions
The Sliding Window
TCP uses a sliding window mechanism for flow control. The receiver advertises how much buffer space is available, and the sender limits itself accordingly.
Receive Window (rwnd)
Receiver's perspective:
Receive Buffer (size: 65535 bytes)
┌─────────────────────────────────────────────────────────────┐
│ Data waiting │ Application │ Available Space │
│ to be read │ reading... │ (Window) │
│ (ACKed) │ │ │
├──────────────┼────────────────┼─────────────────────────────┤
│ 10000 │ (consuming) │ 55535 │
└──────────────┴────────────────┴─────────────────────────────┘
Window advertised to sender: 55535 bytes
"I can accept 55535 more bytes"
Sender’s View
The sender tracks three pointers:
Sent & ACKed │ Sent, waiting for ACK │ Can send │ Cannot send
│ │ (Window) │ (beyond window)
─────────────┴───────────────────────┴──────────────┴────────────────
1000 1000-5000 5000-10000 10000+
The "window" slides forward as ACKs arrive:
Before ACK:
[=====Sent=====][=====In Flight=====][===Can Send===][ Cannot ]
└──rwnd=5000───┘
After ACK (receiver consumed data):
[===Sent===][===In Flight===][=====Can Send=====][Cannot]
└─────rwnd=8000─────┘
Window "slides" right as data is acknowledged
Window Flow
Let’s trace a file transfer with flow control:
Sender Receiver
│ rwnd = 4000 │
│ │
│──── Seq=1000, 1000 bytes ────────────────────────>│
│──── Seq=2000, 1000 bytes ────────────────────────>│
│──── Seq=3000, 1000 bytes ────────────────────────>│
│──── Seq=4000, 1000 bytes ────────────────────────>│
│ │
│ (Sender has sent rwnd bytes, must wait) │
│ │
│<──── ACK=5000, Win=2000 ──────────────────────────│
│ (App read 2000 bytes, 2000 space freed) │
│ │
│──── Seq=5000, 1000 bytes ────────────────────────>│
│──── Seq=6000, 1000 bytes ────────────────────────>│
│ │
│ (Window full again, wait) │
│ │
│<──── ACK=7000, Win=4000 ──────────────────────────│
│ (App caught up, more space) │
Window Size and Throughput
The window limits throughput based on latency:
Maximum throughput = Window Size / Round Trip Time
Example 1: Window=65535, RTT=10ms
Throughput ≤ 65535 / 0.010 = 6.5 MB/s
Example 2: Window=65535, RTT=100ms
Throughput ≤ 65535 / 0.100 = 655 KB/s
This is why window scaling matters for high-latency links!
Bandwidth-Delay Product (BDP)
For optimal throughput, window should be ≥ BDP:
BDP = Bandwidth × RTT
Example: 100 Mbps link, 50ms RTT
BDP = 100,000,000 bits/s × 0.050 s
= 5,000,000 bits = 625,000 bytes
Need window ≥ 625 KB to fully utilize the link!
Standard 65535-byte window is way too small.
Window scaling essential: 65535 × 2^4 = ~1MB
Window Scaling
Window scaling multiplies the 16-bit window field:
Without scaling:
Max window = 65535 bytes
On 100Mbps, 50ms link: 65535/0.050 = 1.3 MB/s (10% utilization)
With scale factor 7:
Max window = 65535 × 128 = 8.3 MB bytes
On 100Mbps, 50ms link: 8.3M/0.050 = 166 MB/s (full utilization)
Negotiated during handshake:
SYN: WScale=7
SYN-ACK: WScale=8
Scale applies to window field in all subsequent segments
Zero Window
When the receiver’s buffer is full, it advertises window = 0:
Sender Receiver
│ │
│<──── ACK=5000, Win=0 ────────────────────────────│
│ "My buffer is full, stop sending!" │
│ │
│ (Sender stops, starts "persist timer") │
│ │
│──── Window Probe (1 byte) ──────────────────────>│
│ │
│<──── ACK=5000, Win=0 ────────────────────────────│
│ (Still full) │
│ │
│ (Wait, probe again) │
│ │
│──── Window Probe (1 byte) ──────────────────────>│
│ │
│<──── ACK=5000, Win=4000 ─────────────────────────│
│ (Space available, resume!) │
Persist Timer
The persist timer prevents deadlock when window = 0:
Without persist timer:
Receiver: Window=0 (buffer full)
Sender: Stops sending, waits for window update
Receiver: Window update packet is lost!
Both sides wait forever → Deadlock
With persist timer:
Sender periodically probes with 1-byte segments
Eventually receives window update
No deadlock possible
Silly Window Syndrome
A pathological condition where tiny windows cause inefficiency:
Problem scenario:
Application reads 1 byte at a time
Receiver advertises 1-byte window
Sender sends 1-byte segments (huge overhead!)
1 byte payload + 20 TCP + 20 IP = 41 bytes
Efficiency: 1/41 = 2.4%
This is "Silly Window Syndrome" (SWS)
Prevention
Receiver side (Clark’s algorithm):
Don't advertise tiny windows.
Wait until either:
- Window ≥ MSS, or
- Window ≥ buffer/2
"I have space" → If space < MSS, advertise Win=0
Sender side (Nagle’s algorithm):
Don't send tiny segments.
If there's unacknowledged data:
- Buffer small writes
- Wait for ACK before sending
Can be disabled with TCP_NODELAY socket option
(Important for latency-sensitive apps)
Flow Control in Action
Here’s a real-world example captured with tcpdump:
Time Direction Seq ACK Win Len
──────────────────────────────────────────────────
0.000 → 1 1 65535 1460 # Send data
0.001 → 1461 1 65535 1460 # More data
0.050 ← 1 2921 32768 0 # ACK, window shrunk
0.051 → 2921 1 65535 1460 # Continue
0.052 → 4381 1 65535 1460
0.100 ← 1 5841 16384 0 # Window shrinking
0.101 → 5841 1 65535 1460
0.150 ← 1 7301 0 0 # ZERO WINDOW!
0.650 → 7301 1 65535 1 # Window probe
0.700 ← 1 7302 8192 0 # Window opened
0.701 → 7302 1 65535 1460 # Resume
Tuning Flow Control
Receiver Buffer Size
# Linux - check current buffer sizes
$ sysctl net.core.rmem_default
net.core.rmem_default = 212992
$ sysctl net.core.rmem_max
net.core.rmem_max = 212992
# Increase for high-bandwidth applications
$ sudo sysctl -w net.core.rmem_max=16777216
$ sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
# min default max
Application-Level Control
import socket
# Create socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Set receive buffer (affects advertised window)
s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1048576) # 1MB
# Check actual buffer size (OS may adjust)
actual = s.getsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF)
print(f"Receive buffer: {actual}")
Visualizing the Window
Receiver's buffer over time:
Time=0 (empty buffer, large window)
┌────────────────────────────────────────────────────────────┐
│ Available (Win=64KB) │
└────────────────────────────────────────────────────────────┘
Time=1 (receiving faster than app reads)
┌───────────────────────────┬────────────────────────────────┐
│ Buffered (32KB) │ Available (Win=32KB) │
└───────────────────────────┴────────────────────────────────┘
Time=2 (app not reading, buffer filling)
┌───────────────────────────────────────────┬────────────────┐
│ Buffered (48KB) │ Avail(Win=16KB)│
└───────────────────────────────────────────┴────────────────┘
Time=3 (buffer full!)
┌────────────────────────────────────────────────────────────┐
│ Buffered (64KB) - Win=0! │
└────────────────────────────────────────────────────────────┘
Time=4 (app reads 32KB)
┌───────────────────────────┬────────────────────────────────┐
│ Buffered (32KB) │ Available (Win=32KB) │
└───────────────────────────┴────────────────────────────────┘
Summary
Flow control ensures receivers aren’t overwhelmed:
| Mechanism | Purpose |
|---|---|
| Receive Window (rwnd) | Advertises available buffer space |
| Window Scaling | Enables windows > 65535 bytes |
| Zero Window | Signals “stop sending” |
| Persist Timer | Prevents deadlock on zero window |
| Nagle’s Algorithm | Prevents sending tiny segments |
| Clark’s Algorithm | Prevents advertising tiny windows |
Key formulas:
Max throughput = Window / RTT
BDP = Bandwidth × RTT (optimal window size)
Flow control handles receiver capacity. But what about the network itself? That’s congestion control—our next topic.
Congestion Control
While flow control prevents overwhelming the receiver, congestion control prevents overwhelming the network. Without it, the internet would suffer from congestion collapse—everyone sending as fast as possible, causing massive packet loss and near-zero throughput.
The Congestion Problem
Multiple senders sharing a bottleneck:
Sender A (100 Mbps) ─┐
│
Sender B (100 Mbps) ─┼────[Router]────> 50 Mbps link ────> Internet
│ (bottleneck)
Sender C (100 Mbps) ─┘
If everyone sends at full speed:
Input: 300 Mbps
Capacity: 50 Mbps
Result: Router drops 250 Mbps worth of packets!
Dropped packets → Retransmissions → Even more traffic → More drops
This is "congestion collapse"
TCP’s Solution: Congestion Window
TCP maintains a congestion window (cwnd) that limits how much unacknowledged data can be in flight:
Effective window = min(cwnd, rwnd)
rwnd: What the receiver can accept (flow control)
cwnd: What the network can handle (congestion control)
Even if receiver says "send 1 MB", if cwnd=64KB,
sender only sends 64KB before waiting for ACKs.
The Four Phases
TCP congestion control has four main phases:
┌─────────────────────────────────────────────────────────────┐
│ │
│ cwnd │
│ │ │
│ │ Congestion │ │
│ │ Avoidance │ │
│ │ ____/ │ ssthresh │
│ │ / │←─────── │
│ │ / ______│ │
│ │ / / │ │
│ │ /──────────────────/ │ │
│ │ / │ │
│ │ / │ │
│ │ / Slow Start │ │
│ │ / │ │
│ │ / │ │
│ │ / │ │
│ │────/ │ │
│ └─────────────────────────────────────────────> Time │
│ Loss detected: cwnd cut, │
│ ssthresh lowered │
│ │
└─────────────────────────────────────────────────────────────┘
1. Slow Start
Despite the name, slow start grows cwnd exponentially:
Initial cwnd = 1 MSS (or IW, Initial Window, typically 10 MSS now)
Round 1: Send 1 segment, get 1 ACK → cwnd = 2
Round 2: Send 2 segments, get 2 ACKs → cwnd = 4
Round 3: Send 4 segments, get 4 ACKs → cwnd = 8
Round 4: Send 8 segments, get 8 ACKs → cwnd = 16
cwnd doubles every RTT (exponential growth)
Continues until:
- cwnd reaches ssthresh (slow start threshold)
- Packet loss is detected
Why “slow” start?
Before TCP had congestion control, senders would
immediately blast data at full speed. "Slow" start
is slow compared to that—it probes the network
capacity before going full throttle.
2. Congestion Avoidance
Once cwnd ≥ ssthresh, growth becomes linear:
For each RTT (when all cwnd bytes are acknowledged):
cwnd = cwnd + MSS
Or equivalently, for each ACK:
cwnd = cwnd + MSS/cwnd
Example (MSS=1000, cwnd=10000):
ACK received → cwnd = 10000 + 1000/10000 = 10000.1
After 10 ACKs → cwnd ≈ 10001
After 100 ACKs → cwnd = 10010
After 10000 ACKs (1 RTT) → cwnd = 11000
Linear growth: +1 MSS per RTT
This is much slower than slow start's doubling
3. Loss Detection and Response
When packet loss is detected, TCP assumes congestion:
Triple Duplicate ACK (Fast Retransmit):
Sender receives 3 duplicate ACKs for same sequence
Interpretation: "One packet lost, but others arriving"
(Mild congestion, some packets getting through)
Response (TCP Reno/NewReno):
ssthresh = cwnd / 2
cwnd = ssthresh + 3 MSS (Fast Recovery)
Retransmit lost segment
Stay in congestion avoidance
Timeout (RTO expiration):
No ACK received within timeout period
Interpretation: "Severe congestion, possibly multiple losses"
(Major congestion, most packets lost)
Response:
ssthresh = cwnd / 2
cwnd = 1 MSS (or IW)
Go back to slow start
4. Fast Recovery
After fast retransmit, enters fast recovery:
During Fast Recovery:
For each duplicate ACK received:
cwnd = cwnd + MSS
(Indicates packets leaving network)
When new ACK arrives (lost packet recovered):
cwnd = ssthresh
Exit fast recovery
Enter congestion avoidance
Congestion Control Algorithms
Different algorithms for different scenarios:
TCP Reno (Classic)
The original widely-deployed algorithm
Slow Start: Exponential growth
Cong. Avoid: Linear growth (AIMD - Additive Increase)
Loss Response: Multiplicative Decrease (cwnd/2)
AIMD = Additive Increase, Multiplicative Decrease
- Increase by 1 MSS per RTT
- Decrease by half on loss
- Proven to converge to fair share
TCP NewReno
Improvement over Reno for multiple losses:
Problem with Reno:
Multiple packets lost in one window
Fast retransmit fixes one, then exits fast recovery
Has to wait for timeout for others
NewReno:
Stays in fast recovery until all lost packets recovered
"Partial ACK" triggers retransmit of next lost segment
Much better for high loss environments
TCP CUBIC (Linux Default)
Designed for high-bandwidth, high-latency networks
Key differences:
- cwnd growth is cubic function of time since last loss
- More aggressive than Reno in probing capacity
- Better utilization of fat pipes
cwnd = C × (t - K)³ + Wmax
Where:
C = scaling constant
t = time since last loss
K = time to reach Wmax
Wmax = cwnd at last loss
BBR (Bottleneck Bandwidth and RTT)
Google's model-based algorithm (2016)
Revolutionary approach:
- Explicitly measures bottleneck bandwidth
- Explicitly measures minimum RTT
- Doesn't use loss as primary congestion signal
Phases:
Startup: Exponential probing (like slow start)
Drain: Reduce queue after startup
Probe BW: Cycle through bandwidth probing
Probe RTT: Periodically measure minimum RTT
Advantages:
- Better throughput on lossy links
- Lower latency (doesn't fill buffers)
- Fairer bandwidth sharing
Visualizing Congestion Control
TCP Reno behavior over time:
cwnd │
│ *
│ * * *
│ * * * *
│ * * * *
│ * * * *
│ * * * *
│ * * * *
│ * * *
│ * (slow start) Loss! *
│ * ↓ Loss!
│ * ssthresh set ↓
│ *
│*
└────────────────────────────────────────> Time
"Sawtooth" pattern is classic TCP Reno behavior
Congestion Control in Practice
Checking Your System’s Algorithm
# Linux
$ sysctl net.ipv4.tcp_congestion_control
net.ipv4.tcp_congestion_control = cubic
# See available algorithms
$ sysctl net.ipv4.tcp_available_congestion_control
net.ipv4.tcp_available_congestion_control = reno cubic bbr
# Change algorithm (root)
$ sudo sysctl -w net.ipv4.tcp_congestion_control=bbr
Per-Connection Algorithm (Linux)
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Set BBR for this connection
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_CONGESTION, b'bbr')
# Check what's set
algo = s.getsockopt(socket.IPPROTO_TCP, socket.TCP_CONGESTION, 16)
print(algo.decode()) # 'bbr'
ECN: Explicit Congestion Notification
Instead of dropping packets, routers can mark them:
Traditional congestion signal:
Router overloaded → Drops packets
Sender sees loss → Reduces cwnd
With ECN:
Router overloaded → Sets ECN bits in IP header
Receiver sees ECN → Echoes to sender via TCP
Sender reduces cwnd → No packet loss!
Benefits:
- No lost data
- Faster response
- Lower latency
ECN requires support from:
- Routers (must mark instead of drop)
- Both TCP endpoints (must negotiate)
Fairness
Congestion control isn’t just about throughput—it’s about sharing:
Two flows sharing a bottleneck:
Flow A: 100 Mbps network, long-running download
Flow B: 100 Mbps network, long-running download
Bottleneck: 10 Mbps
Fair outcome: Each gets ~5 Mbps
TCP's AIMD achieves this:
- Both increase at same rate (additive)
- Both decrease proportionally (multiplicative)
- Over time, converges to fair share
RTT Fairness Problem
Flow A: 10 ms RTT
Flow B: 100 ms RTT
Same bottleneck
Problem: Flow A increases cwnd 10× faster!
A: +1 MSS every 10ms = +100 MSS/second
B: +1 MSS every 100ms = +10 MSS/second
Flow A gets ~10× more bandwidth
This is why CUBIC and BBR were designed
Bufferbloat
Excessive buffering causes latency issues:
Problem:
Router has 100MB buffer
TCP fills buffer to maximize throughput
1000 Mbps link with 100MB buffer:
Buffer delay = 100MB / 125MB/s = 800ms!
Packets wait in queue → High latency
TCP only reacts when buffer overflows → Too late
Solutions:
- Active Queue Management (AQM)
- CoDel, PIE, fq_codel
- BBR (doesn't fill buffers)
Debugging Congestion
Symptoms
- Good bandwidth but high latency = bufferbloat
- Periodic throughput drops = congestion/loss
- Consistently low throughput = bottleneck or small cwnd
Tools
# Linux: view cwnd and ssthresh
$ ss -ti
ESTAB 0 0 192.168.1.100:52431 93.184.216.34:80
cubic wscale:7,7 rto:208 rtt:104/52 ato:40 mss:1448
cwnd:10 ssthresh:7 bytes_sent:1448 bytes_acked:1449
# Trace cwnd over time
$ ss -ti | grep cwnd # repeat or use watch
# tcptrace for analysis
$ tcptrace -l captured.pcap
Summary
Congestion control prevents network overload through self-regulation:
| Phase | Growth | Trigger |
|---|---|---|
| Slow Start | Exponential | cwnd < ssthresh |
| Congestion Avoidance | Linear | cwnd ≥ ssthresh |
| Fast Recovery | +1 MSS per dup ACK | 3 duplicate ACKs |
| Timeout | Reset to 1 | RTO expiration |
Key algorithms:
- Reno: Classic AIMD, good baseline
- CUBIC: Default Linux, better for fat pipes
- BBR: Model-based, good for lossy networks
Effective sending rate:
Rate = min(cwnd, rwnd) / RTT
Congestion control is why the internet works—millions of TCP connections sharing limited bandwidth without centralized coordination. Next, we’ll look at retransmission mechanisms—how TCP actually recovers lost data.
Retransmission Mechanisms
TCP guarantees reliable delivery by detecting lost packets and retransmitting them. This chapter explores how TCP knows when to retransmit and the mechanisms it uses to recover efficiently.
The Challenge
IP provides no delivery guarantee. Packets can be:
- Lost (router overflow, corruption, route failure)
- Duplicated (rare, but possible)
- Reordered (different paths)
- Delayed (congestion, buffering)
TCP must distinguish between these cases and respond appropriately.
How TCP Detects Loss
TCP uses two primary loss detection mechanisms:
1. Timeout (RTO)
If no ACK arrives within the Retransmission Timeout (RTO), assume the packet is lost:
Sender Receiver
│ │
│─── Seq=1000 (data) ───────X │ ← Packet lost!
│ │
│ [Timer starts] │
│ [waiting...] │
│ [RTO expires!] │
│ │
│─── Seq=1000 (retransmit) ─────────>│
│ │
│<── ACK=1500 ───────────────────────│
2. Fast Retransmit (Triple Duplicate ACK)
Three duplicate ACKs indicate a packet was lost but later packets arrived:
Sender Receiver
│ │
│─── Seq=1000 ──────────────────────>│
│─── Seq=1500 ─────────X │ ← Lost!
│─── Seq=2000 ──────────────────────>│
│─── Seq=2500 ──────────────────────>│
│─── Seq=3000 ──────────────────────>│
│ │
│<── ACK=1500 ──────────────────────│ (got 1000, want 1500)
│<── ACK=1500 (dup 1) ──────────────│ (got 2000, still want 1500)
│<── ACK=1500 (dup 2) ──────────────│ (got 2500, still want 1500)
│<── ACK=1500 (dup 3) ──────────────│ (got 3000, still want 1500)
│ │
│ [3 dup ACKs = loss!] │
│ │
│─── Seq=1500 (retransmit) ─────────>│
│ │
│<── ACK=3500 ──────────────────────│ (got everything!)
Fast retransmit is faster than waiting for timeout—often by hundreds of milliseconds.
Retransmission Timeout (RTO) Calculation
RTO must adapt to network conditions:
Too short: Unnecessary retransmissions (network already delivered)
Too long: Slow recovery from actual loss
RTO is calculated from measured RTT:
SRTT = (1 - α) × SRTT + α × RTT_sample
(Smoothed RTT, exponential moving average)
α = 1/8
RTTVAR = (1 - β) × RTTVAR + β × |SRTT - RTT_sample|
(RTT variance)
β = 1/4
RTO = SRTT + max(G, 4 × RTTVAR)
G = clock granularity (typically 1ms)
Example:
SRTT = 100ms, RTTVAR = 25ms
RTO = 100 + 4×25 = 200ms
RTO Bounds
Minimum RTO: Typically 200ms (RFC 6298 recommends 1 second!)
Maximum RTO: Typically 120 seconds
Initial RTO: 1 second (before any measurements)
RTO Backoff
On repeated timeouts, RTO doubles (exponential backoff):
1st timeout: RTO = 200ms
2nd timeout: RTO = 400ms
3rd timeout: RTO = 800ms
4th timeout: RTO = 1600ms
...
Gives up after max retries (typically ~15)
This prevents overwhelming an already congested network.
Selective Acknowledgment (SACK)
SACK dramatically improves retransmission efficiency when multiple packets are lost:
Without SACK
Lost: packets 3 and 5 out of 1,2,3,4,5,6,7
Sender Receiver
│ │
│ Receives ACK=3 │
│ (receiver has 1,2) │
│ │
│ Retransmits 3 │
│ │
│ Receives ACK=5 │
│ (receiver has 1,2,3,4) │
│ │
│ Retransmits 5 │
│ │
│ Finally ACK=8 │
Each loss requires a separate round trip!
With SACK
Lost: packets 3 and 5
Sender Receiver
│ │
│ Receives: │
│ ACK=3, SACK=4-5,6-8 │
│ "Got 1-2 (ack), 4-5 (sack), │
│ 6-8 (sack). Missing: 3, 5" │
│ │
│ Retransmits 3 and 5 together │
│ │
│ Receives ACK=8 │
Both lost packets identified and retransmitted in one round trip!
SACK Format
TCP Option:
┌─────────┬────────┬─────────────┬─────────────┬─────┐
│ Kind=5 │ Length │ Left Edge 1 │ Right Edge 1│ ... │
│ (1 byte)│(1 byte)│ (4 bytes) │ (4 bytes) │ │
└─────────┴────────┴─────────────┴─────────────┴─────┘
Example: SACK 5001-6000, 7001-9000
"I have bytes 5001-6000 and 7001-9000"
"I'm missing 1-5000 and 6001-7000"
Maximum 4 SACK blocks (40 bytes option max, minus timestamps)
Duplicate Detection
TCP must handle duplicate packets (from retransmission or network duplication):
Sequence Number Check
Receiver tracks:
RCV.NXT = next expected sequence number
Incoming sequence < RCV.NXT?
→ Duplicate! Already received. Discard (but still ACK).
Example:
RCV.NXT = 5000
Packet arrives: Seq=3000
Already have this, discard.
PAWS (Protection Against Wrapped Sequences)
For high-speed connections, sequence numbers can wrap:
32-bit sequence: 0 to 4,294,967,295
At 1 Gbps: wraps every ~34 seconds
At 10 Gbps: wraps every ~3.4 seconds
Problem:
Old duplicate from previous wrap could be accepted
as valid data!
Solution: Timestamps
Each segment has timestamp
Old segment has old timestamp
Even if sequence matches, timestamp reveals age
Reject segments with timestamps too old
Spurious Retransmissions
Sometimes TCP retransmits unnecessarily:
Causes:
- RTT suddenly increased (but packet not lost)
- Delay spike on reverse path (ACK delayed)
- RTO calculated too aggressively
Problems:
- Wastes bandwidth
- cwnd reduced unnecessarily
- Triggers congestion response
Mitigations:
- F-RTO: Detect spurious timeout retransmissions
- Eifel algorithm: Use timestamps to detect
- DSACK: Receiver reports duplicate segments received
D-SACK (Duplicate SACK)
Receiver tells sender about duplicate segments:
Sender retransmits Seq=1000 (timeout)
Original arrives late at receiver
Retransmit also arrives
Receiver sends:
ACK=2000, SACK=1000-1500 (D-SACK)
"You already sent this, I got it twice"
Sender learns: My RTO was too aggressive!
Can adjust RTO calculation.
Retransmission in Action
Real-world packet capture showing loss and recovery:
Time Direction Seq ACK Flags Notes
──────────────────────────────────────────────────────────────
0.000 → 1000 PSH Send data
0.001 → 1500 PSH Send more
0.002 → 2000 PSH Lost!
0.003 → 2500 PSH Send more
0.004 → 3000 PSH Send more
0.050 ← 1500 ACK (got 1000)
0.051 ← 2000 ACK (got 1500)
0.052 ← 2000 DUP DupACK 1 (gap!)
0.053 ← 2000 DUP DupACK 2
0.054 ← 2000 DUP DupACK 3
0.055 → 2000 PSH Fast retransmit!
0.105 ← 3500 ACK (recovered!)
Optimizations
Tail Loss Probe (TLP)
Probes for loss when the sender goes idle:
Problem:
Send last segment of request
Segment lost
No more data to send → No duplicate ACKs
Must wait for full RTO
TLP solution:
If no ACK within 2×SRTT after sending:
Retransmit last segment (or send new probe)
Triggers immediate feedback
Reduces tail latency significantly.
Early Retransmit
Allows fast retransmit with fewer than 3 dup ACKs:
Traditional: Need 3 dup ACKs
Problem: What if only 2 packets in flight?
Early retransmit:
Small window (< 4 segments)
Allow fast retransmit with just 1-2 dup ACKs
Better for small transfers
RACK (Recent ACKnowledgment)
Time-based loss detection:
Traditional: Count duplicate ACKs
Problem: Reordering looks like loss
RACK approach:
Track time of most recent ACK
If segment sent before recent ACK hasn't been ACKed:
Probably lost (not reordered)
Better handles reordering vs. loss distinction
Configuration
Linux Tuning
# View retransmission stats
$ netstat -s | grep -i retrans
1234 segments retransmitted
567 fast retransmits
# RTO settings
$ sysctl net.ipv4.tcp_retries1 # Soft threshold
net.ipv4.tcp_retries1 = 3
$ sysctl net.ipv4.tcp_retries2 # Hard maximum
net.ipv4.tcp_retries2 = 15
# Enable SACK (usually default)
$ sysctl net.ipv4.tcp_sack
net.ipv4.tcp_sack = 1
# Enable TLP
$ sysctl net.ipv4.tcp_early_retrans
net.ipv4.tcp_early_retrans = 3 # TLP enabled
Monitoring Retransmissions
# Count retransmits on a connection
$ ss -ti dst example.com
... retrans:5/10 ...
│ └── Total retransmits
└── Unrecovered retransmits
# Watch for retransmissions
$ tcpdump -i eth0 'tcp[tcpflags] & tcp-syn == 0' | grep -i retrans
Summary
TCP uses multiple mechanisms to recover from loss:
| Mechanism | Detection | Speed | Use Case |
|---|---|---|---|
| Timeout (RTO) | Timer expires | Slow | Last resort |
| Fast Retransmit | 3 dup ACKs | Fast | Most losses |
| SACK | Explicit gaps | Fast | Multiple losses |
| TLP | Probe timeout | Fast | Tail losses |
RTO calculation:
RTO = SRTT + 4 × RTTVAR
Key principles:
- Fast retransmit beats waiting for timeout
- SACK enables efficient multi-loss recovery
- Timestamps help detect spurious retransmissions
- Modern algorithms (RACK) improve reordering tolerance
Understanding retransmission helps you diagnose network issues. High retransmission rates indicate packet loss—which could be congestion, bad hardware, or misconfiguration.
Next, we’ll cover TCP states—the lifecycle of a TCP connection from creation to termination.
TCP States and Lifecycle
A TCP connection progresses through a series of states from creation to termination. Understanding these states helps you debug connection issues, interpret netstat output, and understand why connections sometimes linger.
The State Diagram
┌───────────────────────────────────────┐
│ CLOSED │
└───────────────────┬───────────────────┘
│
┌─────────────────────────────────┼─────────────────────────────────┐
│ │ │
│ Passive Open │ Active Open │
│ (Server: listen()) │ (Client: connect()) │
▼ ▼ │
┌───────────────┐ ┌───────────────┐ │
│ LISTEN │ │ SYN_SENT │ │
│ │ │ │ │
│ Waiting for │ │ SYN sent, │ │
│ connection │ │ waiting for │ │
│ request │ │ SYN-ACK │ │
└───────┬───────┘ └───────┬───────┘ │
│ │ │
│ Receive SYN │ Receive SYN-ACK │
│ Send SYN-ACK │ Send ACK │
▼ ▼ │
┌───────────────┐ ┌───────────────┐ │
│ SYN_RCVD │ │ ESTABLISHED │◄─────────────────────────┘
│ │ │ │
│ SYN received, │ │ Connection │
│ SYN-ACK sent │──────────────>│ is open │
│ waiting ACK │ Receive ACK │ │
└───────────────┘ └───────────────┘
│
┌───────────┴───────────┐
│ │
│ Close requested │
│ │
▼ ▼
(Active Close) (Passive Close)
┌───────────────┐ ┌───────────────┐
│ FIN_WAIT_1 │ │ CLOSE_WAIT │
│ │ │ │
│ FIN sent, │ │ FIN received, │
│ waiting ACK │ │ ACK sent, │
└───────┬───────┘ │ waiting for │
│ │ app to close │
Receive ACK │ └───────┬───────┘
┌───────┴───────┐ │
│ │ │ App calls close()
▼ │ │ Send FIN
┌───────────────┐ │ ▼
│ FIN_WAIT_2 │ │ ┌───────────────┐
│ │ │ │ LAST_ACK │
│ Waiting for │ │ │ │
│ peer's FIN │ │ │ FIN sent, │
└───────┬───────┘ │ │ waiting ACK │
│ │ └───────┬───────┘
Receive FIN │ │ │
Send ACK │ │ Receive FIN │ Receive ACK
│ │ Send ACK │
▼ ▼ ▼
┌───────────────────────────┐ ┌───────────────┐
│ TIME_WAIT │ │ CLOSED │
│ │ │ │
│ Wait 2×MSL before │ │ Connection │
│ fully closing │ │ terminated │
│ (typically 60-120 sec) │ │ │
└─────────────┬─────────────┘ └───────────────┘
│
│ Timeout (2×MSL)
▼
┌───────────────┐
│ CLOSED │
└───────────────┘
State Descriptions
CLOSED
The starting and ending state. No connection exists.
Not actually tracked—absence of connection state.
LISTEN
Server is waiting for incoming connections.
Created by: listen() system call
$ netstat -an | grep LISTEN
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
SYN_SENT
Client has sent SYN, waiting for SYN-ACK.
Created by: connect() system call
Typical duration: 1 RTT (plus retries if lost)
If you see many SYN_SENT:
- Remote server might be down
- Firewall blocking SYN-ACKs
- Network connectivity issues
SYN_RCVD (SYN_RECEIVED)
Server received SYN, sent SYN-ACK, waiting for ACK.
Part of the half-open connection.
Typical duration: 1 RTT
If you see many SYN_RCVD:
- Could be SYN flood attack
- Check SYN backlog settings
- Consider SYN cookies
ESTABLISHED
Three-way handshake complete. Data can flow.
This is the normal "connection open" state.
$ netstat -an | grep ESTABLISHED
tcp 0 0 192.168.1.100:52431 93.184.216.34:80 ESTABLISHED
FIN_WAIT_1
Application called close(), FIN sent.
Waiting for ACK of FIN (or FIN from peer).
Brief transitional state.
FIN_WAIT_2
FIN acknowledged, waiting for peer's FIN.
Peer's application hasn't closed yet.
Can persist if peer doesn't close:
- Application bug (not calling close())
- Half-close intentional
Linux: tcp_fin_timeout controls how long to wait
CLOSE_WAIT
Received FIN from peer, sent ACK.
Waiting for local application to close.
If you see many CLOSE_WAIT:
- Application not calling close()!
- Resource leak / application bug
- Common source of "too many open files"
LAST_ACK
Sent FIN after receiving peer's FIN.
Waiting for final ACK.
Brief transitional state.
TIME_WAIT
Connection fully closed, waiting before reuse.
The "lingering" state that often confuses people.
Duration: 2 × MSL (Maximum Segment Lifetime)
MSL typically 30-60 seconds
TIME_WAIT typically 60-120 seconds
Why it exists: (see below)
CLOSING
Rare state: Both sides sent FIN simultaneously.
Each waiting for ACK of their FIN.
Simultaneous close scenario.
Why TIME_WAIT Exists
TIME_WAIT serves two important purposes:
1. Reliable Connection Termination
Scenario: Final ACK is lost
Client Server
│ │
│──── FIN ──────────────────────────>│
│<─── ACK, FIN ──────────────────────│
│──── ACK ───────────X │ ← Lost!
│ │
│ (Client in TIME_WAIT) │ (Server in LAST_ACK)
│ │
│<─── FIN (retransmit) ──────────────│
│──── ACK ──────────────────────────>│ (Re-ACK)
│ │
│ TIME_WAIT ensures we can │
│ re-acknowledge if needed │
2. Prevent Stale Segments
Old connection: 192.168.1.100:52431 → 93.184.216.34:80
Some segments still in network (delayed)
New connection with same 4-tuple:
If TIME_WAIT didn't exist, could reuse immediately
Old segments might be accepted as valid!
TIME_WAIT (2×MSL) ensures old segments expire:
MSL = Maximum Segment Lifetime in network
2×MSL = round-trip time for any lingering data
TIME_WAIT Problems and Solutions
The Problem
High-traffic servers can accumulate thousands of TIME_WAIT connections:
$ netstat -an | grep TIME_WAIT | wc -l
15234
Each TIME_WAIT:
- Consumes memory (small, but adds up)
- Holds ephemeral port (can exhaust ports!)
- 4-tuple unavailable for new connections
Solutions
1. SO_REUSEADDR
# Allow bind() to reuse address in TIME_WAIT
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
# Server can restart immediately after crash
# Doesn't allow simultaneous bind to same port
2. tcp_tw_reuse (Linux)
# Allow reusing TIME_WAIT sockets for outgoing connections
$ sysctl -w net.ipv4.tcp_tw_reuse=1
# Safe because timestamps prevent confusion
# Only for outgoing connections (client side)
3. Reduce TIME_WAIT duration (careful!)
# Not recommended - violates TCP specification
# Some systems allow it anyway
# Linux doesn't have direct control
# tcp_fin_timeout only affects FIN_WAIT_2
4. Connection pooling
Reuse established connections
- HTTP Keep-Alive
- Database connection pools
- gRPC persistent connections
Fewer connections = fewer TIME_WAITs
5. Use server-side close
If server closes first → Server gets TIME_WAIT
If client closes first → Client gets TIME_WAIT
For servers with many short-lived connections:
Let clients close first (HTTP/1.1 does this)
Viewing Connection States
Linux/macOS
# All connections with state
$ netstat -an
$ ss -an
# Count by state
$ ss -s
TCP: 2156 (estab 234, closed 1856, orphaned 12, timewait 1844)
# Filter by state
$ ss -t state established
$ ss -t state time-wait
# Show process info
$ ss -tp
$ netstat -tp
State Distribution Check
# Quick state summary
$ ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
1844 TIME-WAIT
234 ESTAB
56 FIN-WAIT-2
23 CLOSE-WAIT
5 LISTEN
3 SYN-SENT
Connection Termination: Normal vs. Abort
Graceful Close (FIN)
Normal termination - all data delivered
Client: close() → sends FIN
Waits for peer's FIN
Both sides agree connection is done
4-way handshake:
FIN →
← ACK
← FIN
ACK →
Can be combined (FIN+ACK together)
Abortive Close (RST)
Immediate termination - may lose data
Triggers:
- SO_LINGER with timeout=0
- Receiving data on closed socket
- Connection to non-listening port
- Firewall timeout/rejection
No TIME_WAIT needed - immediate cleanup
But: any in-flight data is lost
Half-Close
TCP allows closing one direction:
Client: shutdown(SHUT_WR)
- Client can't send more data
- Client can still receive
- Server sees EOF when reading
Use case:
"I'm done sending, but I'll wait for your response"
Example: HTTP request sent, waiting for response
Common Issues
Too Many CLOSE_WAIT
Symptoms:
- Connections stuck in CLOSE_WAIT
- "Too many open files" errors
- Application eventually fails
Cause:
- Application receiving FIN but not calling close()
- Bug in cleanup code
- Exception handling not closing sockets
Fix:
- Fix application to properly close sockets
- Use finally blocks / context managers
- Check for file descriptor leaks
Too Many TIME_WAIT
Symptoms:
- Thousands of TIME_WAIT connections
- Port exhaustion for outgoing connections
- "Cannot assign requested address" errors
Cause:
- Many short-lived outgoing connections
- Server closing connections (gets TIME_WAIT)
Fix:
- Connection pooling
- tcp_tw_reuse (client-side)
- Let clients close first (server-side)
- Longer-lived connections
SYN_RECV Accumulation
Symptoms:
- Many connections in SYN_RCVD
- New connections rejected
- Server appears slow or unresponsive
Cause:
- SYN flood attack
- Slow/lossy network (ACKs not arriving)
Fix:
- Enable SYN cookies
- Increase SYN backlog
- Rate limiting
- DDoS protection
Summary
TCP states track the connection lifecycle:
| State | Side | Meaning |
|---|---|---|
| LISTEN | Server | Waiting for connections |
| SYN_SENT | Client | Handshake in progress |
| SYN_RCVD | Server | Handshake in progress |
| ESTABLISHED | Both | Connection open |
| FIN_WAIT_1 | Closer | Sent FIN, waiting ACK |
| FIN_WAIT_2 | Closer | FIN ACKed, waiting peer FIN |
| CLOSE_WAIT | Receiver | Received FIN, app hasn’t closed |
| LAST_ACK | Receiver | Sent FIN, waiting final ACK |
| TIME_WAIT | Closer | Waiting to ensure clean close |
| CLOSED | Both | No connection |
Key debugging insights:
- CLOSE_WAIT accumulation = application not closing sockets
- TIME_WAIT accumulation = many short connections (may be normal)
- SYN_RCVD accumulation = possible SYN flood attack
This completes our deep dive into TCP. You now understand the protocol that powers most of the internet. Next, we’ll look at UDP—the simpler, faster alternative.
UDP: The Simple Protocol
UDP (User Datagram Protocol) is TCP’s minimalist counterpart. Where TCP provides reliability, ordering, and connection management, UDP provides almost nothing—just a thin wrapper around IP. This simplicity makes it fast, lightweight, and ideal for certain use cases.
What UDP Provides
┌─────────────────────────────────────────────────────────────┐
│ UDP Provides │
├─────────────────────────────────────────────────────────────┤
│ ✓ Multiplexing via ports │
│ (Multiple apps can use the network) │
│ │
│ ✓ Checksum for error detection │
│ (Optional in IPv4, mandatory in IPv6) │
│ │
│ That's it. Really. │
└─────────────────────────────────────────────────────────────┘
What UDP Does NOT Provide
┌─────────────────────────────────────────────────────────────┐
│ UDP Does NOT Provide │
├─────────────────────────────────────────────────────────────┤
│ ✗ Reliability (packets may be lost) │
│ ✗ Ordering (packets may arrive out of order) │
│ ✗ Duplication prevention (same packet may arrive twice) │
│ ✗ Connection state (no handshake, no teardown) │
│ ✗ Flow control (can overwhelm receiver) │
│ ✗ Congestion control (can overwhelm network) │
└─────────────────────────────────────────────────────────────┘
TCP vs. UDP at a Glance
TCP UDP
─────────────────────────────────────────────────────────────
Connection-oriented Connectionless
Reliable delivery Best-effort delivery
Ordered delivery No ordering guarantee
Flow control No flow control
Congestion control No congestion control
Higher latency Lower latency
Higher overhead Lower overhead
Stream-based Message-based
Why Choose UDP?
If UDP lacks so many features, why use it?
1. Lower Latency
TCP connection setup:
1. SYN ────────> (1 RTT)
2. <──────── SYN-ACK
3. ACK + Data ──> (another RTT for handshake)
UDP "setup":
1. Data ────────> (immediate!)
For single request-response: UDP saves at least 1 RTT
2. No Head-of-Line Blocking
TCP: Packet 3 lost
Received: 1, 2, [gap], 4, 5, 6, 7
│
└── Can't deliver 4-7 until 3 arrives!
UDP: Packet 3 lost
Received: 1, 2, 4, 5, 6, 7 ← Deliver immediately!
│
└── Application decides what to do
For real-time apps, old data is often worthless anyway.
3. Message Boundaries Preserved
TCP is a byte stream:
send("Hello")
send("World")
Receiver might get: "HelloWorld" or "Hell" + "oWorld"
(No message boundaries)
UDP is message-based:
sendto("Hello")
sendto("World")
Receiver gets: "Hello" then "World"
(Each datagram is discrete)
4. Application Control
TCP decides:
- When to retransmit
- How fast to send
- How to react to loss
UDP lets the application decide:
- Custom retransmission logic
- Application-specific rate control
- Skip old data, prioritize new
When to Use UDP
┌─────────────────────────────────────────────────────────────┐
│ UDP Is Good For: │
├─────────────────────────────────────────────────────────────┤
│ │
│ Real-time Applications │
│ • Voice/Video calls (VoIP) │
│ • Live streaming │
│ • Online gaming │
│ • Real-time sensor data │
│ │
│ Simple Request-Response │
│ • DNS queries │
│ • NTP (time sync) │
│ • DHCP │
│ │
│ Broadcast/Multicast │
│ • Service discovery │
│ • Network announcements │
│ • LAN games │
│ │
│ Custom Protocols │
│ • QUIC (UDP-based, adds reliability) │
│ • Custom game protocols │
│ • IoT protocols │
│ │
└─────────────────────────────────────────────────────────────┘
What You’ll Learn
In this chapter:
- UDP Header and Datagrams: The simple packet format
- When to Use UDP: Detailed use cases and examples
- UDP vs TCP Trade-offs: Making the right choice
UDP’s simplicity is its strength. By providing just enough transport-layer functionality, it enables applications to build exactly what they need on top—nothing more, nothing less.
UDP Header and Datagrams
The UDP header is remarkably simple—just 8 bytes compared to TCP’s minimum of 20. This minimalism is by design, providing just enough functionality to multiplex applications and detect corruption.
UDP Datagram Structure
┌─────────────────────────────────────────────────────────────┐
│ UDP Datagram │
├─────────────────────────────────────────────────────────────┤
│ UDP Header │ UDP Payload │
│ (8 bytes) │ (0 to 65,507 bytes) │
└─────────────────────────────────────────────────────────────┘
Maximum payload: 65,535 (IP max) - 20 (IP header) - 8 (UDP header) = 65,507 bytes
But practical limit is usually much smaller due to MTU.
The UDP Header
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
│ Source Port │ Destination Port │
├───────────────────────────────┼───────────────────────────────┤
│ Length │ Checksum │
├───────────────────────────────┴───────────────────────────────┤
│ │
│ Payload │
│ │
└───────────────────────────────────────────────────────────────┘
That’s it. Four 16-bit fields. Compare to TCP’s 10+ fields!
Header Fields
Source Port (16 bits)
The sender's port number.
Optional: Can be 0 if no reply is expected
(Though this is rarely done in practice)
Used by receiver to send responses back.
Destination Port (16 bits)
The receiver's port number.
Identifies the application/service.
Well-known ports same as TCP:
53 = DNS
67 = DHCP Server
68 = DHCP Client
69 = TFTP
123 = NTP
161 = SNMP
500 = IKE (IPsec)
Length (16 bits)
Total length of UDP datagram (header + payload).
Minimum: 8 (header only, no payload)
Maximum: 65535 (theoretical, rarely practical)
Length = 8 + payload_size
Checksum (16 bits)
Error detection for header and data.
IPv4: Optional (0 = disabled)
IPv6: Mandatory
Calculated over:
- UDP pseudo-header (from IP)
- UDP header
- UDP payload (padded if odd length)
Pseudo-Header
Like TCP, UDP checksum covers IP addresses:
IPv4 Pseudo-Header:
┌───────────────────────────────────────────────────────────────┐
│ Source IP Address │
├───────────────────────────────────────────────────────────────┤
│ Destination IP Address │
├───────────────┬───────────────┬───────────────────────────────┤
│ Zero (8) │ Protocol (17) │ UDP Length │
└───────────────┴───────────────┴───────────────────────────────┘
This ensures the datagram reaches the intended destination.
If IP addresses were modified, checksum fails.
Comparing Headers
TCP Header (minimum): UDP Header:
┌───────────────────────────┐ ┌───────────────────────────┐
│ Source Port (16) │ │ Source Port (16) │
│ Destination Port (16) │ │ Destination Port (16) │
│ Sequence Number (32) │ │ Length (16) │
│ Acknowledgment (32) │ │ Checksum (16) │
│ Data Offset/Flags (16) │ └───────────────────────────┘
│ Window (16) │ 8 bytes total
│ Checksum (16) │
│ Urgent Pointer (16) │
│ [Options...] │
└───────────────────────────┘
20+ bytes
TCP overhead: 20+ bytes
UDP overhead: 8 bytes
For small messages, the difference matters!
UDP Socket Programming
Sending a Datagram
import socket
# Create UDP socket
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
# No connect() needed - just send!
message = b"Hello, UDP!"
sock.sendto(message, ("192.168.1.100", 12345))
# Can send to different destinations with same socket
sock.sendto(b"Hello A", ("192.168.1.101", 12345))
sock.sendto(b"Hello B", ("192.168.1.102", 12345))
sock.close()
Receiving Datagrams
import socket
# Create and bind UDP socket
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind(("0.0.0.0", 12345))
print("Listening on port 12345...")
while True:
# recvfrom returns data AND sender address
data, addr = sock.recvfrom(65535) # Buffer size
print(f"Received from {addr}: {data.decode()}")
# Can reply directly
sock.sendto(b"Got it!", addr)
Connected UDP (Optional)
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
# Can "connect" UDP socket - sets default destination
sock.connect(("192.168.1.100", 12345))
# Now can use send() instead of sendto()
sock.send(b"Hello!")
# recv() instead of recvfrom()
response = sock.recv(1024)
# Also enables receiving ICMP errors
# (Unconnected UDP sockets don't see them)
Message Boundaries
Unlike TCP’s byte stream, UDP preserves message boundaries:
Sender:
sendto(b"Message 1")
sendto(b"Message 2")
sendto(b"Message 3")
Receiver:
recvfrom() → b"Message 1"
recvfrom() → b"Message 2"
recvfrom() → b"Message 3"
Each datagram is delivered as a complete unit (or not at all).
Never get partial messages or merged messages.
Datagram Size Considerations
Practical Limits
Maximum theoretical: 65,507 bytes
Maximum without fragmentation: MTU - IP header - UDP header
Ethernet: 1500 - 20 - 8 = 1472 bytes
Jumbo: 9000 - 20 - 8 = 8972 bytes
Recommended maximum for reliability:
512-1400 bytes (avoids fragmentation)
DNS uses 512 bytes historically (EDNS allows larger)
Fragmentation Problem
UDP datagram > MTU gets fragmented at IP layer:
Original: 3000-byte UDP datagram
│
▼
┌─────────────────┐ ┌─────────────────┐ ┌────────────┐
│ IP Frag 1 │ │ IP Frag 2 │ │ IP Frag 3 │
│ UDP hdr + 1472B │ │ Data (1480B) │ │ Data (48B) │
└─────────────────┘ └─────────────────┘ └────────────┘
Problems:
- Any fragment lost = entire datagram lost
- No automatic retransmission
- Higher effective loss rate
Best practice: Keep datagrams under MTU
No Connection State
UDP sockets don’t track connections:
TCP Server:
listen()
while True:
client = accept() ← New socket per connection
handle(client)
client.close()
UDP Server:
bind()
while True:
data, addr = recvfrom() ← All messages on same socket
# addr tells you who sent it
handle(data, addr)
sendto(response, addr)
UDP has no notion of "accepted connections"
Just receives datagrams with source addresses.
Common UDP Patterns
Request-Response
Client Server
│ │
│─── Request (with client port) ───>│
│ │
│<── Response (to client port) ─────│
│ │
Simple: One datagram each direction.
DNS works this way.
Streaming
Source Destination
│ │
│─── Packet 1 (seq=1) ───────────────>│
│─── Packet 2 (seq=2) ───────────────>│
│─── Packet 3 (seq=3) ──X │ Lost!
│─── Packet 4 (seq=4) ───────────────>│
│─── Packet 5 (seq=5) ───────────────>│
│ │
│ Receiver notices gap, may request │
│ retransmit or skip packet 3 │
Application implements sequencing/recovery as needed.
Video streaming, gaming use this pattern.
Multicast
One sender, multiple receivers:
Source ─────────────────┬───────────────> Receiver A
│ │
│ Multicast ├───────────────> Receiver B
│ Group │
│ └───────────────> Receiver C
UDP is required for multicast (TCP is point-to-point only).
Used for IPTV, service discovery, LAN gaming.
Viewing UDP Traffic
# Linux: Show UDP sockets
$ ss -u -a
State Recv-Q Send-Q Local Address:Port Peer Address:Port
UNCONN 0 0 0.0.0.0:68 0.0.0.0:*
UNCONN 0 0 127.0.0.1:323 0.0.0.0:*
# Capture UDP packets
$ tcpdump -i eth0 udp port 53
14:23:15.123 IP 192.168.1.100.52431 > 8.8.8.8.53: UDP, length 32
# Show UDP statistics
$ netstat -su
Udp:
1234567 packets received
12 packets to unknown port received
0 packet receive errors
1234560 packets sent
Summary
The UDP header is minimal by design:
| Field | Size | Purpose |
|---|---|---|
| Source Port | 16 bits | Reply address |
| Destination Port | 16 bits | Target application |
| Length | 16 bits | Datagram size |
| Checksum | 16 bits | Error detection |
Key characteristics:
- 8-byte header (vs TCP’s 20+)
- Message-oriented (boundaries preserved)
- Connectionless (no state to manage)
- No fragmentation at UDP level (handled by IP)
UDP provides just enough to identify applications and detect corruption. Everything else—reliability, ordering, flow control—is the application’s responsibility (if needed at all).
Next, we’ll explore when UDP is the right choice and common use cases.
When to Use UDP
Choosing UDP over TCP is a significant architectural decision. UDP shines in specific scenarios where its characteristics—low latency, no connection overhead, and application control—outweigh the lack of built-in reliability.
Primary Use Cases
Real-Time Communication
Voice over IP (VoIP)
Why UDP?
TCP behavior on packet loss:
Packet lost → Retransmit → Arrives 200ms later
Audio: "Hello, how are--[200ms pause]--you?"
UDP behavior:
Packet lost → Move on
Audio: "Hello, how are [brief glitch] you?"
Humans tolerate small audio gaps.
Humans hate delays in conversation.
VoIP typically tolerates 1-5% packet loss gracefully.
Delay > 150ms makes conversation awkward.
Video Streaming (Live)
Live video constraints:
- Frame every 33ms (30 fps)
- Old frames are worthless
- Viewer can't wait for retransmit
UDP approach:
Lost packet? Skip it, show next frame.
Minor visual artifact better than frozen video.
Note: Buffered streaming (Netflix) often uses TCP.
TCP's reliability works when you can buffer ahead.
Online Gaming
Game server sends world state 60 times/second:
Frame 1: Player at (100, 200)
Frame 2: Player at (102, 201) ← Lost!
Frame 3: Player at (104, 202)
Frame 4: Player at (106, 203)
With TCP: Wait for Frame 2 retransmit
Game stutters, all updates delayed
With UDP: Skip Frame 2
Frame 3 has newer position anyway!
Smooth gameplay
Games implement their own:
- Sequence numbers (detect loss)
- Interpolation (smooth missing frames)
- Prediction (guess missing data)
Simple Request-Response
DNS (Domain Name System)
DNS query:
Client: "What's the IP for example.com?"
Server: "93.184.216.34"
Why UDP (historically)?
- Single small request (<512 bytes)
- Single small response
- No connection state needed
- Low latency critical (affects every web request)
UDP saves: 1-2 RTT from TCP handshake
Modern note: DNS over TCP exists and is growing
- Large responses (DNSSEC)
- DNS over HTTPS/TLS (encrypted, uses TCP)
NTP (Network Time Protocol)
Time sync:
Client: "What time is it?"
Server: "2024-01-15 14:23:45.123456"
Latency matters for accuracy!
Every ms of delay affects time calculation
UDP request-response: ~1 RTT
TCP setup + request: ~3 RTT
DHCP (Dynamic Host Configuration Protocol)
Network bootstrapping:
Client: "I need an IP address!" (broadcast)
Server: "You can use 192.168.1.100"
Special challenge: Client has NO IP address yet!
TCP requires an IP to establish connection
UDP can broadcast without source IP
Also: DHCP predates TCP optimizations
Broadcast and Multicast
Service Discovery
Finding services on local network:
Option 1 (TCP): Connect to every device, ask "Are you a printer?"
Slow, inefficient, doesn't scale
Option 2 (UDP multicast):
Send to multicast address: "Who's a printer?"
All printers respond: "Me! I'm at 192.168.1.50"
mDNS/Bonjour uses this (224.0.0.251, port 5353)
IPTV / Live TV Distribution
Sending same video to 10,000 viewers:
TCP: 10,000 separate connections
10,000 copies of each packet
Massive server load
UDP Multicast: 1 stream
Network duplicates as needed
Scales infinitely
Multicast REQUIRES UDP (TCP is point-to-point).
Custom Protocols
QUIC (HTTP/3)
QUIC is a custom protocol over UDP:
- Implements reliability (like TCP)
- Implements congestion control (like TCP)
- But with multiplexing, 0-RTT, migration
Why not just improve TCP?
- TCP is in operating system kernels
- Kernel changes take years to deploy
- Middleboxes (firewalls, NAT) expect TCP behavior
UDP is a blank slate:
- Implement in userspace (fast iteration)
- Passes through middleboxes (they don't inspect UDP)
- Customize behavior completely
Custom Game Protocols
Games often need:
- Reliable delivery for some messages (chat, purchases)
- Unreliable for others (position updates)
- Priority levels
- Custom congestion handling
TCP: One-size-fits-all, no customization
UDP: Build exactly what you need
Many game engines implement hybrid:
- Reliable ordered channel (mimics TCP)
- Reliable unordered channel
- Unreliable channel
All over single UDP socket.
Lightweight IoT
Sensor Networks
Thousands of sensors reporting temperature:
- Small messages (few bytes)
- Frequent updates
- Individual readings not critical
- Network/power constrained
TCP overhead per reading:
20-byte header (often > payload!)
Connection state on server
UDP overhead:
8-byte header
No state
Fire and forget
CoAP (Constrained Application Protocol) uses UDP.
When NOT to Use UDP
File Transfer
File transfer requirements:
✓ Complete delivery (every byte matters)
✓ Correct order
✓ Error detection
UDP would require implementing:
- Sequence numbers
- Acknowledgments
- Retransmission
- Congestion control
...basically reimplementing TCP poorly.
Use TCP for file transfer. Or QUIC.
Web APIs / HTTP
HTTP requires:
✓ Reliable delivery (incomplete JSON is useless)
✓ Request-response matching
✓ Large responses
TCP is the right choice.
(HTTP/3 uses QUIC over UDP, but QUIC handles reliability)
Anything Through Firewalls
Many corporate firewalls:
- Allow TCP 80, 443
- Block most UDP
- May even block all UDP
If targeting corporate networks:
Consider TCP for better connectivity.
WebSocket (TCP) often works where custom UDP doesn't.
Decision Framework
┌─────────────────────────────────────────────────────────────┐
│ Should I Use UDP? │
└─────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ Is low latency critical? │
│ (Real-time, interactive) │
└──────────────┬───────────────┘
│
┌───────────┴───────────┐
│ │
Yes No
│ │
▼ ▼
┌────────────────────┐ ┌────────────────────┐
│ Can you tolerate │ │ Do you need │
│ some data loss? │ │ broadcast/multicast?│
└─────────┬──────────┘ └─────────┬──────────┘
│ │
┌───────┴───────┐ ┌───────┴───────┐
│ │ │ │
Yes No Yes No
│ │ │ │
▼ ▼ ▼ ▼
┌──────┐ ┌───────┐ ┌──────┐ ┌───────┐
│ UDP! │ │QUIC or│ │ UDP! │ │ TCP │
│ │ │ TCP │ │ │ │ │
└──────┘ └───────┘ └──────┘ └───────┘
Real-World Examples
Discord Voice
Text chat: TCP (reliable)
Voice chat: UDP (low latency)
Voice handling:
- Opus codec (tolerates loss)
- Packet loss concealment
- Jitter buffer
- Falls back to TCP if UDP blocked
Zoom
Video: UDP preferred, TCP fallback
Audio: UDP preferred, TCP fallback
Screen share: UDP preferred
Quality adapts to conditions:
- High loss? Reduce quality
- UDP blocked? Switch to TCP
- Still works, but with higher latency
DNS
Traditional: UDP port 53
- Fast, simple
- Limited to 512 bytes (without EDNS)
DNS over TCP: Port 53
- Large responses (DNSSEC)
- Zone transfers
DNS over HTTPS: TCP port 443
- Encrypted
- Privacy focused
- More overhead
Online Games
Fortnite, Valorant, etc.:
Position updates: UDP (unreliable, frequent)
Game events: UDP (reliable channel)
Chat: UDP or TCP
Downloads/patches: TCP
Hybrid approach is common.
Summary
Use UDP when:
- Latency matters more than reliability
- Data has short lifespan (real-time)
- Some loss is acceptable
- Broadcast/multicast needed
- Building custom protocol (like QUIC)
- Extreme resource constraints (IoT)
Use TCP when:
- Every byte must arrive
- Order matters
- Simplicity preferred (let TCP handle complexity)
- Firewall traversal important
- Building on existing TCP-based protocols
The key question: Is old data still valuable?
- Yes → TCP (file, web page, API)
- No → Consider UDP (voice, video, game state)
Next, we’ll do a detailed comparison of UDP vs TCP trade-offs.
UDP vs TCP Trade-offs
Choosing between UDP and TCP isn’t about which is “better”—it’s about understanding the trade-offs and matching them to your requirements. This chapter provides a detailed comparison to help you make informed decisions.
Feature Comparison
┌─────────────────────────────────────────────────────────────────────────┐
│ Feature │ TCP │ UDP │ │
├────────────────────────────────────────┼───────────┼───────────┼────────┤
│ Reliable delivery │ ✓ │ ✗ │ │
│ Ordered delivery │ ✓ │ ✗ │ │
│ Error detection │ ✓ │ ✓* │ *opt. │
│ Flow control │ ✓ │ ✗ │ │
│ Congestion control │ ✓ │ ✗ │ │
│ Connection-oriented │ ✓ │ ✗ │ │
│ Message boundaries │ ✗ │ ✓ │ │
│ Broadcast/Multicast │ ✗ │ ✓ │ │
│ NAT traversal friendly │ ✓ │ varies │ │
│ Firewall friendly │ ✓ │ ✗ │ │
└────────────────────────────────────────┴───────────┴───────────┴────────┘
Latency Analysis
Connection Establishment
TCP - New Connection:
┌────────────────────────────────────────────────────────────┐
│ 0ms Client sends SYN │
│ 50ms Server receives, sends SYN-ACK │
│ 100ms Client receives, sends ACK + first data │
│ 150ms Server receives first data │
│ │
│ Minimum latency to first data: 1.5 RTT │
└────────────────────────────────────────────────────────────┘
TCP - Established Connection:
┌────────────────────────────────────────────────────────────┐
│ 0ms Client sends data │
│ 50ms Server receives data │
│ │
│ Latency: 0.5 RTT (one-way) │
└────────────────────────────────────────────────────────────┘
UDP:
┌────────────────────────────────────────────────────────────┐
│ 0ms Client sends datagram │
│ 50ms Server receives datagram │
│ │
│ Latency: 0.5 RTT (always) │
│ No connection overhead! │
└────────────────────────────────────────────────────────────┘
Request-Response Latency
Single request, single response:
TCP (new connection):
Handshake: 1 RTT
Request: 0.5 RTT
Response: 0.5 RTT
Total: 2 RTT
TCP (existing connection):
Request: 0.5 RTT
Response: 0.5 RTT
Total: 1 RTT
UDP:
Request: 0.5 RTT
Response: 0.5 RTT
Total: 1 RTT
For one-shot interactions, UDP saves 1 RTT.
For repeated interactions, TCP connection reuse matches UDP.
Latency Under Loss
5% packet loss scenario:
TCP:
Packet lost → Detected (3 dup ACKs or timeout)
Fast retransmit: ~1 RTT additional delay
Timeout: Several seconds delay!
Also: Congestion window reduced
Subsequent packets slowed
UDP:
Packet lost → Application decides:
- Ignore it (real-time)
- Request retransmit (application-level)
- Interpolate from adjacent data
No cascading effects on other packets.
Throughput Analysis
Header Overhead
Per-packet overhead:
TCP: 20-60 bytes (typically 32 with timestamps)
UDP: 8 bytes
Efficiency for 100-byte payload:
TCP: 100 / 132 = 76%
UDP: 100 / 108 = 93%
Efficiency for 1400-byte payload:
TCP: 1400 / 1432 = 98%
UDP: 1400 / 1408 = 99%
UDP's advantage shrinks with larger payloads.
Matters most for small messages.
Maximum Throughput
TCP:
Limited by: min(cwnd, rwnd) / RTT
Congestion control prevents network overload
Fair sharing with other flows
Example: 64KB window, 50ms RTT
Max: 64KB / 50ms = 1.28 MB/s
UDP:
Limited by: Application send rate
No built-in limits!
Can overwhelm network
Can achieve wire speed... if network allows.
But may cause massive loss and collateral damage.
Behavior Under Congestion
Network congested:
TCP:
Detects loss → Reduces cwnd
Backs off → Congestion clears
Gradually increases again
"Good citizen" - shares fairly
UDP:
No awareness of congestion
Keeps sending at same rate
Causes more congestion
Other TCP flows suffer
This is why uncontrolled UDP can harm the network.
Responsible UDP apps implement their own congestion control.
Reliability Implications
Handling Loss
TCP handles loss automatically:
1. Detects via ACK timeout or dup ACKs
2. Retransmits lost segment
3. Adjusts congestion window
4. Application sees reliable byte stream
UDP loss is application's problem:
1. Application must detect (if it cares)
2. Application must request retransmit (if needed)
3. Application decides what to do
Sometimes that's a feature:
Video codec can mask lost frame
Game can interpolate missing position
Voice can use error concealment
Ordering Implications
TCP guarantees order:
Sent: A B C D E
Received: A B C D E (always)
If C is lost:
B arrives, delivered
D arrives, buffered
E arrives, buffered
C retransmitted, arrives
C D E delivered in order
UDP makes no guarantee:
Sent: A B C D E
Received: A B D C E (possible)
A B D E (C lost)
A D B C E (reordered)
Application must handle or ignore.
Resource Usage
Server Memory
TCP Server (10,000 connections):
Per connection:
- Socket structure
- Send buffer (~16KB)
- Receive buffer (~16KB)
- TCP control block
Total: ~320MB for buffers alone
Plus connection tracking overhead
UDP Server (10,000 "clients"):
Single socket:
- One send buffer
- One receive buffer
- No connection state!
Total: ~32KB
Applications track state if needed
UDP scales better for many ephemeral interactions.
CPU Usage
TCP per packet:
- Checksum calculation
- Sequence number tracking
- ACK generation
- Window management
- Congestion control
- Timer management
UDP per packet:
- Checksum calculation (optional in IPv4)
- That's it
UDP has lower CPU overhead per packet.
But if you implement reliability, you add CPU work.
NAT and Firewall Behavior
NAT Traversal
TCP through NAT:
1. Client connects out
2. NAT creates mapping
3. Server responses follow mapping
4. Works reliably
UDP through NAT:
1. Client sends datagram out
2. NAT creates mapping
3. Mapping may timeout quickly!
4. Need keepalive packets
UDP NAT mappings often timeout in 30-120 seconds.
Long-lived UDP "connections" need periodic keepalive.
Firewall Policies
Common firewall behavior:
Corporate firewalls:
TCP 80 (HTTP): Usually allowed
TCP 443 (HTTPS): Usually allowed
UDP 53 (DNS): Often allowed
UDP 123 (NTP): Sometimes allowed
Other UDP: Often blocked!
If targeting corporate networks:
UDP may not work
TCP or WebSocket more reliable
HTTPS most reliable
When to Choose What
Choose TCP When:
✓ Data integrity critical (files, transactions)
✓ Simple implementation preferred
✓ Operating through corporate firewalls
✓ Long-lived connections
✓ Need reliable delivery without custom code
✓ Building on HTTP, TLS, or other TCP protocols
Choose UDP When:
✓ Real-time requirements (voice, video, gaming)
✓ Broadcast or multicast needed
✓ Small, independent messages
✓ Custom reliability acceptable
✓ Willing to implement congestion control
✓ Protocol requires it (DNS, DHCP, QUIC)
Consider QUIC When:
✓ Want UDP benefits with reliability
✓ Need multiple streams without HoL blocking
✓ Want 0-RTT connection resumption
✓ Willing to use a more complex library
✓ Building modern web services
Performance Comparison Summary
┌────────────────────────────────────────────────────────────────────────┐
│ Metric │ TCP │ UDP │
├──────────────────────────┼─────────────────────┼───────────────────────┤
│ Initial latency │ 1-1.5 RTT overhead │ No overhead │
│ Steady-state latency │ Similar │ Similar │
│ Latency under loss │ High (retransmit) │ Low (skip if desired) │
│ Throughput (clean) │ Good │ Can exceed │
│ Throughput (lossy) │ Degrades gracefully │ Application-dependent │
│ Header overhead │ 20-60 bytes │ 8 bytes │
│ Server memory │ High │ Low │
│ Server CPU │ Moderate │ Low │
│ Implementation effort │ Low (OS handles) │ High (if reliability) │
└────────────────────────────────────────────────┴───────────────────────┘
Hybrid Approaches
Many applications use both:
Example: Online Game
TCP for:
- Authentication
- Chat messages
- Purchases/transactions
- Downloading updates
UDP for:
- Player positions
- World state
- Audio chat
- Time-sensitive events
Single codebase, two transports, best of both worlds.
Summary
The choice between TCP and UDP depends on your specific requirements:
| Requirement | Prefer |
|---|---|
| Simplicity | TCP |
| Reliability built-in | TCP |
| Lowest latency | UDP |
| Real-time tolerance for loss | UDP |
| Broadcast/multicast | UDP |
| Corporate firewall traversal | TCP |
| Custom protocol over UDP | Consider QUIC |
Neither protocol is universally “better.” Understanding the trade-offs lets you make the right choice for your application—or use both where appropriate.
This completes our coverage of UDP. Next, we’ll explore DNS—the internet’s naming system that typically uses UDP for queries.
DNS: The Internet’s Directory
DNS (Domain Name System) is the internet’s phone book. It translates human-readable domain names like example.com into IP addresses like 93.184.216.34. Without DNS, we’d have to memorize IP addresses for every website—the internet would be unusable.
Why DNS Matters
Every network connection starts with DNS:
You type: https://github.com
Browser needs: IP address
1. Browser → DNS: "What's the IP for github.com?"
2. DNS → Browser: "140.82.114.3"
3. Browser → 140.82.114.3: "GET / HTTP/1.1"
4. Server → Browser: "Here's the page!"
DNS lookup happens before any connection.
DNS performance affects EVERY request.
The Hierarchical Design
DNS is a distributed database organized as a tree:
. (root)
│
┌────────────────────┼────────────────────┐
│ │ │
com org net
│ │ │
┌────┼────┐ ┌────┼────┐ ...
│ │ │ │ │
example google ... wikipedia ...
│
www
Domain: www.example.com
- "." is the root (usually implicit)
- "com" is the Top-Level Domain (TLD)
- "example" is the Second-Level Domain
- "www" is a subdomain
Key Concepts
Domain Names
Fully Qualified Domain Name (FQDN):
www.example.com.
└── Trailing dot means "this is complete"
(Usually omitted in browsers)
Labels: Separated by dots
- Each label: 1-63 characters
- Total FQDN: max 253 characters
- Case-insensitive (Example.COM = example.com)
DNS Servers
┌─────────────────────────────────────────────────────────────┐
│ Types of DNS Servers │
├─────────────────────────────────────────────────────────────┤
│ │
│ Recursive Resolver (Caching Nameserver) │
│ - What your computer typically talks to │
│ - Does the heavy lifting of finding answers │
│ - Caches results for faster subsequent queries │
│ - Examples: 8.8.8.8 (Google), 1.1.1.1 (Cloudflare) │
│ │
│ Authoritative Nameserver │
│ - Holds the actual DNS records for a zone │
│ - Is the "source of truth" for that domain │
│ - Responds to queries about its zones │
│ │
│ Root Nameservers │
│ - 13 root server clusters (a.root-servers.net, etc.) │
│ - Know where to find TLD servers │
│ - Foundation of the entire DNS system │
│ │
│ TLD Nameservers │
│ - Manage .com, .org, .net, country codes, etc. │
│ - Know authoritative servers for each domain │
│ │
└─────────────────────────────────────────────────────────────┘
The Resolution Process (Preview)
Query: "What's the IP for www.example.com?"
Your Computer → Recursive Resolver
│
├──> Root Server: "Who handles .com?"
│ "Go ask a.gtld-servers.net"
│
├──> .com TLD: "Who handles example.com?"
│ "Go ask ns1.example.com"
│
├──> example.com NS: "What's www.example.com?"
│ "It's 93.184.216.34"
│
└──> Returns answer to your computer
Multiple round trips, but caching makes it fast.
Why DNS Uses UDP (Mostly)
Traditional DNS:
- Small queries (~50 bytes)
- Small responses (~100-500 bytes)
- Single request-response
- Speed matters (affects every page load)
UDP advantages:
- No connection overhead
- Faster resolution
- Lower server load
When TCP is used:
- Responses > 512 bytes (EDNS extends this)
- Zone transfers between servers
- DNS over TLS (DoT)
- DNS over HTTPS (DoH)
What You’ll Learn
In this chapter:
- DNS Resolution Process: How lookups actually work
- Record Types: A, AAAA, CNAME, MX, and more
- DNS Caching: How TTLs and caching improve performance
- DNSSEC: Securing DNS against tampering
Understanding DNS helps you:
- Debug “cannot resolve hostname” errors
- Configure domains correctly
- Understand CDN and load balancing behavior
- Recognize DNS-based attacks
DNS Resolution Process
When your browser needs to find example.com, a complex but elegant process unfolds. Understanding this process helps you debug DNS issues and optimize performance.
The Query Journey
A full DNS resolution involves multiple servers:
┌─────────────┐ ┌───────────────┐ ┌───────────────┐
│Your Computer│───>│ Recursive │───>│ Root Servers │
│ │ │ Resolver │ │ (13 clusters) │
│ (Stub │ │ (8.8.8.8) │ │ │
│ Resolver) │ │ │ └───────┬───────┘
└─────────────┘ │ │ │
│ │<───────────┘
│ │ ┌───────────────┐
│ │───>│ TLD Servers │
│ │ │ (.com, .org) │
│ │ │ │
│ │<───┴───────────────┘
│ │ ┌───────────────┐
│ │───>│ Authoritative │
│ │ │ Nameserver │
└───────┬───────┘ │ (example.com) │
│ └───────────────┘
│
Answer returned
to your computer
Step-by-Step Resolution
Let’s trace a query for www.example.com:
Step 1: Local Stub Resolver
Your computer checks (in order):
1. Local cache (recently resolved names)
2. /etc/hosts file (manual overrides)
3. If not found → Query configured DNS server
$ cat /etc/hosts
127.0.0.1 localhost
192.168.1.10 myserver.local
$ cat /etc/resolv.conf # Linux
nameserver 8.8.8.8
nameserver 8.8.4.4
If not in cache or hosts → Send UDP query to 8.8.8.8
Step 2: Recursive Resolver Check Cache
Recursive resolver (8.8.8.8) checks its cache:
Cache might have:
- www.example.com → 93.184.216.34 (exact match!)
- example.com NS → ns1.example.com (partial help)
- .com NS → a.gtld-servers.net (partial help)
Cache hit? Return immediately!
Cache miss? Start the recursive lookup.
Step 3: Query Root Servers
Resolver → Root Server (a.root-servers.net)
Q: "What's the IP for www.example.com?"
Root server response:
"I don't know www.example.com, but .com is handled by:
a.gtld-servers.net (192.5.6.30)
b.gtld-servers.net (192.33.14.30)
... (and others)
This is a REFERRAL, not an answer.
Go ask them."
Type: NS (Name Server) referral
Step 4: Query TLD Servers
Resolver → .com TLD Server (a.gtld-servers.net)
Q: "What's the IP for www.example.com?"
TLD server response:
"I don't know www.example.com, but example.com is handled by:
ns1.example.com (93.184.216.34)
ns2.example.com (93.184.216.34)
Go ask them."
Type: NS referral + glue records (IPs of nameservers)
Step 5: Query Authoritative Server
Resolver → Authoritative NS (ns1.example.com)
Q: "What's the IP for www.example.com?"
Authoritative response:
"www.example.com has address 93.184.216.34"
Type: A record (the actual answer!)
This server IS authoritative for example.com.
The answer is definitive, not a referral.
Step 6: Return to Client
Recursive resolver:
1. Caches the answer (and intermediate results)
2. Returns answer to your computer
Your computer:
1. Caches the answer
2. Uses IP to connect
Total time: 50-200ms (uncached)
Cached lookup: <1ms
Query Types
Recursive Query
Client → Recursive Resolver:
"Get me the answer, do whatever it takes"
Resolver must:
- Return the answer, OR
- Return an error
Client doesn't do iterative lookups itself.
Iterative Query
Resolver → Authoritative Servers:
"Tell me what you know"
Server response can be:
- The answer (if authoritative)
- A referral (try somewhere else)
- Error (doesn't exist)
Resolver follows referrals iteratively.
DNS Message Format
DNS Query/Response Structure:
┌────────────────────────────────────────────────────────────┐
│ Header │
│ - Query ID (match responses to queries) │
│ - Flags (query/response, recursion desired, etc.) │
│ - Question count, Answer count, Authority count, etc. │
├────────────────────────────────────────────────────────────┤
│ Question │
│ - Name: www.example.com │
│ - Type: A (or AAAA, MX, etc.) │
│ - Class: IN (Internet) │
├────────────────────────────────────────────────────────────┤
│ Answer │
│ - Name: www.example.com │
│ - Type: A │
│ - TTL: 3600 │
│ - Data: 93.184.216.34 │
├────────────────────────────────────────────────────────────┤
│ Authority │
│ (Nameservers for the zone) │
├────────────────────────────────────────────────────────────┤
│ Additional │
│ (Extra helpful records, like NS IP addresses) │
└────────────────────────────────────────────────────────────┘
DNS Query in Action
Using dig to see the resolution:
$ dig www.example.com +trace
; <<>> DiG 9.16.1 <<>> www.example.com +trace
;; global options: +cmd
. 518400 IN NS a.root-servers.net.
. 518400 IN NS b.root-servers.net.
;; Received 262 bytes from 8.8.8.8#53(8.8.8.8) in 12 ms
com. 172800 IN NS a.gtld-servers.net.
com. 172800 IN NS b.gtld-servers.net.
;; Received 828 bytes from 192.58.128.30#53(a.root-servers.net) in 24 ms
example.com. 172800 IN NS ns1.example.com.
example.com. 172800 IN NS ns2.example.com.
;; Received 268 bytes from 192.5.6.30#53(a.gtld-servers.net) in 32 ms
www.example.com. 3600 IN A 93.184.216.34
;; Received 56 bytes from 93.184.216.34#53(ns1.example.com) in 16 ms
Negative Responses
What if the domain doesn’t exist?
NXDOMAIN
Query: thisdomaindoesnotexist.com
Response:
Status: NXDOMAIN (Non-Existent Domain)
Meaning: Domain doesn't exist at all
This is authoritative - the domain really doesn't exist.
Can be cached (negative caching).
NODATA
Query: example.com (type AAAA for IPv6)
Response:
Status: NODATA
Meaning: Domain exists but no record of this type
example.com has A records but no AAAA records.
Also cached negatively.
Resolver Behavior
Timeouts and Retries
Resolver query to server times out:
Default behavior:
Timeout: ~2 seconds
Retries: 2-3 attempts
Tries alternate servers in list
Total resolution might take:
Best case: <50ms (cached)
Typical: 50-200ms (uncached)
Worst case: Several seconds (timeouts)
Server Selection
Multiple nameservers for redundancy:
ns1.example.com
ns2.example.com
Resolver tracks:
- Response times per server
- Failure counts
- Prefers faster/more reliable servers
"Smoothed Round Trip Time" (SRTT) helps pick fastest.
Common Resolution Issues
“Could not resolve hostname”
Causes:
1. DNS server unreachable (network issue)
2. Domain doesn't exist (NXDOMAIN)
3. DNS server returning errors
4. Local resolver misconfigured
Debug:
$ nslookup example.com
$ dig example.com
$ ping 8.8.8.8 # Can you reach DNS server?
Slow Resolution
Causes:
1. Cache empty (first lookup is slow)
2. DNS server far away
3. DNS server overloaded
4. Network latency
Solutions:
- Use closer DNS server
- Increase local cache size/TTL
- Pre-resolve critical domains
Stale Cache
Situation:
Website changed IP
Your cache still has old IP
Connection fails
Solutions:
$ sudo systemd-resolve --flush-caches # Linux systemd
$ sudo dscacheutil -flushcache # macOS
$ ipconfig /flushdns # Windows
Or wait for TTL to expire.
Programming with DNS
Basic Lookup (Python)
import socket
# Simple lookup
ip = socket.gethostbyname('example.com')
print(ip) # 93.184.216.34
# Get all addresses (IPv4 + IPv6)
infos = socket.getaddrinfo('example.com', 80)
for info in infos:
family, socktype, proto, canonname, sockaddr = info
print(f"{family.name}: {sockaddr[0]}")
Using dnspython Library
import dns.resolver
# A record lookup
answers = dns.resolver.resolve('example.com', 'A')
for rdata in answers:
print(f"IP: {rdata}")
# MX record lookup
answers = dns.resolver.resolve('example.com', 'MX')
for rdata in answers:
print(f"Mail server: {rdata.exchange} (priority {rdata.preference})")
# Tracing (like dig +trace)
import dns.query
import dns.zone
# ... more advanced queries
Summary
DNS resolution follows a hierarchical pattern:
Your Computer
│
▼
Recursive Resolver (does the work)
│
├──> Root Servers (.com? .org? .net?)
│
├──> TLD Servers (example.com? github.com?)
│
└──> Authoritative Servers (www? mail? api?)
│
▼
Answer!
Key points:
- Stub resolvers on your computer do minimal work
- Recursive resolvers (like 8.8.8.8) do the heavy lifting
- Caching at every level makes it fast
- Authoritative servers are the source of truth
- TTL values control cache duration
Next, we’ll explore the different types of DNS records and their uses.
DNS Record Types
DNS stores more than just IP addresses. Different record types serve different purposes—from pointing domain names to servers, to routing email, to verifying domain ownership.
Common Record Types
┌───────────────────────────────────────────────────────────────────────┐
│ Type │ Name │ Purpose │
├───────┼────────────────────────┼──────────────────────────────────────┤
│ A │ Address │ Maps name to IPv4 address │
│ AAAA │ IPv6 Address │ Maps name to IPv6 address │
│ CNAME │ Canonical Name │ Alias to another name │
│ MX │ Mail Exchange │ Email server for domain │
│ TXT │ Text │ Arbitrary text (verification, SPF) │
│ NS │ Name Server │ Authoritative servers for zone │
│ SOA │ Start of Authority │ Zone metadata and parameters │
│ PTR │ Pointer │ Reverse DNS (IP to name) │
│ SRV │ Service │ Service location (port, priority) │
│ CAA │ Cert. Authority Auth. │ Which CAs can issue certificates │
└───────┴────────────────────────┴──────────────────────────────────────┘
A Record (Address)
Maps a domain name to an IPv4 address.
Record:
Name: example.com
Type: A
TTL: 3600
Value: 93.184.216.34
Lookup:
$ dig example.com A
example.com. 3600 IN A 93.184.216.34
Multiple A Records
Load balancing via DNS:
example.com. 300 IN A 192.0.2.1
example.com. 300 IN A 192.0.2.2
example.com. 300 IN A 192.0.2.3
Client picks one (often randomly or round-robin).
Simple load distribution, no dedicated load balancer.
Drawbacks:
- No health checking
- Uneven distribution possible
- Cached entries persist after server failure
AAAA Record (IPv6 Address)
Maps a domain name to an IPv6 address.
Record:
Name: example.com
Type: AAAA
TTL: 3600
Value: 2606:2800:220:1:248:1893:25c8:1946
Lookup:
$ dig example.com AAAA
example.com. 3600 IN AAAA 2606:2800:220:1:248:1893:25c8:1946
"AAAA" = four times "A" = four times the address size (32 → 128 bits)
Dual Stack
Many domains have both A and AAAA:
example.com. A 93.184.216.34
example.com. AAAA 2606:2800:220:1:248:1893:25c8:1946
Client chooses based on connectivity:
- Happy Eyeballs algorithm prefers IPv6
- Falls back to IPv4 if IPv6 fails
CNAME Record (Canonical Name)
Creates an alias pointing to another domain name.
Record:
Name: www.example.com
Type: CNAME
TTL: 3600
Value: example.com
Lookup:
$ dig www.example.com
www.example.com. 3600 IN CNAME example.com.
example.com. 3600 IN A 93.184.216.34
Resolver follows the chain:
www.example.com → example.com → 93.184.216.34
CNAME Use Cases
1. WWW alias:
www.example.com → example.com
2. CDN integration:
cdn.example.com → d1234.cloudfront.net
3. Service endpoints:
api.example.com → api-prod.company.internal
4. Environment switching:
app.example.com → staging.example.com (during testing)
app.example.com → production.example.com (in production)
CNAME Restrictions
Cannot coexist with other records at same name:
INVALID:
example.com CNAME other.com
example.com A 1.2.3.4 ← Conflict!
INVALID:
example.com CNAME other.com
example.com MX mail.example.com ← Conflict!
Therefore: Cannot use CNAME at zone apex (example.com)
Must use A/AAAA records there
Workarounds:
- ALIAS records (provider-specific, not standard DNS)
- ANAME records (draft standard)
MX Record (Mail Exchange)
Specifies email servers for a domain.
Record:
Name: example.com
Type: MX
TTL: 3600
Priority: 10
Value: mail.example.com
Lookup:
$ dig example.com MX
example.com. 3600 IN MX 10 mail.example.com.
example.com. 3600 IN MX 20 backup.example.com.
MX Priority
Lower number = higher priority
example.com. MX 10 primary.mail.example.com.
example.com. MX 20 secondary.mail.example.com.
example.com. MX 30 backup.mail.example.com.
Email delivery attempts:
1. Try primary (priority 10)
2. If unavailable, try secondary (priority 20)
3. Last resort: backup (priority 30)
Email Flow with MX
Sending email to user@example.com:
1. Sender's MTA queries: example.com MX
2. Gets: mail.example.com (priority 10)
3. Queries: mail.example.com A
4. Gets: 93.184.216.100
5. Connects to 93.184.216.100:25 (SMTP)
6. Delivers email
TXT Record
Stores arbitrary text. Used for verification and email security.
Record:
Name: example.com
Type: TXT
TTL: 3600
Value: "v=spf1 include:_spf.google.com ~all"
Lookup:
$ dig example.com TXT
example.com. 3600 IN TXT "v=spf1 include:_spf.google.com ~all"
Common TXT Uses
1. SPF (Sender Policy Framework):
"v=spf1 include:_spf.google.com ~all"
Specifies authorized email senders
2. DKIM (DomainKeys Identified Mail):
selector._domainkey.example.com TXT "v=DKIM1; k=rsa; p=..."
Public key for email signing
3. DMARC (Domain-based Message Authentication):
_dmarc.example.com TXT "v=DMARC1; p=reject; rua=mailto:..."
Policy for handling authentication failures
4. Domain verification:
example.com TXT "google-site-verification=abc123..."
Proves domain ownership to services
5. Custom data:
example.com TXT "contact=admin@example.com"
NS Record (Name Server)
Specifies authoritative nameservers for a zone.
Record:
Name: example.com
Type: NS
TTL: 86400
Value: ns1.example.com
Lookup:
$ dig example.com NS
example.com. 86400 IN NS ns1.example.com.
example.com. 86400 IN NS ns2.example.com.
Delegation
NS records delegate subdomains:
example.com zone has:
subdomain.example.com. NS ns1.subdomain-hosting.com.
Now subdomain.example.com has its own nameservers.
The parent zone "delegates" authority.
SOA Record (Start of Authority)
Contains zone metadata and parameters.
$ dig example.com SOA
example.com. 3600 IN SOA ns1.example.com. admin.example.com. (
2024011501 ; Serial
7200 ; Refresh
3600 ; Retry
1209600 ; Expire
3600 ) ; Minimum TTL
Fields:
Primary NS: ns1.example.com
Admin email: admin@example.com (@ replaced with .)
Serial: Version number (often YYYYMMDDNN)
Refresh: How often secondaries check for updates
Retry: Retry interval after failed refresh
Expire: When secondary data becomes invalid
Minimum: Negative caching TTL
PTR Record (Pointer)
Maps IP addresses back to names (reverse DNS).
IP to name lookup:
IP: 93.184.216.34
Reverse zone: 34.216.184.93.in-addr.arpa
Record:
34.216.184.93.in-addr.arpa. PTR example.com.
Lookup:
$ dig -x 93.184.216.34
34.216.184.93.in-addr.arpa. 3600 IN PTR example.com.
Reverse DNS Uses
1. Email server verification:
Receiving servers check if sender IP has valid PTR
Missing/mismatched PTR → likely spam
2. Logging and auditing:
Convert IPs to names for readable logs
3. Security analysis:
Quick identification of attacking IPs
SRV Record (Service)
Specifies location of services with port and priority.
Record format:
_service._proto.name SRV priority weight port target
Example:
_sip._tcp.example.com. SRV 10 60 5060 sipserver.example.com.
_xmpp._tcp.example.com. SRV 10 50 5222 xmpp.example.com.
Fields:
Priority: Lower = preferred (like MX)
Weight: For load balancing among same priority
Port: Service port number
Target: Server hostname
SRV Use Cases
1. VoIP/SIP:
_sip._tcp.example.com → voip.example.com:5060
2. XMPP/Jabber:
_xmpp-client._tcp.example.com → chat.example.com:5222
3. LDAP:
_ldap._tcp.example.com → ldap.example.com:389
4. Kubernetes services:
_http._tcp.myservice.namespace → pod-ip:port
CAA Record (Certificate Authority Authorization)
Controls which Certificate Authorities can issue SSL certificates.
Record:
example.com. CAA 0 issue "letsencrypt.org"
example.com. CAA 0 issuewild ";"
example.com. CAA 0 iodef "mailto:security@example.com"
Meanings:
issue: Which CA can issue regular certs
issuewild: Which CA can issue wildcard certs (";" = none)
iodef: Where to report violations
$ dig example.com CAA
example.com. 3600 IN CAA 0 issue "letsencrypt.org"
Querying Different Record Types
# A record (IPv4)
$ dig example.com A
# AAAA record (IPv6)
$ dig example.com AAAA
# MX record (mail servers)
$ dig example.com MX
# TXT record (text)
$ dig example.com TXT
# All records
$ dig example.com ANY # Note: Many servers don't support ANY
# Specific nameserver
$ dig @8.8.8.8 example.com A
# Short output
$ dig +short example.com A
93.184.216.34
# Trace resolution path
$ dig +trace example.com
Record TTL Considerations
TTL (Time To Live) controls caching:
Long TTL (86400 = 24 hours):
+ Fewer queries, lower load
+ Faster lookups (cached)
- Slow to update, changes take time
Short TTL (60 = 1 minute):
+ Quick updates
+ Fast failover
- More queries
- Higher load on nameservers
Recommendations:
Stable records: 3600-86400 (1-24 hours)
Dynamic/failover: 60-300 (1-5 minutes)
During migration: Reduce before, restore after
Summary
| Record | Purpose | Example Value |
|---|---|---|
| A | IPv4 address | 93.184.216.34 |
| AAAA | IPv6 address | 2606:2800:220:1::1 |
| CNAME | Alias | www → example.com |
| MX | Mail server | 10 mail.example.com |
| TXT | Text/verification | “v=spf1 …” |
| NS | Nameservers | ns1.example.com |
| SOA | Zone metadata | Serial, timers |
| PTR | Reverse lookup | IP → name |
| SRV | Service location | priority weight port target |
| CAA | CA authorization | 0 issue “letsencrypt.org” |
Understanding record types helps you:
- Configure domains correctly
- Debug email delivery issues
- Set up SSL certificates
- Implement service discovery
Next, we’ll explore DNS caching and how TTLs affect performance.
DNS Caching
Caching is what makes DNS fast. Without it, every web request would require multiple round trips to root servers, TLD servers, and authoritative servers. Understanding caching helps you balance performance against update speed.
The Caching Hierarchy
DNS caches exist at multiple levels:
┌─────────────────────────────────────────────────────────────┐
│ Caching Layers │
├─────────────────────────────────────────────────────────────┤
│ │
│ Browser Cache (seconds to minutes) │
│ ↓ │
│ Operating System (minutes to hours) │
│ ↓ │
│ Local DNS Server (minutes to hours) │
│ (home router, office) │
│ ↓ │
│ Recursive Resolver (minutes to days) │
│ (8.8.8.8, ISP DNS) │
│ ↓ │
│ Authoritative Server (source of truth) │
│ │
└─────────────────────────────────────────────────────────────┘
Each level can serve cached responses.
Request only goes further if cache misses.
TTL (Time To Live)
Every DNS record has a TTL that controls how long it can be cached:
Record with TTL:
example.com. 3600 IN A 93.184.216.34
│
└── Cache for 3600 seconds (1 hour)
When cached:
- Resolver stores record with timestamp
- Returns cached response for subsequent queries
- After TTL expires, must re-query authoritative server
How TTL Decrements
Authoritative server returns:
example.com. 3600 IN A 93.184.216.34
Resolver caches at T=0:
TTL remaining: 3600
After 1000 seconds (T=1000):
Client queries resolver
Resolver returns from cache with TTL=2600
After 3600 seconds (T=3600):
TTL=0, entry expired
Next query goes to authoritative server
Fresh record cached with new TTL
Browser DNS Cache
Browsers maintain their own DNS cache:
Chrome: chrome://net-internals/#dns
Firefox: about:networking#dns
Safari: No direct viewer
Typical browser cache TTL: Capped at 1-60 seconds
(Shorter than OS cache to detect changes faster)
Clearing browser cache:
- Chrome: Settings → Privacy → Clear browsing data
- Firefox: Settings → Privacy → Clear Data
- Or restart browser
Operating System Cache
Linux (systemd-resolved)
# View cache statistics
$ resolvectl statistics
# View cached entries (limited)
$ resolvectl query example.com
# Flush cache
$ sudo resolvectl flush-caches
# Alternative (older systems)
$ sudo systemctl restart systemd-resolved
macOS
# Flush DNS cache
$ sudo dscacheutil -flushcache
$ sudo killall -HUP mDNSResponder
# View cached entries (limited visibility)
$ sudo killall -INFO mDNSResponder
# Check Console.app for output
Windows
# View cache
> ipconfig /displaydns
# Flush cache
> ipconfig /flushdns
# Check DNS client service
> Get-Service dnscache
Recursive Resolver Cache
Public resolvers like 8.8.8.8 cache extensively:
Benefits:
- Single query serves millions of users
- Popular domains almost always cached
- Reduced load on authoritative servers
Cache characteristics:
- Respects TTL from authoritative
- May apply minimum TTL (typically 60s)
- May cap maximum TTL (typically 24-48h)
- Huge cache (millions of entries)
Cache Warming
Large resolvers “warm” their caches:
Popular domain (google.com):
- Millions of queries per second
- Always in cache
- TTL never truly expires (refreshed constantly)
Obscure domain (your-small-site.com):
- Few queries
- May fall out of cache between queries
- Each visitor might trigger fresh lookup
Negative Caching
Failed lookups are also cached:
Query: nonexistent.example.com
Response: NXDOMAIN (doesn't exist)
Cached as negative response:
- Saves repeated queries for invalid domains
- TTL from SOA minimum field
- Typically cached for minutes to hours
RFC 2308 defines negative caching behavior.
Negative Cache Problems
Scenario:
1. Query new domain before DNS propagates
2. Get NXDOMAIN (not yet available)
3. NXDOMAIN cached for 1 hour
4. Domain IS available 5 minutes later
5. Still getting NXDOMAIN from cache!
Solution:
- Wait for negative cache to expire
- Flush local DNS cache
- Use different resolver temporarily
TTL Strategies
Long TTL (3600-86400 seconds)
Pros:
+ Fewer queries to authoritative servers
+ Faster lookups (usually cached)
+ Less DNS infrastructure needed
Cons:
- Slow propagation of changes
- Failover takes time
- Users may hit stale data
Best for:
- Stable infrastructure
- Rarely-changing records
- Cost/performance optimization
Short TTL (60-300 seconds)
Pros:
+ Quick propagation of changes
+ Fast failover
+ More control over traffic
Cons:
- More queries (higher load)
- Slightly higher latency on cache miss
- More authoritative server capacity needed
Best for:
- Dynamic infrastructure
- Traffic management
- Disaster recovery scenarios
TTL Strategy by Record Type
┌──────────────────────────────────────────────────────────────┐
│ Record Type │ Recommended TTL │ Rationale │
├─────────────────┼────────────────────┼───────────────────────┤
│ NS │ 86400 (24 hours) │ Rarely change │
│ MX │ 3600-14400 │ Email can retry │
│ A/AAAA (stable) │ 3600-86400 │ Usually cached anyway │
│ A/AAAA (dynamic)│ 60-300 │ Need quick updates │
│ CNAME │ 3600 │ Depends on target │
│ TXT (SPF/DKIM) │ 3600 │ Reasonable balance │
└──────────────────────────────────────────────────────────────┘
TTL and DNS Migrations
When changing DNS records, manage TTL proactively:
Timeline for IP change:
T-24h: Reduce TTL
example.com. 300 IN A 93.184.216.34
(Old IP, short TTL now)
T-0: Make the change
example.com. 300 IN A 198.51.100.50
(New IP)
T+1h: Verify traffic shifted
T+24h: Restore normal TTL
example.com. 3600 IN A 198.51.100.50
(New IP, normal TTL)
The "reduce before, restore after" pattern minimizes
stale cache impact during changes.
Cache Debugging
Check What’s Cached
# Query specific resolver (bypass local cache)
$ dig @8.8.8.8 example.com
# Check TTL remaining
$ dig example.com | grep -A1 "ANSWER SECTION"
example.com. 2847 IN A 93.184.216.34
│
└── 2847 seconds remaining in cache
# Compare different resolvers
$ dig @8.8.8.8 example.com +short
$ dig @1.1.1.1 example.com +short
$ dig @9.9.9.9 example.com +short
# Different results = propagation in progress
Force Fresh Lookup
# Query authoritative directly
$ dig @ns1.example.com example.com
# Trace (bypasses cache, queries authoritatively)
$ dig +trace example.com
# No recursion (only ask one server)
$ dig +norecurse @a.root-servers.net example.com
Caching Issues
Inconsistent Results
Problem:
dig @8.8.8.8 example.com → 1.2.3.4
dig @1.1.1.1 example.com → 5.6.7.8
Causes:
- Recent change, propagation in progress
- Different servers have different cache ages
- Anycast resolvers hit different instances
Solution:
Wait for TTL to expire everywhere
Typically resolves within max(TTL) time
Cached Failure
Problem:
DNS change made, but users still see old/error
Causes:
- Negative caching (NXDOMAIN cached)
- Old positive record still valid
- Client-side cache not flushed
Debug:
1. Check TTL of cached record
2. Check negative TTL (SOA minimum)
3. Flush caches at multiple levels
4. Wait for TTL expiration
Cache Poisoning (Security)
Attack:
Attacker injects fake record into resolver cache
Users sent to malicious server
Mitigations:
- DNSSEC (cryptographic validation)
- Source port randomization
- Query ID randomization
- Response validation (0x20 encoding)
Summary
DNS caching is hierarchical and TTL-controlled:
| Cache Location | Typical TTL Cap | Flush Method |
|---|---|---|
| Browser | 60s | Restart or clear |
| OS | varies | System-specific |
| Local resolver | varies | Restart service |
| Recursive resolver | Respects record TTL | Wait |
TTL guidelines:
- Stable records: 3600-86400 seconds
- Dynamic records: 60-300 seconds
- Before changes: Reduce TTL in advance
- After changes: Wait for old TTL to expire
Caching makes DNS fast but requires understanding for:
- Planning DNS changes
- Debugging resolution issues
- Balancing freshness vs. performance
Next, we’ll explore DNSSEC—how DNS responses can be cryptographically validated.
DNSSEC
DNSSEC (Domain Name System Security Extensions) adds cryptographic authentication to DNS. It allows resolvers to verify that DNS responses haven’t been tampered with—protecting against attacks like cache poisoning.
The Problem DNSSEC Solves
Traditional DNS has no authentication:
Without DNSSEC:
Client: "What's the IP for bank.com?"
Legitimate response: OR Attacker's response:
bank.com → 1.2.3.4 bank.com → 6.6.6.6 (malicious)
How does client know which is real?
It can't! DNS responses are unsigned.
Attacks possible:
- Cache poisoning (inject fake records)
- Man-in-the-middle (intercept and modify)
- Redirection to phishing sites
How DNSSEC Works
DNSSEC adds digital signatures to DNS records:
With DNSSEC:
Zone operator:
1. Generates signing keys
2. Signs each record set
3. Publishes signatures alongside records
Resolver:
1. Receives record + signature
2. Retrieves zone's public key
3. Verifies signature
4. If valid → Trust the record
5. If invalid → Reject (SERVFAIL)
Attacker cannot forge valid signatures without private key.
DNSSEC Record Types
DNSKEY (Public Key)
Zone's public key for signature verification:
example.com. 3600 IN DNSKEY 257 3 13 (
mdsswUyr3DPW132mOi8V9xESWE8jTo0d
xCjjnopKl+GqJxpVXckHAeF+KkxLbxIL
fDLUT0rAK9iUzy1L53eKGQ==
)
Fields:
257 = Zone Signing Key (ZSK) or Key Signing Key (KSK)
3 = Protocol (always 3)
13 = Algorithm (13 = ECDSA P-256)
Base64 = The public key
RRSIG (Resource Record Signature)
Signature over a record set:
example.com. 3600 IN A 93.184.216.34
example.com. 3600 IN RRSIG A 13 2 3600 (
20240215000000
20240201000000
12345 example.com.
oJB1W6WNGv+ldvQ3WDG0MQkg5IEhjRip
8WTrPYGv07h108dUKGMeDPKijVCHX3DD
Kdfb+v6oB9wfuh3DTJXUAfI= )
Fields:
A = Type being signed
13 = Algorithm
2 = Labels in name
3600 = Original TTL
Dates = Signature validity period
12345 = Key tag (identifies signing key)
Base64 = The signature
DS (Delegation Signer)
Links child zone to parent (chain of trust):
In .com zone:
example.com. 86400 IN DS 12345 13 2 (
49FD46E6C4B45C55D4AC69CBD3CD3440
9B20CAC6B08F4E7FAE3F2BDDBF1BB349 )
Fields:
12345 = Key tag of child's KSK
13 = Algorithm
2 = Digest type (2 = SHA-256)
Hex = Hash of child's DNSKEY
Parent vouches for child's key.
Enables trust chain from root.
NSEC/NSEC3 (Authenticated Denial)
Proves a name doesn't exist:
Query: nonexistent.example.com
Response: NXDOMAIN + NSEC record
NSEC proves there's no record between two names:
aaa.example.com. NSEC zzz.example.com. A AAAA
"There's nothing between aaa and zzz"
Therefore nonexistent.example.com doesn't exist.
NSEC3: Hashed version (prevents zone enumeration)
Chain of Trust
DNSSEC builds a chain from root to leaf:
┌──────────────────┐
│ Root Zone (.) │
│ │
│ DNSKEY (root) │ ← Hardcoded in resolvers
└────────┬─────────┘ (trust anchor)
│
Signed DS record for .com
│
┌────────▼─────────┐
│ .com TLD │
│ │
│ DNSKEY (.com) │
└────────┬─────────┘
│
Signed DS record for example.com
│
┌────────▼─────────┐
│ example.com │
│ │
│ DNSKEY │
│ A record + RRSIG│
└──────────────────┘
Each level signs the next level's key hash.
Trust flows from root anchor to leaf records.
Validation Process
Resolver validating example.com A record:
1. Get example.com A + RRSIG
2. Get example.com DNSKEY
3. Verify RRSIG with DNSKEY ✓
4. Get DS for example.com (from .com zone)
5. Verify DS matches DNSKEY hash ✓
6. Get .com DNSKEY
7. Verify DS RRSIG with .com DNSKEY ✓
8. Get DS for .com (from root zone)
9. Verify DS matches .com DNSKEY hash ✓
10. Get root DNSKEY
11. Verify against trust anchor ✓
All checks pass → Record is authenticated!
Any check fails → SERVFAIL (reject response)
Querying DNSSEC Records
# Request DNSSEC records
$ dig example.com +dnssec
# Check if domain is signed
$ dig example.com DNSKEY
$ dig example.com DS
# Verify signature chain
$ dig +sigchase example.com
# Use delv (DNSSEC-aware dig)
$ delv example.com
; fully validated
example.com. 86400 IN A 93.184.216.34
# Check validation status
$ dig +cd example.com # CD = Checking Disabled (skip validation)
DNSSEC Status Check
# Online validators:
# https://dnssec-analyzer.verisignlabs.com/
# https://dnsviz.net/
# Command line check:
$ delv @8.8.8.8 example.com
; fully validated ← DNSSEC working
; unsigned answer ← Not signed
; validation failed ← Signature invalid
# Check with drill
$ drill -S example.com
Key Management
Key Types
Zone Signing Key (ZSK):
- Signs zone records
- Rotated frequently (monthly to quarterly)
- Smaller key (faster signing)
Key Signing Key (KSK):
- Signs the ZSK
- Rotated less often (yearly)
- Referenced by parent's DS record
- Larger key (more security)
Why two keys?
ZSK rotation doesn't require parent update
KSK rotation requires new DS in parent zone
Key Rollover
ZSK Rollover (simpler):
1. Generate new ZSK
2. Publish both old and new DNSKEY
3. Sign with new ZSK
4. After TTL, remove old ZSK
KSK Rollover (complex):
1. Generate new KSK
2. Publish both DNSKEYs
3. Submit new DS to parent
4. Wait for parent propagation
5. Sign ZSKs with new KSK
6. After parent DS TTL, remove old KSK
Automated by most DNS providers.
Deployment Considerations
Enabling DNSSEC
Domain owner must:
1. Sign zone with DNSSEC keys
2. Upload DS record to registrar
3. Registrar submits DS to TLD
4. Chain of trust established
Many registrars/DNS providers automate this:
- Cloudflare: One-click DNSSEC
- Route53: Supports DNSSEC
- Google Domains: Easy setup
Response Size
DNSSEC adds significant size:
Without DNSSEC:
example.com A → ~50 bytes
With DNSSEC:
example.com A + RRSIG + DNSKEY → ~1000+ bytes
Implications:
- May exceed 512-byte UDP limit
- Requires EDNS (larger UDP) or TCP
- More bandwidth usage
Validation Failures
If DNSSEC validation fails:
Validating resolver returns: SERVFAIL
User sees: DNS error / site unreachable
Causes:
- Expired signatures (operator forgot renewal)
- Incorrect DS record (misconfiguration)
- Clock skew (signature timestamps)
- Key rollover problems
This is a feature, not a bug!
Invalid signatures could mean attack.
But operational errors can cause outages.
Limitations
DNSSEC protects authenticity, not privacy:
DNSSEC provides:
✓ Authentication (record from legitimate source)
✓ Integrity (record not modified)
✓ Authenticated denial (NXDOMAIN is real)
DNSSEC does NOT provide:
✗ Confidentiality (queries/responses visible)
✗ Protection from DNS operator
✗ Protection of last-mile (resolver to client)
For privacy: DNS over HTTPS (DoH) or DNS over TLS (DoT)
DNSSEC Adoption
Adoption varies by TLD:
Signed TLDs: .com, .org, .net (all major TLDs)
Domain signing rates:
.nl (Netherlands): ~50%
.se (Sweden): ~40%
.com: ~3%
Validation by resolvers:
8.8.8.8 (Google): Validates
1.1.1.1 (Cloudflare): Validates
ISP resolvers: Varies
Growing but not universal.
Alternatives and Complements
DNS over HTTPS (DoH)
HTTPS encryption for DNS queries:
- Hides queries from network observers
- Bypasses some filtering
- Runs on port 443 (like web traffic)
Complements DNSSEC:
DoH = Privacy (encrypted transport)
DNSSEC = Authenticity (signed records)
Can use both together.
DNS over TLS (DoT)
TLS encryption for DNS:
- Dedicated port 853
- Easier to identify/block than DoH
- Same privacy benefits as DoH
Adoption growing in mobile and resolvers.
Summary
DNSSEC adds cryptographic security to DNS:
| Component | Purpose |
|---|---|
| DNSKEY | Zone’s public keys |
| RRSIG | Signatures on record sets |
| DS | Links child to parent (trust chain) |
| NSEC/NSEC3 | Proves non-existence |
Key points:
- Chain of trust from root to leaf
- Signatures prevent tampering
- Validation failures block responses
- Doesn’t provide privacy (use DoH/DoT)
Considerations:
- Operational complexity (key management)
- Larger responses (more bandwidth)
- Validation failures can cause outages
- Growing but not universal adoption
This completes our DNS coverage. Next, we’ll explore the evolution of HTTP—from 1.0 to HTTP/3.
HTTP Evolution
HTTP (Hypertext Transfer Protocol) is the foundation of the web. What started as a simple protocol for retrieving hypertext documents has evolved into a sophisticated system powering everything from websites to APIs to real-time applications.
The Journey
┌─────────────────────────────────────────────────────────────────────────┐
│ HTTP Timeline │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1991 HTTP/0.9 One-line protocol, GET only │
│ │ │
│ 1996 HTTP/1.0 Headers, methods, status codes │
│ │ Problem: One request per connection │
│ │ │
│ 1997 HTTP/1.1 Persistent connections, pipelining │
│ │ Problem: Head-of-line blocking │
│ │ │
│ 2015 HTTP/2 Binary, multiplexing, server push │
│ │ Problem: TCP head-of-line blocking │
│ │ │
│ 2022 HTTP/3 QUIC transport, UDP-based │
│ Eliminates transport-level blocking │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Why HTTP Keeps Evolving
Each HTTP version addressed limitations of its predecessor:
HTTP/1.0 → HTTP/1.1
Problem: Opening new TCP connection per request is slow
Solution: Keep connections open (persistent connections)
HTTP/1.1 → HTTP/2
Problem: Requests must wait in line, even on persistent connections
Solution: Multiplex requests over single connection
HTTP/2 → HTTP/3
Problem: TCP packet loss blocks ALL streams
Solution: Use QUIC (UDP-based), independent stream delivery
Request-Response Model
Despite version differences, HTTP maintains its fundamental model:
┌────────────────────────────────────────────────────────────────┐
│ HTTP Transaction │
├────────────────────────────────────────────────────────────────┤
│ │
│ Client Server │
│ │ │ │
│ │─────────── Request ────────────────>│ │
│ │ │ │
│ │ GET /index.html HTTP/1.1 │ │
│ │ Host: example.com │ │
│ │ Accept: text/html │ │
│ │ │ │
│ │ │ │
│ │<──────────── Response ──────────────│ │
│ │ │ │
│ │ HTTP/1.1 200 OK │ │
│ │ Content-Type: text/html │ │
│ │ Content-Length: 1234 │ │
│ │ │ │
│ │ <!DOCTYPE html>... │ │
│ │ │ │
└────────────────────────────────────────────────────────────────┘
Request = Method + Path + Headers + (optional) Body
Response = Status + Headers + (optional) Body
Key HTTP Concepts
Methods
GET Retrieve a resource
POST Submit data, create resource
PUT Replace a resource
PATCH Partially modify a resource
DELETE Remove a resource
HEAD GET without body (metadata only)
OPTIONS Describe communication options
Status Codes
1xx Informational 100 Continue, 101 Switching Protocols
2xx Success 200 OK, 201 Created, 204 No Content
3xx Redirection 301 Moved, 302 Found, 304 Not Modified
4xx Client Error 400 Bad Request, 401 Unauthorized, 404 Not Found
5xx Server Error 500 Internal Error, 502 Bad Gateway, 503 Unavailable
Headers
Request headers:
Host: example.com (required in HTTP/1.1+)
Accept: application/json (preferred response type)
Authorization: Bearer xyz (credentials)
Cookie: session=abc123 (state)
Response headers:
Content-Type: text/html (body format)
Content-Length: 1234 (body size)
Cache-Control: max-age=3600 (caching rules)
Set-Cookie: session=abc123 (set state)
What You’ll Learn
In this chapter:
- HTTP/1.0 and HTTP/1.1: The text-based foundation
- HTTP/2: Binary framing and multiplexing
- HTTP/3 and QUIC: The modern, UDP-based protocol
Understanding HTTP evolution helps you:
- Choose appropriate protocol versions
- Optimize web performance
- Debug connection issues
- Design efficient APIs
HTTP/1.0 and HTTP/1.1
HTTP/1.x established the patterns still used today: request-response over TCP, text-based headers, and the familiar verbs like GET and POST. Understanding these versions explains why later versions were needed.
HTTP/1.0 (1996)
Basic Request-Response
Client connects to server:
1. TCP handshake (SYN, SYN-ACK, ACK)
2. Send HTTP request
3. Receive HTTP response
4. Close connection
Every request = New TCP connection!
Request Format
GET /index.html HTTP/1.0
User-Agent: Mozilla/5.0
Accept: text/html
That’s it—method, path, version, and optional headers. Blank line ends headers.
Response Format
HTTP/1.0 200 OK
Content-Type: text/html
Content-Length: 1234
<!DOCTYPE html>
<html>
...
</html>
Status line, headers, blank line, body.
The Connection Problem
Loading a webpage with HTTP/1.0:
Page needs:
- index.html (1 request)
- style.css (1 request)
- script.js (1 request)
- logo.png (1 request)
- header.png (1 request)
HTTP/1.0 timeline (sequential):
┌─────────────────────────────────────────────────────────────────────────┐
│ │
│ ├─TCP─┤├───HTML────┤ │
│ ├─TCP─┤├───CSS────┤ │
│ ├─TCP─┤├───JS────┤ │
│ ├─TCP─┤├PNG1─┤ │
│ ... │
│ │
│ Total: 5 TCP handshakes + 5 requests = Very slow! │
└─────────────────────────────────────────────────────────────────────────┘
Each resource requires:
- TCP handshake (~1 RTT)
- Request + response (~1 RTT)
- TCP teardown
For 10 resources over 100ms RTT: ~2 seconds just for overhead!
HTTP/1.1 (1997)
HTTP/1.1 addressed the connection overhead with several improvements.
Persistent Connections
Connections stay open by default:
HTTP/1.0:
Connection: close (default, close after response)
Connection: keep-alive (optional, keep open)
HTTP/1.1:
Connection: keep-alive (default, keep open)
Connection: close (optional, close after response)
Connection Reuse
HTTP/1.1 timeline (persistent connection):
┌─────────────────────────────────────────────────────────────────────────┐
│ │
│ ├─TCP─┤├─HTML─┤├─CSS─┤├─JS─┤├─PNG1─┤├─PNG2─┤ │
│ │
│ One TCP handshake, multiple requests! │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Savings: 4 fewer TCP handshakes = ~400ms on 100ms RTT link
Pipelining
Send multiple requests without waiting for responses:
Without pipelining:
Request 1 → Response 1 → Request 2 → Response 2
With pipelining:
Request 1 → Request 2 → Request 3 → Response 1 → Response 2 → Response 3
┌─────────────────────────────────────────────────────────────────────────┐
│ │
│ Client: [Req1][Req2][Req3] │
│ Server: [Resp1][Resp2][Resp3] │
│ │
│ Server processes in parallel (potentially) │
│ But responses MUST be in request order! │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Pipelining’s Fatal Flaw: Head-of-Line Blocking
Pipelining problem:
Requests sent: [HTML][CSS][JS]
Server ready: JS(10ms), CSS(20ms), HTML(500ms)
Must respond in order:
├────────HTML (500ms)────────┤├CSS┤├JS┤
JS is ready instantly but waits 500ms for HTML!
This is "head-of-line blocking."
Reality: Pipelining rarely used
- Complex to implement correctly
- Many proxies don't support it
- HOL blocking negates benefits
- Browsers disabled it by default
Multiple Connections Workaround
Browsers work around HTTP/1.1 limitations:
Browser opens 6 parallel connections per domain:
Connection 1: [HTML]─────────[Image5]──────────
Connection 2: [CSS]─────[Image1]──────[Image6]─
Connection 3: [JS1]─────[Image2]───────────────
Connection 4: [JS2]─────[Image3]───────────────
Connection 5: [Font]────[Image4]───────────────
Connection 6: [Icon]────[Image7]───────────────
Parallel downloads without pipelining!
But: 6 TCP connections = 6× overhead
6× congestion control windows
Not efficient
Domain Sharding (Historical)
Workaround for 6-connection limit:
Instead of:
example.com/style.css
example.com/script.js
example.com/image1.png
Use:
example.com/style.css
static1.example.com/script.js
static2.example.com/image1.png
Browser sees different domains:
6 connections to example.com
6 connections to static1.example.com
6 connections to static2.example.com
= 18 parallel connections!
Downsides:
- More TCP overhead
- More TLS handshakes (if HTTPS)
- DNS lookups for each domain
- Cache fragmentation
Note: Harmful with HTTP/2! (multiplexing is better)
Host Header (Required)
HTTP/1.1 requires the Host header:
GET /page.html HTTP/1.1
Host: www.example.com
Enables virtual hosting—multiple sites on one IP:
Server at 192.168.1.100 hosts:
- www.example.com
- www.another-site.com
- api.example.com
Host header tells server which site is requested.
Without it: Server doesn't know which site you want!
Chunked Transfer Encoding
Send response without knowing size upfront:
HTTP/1.1 200 OK
Transfer-Encoding: chunked
1a
This is the first chunk.
1b
This is the second chunk.
0
Format: Size (hex) + CRLF + Data + CRLF, ending with 0
Use cases:
- Streaming responses
- Server-generated content
- Live data feeds
Additional HTTP/1.1 Features
100 Continue
Client: POST /upload HTTP/1.1
Content-Length: 10000000
Expect: 100-continue
Server: HTTP/1.1 100 Continue
Client: (sends 10MB body)
Server: HTTP/1.1 200 OK
Avoids sending large body if server will reject it.
Range Requests
GET /large-file.zip HTTP/1.1
Range: bytes=1000-1999
HTTP/1.1 206 Partial Content
Content-Range: bytes 1000-1999/50000
Resume interrupted downloads, video seeking.
Cache Control
Cache-Control: max-age=3600, must-revalidate
ETag: "abc123"
Last-Modified: Wed, 21 Oct 2015 07:28:00 GMT
Sophisticated caching for performance.
HTTP/1.1 Example Session
$ telnet example.com 80
Trying 93.184.216.34...
Connected to example.com.
GET / HTTP/1.1
Host: example.com
Connection: keep-alive
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Content-Length: 1256
Connection: keep-alive
Cache-Control: max-age=604800
<!doctype html>
<html>
<head>
<title>Example Domain</title>
...
GET /favicon.ico HTTP/1.1
Host: example.com
Connection: close
HTTP/1.1 404 Not Found
Content-Length: 0
Connection: close
Connection closed by foreign host.
HTTP/1.x Limitations Summary
┌─────────────────────────────────────────────────────────────────────────┐
│ HTTP/1.x Limitations │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Head-of-Line Blocking │
│ Responses must be in request order │
│ One slow response blocks all others │
│ │
│ 2. Textual Protocol Overhead │
│ Headers are uncompressed text │
│ Same headers sent repeatedly │
│ │
│ 3. No Request Prioritization │
│ Can't indicate which resources are critical │
│ Server processes arbitrarily │
│ │
│ 4. Client-Initiated Only │
│ Server can't push resources proactively │
│ Client must request everything explicitly │
│ │
└─────────────────────────────────────────────────────────────────────────┘
When HTTP/1.1 Is Still Used
HTTP/1.1 remains common for:
- Simple APIs (few requests per connection)
- Internal services (low latency networks)
- Legacy system compatibility
- Debugging (human-readable)
- When HTTP/2 isn't supported
Modern web: HTTP/2 or HTTP/3 preferred
Better performance with no application changes
Summary
| Feature | HTTP/1.0 | HTTP/1.1 |
|---|---|---|
| Persistent connections | Optional | Default |
| Host header | Optional | Required |
| Chunked transfer | No | Yes |
| Pipelining | No | Yes (rarely used) |
| Cache-Control | Limited | Full support |
| Range requests | No | Yes |
| 100 Continue | No | Yes |
HTTP/1.1 significantly improved on 1.0 but still suffers from head-of-line blocking. HTTP/2 was designed to solve this—which we’ll explore next.
HTTP/2: Multiplexing Revolution
HTTP/2 (2015) reimagined how HTTP works at the wire level while maintaining full compatibility with HTTP/1.1 semantics. The result: dramatically faster page loads with no application changes required.
The Core Innovation: Multiplexing
HTTP/2’s killer feature is multiplexing—sending multiple requests and responses over a single TCP connection simultaneously:
HTTP/1.1 (head-of-line blocking):
┌─────────────────────────────────────────────────────────────────────────┐
│ Connection 1: [──────Req 1──────][──────Req 2──────][──Req 3──] │
│ Connection 2: [──────Req 4──────][──Req 5──] │
│ Connection 3: [──Req 6──][──────Req 7──────] │
│ │
│ Sequential on each connection. Multiple connections needed. │
└─────────────────────────────────────────────────────────────────────────┘
HTTP/2 (multiplexed):
┌─────────────────────────────────────────────────────────────────────────┐
│ Single connection: │
│ [R1][R2][R3][R1][R4][R2][R5][R3][R1][R6][R7]... │
│ │
│ All requests interleaved on one connection! │
│ No head-of-line blocking at HTTP level. │
└─────────────────────────────────────────────────────────────────────────┘
Binary Framing Layer
HTTP/2 replaces text with binary frames:
HTTP/1.1 (text):
┌────────────────────────────────────────┐
│ GET /page HTTP/1.1\r\n │
│ Host: example.com\r\n │
│ Accept: text/html\r\n │
│ \r\n │
└────────────────────────────────────────┘
HTTP/2 (binary frames):
┌────────────────────────────────────────┐
│ ┌─────────────┐ ┌─────────────┐ │
│ │HEADERS Frame│ │ DATA Frame │ │
│ │ Stream ID: 1│ │ Stream ID: 1│ │
│ │ (compressed)│ │ (payload) │ │
│ └─────────────┘ └─────────────┘ │
└────────────────────────────────────────┘
Binary format:
+ Efficient parsing (no text scanning)
+ Compact representation
+ Clear frame boundaries
- Not human-readable (need tools)
Frame Structure
Every HTTP/2 message is a series of frames:
Frame Format:
┌────────────────────────────────────────────────────────────────────┐
│ Length (24 bits) │ Type (8) │ Flags (8) │ R │ Stream ID (31 bits) │
├────────────────────────────────────────────────────────────────────┤
│ Frame Payload │
└────────────────────────────────────────────────────────────────────┘
Length: Size of payload (max 16KB default, configurable)
Type: DATA, HEADERS, PRIORITY, RST_STREAM, SETTINGS, etc.
Flags: Type-specific flags (END_STREAM, END_HEADERS, etc.)
Stream ID: Which stream this frame belongs to
Frame Types
┌──────────────────────────────────────────────────────────────────┐
│ Type │ Purpose │
├──────────────┼───────────────────────────────────────────────────┤
│ DATA │ Request/response body data │
│ HEADERS │ Request/response headers (compressed) │
│ PRIORITY │ Stream priority information │
│ RST_STREAM │ Terminate a stream │
│ SETTINGS │ Connection configuration │
│ PUSH_PROMISE│ Server push notification │
│ PING │ Connection health check │
│ GOAWAY │ Graceful connection shutdown │
│ WINDOW_UPDATE│ Flow control window adjustment │
│ CONTINUATION│ Continuation of HEADERS │
└──────────────┴───────────────────────────────────────────────────┘
Streams
A stream is a bidirectional sequence of frames within a connection:
Single HTTP/2 connection with multiple streams:
┌─────────────────────────────────────────────────────────────────────┐
│ TCP Connection │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Stream 1: [HEADERS]──[DATA]──[DATA]──[DATA] │ │
│ │ Stream 3: [HEADERS]──[DATA] │ │
│ │ Stream 5: [HEADERS] │ │
│ │ Stream 7: [HEADERS]──[DATA]──[DATA] │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Stream IDs:
- Odd numbers: Client-initiated
- Even numbers: Server-initiated (push)
- 0: Connection-level messages (SETTINGS, PING, GOAWAY)
Request/Response as Streams
HTTP/2 Request (Stream 1):
┌────────────────────────────────────────────────────────────────────┐
│ HEADERS Frame (Stream 1) │
│ :method = GET │
│ :path = /index.html │
│ :scheme = https │
│ :authority = example.com │
│ accept = text/html │
│ END_HEADERS, END_STREAM flags │
└────────────────────────────────────────────────────────────────────┘
HTTP/2 Response (Stream 1):
┌────────────────────────────────────────────────────────────────────┐
│ HEADERS Frame (Stream 1) │
│ :status = 200 │
│ content-type = text/html │
│ END_HEADERS flag │
├────────────────────────────────────────────────────────────────────┤
│ DATA Frame (Stream 1) │
│ [HTML content...] │
│ END_STREAM flag │
└────────────────────────────────────────────────────────────────────┘
Header Compression (HPACK)
HTTP/2 compresses headers using HPACK:
HTTP/1.1 headers (sent every request):
Host: example.com ~17 bytes
User-Agent: Mozilla/5.0... ~70 bytes
Accept: text/html,application/... ~100 bytes
Accept-Language: en-US,en;q=0.5 ~25 bytes
Accept-Encoding: gzip, deflate ~25 bytes
Cookie: session=abc123;... ~50+ bytes
──────────────────────────────────────────────
Total: ~300 bytes per request!
10 requests = 3KB just in headers!
HPACK compression:
1. Static table: 61 common headers (predefined)
2. Dynamic table: Recently used headers (learned)
3. Huffman coding: Compress literal values
First request: ~300 bytes → ~150 bytes (Huffman)
Second request: Same headers → ~30 bytes (indexed!)
10 requests ≈ 300 bytes total (vs 3KB)
HPACK Example
First request headers sent:
:method: GET → Index 2 (static table)
:path: /index.html → Literal, indexed
:authority: example.com → Literal, indexed, Huffman
accept: text/html → Literal, indexed
Dynamic table after request:
[62] :path: /index.html
[63] :authority: example.com
[64] accept: text/html
Second request to same server:
:method: GET → Index 2 (static)
:path: /style.css → Literal (new path)
:authority: example.com → Index 63 (dynamic!)
accept: text/css → Literal (different)
Much smaller because authority is now indexed!
Server Push
Servers can proactively send resources:
Without server push:
Client: GET /index.html
Server: (sends HTML)
Client: (parses, sees style.css link)
Client: GET /style.css ← Extra round trip!
Server: (sends CSS)
With server push:
Client: GET /index.html
Server: PUSH_PROMISE /style.css ← "I'll send this too"
Server: (sends HTML)
Server: (sends CSS on separate stream)
Client: (already has CSS when parsing HTML!)
Saves round trip for critical resources.
Server Push Caveats
Push sounds great but has issues:
1. May push already-cached resources
Server doesn't know client cache state
Waste bandwidth pushing what client has
2. Priority problems
Pushed resources may compete with requested ones
Can slow down critical content
3. Limited browser support
Chrome deprecated push support (2022)
Most CDNs recommend disabling
Alternative: 103 Early Hints
Server sends hints before full response
Client can preload without full push complexity
Stream Prioritization
Clients can indicate resource importance:
Priority information:
- Weight: 1-256 (relative importance)
- Dependency: Stream this depends on
Example:
Stream 1 (HTML): Weight=256 (highest)
Stream 3 (CSS): Weight=128, depends on Stream 1
Stream 5 (JS): Weight=128, depends on Stream 1
Stream 7 (image): Weight=64, depends on Stream 3
Priority tree:
[Stream 1 - HTML]
/ \
[Stream 3-CSS] [Stream 5-JS]
|
[Stream 7-image]
Server should:
1. Complete Stream 1 first
2. Then CSS and JS equally
3. Images last
Reality: Server implementation varies
Many servers ignore priorities
Flow Control
HTTP/2 has stream-level flow control:
Connection flow control:
Each side advertises receive window
Similar to TCP flow control
Stream flow control:
Each stream has its own window
Prevents one stream from consuming all bandwidth
WINDOW_UPDATE frame:
Signals capacity for more data
┌────────────────────────────────────────────────────────────────────┐
│ Stream 1: Window=65535 │
│ Stream 3: Window=65535 │
│ Connection: Window=1048576 │
│ │
│ Server sends 32768 bytes on Stream 1: │
│ Stream 1: Window=32767 │
│ Connection: Window=1015808 │
│ │
│ Client sends WINDOW_UPDATE (Stream 1, 32768): │
│ Stream 1: Window=65535 (restored) │
└────────────────────────────────────────────────────────────────────┘
HTTP/2 Connection Setup
HTTP/2 requires TLS in practice (browsers require HTTPS):
1. TCP handshake
2. TLS handshake (ALPN negotiates HTTP/2)
3. HTTP/2 connection preface:
Client sends: "PRI * HTTP/2.0\r\n\r\nSM\r\n\r\n"
Both send: SETTINGS frame
4. Ready for requests!
ALPN (Application-Layer Protocol Negotiation):
Client TLS hello includes: "I support h2, http/1.1"
Server chooses: "Let's use h2"
Connection established as HTTP/2
The Remaining Problem: TCP HOL Blocking
HTTP/2 solved HTTP-level head-of-line blocking but TCP has its own:
HTTP/2 over TCP problem:
Stream 1: [Frame 1][Frame 2][Frame 3]
Stream 3: [Frame A][Frame B]
Stream 5: [Frame X][Frame Y]
TCP sees: [1][A][2][X][B][3][Y]
If TCP packet containing [2] is lost:
TCP retransmits [2]
ALL subsequent data waits!
Frames [X][B][3][Y] all blocked
Even though X,B,Y are independent streams!
This is TCP-level head-of-line blocking.
HTTP/3 solves this with QUIC.
HTTP/2 Performance
When HTTP/2 shines:
- Many small resources (multiplexing wins)
- High latency connections (fewer round trips)
- Header-heavy requests (compression helps)
- HTTPS (required anyway, TLS overhead amortized)
When HTTP/2 helps less:
- Single large download (one stream anyway)
- Very low latency networks (overhead matters less)
- Lossy networks (TCP HOL blocking hurts)
Typical improvements:
Page load time: 10-50% faster
Time to first byte: Similar or slightly better
Number of connections: 1 vs 6+ (simpler)
Debugging HTTP/2
# curl with HTTP/2
$ curl -I --http2 https://example.com
HTTP/2 200
content-type: text/html
# nghttp client
$ nghttp -nv https://example.com
# Chrome DevTools
Network tab → Protocol column shows "h2"
# Wireshark
Filter: http2
Decode TLS with SSLKEYLOGFILE
Summary
HTTP/2’s key innovations:
| Feature | Benefit |
|---|---|
| Binary framing | Efficient parsing, clear boundaries |
| Multiplexing | Multiple requests on one connection |
| Header compression | ~85% reduction in header size |
| Stream prioritization | Better resource loading order |
| Server push | Proactive resource delivery |
| Flow control | Per-stream bandwidth management |
Limitations:
- TCP head-of-line blocking remains
- Server push deprecated in browsers
- Complexity increased
HTTP/2 is a significant improvement over HTTP/1.1, but TCP’s head-of-line blocking motivated HTTP/3’s move to QUIC—which we’ll cover next.
HTTP/3 and QUIC
HTTP/3 (2022) takes a radical approach: instead of building on TCP, it uses QUIC—a new transport protocol running over UDP. This eliminates TCP’s head-of-line blocking and enables features impossible with TCP.
Why Replace TCP?
HTTP/2’s remaining problem was TCP itself:
TCP Head-of-Line Blocking:
HTTP/2 multiplexes streams:
Stream 1: [A][B][C]
Stream 3: [X][Y][Z]
TCP sees single byte stream:
[A][X][B][Y][C][Z]
TCP packet lost (containing [B]):
- TCP waits for retransmit
- ALL data after [B] blocked
- [Y][C][Z] wait even though they're independent
This defeats HTTP/2's multiplexing benefits on lossy networks.
QUIC: UDP-Based Transport
QUIC (Quick UDP Internet Connections) provides TCP-like reliability over UDP:
┌─────────────────────────────────────────────────────────────────────┐
│ Protocol Comparison │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ HTTP/1.1, HTTP/2: HTTP/3: │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ HTTP │ │ HTTP/3 │ │
│ ├─────────────┤ ├─────────────┤ │
│ │ TLS │ │ QUIC │ ← Includes TLS! │
│ ├─────────────┤ ├─────────────┤ │
│ │ TCP │ │ UDP │ │
│ ├─────────────┤ ├─────────────┤ │
│ │ IP │ │ IP │ │
│ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
QUIC provides:
- Reliable delivery (like TCP)
- Congestion control (like TCP)
- Encryption (built-in TLS 1.3)
- Stream multiplexing (independent streams!)
- Connection migration
- 0-RTT connection resumption
No Head-of-Line Blocking
QUIC streams are independent:
QUIC stream multiplexing:
Stream 1: [A]──[B]──[C]
Stream 3: [X]──[Y]──[Z]
UDP packets:
Packet 1: [A][X]
Packet 2: [B][Y] ← Lost!
Packet 3: [C][Z]
What happens:
Packet 3 arrives, QUIC delivers:
Stream 3: [Z] delivered immediately!
Packet 2 retransmitted, then:
Stream 1: [B] delivered
Stream 3: [Y] delivered
Stream 3 doesn't wait for Stream 1's retransmit!
Each stream has independent delivery.
Faster Connection Establishment
TCP+TLS: 2-3 Round Trips
TCP + TLS 1.3 connection:
Client Server
│ │
│────── TCP SYN ──────────────────────────>│
│<───── TCP SYN-ACK ───────────────────────│
│────── TCP ACK ──────────────────────────>│ ← 1 RTT (TCP)
│ │
│────── TLS ClientHello ──────────────────>│
│<───── TLS ServerHello + Finished ────────│
│────── TLS Finished + HTTP Request ──────>│ ← 1 RTT (TLS)
│<───── HTTP Response ─────────────────────│
│ │
Total: 2 RTT before first HTTP response
(3 RTT with TLS 1.2)
QUIC: 1 Round Trip (or 0!)
QUIC initial connection:
Client Server
│ │
│────── QUIC Initial + TLS Hello ─────────>│
│<───── QUIC Initial + TLS + ACK ──────────│
│────── QUIC + HTTP Request ──────────────>│ ← 1 RTT total!
│<───── HTTP Response ─────────────────────│
│ │
QUIC combines transport + crypto handshake!
TLS 1.3 is integrated into QUIC.
0-RTT Connection Resumption
Returning to a previously visited server:
Client Server
│ │
│────── QUIC 0-RTT + HTTP Request ────────>│ ← No handshake!
│<───── HTTP Response ─────────────────────│
│ │
How it works:
- Client cached server's "resumption token"
- Client sends encrypted request immediately
- Server validates token, responds immediately
Caveat: 0-RTT data is replayable
- Attackers can replay the request
- Safe only for idempotent requests (GET)
- Server can implement replay protection
Connection Migration
QUIC connections can survive network changes:
Traditional TCP:
Connection = (Source IP, Source Port, Dest IP, Dest Port)
Phone switches WiFi → Cellular:
IP address changes!
TCP connection breaks
Must establish new connection
HTTP request fails/retries
QUIC:
Connection = Connection ID (random identifier)
Phone switches WiFi → Cellular:
IP address changes
Connection ID unchanged
QUIC connection continues!
HTTP request completes seamlessly
Connection Migration Flow
┌─────────────────────────────────────────────────────────────────────┐
│ Connection Migration │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Client on WiFi: 192.168.1.100 │
│ Server: 93.184.216.34 │
│ Connection ID: 0xABCD1234 │
│ │
│ Client ────[QUIC 0xABCD1234]──────> Server │
│ Server <───[QUIC 0xABCD1234]─────── Client │
│ │
│ --- Client moves to cellular: 10.0.0.50 --- │
│ │
│ Client ────[QUIC 0xABCD1234]──────> Server │
│ ↑ │ │
│ │ New IP, same connection ID! │ │
│ │ ▼ │
│ Server validates connection ID, │
│ continues same connection! │
│ │
└─────────────────────────────────────────────────────────────────────┘
QUIC Encryption
All QUIC packets are encrypted (except initial handshake):
TCP + TLS:
TCP header: Visible (port, seq, etc.)
TLS record: Encrypted
HTTP data: Encrypted
Middleboxes can see TCP headers, manipulate connections
QUIC:
UDP header: Visible (minimal: ports only)
QUIC header: Partially encrypted
QUIC payload: Fully encrypted
Middleboxes see only UDP ports
Cannot inspect or manipulate QUIC layer
Benefits of Always-On Encryption
1. Privacy: HTTP/3 headers/content always encrypted
2. Security: Harder to inject/modify traffic
3. Ossification prevention: Middleboxes can't break QUIC
4. Future-proofing: Protocol can evolve without breaking
HTTP/3 Frames
HTTP/3 uses frames similar to HTTP/2, but over QUIC streams:
HTTP/3 Frame Types:
DATA - Request/response body
HEADERS - Headers (QPACK compressed)
CANCEL_PUSH - Cancel server push
SETTINGS - Connection settings
PUSH_PROMISE - Server push notification
GOAWAY - Connection shutdown
MAX_PUSH_ID - Limit on push streams
Key difference from HTTP/2:
- Each request/response on separate QUIC stream
- No stream multiplexing in HTTP/3 (QUIC handles it)
- Flow control handled by QUIC layer
QPACK Header Compression
HTTP/3 uses QPACK (QUIC-aware HPACK variant):
HPACK problem with QUIC:
HPACK uses dynamic table updated per header
Headers arrive out of order in QUIC
Can't update table until all prior updates processed
→ Head-of-line blocking in header compression!
QPACK solution:
- Separate unidirectional streams for table updates
- Encoder/decoder can choose blocking behavior
- Trades compression ratio for lower latency
Result: Slightly less compression than HPACK,
but no header compression blocking
HTTP/3 Adoption
As of 2024:
- ~25% of websites support HTTP/3
- All major browsers support HTTP/3
- Major CDNs (Cloudflare, Akamai, Fastly) support HTTP/3
- Google, Facebook, and others use HTTP/3 heavily
Server support:
- nginx: Experimental
- Cloudflare: Full support
- LiteSpeed: Full support
- Caddy: Full support
- IIS: Not yet
Client support:
- Chrome: Yes
- Firefox: Yes
- Safari: Yes
- Edge: Yes
- curl: Yes (with HTTP/3 build)
Deploying HTTP/3
Server Configuration
# nginx (experimental)
server {
listen 443 quic reuseport;
listen 443 ssl;
ssl_certificate /path/to/cert.pem;
ssl_certificate_key /path/to/key.pem;
# Advertise HTTP/3 support
add_header Alt-Svc 'h3=":443"; ma=86400';
}
Alt-Svc Header
Browsers discover HTTP/3 via Alt-Svc:
HTTP/2 200 OK
alt-svc: h3=":443"; ma=86400
Meaning:
h3=":443" - HTTP/3 available on port 443
ma=86400 - Cache this for 24 hours
Browser flow:
1. Connect via HTTP/2 (known to work)
2. See Alt-Svc header
3. Try HTTP/3 for subsequent requests
4. Fall back to HTTP/2 if QUIC blocked
When HTTP/3 Helps Most
Significant improvement:
- High latency connections (satellite, intercontinental)
- Lossy networks (mobile, WiFi congestion)
- Many small resources (API calls)
- Users switching networks (mobile)
Moderate improvement:
- Low latency, reliable networks
- Large single downloads
May not help:
- Local/datacenter communication
- UDP blocked (corporate firewalls)
Debugging HTTP/3
# curl with HTTP/3
$ curl --http3 https://example.com -v
* using HTTP/3
* h3 [:method: GET]
* h3 [:path: /]
...
# Check if site supports HTTP/3
$ curl -sI https://example.com | grep -i alt-svc
# Chrome DevTools
Network tab → Protocol shows "h3"
# qlog for QUIC debugging
Standardized logging format for QUIC
Visualize with qvis (https://qvis.quictools.info/)
Summary
HTTP/3 over QUIC provides:
| Feature | Benefit |
|---|---|
| UDP-based | Avoids TCP ossification |
| Independent streams | No transport HOL blocking |
| 0-RTT resumption | Instant subsequent connections |
| Connection migration | Survives network changes |
| Built-in encryption | Always secure, anti-ossification |
| QPACK compression | Efficient headers without blocking |
Trade-offs:
- UDP may be blocked by firewalls
- More CPU for encryption/decryption
- Newer, less mature implementations
- Debugging tools still evolving
HTTP/3 represents the cutting edge of web protocols. For deep dives into QUIC itself, see the next chapter.
QUIC Protocol
QUIC is a general-purpose transport protocol that originated at Google and is now standardized by the IETF. While HTTP/3 is its most visible use, QUIC can transport any application protocol that currently uses TCP.
QUIC at a Glance
┌─────────────────────────────────────────────────────────────────────┐
│ QUIC Features │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Transport Layer │
│ ✓ Reliable, ordered delivery per stream │
│ ✓ Congestion control │
│ ✓ Flow control (connection and stream level) │
│ │
│ Encryption │
│ ✓ TLS 1.3 integrated (mandatory) │
│ ✓ Encrypted headers and payload │
│ ✓ Protected from middlebox interference │
│ │
│ Multiplexing │
│ ✓ Independent streams (no HOL blocking) │
│ ✓ Bidirectional and unidirectional streams │
│ ✓ Stream priorities │
│ │
│ Connection │
│ ✓ Connection IDs (survives IP changes) │
│ ✓ 0-RTT resumption │
│ ✓ 1-RTT initial connection │
│ │
└─────────────────────────────────────────────────────────────────────┘
Why Build on UDP?
Why not improve TCP?
1. Kernel Dependency
TCP is implemented in OS kernels
Changes require kernel updates
Deployment takes years
2. Middlebox Ossification
Firewalls, NATs inspect TCP headers
"Unknown" TCP options get dropped
TCP extensions rarely deploy successfully
3. Head-of-Line Blocking
TCP's byte stream model is fundamental
Cannot fix without breaking compatibility
QUIC on UDP:
- Implemented in userspace (fast iteration)
- UDP passes through middleboxes unchanged
- Full control over protocol behavior
- Can add features without kernel changes
What You’ll Learn
This chapter covers:
- Why QUIC Exists: The problems it solves
- Connection Establishment: 0-RTT and 1-RTT handshakes
- Multiplexing: How streams eliminate HOL blocking
- Connection Migration: Surviving network changes
QUIC is the future of transport protocols. Understanding it prepares you for where networking is heading.
Why QUIC Exists
QUIC wasn’t created to replace TCP for its own sake. It addresses specific, persistent problems that couldn’t be solved within TCP’s constraints.
Problem 1: TCP Head-of-Line Blocking
TCP guarantees ordered delivery of a byte stream:
Application sends:
write(1000 bytes)
write(500 bytes)
write(700 bytes)
TCP segments sent:
Segment 1: bytes 0-999
Segment 2: bytes 1000-1499
Segment 3: bytes 1500-2199
If Segment 2 is lost:
Segment 3 arrives, but TCP buffers it
Application sees nothing until Segment 2 retransmitted
With HTTP/2:
Stream A data in Segment 1
Stream B data in Segment 2 (lost)
Stream C data in Segment 3
Stream C waits for Stream B retransmit!
Even though they're independent streams.
QUIC Solution
QUIC streams are independent:
Stream A ─────────────────────────> Delivered immediately
Stream B ─────X────[retransmit]──> Delivered when ready
Stream C ─────────────────────────> Delivered immediately
Each stream has its own sequence space.
Loss on one stream doesn't block others.
Problem 2: Connection Establishment Latency
TCP + TLS requires multiple round trips:
TCP + TLS 1.2: 3 RTT before first byte
┌──────────────────────────────────────────────────────────────────┐
│ TCP SYN ──────────────────────────────────> │
│ TCP SYN-ACK <────────────────────────────────── │
│ TCP ACK ──────────────────────────────────> 1 RTT │
│ │
│ TLS Hello ──────────────────────────────────> │
│ TLS Hello <────────────────────────────────── │
│ TLS Finished ──────────────────────────────────> 2 RTT │
│ TLS Finished <────────────────────────────────── │
│ │
│ HTTP Request ──────────────────────────────────> 3 RTT │
│ HTTP Response <────────────────────────────────── │
└──────────────────────────────────────────────────────────────────┘
TCP + TLS 1.3: 2 RTT (TLS 1.3 is 1-RTT)
On 100ms RTT: 200-300ms before data flows
QUIC Solution
QUIC: 1 RTT (or 0 RTT for repeat visits)
┌──────────────────────────────────────────────────────────────────┐
│ QUIC Initial + TLS Hello ─────────────────────────> │
│ QUIC Initial + TLS <─────────────────────────── │
│ QUIC + Request ─────────────────────────> 1 RTT │
│ QUIC + Response <─────────────────────────── │
└──────────────────────────────────────────────────────────────────┘
0-RTT resumption:
┌──────────────────────────────────────────────────────────────────┐
│ QUIC + TLS ticket + Request ───────────────────────> 0 RTT! │
│ QUIC + Response <───────────────────────── │
└──────────────────────────────────────────────────────────────────┘
QUIC combines transport and crypto handshake.
Problem 3: Network Ossification
Middleboxes (firewalls, NATs, load balancers) inspect and sometimes modify traffic:
TCP extension deployment problem:
New TCP option added:
1. RFC published
2. OS kernels implement it
3. Middlebox sees "unknown" option
4. Middlebox strips it or drops packet!
5. Feature doesn't work
Real examples:
- TCP Fast Open: ~50% of paths don't work
- ECN: Historically broken by many middleboxes
- Multipath TCP: Often stripped
Result: TCP is effectively frozen.
Can't add new features reliably.
QUIC Solution
QUIC encrypts everything:
UDP Header: [Source Port] [Dest Port] [Length] [Checksum]
↑ Visible to middleboxes
QUIC Header: [Connection ID] [Packet Number] ...
↑ Encrypted (except initial packets)
QUIC Payload: [Encrypted frames]
↑ Encrypted
Middleboxes can see:
- UDP ports
- That it's QUIC (maybe)
Middleboxes cannot:
- Inspect QUIC headers
- Modify QUIC content
- Apply TCP-specific rules
Result: QUIC can evolve without middlebox interference.
Problem 4: Connection Bound to IP Address
TCP connections are identified by:
(Source IP, Source Port, Destination IP, Destination Port)
Your phone on WiFi:
192.168.1.100:52000 → 93.184.216.34:443
Phone moves to cellular:
10.0.0.50:??? → 93.184.216.34:443
TCP: "That's a different connection!"
Connection reset. Start over.
Mobile users experience this constantly:
- WiFi to cellular handoff
- Moving between cell towers
- VPN connects/disconnects
QUIC Solution
QUIC connections identified by Connection ID:
Connection ID: 0x1A2B3C4D (random, opaque)
WiFi: 192.168.1.100 + CID 0x1A2B3C4D
Cellular: 10.0.0.50 + CID 0x1A2B3C4D
Server sees same Connection ID → Same connection!
Connection survives:
- Network changes
- IP address changes
- NAT rebinding
Seamless for user. No reconnection needed.
Summary: QUIC’s Value Proposition
┌──────────────────────────────────────────────────────────────────┐
│ Problem │ QUIC Solution │
├───────────────────────────────┼──────────────────────────────────┤
│ TCP head-of-line blocking │ Independent streams │
│ High connection latency │ 1-RTT, 0-RTT resumption │
│ Protocol ossification │ Encrypted, userspace impl. │
│ Connections break on move │ Connection ID migration │
│ Unencrypted metadata │ All headers encrypted │
└───────────────────────────────┴──────────────────────────────────┘
These aren’t theoretical problems—they affect billions of users daily. QUIC provides solutions that TCP cannot, which is why it’s becoming the foundation for modern protocols.
Connection Establishment and 0-RTT
QUIC’s handshake integrates transport and cryptographic establishment, dramatically reducing connection latency.
1-RTT Handshake
A new QUIC connection to a server:
Client Server
│ │
│─── Initial[TLS ClientHello, CRYPTO] ───────────>│
│ │
│ (1 RTT passes) │
│ │
│<── Initial[TLS ServerHello, CRYPTO] ────────────│
│<── Handshake[TLS EncryptedExtensions] ──────────│
│<── Handshake[TLS Certificate] ──────────────────│
│<── Handshake[TLS CertVerify, Finished] ─────────│
│ │
│─── Handshake[TLS Finished] ────────────────────>│
│ │
│ === Connection Established === │
│ │
│─── Application Data ───────────────────────────>│
│<── Application Data ────────────────────────────│
Packet Types During Handshake
Initial Packets:
- First packets sent
- Protected with Initial Keys (derived from DCID)
- Contains CRYPTO frames with TLS messages
- Minimum 1200 bytes (amplification protection)
Handshake Packets:
- Sent after Initial exchange
- Protected with Handshake Keys
- Complete the TLS 1.3 handshake
1-RTT Packets:
- After handshake completes
- Protected with Application Keys
- Used for all application data
0-RTT Resumption
If client has previously connected, it can send data immediately:
First connection:
- Client receives "session ticket" from server
- Contains resumption secret and server config
- Cached for future use
Subsequent connection:
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ Client Server │
│ │ │ │
│ │─── Initial[TLS ClientHello] ───────────────────>│ │
│ │─── 0-RTT[Application Data] ────────────────────>│ ← Data │
│ │ │ sent │
│ │<── Initial[TLS ServerHello] ────────────────────│ before │
│ │<── Handshake[TLS Finished] ─────────────────────│ handshake │
│ │<── 1-RTT[Application Data Response] ────────────│ completes!│
│ │ │ │
└─────────────────────────────────────────────────────────────────────┘
Client sends request BEFORE receiving server's response!
0-RTT Security Considerations
0-RTT data can be replayed:
Attacker captures:
[Initial + 0-RTT packets]
Attacker replays:
[Initial + 0-RTT packets] → Server processes request again!
Safe for 0-RTT:
✓ GET requests (idempotent)
✓ Read-only operations
✓ Operations with other replay protection
NOT safe for 0-RTT:
✗ POST/PUT (non-idempotent)
✗ Financial transactions
✗ Anything with side effects
Server controls:
- Can reject 0-RTT entirely
- Can accept but limit to safe operations
- Can implement replay detection (within limits)
Connection IDs
QUIC connections are identified by Connection IDs, not IP/port tuples:
Connection ID structure:
- Variable length (0-20 bytes)
- Chosen by each endpoint
- Destination CID: What I put in packets TO you
- Source CID: What you put in packets TO ME
Initial exchange:
Client → Server: Dest CID = random, Source CID = client's CID
Server → Client: Dest CID = client's CID, Source CID = server's CID
After handshake:
Both sides agree on CIDs to use
Server typically provides multiple CIDs for migration
Connection ID Benefits
1. NAT Rebinding Tolerance
NAT timeout changes source port
CID unchanged → Connection continues
2. Load Balancer Routing
CID can encode server selection
Any frontend can route to correct backend
3. Privacy (with rotation)
CID can be changed periodically
Harder to track connections across time
Amplification Attack Protection
QUIC prevents DDoS amplification:
Attack scenario without protection:
Attacker: Sends 50-byte Initial with spoofed source IP
Server: Responds with 10,000 bytes to victim
Amplification factor: 200x
QUIC protection:
Before address validation:
Server can send ≤ 3× what client sent
Client Initial minimum: 1200 bytes
Server can send: ≤ 3600 bytes
For more, server requires address validation:
- Send Retry packet (stateless)
- Or use address validation token
Retry Flow
Client Server
│ │
│─── Initial (1200 bytes) ──────────────────────>│
│ │
│<── Retry[token, new SCID] ─────────────────────│
│ │
│─── Initial[token] ────────────────────────────>│
│ │
│ (server validates token, proceeds normally) │
Connection Termination
Graceful Close
Endpoint sends CONNECTION_CLOSE frame:
- Error code: NO_ERROR (0x0) for clean close
- Reason phrase (optional)
Both sides:
- Stop sending new data
- Send acknowledgments for received data
- Enter closing period (3× PTO)
- Then fully close
Stateless Reset
For when connection state is lost:
Server crashes and restarts:
- Lost all connection state
- Client sends packets server doesn't recognize
Server sends Stateless Reset:
- Looks like a regular packet
- Contains reset token (derived from CID)
- Client recognizes token, closes connection
Prevents hanging connections after server restart.
Summary
QUIC connection establishment:
| Scenario | Round Trips | Data Delay |
|---|---|---|
| TCP + TLS 1.2 | 3 RTT | 3 RTT |
| TCP + TLS 1.3 | 2 RTT | 2 RTT |
| QUIC new | 1 RTT | 1 RTT |
| QUIC 0-RTT | 0 RTT | 0 RTT |
Key mechanisms:
- Integrated handshake: Transport + crypto combined
- Connection IDs: Enable migration, load balancing
- 0-RTT: Instant resumption (with replay caveats)
- Amplification protection: Prevents DDoS abuse
Multiplexing Without Head-of-Line Blocking
QUIC’s stream multiplexing is its most impactful feature for application performance. Unlike TCP, QUIC streams are truly independent.
Streams in QUIC
A QUIC connection supports multiple concurrent streams:
┌─────────────────────────────────────────────────────────────────────┐
│ QUIC Connection │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Stream 0 (bidirectional): [HTTP request/response] │ │
│ │ Stream 4 (bidirectional): [Another request/response] │ │
│ │ Stream 8 (bidirectional): [Third request/response] │ │
│ │ Stream 2 (unidirectional): [Control messages →] │ │
│ │ Stream 6 (unidirectional): [Server push →] │ │
│ └────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Stream types:
Bidirectional: Data flows both ways
Unidirectional: Data flows one way only
Stream IDs:
0, 4, 8, 12... Client-initiated bidirectional
1, 5, 9, 13... Server-initiated bidirectional
2, 6, 10, 14... Client-initiated unidirectional
3, 7, 11, 15... Server-initiated unidirectional
Independence Guarantee
Each stream maintains its own state:
Stream states:
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ Stream 0: offset 0-1000 received, expecting 1001 │
│ Stream 4: offset 0-500 received, expecting 501 │
│ Stream 8: offset 0-2000 received, complete │
│ │
└─────────────────────────────────────────────────────────────────────┘
If Stream 4 data at offset 501-1000 is lost:
- Stream 0: Continues receiving, delivering data
- Stream 4: Waits for retransmit of 501-1000
- Stream 8: Already complete, unaffected
NO cross-stream blocking!
Comparison with HTTP/2 over TCP
HTTP/2 over TCP:
Request 1: [----data----]
Request 2: [--data--]
Request 3: [------data------]
TCP sees: [1][2][3][1][2][3][1][3][1][3]
TCP packet 3 lost (contains Request 2 data):
All subsequent packets buffered by TCP
Requests 1 and 3 blocked waiting for Request 2!
─────────────────────────────────────────────────────────────────────
HTTP/3 over QUIC:
Request 1: Stream 0
Request 2: Stream 4
Request 3: Stream 8
QUIC packet with Stream 4 data lost:
Stream 4 data retransmitted
Streams 0 and 8 continue independently!
Each HTTP request truly independent.
Stream Flow Control
QUIC has two levels of flow control:
1. Stream-level flow control:
Each stream has its own receive window
Prevents one stream from consuming all buffer
2. Connection-level flow control:
Total bytes across all streams
Prevents connection from overwhelming receiver
┌─────────────────────────────────────────────────────────────────────┐
│ Connection MAX_DATA: 1,000,000 bytes │
│ ├── Stream 0 MAX_STREAM_DATA: 100,000 bytes │
│ ├── Stream 4 MAX_STREAM_DATA: 100,000 bytes │
│ └── Stream 8 MAX_STREAM_DATA: 100,000 bytes │
│ │
│ Stream 0 can use up to 100KB │
│ All streams combined can use up to 1MB │
└─────────────────────────────────────────────────────────────────────┘
Flow control frames:
MAX_DATA: Update connection limit
MAX_STREAM_DATA: Update stream limit
DATA_BLOCKED: Signal sender is blocked
STREAM_DATA_BLOCKED: Signal stream is blocked
Stream Prioritization
Applications can indicate stream importance:
Priority information per stream:
- Urgency: 0-7 (0 highest)
- Incremental: true/false (can process partially)
Example (HTTP/3):
HTML: Stream 0, urgency=0, incremental=false
CSS: Stream 4, urgency=1, incremental=false
JS: Stream 8, urgency=1, incremental=false
Images: Stream 12+, urgency=5, incremental=true
Sender should:
1. Send all urgency=0 data first
2. Round-robin among same urgency
3. Incremental streams can be interleaved
Note: Priority is a hint, not enforced by QUIC itself.
Stream Lifecycle
┌──────────────────┐
│ Idle │
└────────┬─────────┘
│ Open (send/receive)
┌────────▼─────────┐
│ Open │
│ │
│ Send/Receive data│
└──┬──────────┬────┘
│ │
Send FIN ─────┘ └───── Receive FIN
│ │
┌────────▼──┐ ┌──▼─────────┐
│Half-Closed│ │Half-Closed │
│ (local) │ │ (remote) │
└─────┬─────┘ └──────┬─────┘
│ │
Receive FIN ────┘ └──── Send FIN
│ │
└────────┬────────┘
┌────────▼─────────┐
│ Closed │
└──────────────────┘
Stream can also be reset (RST_STREAM) at any point.
Practical Impact
Scenario: Page with 50 resources over lossy network (2% loss)
HTTP/2 over TCP:
Any lost packet blocks ALL pending responses
On 2% loss: Significant stalls and delays
Measured: 3-4x slower on lossy mobile
HTTP/3 over QUIC:
Lost packet only affects its stream
Other 49 resources continue loading
Measured: Near-optimal even with loss
Real-world impact is most visible on:
- Mobile networks (variable quality)
- Satellite connections (high latency + loss)
- Congested WiFi
Stream Limits
Connections limit maximum streams:
MAX_STREAMS frames:
- MAX_STREAMS (bidi): Max bidirectional streams
- MAX_STREAMS (uni): Max unidirectional streams
Typical defaults:
100 bidirectional streams
100 unidirectional streams
If limit reached:
Sender must wait for streams to close
Or wait for MAX_STREAMS increase
STREAMS_BLOCKED frame signals waiting
Summary
QUIC stream multiplexing provides:
| Feature | Benefit |
|---|---|
| Independent streams | No head-of-line blocking |
| Per-stream flow control | Fair resource allocation |
| Stream priorities | Important content first |
| Unidirectional streams | Efficient one-way data |
| Low overhead | Stream creation is cheap |
This is QUIC’s key advantage over TCP for multiplexed protocols. Loss on one stream doesn’t impact others, making it ideal for modern web applications with many parallel requests.
Connection Migration
QUIC connections can survive network changes—a game-changer for mobile users who constantly switch between WiFi and cellular networks.
The Problem with TCP
TCP connections are bound to IP addresses:
TCP connection tuple:
(192.168.1.100, 52000, 93.184.216.34, 443)
└── Client IP ──┘
When IP changes (WiFi → cellular):
New tuple: (10.0.0.50, 48000, 93.184.216.34, 443)
Server: "Who are you? I don't have a connection from 10.0.0.50"
Connection dies. Application must reconnect.
Impact:
- HTTP request fails
- Download interrupted
- Streaming buffers
- User experience degraded
QUIC’s Solution: Connection IDs
QUIC identifies connections by Connection ID, not IP:
QUIC connection:
Connection ID: 0x1A2B3C4D5E6F
Packets from 192.168.1.100 with CID 0x1A2B3C4D5E6F
Packets from 10.0.0.50 with CID 0x1A2B3C4D5E6F
Server: "Same CID? Same connection! Continue."
Migration Flow
┌─────────────────────────────────────────────────────────────────────┐
│ Connection Migration │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Time 0: Client on WiFi (192.168.1.100) │
│ Client ──[CID: ABC]──> Server │
│ Server <──[CID: XYZ]── Client │
│ │
│ Time 1: Client switches to cellular (10.0.0.50) │
│ Client ──[CID: ABC]──> Server (from new IP!) │
│ │
│ Time 2: Server validates new path │
│ Server ──[PATH_CHALLENGE]──> Client │
│ Client ──[PATH_RESPONSE]──> Server │
│ │
│ Time 3: Migration complete │
│ Connection continues seamlessly │
│ In-flight data retransmitted if needed │
│ │
└─────────────────────────────────────────────────────────────────────┘
Path Validation
Migration requires validating the new path:
Why validate?
- Prove client owns new address (anti-spoofing)
- Verify path works for bidirectional traffic
- Update RTT estimates for new path
PATH_CHALLENGE:
Server sends random 8-byte challenge to new address
PATH_RESPONSE:
Client echoes the 8 bytes back
If response matches: Path validated, migration complete
If no response: Revert to previous path
Connection ID Management
Multiple Connection IDs enable smooth migration:
Server provides multiple CIDs:
NEW_CONNECTION_ID frames:
CID 1: 0xAAAAAA, Sequence 0
CID 2: 0xBBBBBB, Sequence 1
CID 3: 0xCCCCCC, Sequence 2
Client can use any of these CIDs.
On migration:
Client switches to unused CID
Server correlates new CID to connection
Old CID retired for privacy
RETIRE_CONNECTION_ID:
"I'm done using CID sequence 0"
Privacy Benefits
Without CID rotation:
Observer: "CID 0xABC on WiFi... same CID on cellular"
Observer: "This is the same user, tracked!"
With CID rotation:
WiFi: Uses CID 0xAAA
Cellular: Uses CID 0xBBB (unused before)
Observer: "Different CIDs, can't correlate"
(Connection continues, but linkability reduced)
Probing and Preferred Paths
QUIC can probe multiple paths:
Client has:
- WiFi connection (reliable, maybe slow)
- Cellular connection (less reliable, maybe faster)
Client can:
1. Probe both paths with PATH_CHALLENGE
2. Measure RTT and loss on each
3. Choose preferred path
4. Keep other path as backup
This enables:
- Seamless handoff
- Make-before-break migration
- Multipath in future extensions
NAT Rebinding
Even without physical network change, NAT can disrupt:
NAT timeout scenario:
Connection idle for 30 minutes
NAT forgets the mapping
NAT assigns new external port
TCP: Connection times out or RST
QUIC:
CID unchanged
Server validates new path
Connection continues
Server-Side Considerations
Load Balancer Routing
With TCP:
Load balancer routes by 4-tuple
IP change → Different backend → Connection state lost
With QUIC:
Load balancer can route by CID
Server encodes routing info in CID
CID format (example):
[Server ID: 4 bytes][Random: 8 bytes]
Any frontend extracts Server ID from CID
Routes to correct backend regardless of client IP
Connection State
Server must maintain:
- Connection state by CID (not by IP)
- Multiple CIDs per connection
- Token→Connection mapping for 0-RTT
Storage indexed by CID, not IP address.
Mobile Experience Impact
Real-world scenarios improved:
1. Elevator/Subway
TCP: Connection dies, app reconnects
QUIC: Brief pause, then continues
2. Walking between access points
TCP: Each AP change = potential reset
QUIC: Seamless, user unaware
3. VPN connect/disconnect
TCP: All connections reset
QUIC: Continues through VPN changes
4. NAT timeout during idle
TCP: Silent failure on next request
QUIC: Automatic path revalidation
Summary
Connection migration enables:
| Feature | User Benefit |
|---|---|
| IP address change survival | Seamless WiFi/cellular handoff |
| CID-based identification | Load balancer flexibility |
| Path validation | Security against spoofing |
| CID rotation | Privacy from observers |
| NAT rebinding tolerance | Fewer “connection reset” errors |
Migration is one of QUIC’s most user-visible improvements, particularly for mobile users who previously experienced constant interruptions during network transitions.
WebSockets
WebSockets provide full-duplex, bidirectional communication over a single TCP connection. Unlike HTTP’s request-response model, WebSockets allow both client and server to send messages at any time.
Why WebSockets?
HTTP is request-response: client asks, server answers. But many applications need real-time, bidirectional communication:
HTTP Limitations:
- Client must initiate every exchange
- Server can't push data spontaneously
- New request needed for each interaction
- Header overhead for every message
Workarounds before WebSockets:
Polling: Client asks "any updates?" every N seconds
Long-polling: Server holds request until data available
Server-Sent: Server streams events (one-way only)
All have overhead, latency, or direction limitations.
WebSocket Advantages
┌─────────────────────────────────────────────────────────────────────┐
│ WebSocket Benefits │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Full-Duplex: Both sides send simultaneously │
│ Low Latency: No per-message handshake │
│ Low Overhead: 2-10 bytes per frame (vs ~100+ for HTTP) │
│ Persistent: Single connection for entire session │
│ Push: Server sends without client request │
│ │
└─────────────────────────────────────────────────────────────────────┘
Perfect for:
- Chat applications
- Live notifications
- Real-time collaboration
- Live sports/stock updates
- Online gaming
- IoT device communication
The Protocol
WebSocket starts as HTTP, then “upgrades” to a different protocol:
┌─────────────────────────────────────────────────────────────────────┐
│ WebSocket Connection │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. HTTP Request with Upgrade header │
│ 2. Server responds 101 Switching Protocols │
│ 3. Connection becomes WebSocket │
│ 4. Bidirectional frames flow │
│ 5. Either side can close │
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Client │ │ Server │ │
│ └────┬─────┘ └────┬─────┘ │
│ │ │ │
│ │─── HTTP Upgrade Request ───────────────>│ │
│ │<── HTTP 101 Switching ──────────────────│ │
│ │ │ │
│ │═══ WebSocket Connection ════════════════│ │
│ │ │ │
│ │─── Message ────────────────────────────>│ │
│ │<── Message ─────────────────────────────│ │
│ │<── Message ─────────────────────────────│ │
│ │─── Message ────────────────────────────>│ │
│ │ │ │
│ │─── Close Frame ────────────────────────>│ │
│ │<── Close Frame ─────────────────────────│ │
│ │ │ │
│ ╳ ╳ │
│ │
└─────────────────────────────────────────────────────────────────────┘
What You’ll Learn
- The Upgrade Handshake: How HTTP becomes WebSocket
- Full-Duplex Communication: Frame format and messaging
- WebSocket Use Cases: When and why to use WebSockets
The Upgrade Handshake
WebSocket connections begin as HTTP requests, then “upgrade” to the WebSocket protocol. This allows WebSockets to work through HTTP infrastructure (proxies, load balancers) while establishing a different communication pattern.
Client Request
GET /chat HTTP/1.1
Host: server.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13
Origin: http://example.com
Sec-WebSocket-Protocol: chat, superchat
Required Headers
| Header | Purpose |
|---|---|
Upgrade: websocket | Request protocol upgrade |
Connection: Upgrade | Indicates upgrade requested |
Sec-WebSocket-Key | Random base64 value for handshake validation |
Sec-WebSocket-Version: 13 | WebSocket protocol version |
Optional Headers
| Header | Purpose |
|---|---|
Origin | Where request originates (CORS) |
Sec-WebSocket-Protocol | Subprotocol preferences (application-defined) |
Sec-WebSocket-Extensions | Extension negotiation (e.g., compression) |
Server Response
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
Sec-WebSocket-Protocol: chat
Sec-WebSocket-Accept Calculation
Server proves it received the client’s key:
1. Take client's Sec-WebSocket-Key:
"dGhlIHNhbXBsZSBub25jZQ=="
2. Append magic GUID:
"dGhlIHNhbXBsZSBub25jZQ==" + "258EAFA5-E914-47DA-95CA-C5AB0DC85B11"
3. SHA-1 hash the result
4. Base64 encode the hash:
"s3pPLMBiTxaQ9kYGzzhZRbK+xOo="
Client verifies this matches expected value.
Prevents accidental connections or caching issues.
After the Handshake
HTTP/1.1 101 Switching Protocols
...
─── HTTP ENDS HERE ───
┌─────────────────────────────────────────┐
│ WebSocket Frames (Binary) │
│ │
│ [Frame Header][Payload] │
│ [Frame Header][Payload] │
│ ... │
└─────────────────────────────────────────┘
Same TCP connection, different protocol.
No more HTTP until connection closes.
Handshake Implementation
JavaScript (Browser)
const ws = new WebSocket('wss://server.example.com/chat');
ws.onopen = () => {
console.log('Connected!');
ws.send('Hello Server!');
};
ws.onmessage = (event) => {
console.log('Received:', event.data);
};
ws.onclose = (event) => {
console.log('Closed:', event.code, event.reason);
};
ws.onerror = (error) => {
console.error('Error:', error);
};
Python Server (websockets library)
import asyncio
import websockets
async def handler(websocket, path):
async for message in websocket:
print(f"Received: {message}")
await websocket.send(f"Echo: {message}")
async def main():
async with websockets.serve(handler, "localhost", 8765):
await asyncio.Future() # Run forever
asyncio.run(main())
Node.js Server (ws library)
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });
wss.on('connection', (ws) => {
console.log('Client connected');
ws.on('message', (message) => {
console.log('Received:', message.toString());
ws.send(`Echo: ${message}`);
});
ws.on('close', () => {
console.log('Client disconnected');
});
});
Subprotocols
Subprotocols define application-level meaning:
Client: Sec-WebSocket-Protocol: graphql, json, protobuf
Server: Sec-WebSocket-Protocol: graphql
Agreement: Use GraphQL over WebSocket.
Common subprotocols:
- graphql-ws (GraphQL subscriptions)
- mqtt (IoT messaging)
- wamp (Web Application Messaging)
- soap (legacy)
Extensions
Extensions modify the protocol (typically for compression):
Client: Sec-WebSocket-Extensions: permessage-deflate
Server: Sec-WebSocket-Extensions: permessage-deflate
permessage-deflate:
- Compresses message payloads
- Significant bandwidth savings for text
- Supported by most implementations
Error Cases
Server doesn't support WebSocket:
HTTP/1.1 400 Bad Request
(or 404, or no Upgrade response)
Wrong Sec-WebSocket-Accept:
Client: Abort connection
Prevents man-in-the-middle returning wrong accept
Origin not allowed (CORS-like):
HTTP/1.1 403 Forbidden
Server rejects based on Origin header
Secure WebSockets (WSS)
ws:// - Unencrypted WebSocket (like HTTP)
wss:// - Encrypted WebSocket over TLS (like HTTPS)
Process for wss://:
1. TCP connection
2. TLS handshake (certificate validation)
3. HTTP Upgrade request (encrypted)
4. WebSocket frames (encrypted)
Always use wss:// in production.
Summary
The WebSocket handshake:
- Client sends HTTP GET with
Upgrade: websocket - Server validates and responds
101 Switching Protocols - Server sends
Sec-WebSocket-Acceptderived from client’s key - Connection becomes bidirectional WebSocket
After upgrade, it’s no longer HTTP—just WebSocket frames on TCP.
Full-Duplex Communication
After the handshake, WebSocket communication happens through frames—small packets that can carry text, binary data, or control messages.
Frame Format
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
│F│R│R│R│ Opcode │M│ Payload Length │
│I│S│S│S│ (4) │A│ (7) │
│N│V│V│V│ │S│ │
│ │1│2│3│ │K│ │
├─┴─┴─┴─┴─────────┴─┴───────────────────────────────────────────┤
│ Extended payload length (16/64 bits, if needed) │
├───────────────────────────────────────────────────────────────┤
│ Masking key (32 bits, if MASK=1) │
├───────────────────────────────────────────────────────────────┤
│ Payload Data │
└───────────────────────────────────────────────────────────────┘
Minimum frame: 2 bytes (header only)
Typical small message: 6-8 bytes overhead
Frame Fields
| Field | Bits | Description |
|---|---|---|
| FIN | 1 | Final fragment of message |
| RSV1-3 | 3 | Reserved for extensions |
| Opcode | 4 | Frame type |
| MASK | 1 | Payload is masked (required from client) |
| Payload Length | 7+ | Size of payload |
Opcodes
0x0 Continuation Part of fragmented message
0x1 Text UTF-8 text data
0x2 Binary Binary data
0x8 Close Connection close request
0x9 Ping Heartbeat request
0xA Pong Heartbeat response
Message Types
Text Messages
// Send
ws.send("Hello, World!");
// Frame created:
// FIN=1, Opcode=0x1 (text), MASK=1, Payload="Hello, World!"
// Receive
ws.onmessage = (event) => {
console.log(event.data); // "Hello, World!" (string)
};
Binary Messages
// Send ArrayBuffer
const buffer = new ArrayBuffer(8);
const view = new DataView(buffer);
view.setFloat64(0, 3.14159);
ws.send(buffer);
// Send Blob
const blob = new Blob(['Binary data'], {type: 'application/octet-stream'});
ws.send(blob);
// Receive
ws.binaryType = 'arraybuffer'; // or 'blob'
ws.onmessage = (event) => {
const data = event.data; // ArrayBuffer or Blob
};
Control Frames
Ping/Pong (Heartbeat)
Ping: "Are you still there?"
Pong: "Yes, I'm here."
Server sends: Ping (opcode 0x9)
Client sends: Pong (opcode 0xA) with same payload
Purpose:
- Detect dead connections
- Keep NAT mappings alive
- Measure latency
Client browser handles Pong automatically.
Close Frame
Graceful shutdown:
1. Initiator sends Close frame
- Opcode 0x8
- Optional: status code (2 bytes) + reason (text)
2. Recipient sends Close frame back
3. TCP connection closed
Status codes:
1000 Normal closure
1001 Endpoint going away
1002 Protocol error
1003 Unsupported data type
1006 Abnormal closure (no close frame)
1011 Server error
Masking
Client-to-server frames must be masked:
Why masking?
Cache poisoning attack prevention.
Proxies might cache WebSocket data as HTTP.
Masking makes data look random, prevents caching.
Masking process:
1. Generate random 32-bit masking key
2. XOR each byte of payload with key (rotating)
masked[i] = payload[i] XOR key[i % 4]
Server-to-client: No masking required.
Fragmentation
Large messages can be split into fragments:
Large message (1MB):
Fragment 1: FIN=0, Opcode=0x1 (text), data[0:64KB]
Fragment 2: FIN=0, Opcode=0x0 (continuation), data[64KB:128KB]
Fragment 3: FIN=0, Opcode=0x0 (continuation), data[128KB:192KB]
...
Fragment N: FIN=1, Opcode=0x0 (continuation), data[last portion]
Receiver reassembles before delivering to application.
Allows interleaving with control frames (Ping/Pong).
Full-Duplex in Action
┌─────────────────────────────────────────────────────────────────────┐
│ Simultaneous Communication │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Client Server │
│ │ │ │
│ │──── "Hello" ────────────────────────────────────│ │
│ │────────────────────────────── "World" ──────────│ │
│ │ X │ │
│ │──── "How are you?" ─────────────────────────────│ │
│ │────────────────────── "Message for you" ────────│ │
│ │ │ │
│ │ Messages cross "in flight" │ │
│ │ No waiting for response │ │
│ │ Both sides send whenever ready │ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Implementation Patterns
Message Protocol
// Define message format
const message = {
type: 'chat',
payload: {
user: 'alice',
text: 'Hello everyone!'
},
timestamp: Date.now()
};
ws.send(JSON.stringify(message));
// Receiver
ws.onmessage = (event) => {
const msg = JSON.parse(event.data);
switch (msg.type) {
case 'chat':
displayChat(msg.payload);
break;
case 'notification':
showNotification(msg.payload);
break;
}
};
Reconnection Logic
function connect() {
const ws = new WebSocket('wss://example.com/socket');
ws.onopen = () => {
console.log('Connected');
reconnectAttempts = 0;
};
ws.onclose = (event) => {
if (event.code !== 1000) { // Not normal close
// Exponential backoff
const delay = Math.min(1000 * 2 ** reconnectAttempts, 30000);
reconnectAttempts++;
setTimeout(connect, delay);
}
};
return ws;
}
Summary
WebSocket communication features:
| Aspect | Details |
|---|---|
| Frame overhead | 2-14 bytes (vs 100+ for HTTP) |
| Message types | Text (UTF-8), Binary |
| Control frames | Ping, Pong, Close |
| Direction | Full-duplex (simultaneous both ways) |
| Fragmentation | Large messages split across frames |
| Masking | Required for client→server |
WebSocket Use Cases
WebSockets excel when you need persistent, low-latency, bidirectional communication. Understanding when to use them—and when not to—helps you choose the right tool.
Ideal Use Cases
Real-Time Chat
Chat requirements:
✓ Instant message delivery
✓ Multiple participants
✓ Typing indicators
✓ Presence status
WebSocket flow:
User types ──[typing indicator]──> Server ──broadcast──> All users
User sends ──[message]──────────> Server ──broadcast──> All users
User joins ──[presence]─────────> Server ──broadcast──> All users
Single connection handles all message types.
Sub-100ms latency achievable.
Live Notifications
Without WebSocket (polling):
Client: "Any notifications?" (every 5 seconds)
Server: "No"
Client: "Any notifications?"
Server: "No"
... 50 requests later ...
Server: "Yes, you have a message!"
With WebSocket:
Server: "New message!" (instant when it happens)
Benefits:
- Immediate delivery
- No wasted requests
- Lower server load
Collaborative Editing
Google Docs / Figma style:
User A types ──> Server ──> User B (sees cursor, text)
User B draws ──> Server ──> User A (sees drawing)
Requirements:
- Low latency (feels responsive)
- High frequency (keystroke level)
- Bidirectional (everyone sees everyone)
- Reliable (no lost changes)
WebSocket + Operational Transform/CRDT
Online Gaming
Game server sending:
- Player positions (60 fps)
- Game events
- Chat messages
Players sending:
- Input commands
- Actions
- Chat
WebSocket provides:
- Single connection (efficient)
- Binary messages (compact)
- Low latency (responsive)
Note: Competitive games may prefer UDP/QUIC for
lower latency at cost of reliability.
Financial Data
Stock ticker:
[AAPL: 150.25] ──> [AAPL: 150.30] ──> [AAPL: 150.28]
Dozens of updates per second.
HTTP request-response unsuitable.
Server-Sent Events work but one-way only.
WebSocket ideal for bidirectional (quotes + orders).
IoT Dashboard
Sensors ──> Server ──WebSocket──> Dashboard
Real-time display of:
- Temperature
- Humidity
- Motion
- System status
Dashboard can also send commands back:
Dashboard ──> Server ──> Device (turn on/off)
When NOT to Use WebSockets
Simple CRUD APIs
Creating a user:
POST /users
{"name": "Alice"}
Response:
201 Created
{"id": 123, "name": "Alice"}
One request, one response, done.
WebSocket is overkill—use REST/HTTP.
Infrequent Updates
Weather data (updates hourly):
- Polling once per hour is fine
- SSE (Server-Sent Events) works well
- WebSocket connection overhead not justified
Rule of thumb:
Updates > 1/minute: Consider WebSocket
Updates < 1/minute: HTTP polling or SSE
Public APIs
Public REST API considerations:
- Stateless (easy to scale)
- Cacheable
- Standard tools (curl, Postman)
- Rate limiting straightforward
WebSocket:
- Stateful (harder to scale)
- Not cacheable
- Fewer debugging tools
- Rate limiting complex
Alternatives Comparison
┌────────────────────────────────────────────────────────────────────┐
│ Technique │ Direction │ Latency │ Best For │
├────────────────────┼──────────────┼─────────┼──────────────────────┤
│ HTTP Polling │ Client→Svr │ High │ Simple, infrequent │
│ Long Polling │ Client→Svr │ Medium │ Moderate updates │
│ Server-Sent Events │ Server→Clt │ Low │ One-way streaming │
│ WebSocket │ Bidirectional│ Low │ Real-time, two-way │
│ WebRTC │ P2P │ Lowest │ Audio/video, gaming │
└────────────────────────────────────────────────────────────────────┘
Server-Sent Events (SSE)
Good for:
- Live feeds (news, sports)
- Notifications (server→client only)
- Simpler than WebSocket
Limitations:
- One-way (server to client)
- Text only (no binary)
- Fewer connections per browser
// Server
res.setHeader('Content-Type', 'text/event-stream');
res.write('data: Hello\n\n');
// Client
const source = new EventSource('/stream');
source.onmessage = (e) => console.log(e.data);
Scaling WebSockets
Connection Limits
Challenge:
10,000 concurrent users = 10,000 open connections
Each connection uses memory and file descriptors
Solutions:
- Horizontal scaling (multiple servers)
- Connection limits per server
- Load balancing by connection ID
State Distribution
User A connected to Server 1
User B connected to Server 2
User A sends message to User B
Server 1 must route to Server 2!
Solutions:
- Redis pub/sub
- Dedicated message queue
- Sticky sessions (same user → same server)
Architecture Pattern
┌─────────────────┐
│ Load Balancer │
└────────┬────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌──────▼─────┐ ┌──────▼─────┐ ┌──────▼─────┐
│ WS Server 1│ │ WS Server 2│ │ WS Server 3│
└──────┬─────┘ └──────┬─────┘ └──────┬─────┘
│ │ │
└───────────────────┼───────────────────┘
│
┌────────▼────────┐
│ Redis Pub/Sub │
│ (or similar) │
└─────────────────┘
Messages published to Redis, all servers receive.
Summary
Use WebSockets for:
- Real-time bidirectional communication
- High-frequency updates
- Push from server
- Interactive applications
Consider alternatives for:
- Simple request-response (HTTP)
- One-way server→client (SSE)
- Infrequent updates (polling)
- Peer-to-peer (WebRTC)
WebSocket is powerful but adds complexity. Choose based on actual requirements.
TLS/SSL
TLS (Transport Layer Security) encrypts network communication, protecting data from eavesdropping and tampering. It’s what puts the “S” in HTTPS and secures most internet traffic today.
What TLS Provides
┌─────────────────────────────────────────────────────────────────────┐
│ TLS Security Goals │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Confidentiality │
│ Data encrypted, only endpoints can read it │
│ Eavesdropper sees random bytes │
│ │
│ Integrity │
│ Data tampering detected │
│ HMAC ensures message authenticity │
│ │
│ Authentication │
│ Server proves identity via certificate │
│ Client verifies it's talking to real server │
│ (Optional: Client can prove identity too) │
│ │
└─────────────────────────────────────────────────────────────────────┘
TLS in the Stack
┌─────────────────────────────────────────────────────────────────────┐
│ Protocol Stack │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Application (HTTP, SMTP, etc.) │ │
│ ├───────────────────────────────────────────────────────┤ │
│ │ TLS │ ← Here │
│ ├───────────────────────────────────────────────────────┤ │
│ │ TCP │ │
│ ├───────────────────────────────────────────────────────┤ │
│ │ IP │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ TLS sits between application and transport. │
│ Application sees plain data. │
│ Network sees encrypted data. │
│ │
└─────────────────────────────────────────────────────────────────────┘
Brief History
1995: SSL 2.0 (Netscape) - First public version, insecure
1996: SSL 3.0 - Major improvements, still vulnerabilities
1999: TLS 1.0 - Standardized by IETF, based on SSL 3.0
2006: TLS 1.1 - Minor security improvements
2008: TLS 1.2 - Modern cipher suites, still widely used
2018: TLS 1.3 - Simplified, faster, more secure
Today:
TLS 1.3 preferred
TLS 1.2 acceptable
TLS 1.0/1.1 deprecated (should be disabled)
SSL: Do not use
How TLS Works (Overview)
┌─────────────────────────────────────────────────────────────────────┐
│ TLS Connection │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Handshake │
│ - Client and server negotiate parameters │
│ - Server presents certificate │
│ - Key exchange establishes shared secret │
│ - Derive session keys │
│ │
│ 2. Encrypted Communication │
│ - All data encrypted with session keys │
│ - MAC ensures integrity │
│ - Sequence numbers prevent replay │
│ │
│ 3. Closure │
│ - Close notify alert │
│ - Prevents truncation attacks │
│ │
└─────────────────────────────────────────────────────────────────────┘
What You’ll Learn
- The TLS Handshake: How secure connections are established
- Certificates and PKI: How identity is verified
- Cipher Suites: The cryptographic algorithms used
- TLS 1.3 Improvements: What makes the latest version better
The TLS Handshake
The TLS handshake establishes a secure connection by negotiating cryptographic parameters and authenticating the server. Understanding it helps debug connection issues and appreciate TLS 1.3’s improvements.
TLS 1.2 Handshake
Client Server
│ │
│─────────── ClientHello ───────────────────────────>│
│ - TLS version │
│ - Random bytes │
│ - Cipher suites supported │
│ - Extensions (SNI, etc.) │
│ │
│<────────── ServerHello ────────────────────────────│
│ - Chosen TLS version │
│ - Server random │
│ - Selected cipher suite │
│ │
│<────────── Certificate ────────────────────────────│
│ - Server's certificate chain │
│ │
│<────────── ServerKeyExchange ─────────────────────│
│ - Key exchange parameters (if needed) │
│ │
│<────────── ServerHelloDone ───────────────────────│
│ │
│─────────── ClientKeyExchange ─────────────────────>│
│ - Pre-master secret (encrypted) │
│ │
│─────────── ChangeCipherSpec ──────────────────────>│
│ - "Switching to encrypted" │
│ │
│─────────── Finished ──────────────────────────────>│
│ - Encrypted verification │
│ │
│<────────── ChangeCipherSpec ───────────────────────│
│<────────── Finished ───────────────────────────────│
│ │
│══════════ Encrypted Application Data ══════════════│
TLS 1.2: 2 round trips before application data
TLS 1.3 Handshake (Simplified)
Client Server
│ │
│─────────── ClientHello ───────────────────────────>│
│ - TLS 1.3 │
│ - Supported groups & key shares │
│ - Signature algorithms │
│ │
│<────────── ServerHello ────────────────────────────│
│ - Selected key share │
│ │
│<────────── EncryptedExtensions ────────────────────│
│<────────── Certificate ────────────────────────────│
│<────────── CertificateVerify ─────────────────────│
│<────────── Finished ───────────────────────────────│
│ │
│─────────── Finished ──────────────────────────────>│
│ │
│══════════ Encrypted Application Data ══════════════│
TLS 1.3: 1 round trip before application data
Key Exchange
How do client and server agree on encryption keys?
Diffie-Hellman Key Exchange
The magic: Agree on a shared secret over an insecure channel.
1. Public parameters: Prime p, Generator g
2. Client picks random a, computes A = g^a mod p
Server picks random b, computes B = g^b mod p
3. Exchange A and B (visible to eavesdroppers)
4. Client computes: secret = B^a mod p = g^(ab) mod p
Server computes: secret = A^b mod p = g^(ab) mod p
Both have same secret! Eavesdropper has A and B but
cannot compute g^(ab) without knowing a or b (discrete log problem).
Modern TLS uses Elliptic Curve Diffie-Hellman (ECDHE) for efficiency.
Perfect Forward Secrecy
Why ephemeral keys matter:
Without PFS (RSA key exchange):
- Server's long-term key encrypts pre-master secret
- If key later compromised, all past traffic decryptable
With PFS (ECDHE):
- New DH keys generated per session
- Session keys destroyed after use
- Compromising server key doesn't reveal past sessions
TLS 1.3 requires PFS (ECDHE or DHE only).
Server Name Indication (SNI)
Problem: Single IP hosts multiple HTTPS sites.
Which certificate should server present?
Solution: SNI extension in ClientHello.
ClientHello includes:
server_name = "www.example.com"
Server sees hostname BEFORE certificate selection.
Presents correct certificate for that hostname.
Note: SNI is sent unencrypted in TLS 1.2.
Encrypted Client Hello (ECH) in TLS 1.3 hides it.
Session Resumption
Avoiding full handshake for repeat connections:
Session IDs (TLS 1.2)
First connection: Full handshake, server assigns session ID
Subsequent: Client presents session ID, server looks up keys
Abbreviated handshake (1 RTT instead of 2)
Limitation: Server must store session state (doesn't scale).
Session Tickets (TLS 1.2)
Server encrypts session state into ticket.
Client stores ticket, presents on reconnection.
Server decrypts ticket, recovers session state.
Advantage: Stateless server, better scaling.
0-RTT Resumption (TLS 1.3)
Client sends:
ClientHello + Early Data (encrypted with previous session key)
Server can respond to early data immediately.
No round trip before application data!
Security caveat: Early data is replayable.
Only safe for idempotent requests.
Handshake Failures
Certificate Errors
ERR_CERT_AUTHORITY_INVALID
- Certificate not trusted
- Self-signed or unknown CA
- Missing intermediate certificate
ERR_CERT_DATE_INVALID
- Certificate expired
- Certificate not yet valid
- System clock wrong
ERR_CERT_COMMON_NAME_INVALID
- Hostname doesn't match certificate
- Wrong server or misconfiguration
Protocol Errors
ERR_SSL_VERSION_OR_CIPHER_MISMATCH
- No common TLS version
- No common cipher suite
- Often: Server only supports old protocols
ERR_SSL_PROTOCOL_ERROR
- Malformed handshake messages
- Middlebox interference
- Implementation bugs
Debugging TLS
# OpenSSL client
$ openssl s_client -connect example.com:443 -servername example.com
# Show certificate
$ openssl s_client -connect example.com:443 2>/dev/null | \
openssl x509 -text -noout
# Test specific TLS version
$ openssl s_client -connect example.com:443 -tls1_2
$ openssl s_client -connect example.com:443 -tls1_3
# curl with verbose TLS info
$ curl -v https://example.com 2>&1 | grep -i ssl
# Test suite (ssllabs.com/ssltest online, or testssl.sh locally)
$ ./testssl.sh example.com
Summary
TLS handshake accomplishes:
| Goal | Mechanism |
|---|---|
| Version negotiation | ClientHello/ServerHello |
| Cipher suite selection | ClientHello/ServerHello |
| Server authentication | Certificate + signature |
| Key exchange | ECDHE (Diffie-Hellman) |
| Forward secrecy | Ephemeral keys |
| Session resumption | Tickets, 0-RTT |
TLS 1.3 improvements:
- 1-RTT handshake (vs 2-RTT)
- 0-RTT resumption option
- Only secure cipher suites
- Encrypted handshake data
- Simpler, more secure
Certificates and PKI
Certificates prove a server’s identity. The Public Key Infrastructure (PKI) is the trust hierarchy that makes this verification possible.
What’s in a Certificate
X.509 Certificate Structure:
┌─────────────────────────────────────────────────────────────────────┐
│ Version: 3 (X.509v3) │
│ Serial Number: 04:00:00:00:00:01:15:4B:5A:C3:94 │
│ Signature Algorithm: sha256WithRSAEncryption │
│ │
│ Issuer: CN=DigiCert Global CA, O=DigiCert Inc, C=US │
│ Validity: │
│ Not Before: Jan 15 00:00:00 2024 GMT │
│ Not After: Jan 14 23:59:59 2025 GMT │
│ Subject: CN=www.example.com, O=Example Inc, C=US │
│ │
│ Subject Public Key Info: │
│ Public Key Algorithm: rsaEncryption │
│ RSA Public Key: (2048 bit) │
│ Modulus: 00:c3:9b:... │
│ Exponent: 65537 │
│ │
│ X509v3 Extensions: │
│ Subject Alternative Name: │
│ DNS:www.example.com, DNS:example.com │
│ Key Usage: Digital Signature, Key Encipherment │
│ Extended Key Usage: TLS Web Server Authentication │
│ │
│ Signature: 3c:b3:4e:... │
└─────────────────────────────────────────────────────────────────────┘
Key Fields
| Field | Purpose |
|---|---|
| Subject | Who the certificate identifies |
| Issuer | Who signed (issued) the certificate |
| Validity | When certificate is valid |
| Public Key | Server’s public key for encryption |
| Subject Alt Names | Additional valid hostnames |
| Signature | CA’s signature verifying the certificate |
Certificate Chain
Certificates form a chain of trust:
┌─────────────────────────────────────────────────────────────────────┐
│ Certificate Chain │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────┐ │
│ │ Root CA Certificate │ ← Trusted by OS/browser │
│ │ Issuer: DigiCert Root CA │ (pre-installed) │
│ │ Subject: DigiCert Root CA │ Self-signed │
│ └─────────────────┬───────────────────┘ │
│ │ Signs │
│ ┌─────────────────▼───────────────────┐ │
│ │ Intermediate CA Certificate │ ← Server sends this │
│ │ Issuer: DigiCert Root CA │ │
│ │ Subject: DigiCert Global CA │ │
│ └─────────────────┬───────────────────┘ │
│ │ Signs │
│ ┌─────────────────▼───────────────────┐ │
│ │ End-Entity Certificate │ ← Server's certificate │
│ │ Issuer: DigiCert Global CA │ │
│ │ Subject: www.example.com │ │
│ └─────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Validation:
1. Server sends leaf + intermediate(s)
2. Client finds trusted root CA
3. Verifies each signature up the chain
4. Checks validity dates, revocation, hostname
5. Trust established!
Certificate Validation
What clients check:
1. Chain of Trust
Each certificate signed by issuer's key
Chain leads to trusted root CA
2. Validity Period
Current date within Not Before / Not After
3. Hostname Match
Requested hostname in Subject CN or SAN
www.example.com matches *.example.com (wildcard)
4. Revocation Status
Certificate not revoked (CRL or OCSP)
5. Key Usage
Certificate allowed for TLS server authentication
6. Cryptographic Verification
Signatures mathematically valid
Key sizes acceptable
Certificate Types
Domain Validation (DV)
Proves: Control of domain
Verification: Email, DNS, or HTTP challenge
Trust level: Low (only domain ownership)
Example: Let's Encrypt certificates
Organization Validation (OV)
Proves: Domain control + organization exists
Verification: Legal documents, phone calls
Trust level: Medium
Example: Business websites
Extended Validation (EV)
Proves: Domain + organization + legal verification
Verification: Extensive background checks
Trust level: High
Example: Banks, financial institutions
Note: Browsers no longer show green bar
Getting Certificates
Let’s Encrypt (Free, Automated)
# Using certbot
$ sudo certbot certonly --webroot -w /var/www/html -d example.com
# Auto-renewal
$ sudo certbot renew
# Certificates at:
# /etc/letsencrypt/live/example.com/fullchain.pem
# /etc/letsencrypt/live/example.com/privkey.pem
Commercial CAs
- Generate CSR (Certificate Signing Request)
- Submit to CA with payment
- Complete validation
- Receive certificate
# Generate private key and CSR
$ openssl req -new -newkey rsa:2048 -nodes \
-keyout server.key -out server.csr
# Submit server.csr to CA
# Receive server.crt back
Certificate Revocation
When certificates need to be invalidated:
CRL (Certificate Revocation List)
CA publishes list of revoked certificates.
Client downloads CRL, checks if cert is listed.
Problems:
- CRLs can be large
- Caching means delayed revocation detection
- Download can be slow
OCSP (Online Certificate Status Protocol)
Client asks CA: "Is this certificate revoked?"
CA responds: "Valid" or "Revoked"
Better than CRL but:
- Latency for each connection
- Privacy (CA sees what sites you visit)
OCSP Stapling
Server fetches OCSP response periodically.
Server "staples" response to TLS handshake.
Client gets proof without contacting CA.
Best practice: Enable OCSP stapling on your server.
Common Issues
Missing Intermediate
Problem:
Server sends only leaf certificate
Client can't build chain to root
"Certificate not trusted" error
Solution:
Configure server to send full chain:
ssl_certificate /path/to/fullchain.pem;
(includes leaf + intermediates)
Expired Certificate
Problem:
Certificate validity period ended
Browsers show security warning
Solution:
Renew certificate before expiration
Set up automated renewal (Let's Encrypt)
Monitor certificate expiration
Hostname Mismatch
Problem:
Request to example.com
Certificate for www.example.com
Names don't match
Solution:
Include all domains in SAN
Use wildcard (*.example.com) if appropriate
Redirect to canonical hostname
Summary
Certificate/PKI key concepts:
| Component | Purpose |
|---|---|
| Certificate | Binds public key to identity |
| Root CA | Trusted anchor (pre-installed) |
| Intermediate CA | Bridges root to end-entity |
| Chain | Path from leaf to trusted root |
| CRL/OCSP | Revocation checking |
| SAN | Multiple hostnames in one cert |
Best practices:
- Use certificates from trusted CAs
- Include full certificate chain
- Enable OCSP stapling
- Monitor expiration dates
- Use strong key sizes (RSA 2048+ or ECDSA P-256+)
Cipher Suites
A cipher suite is a combination of cryptographic algorithms used in a TLS connection. Understanding them helps you configure secure connections and debug compatibility issues.
Cipher Suite Components
TLS 1.2 cipher suite name:
TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
│ │ │ │ │ │ │
│ │ │ │ │ │ └── PRF hash
│ │ │ │ │ └────── Mode (GCM)
│ │ │ │ └────────── Key size (256-bit)
│ │ │ └────────────── Encryption (AES)
│ │ └────────────────────── Authentication (RSA cert)
│ └──────────────────────────── Key Exchange (ECDHE)
└──────────────────────────────── Protocol
Components:
Key Exchange: How to agree on encryption keys
Authentication: How to verify server identity
Encryption: How to encrypt data
MAC/Hash: How to verify integrity
Algorithm Categories
Key Exchange
RSA Server's RSA key encrypts pre-master secret
No forward secrecy
DO NOT USE (TLS 1.3 removed)
DHE Diffie-Hellman Ephemeral
Forward secrecy
Slower than ECDHE
ECDHE Elliptic Curve Diffie-Hellman Ephemeral
Forward secrecy
Fast, secure
RECOMMENDED
Authentication
RSA RSA certificate, RSA signature
Widely supported
ECDSA Elliptic Curve DSA
Smaller keys, faster
Growing adoption
EdDSA Ed25519/Ed448
Modern, fast
TLS 1.3 support
Bulk Encryption
AES-GCM AES Galois/Counter Mode (AEAD)
Fast, secure, authenticated encryption
RECOMMENDED
AES-CBC AES Cipher Block Chaining
Older, requires separate MAC
Vulnerable to padding oracles
AVOID
ChaCha20-Poly1305 Stream cipher (AEAD)
Fast on devices without AES hardware
Good alternative to AES-GCM
MAC/Hash
SHA-384 For AEAD ciphers, used in PRF
SHA-256 For AEAD ciphers, used in PRF
SHA-1 Old, deprecated
Only for compatibility
DO NOT USE if avoidable
TLS 1.2 Recommended Suites
Preferred (forward secrecy, AEAD):
TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
Acceptable (for compatibility):
TLS_DHE_RSA_WITH_AES_256_GCM_SHA384
TLS_DHE_RSA_WITH_AES_128_GCM_SHA256
Avoid:
Anything with RSA key exchange (no PFS)
Anything with CBC mode (padding attacks)
Anything with 3DES (slow, weak)
Anything with RC4 (broken)
Anything with NULL (no encryption!)
Anything with EXPORT (intentionally weak)
TLS 1.3 Cipher Suites
TLS 1.3 simplified cipher suites dramatically:
Only 5 cipher suites:
TLS_AES_256_GCM_SHA384
TLS_AES_128_GCM_SHA256
TLS_CHACHA20_POLY1305_SHA256
TLS_AES_128_CCM_SHA256
TLS_AES_128_CCM_8_SHA256
Key exchange (ECDHE) and authentication (certificate signature)
are negotiated separately via extensions.
All TLS 1.3 suites provide:
✓ Forward secrecy (mandatory)
✓ AEAD encryption (mandatory)
✓ Strong algorithms (weak ones removed)
Configuring Cipher Suites
nginx
ssl_protocols TLSv1.2 TLSv1.3;
ssl_prefer_server_cipher_on;
# TLS 1.2 ciphers
ssl_ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
# TLS 1.3 ciphers (usually automatic)
ssl_conf_command Ciphersuites TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256;
Apache
SSLProtocol all -SSLv3 -TLSv1 -TLSv1.1
SSLCipherSuite ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:...
SSLHonorCipherOrder on
Testing Cipher Suites
# Test supported ciphers
$ nmap --script ssl-enum-ciphers -p 443 example.com
# OpenSSL test specific cipher
$ openssl s_client -connect example.com:443 \
-cipher ECDHE-RSA-AES256-GCM-SHA384
# Show negotiated cipher
$ openssl s_client -connect example.com:443 2>/dev/null | \
grep "Cipher is"
# New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-GCM-SHA384
# SSL Labs test (comprehensive)
# https://www.ssllabs.com/ssltest/
Security Levels
┌─────────────────────────────────────────────────────────────────────┐
│ Level │ Key Exchange │ Encryption │ Bits of Security │
├───────────┼───────────────┼───────────────┼────────────────────────┤
│ Modern │ ECDHE P-256+ │ AES-128-GCM+ │ 128-bit security │
│ │ │ ChaCha20 │ │
├───────────┼───────────────┼───────────────┼────────────────────────┤
│ Compat. │ ECDHE/DHE │ AES-128+ │ 112-128 bit │
│ │ 2048-bit+ │ │ │
├───────────┼───────────────┼───────────────┼────────────────────────┤
│ Legacy │ RSA 2048 │ AES/3DES │ ~80-112 bit │
│ (avoid) │ │ │ │
├───────────┼───────────────┼───────────────┼────────────────────────┤
│ Broken │ RSA < 2048 │ RC4, DES │ Effectively none │
│ (never) │ EXPORT │ NULL │ │
└───────────┴───────────────┴───────────────┴────────────────────────┘
Summary
Good cipher suite configuration:
- Use TLS 1.3 when possible (automatic good choices)
- Prefer ECDHE for key exchange (forward secrecy)
- Use AEAD encryption (AES-GCM or ChaCha20-Poly1305)
- Disable weak ciphers (RSA key exchange, CBC, old algorithms)
- Test your configuration (SSL Labs, testssl.sh)
TLS 1.3 removes the complexity—all its cipher suites are secure.
TLS 1.3 Improvements
TLS 1.3 (RFC 8446, 2018) represents a major overhaul, not just incremental improvement. It’s faster, simpler, and more secure than TLS 1.2.
Key Improvements
┌─────────────────────────────────────────────────────────────────────┐
│ TLS 1.3 vs TLS 1.2 │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Performance: │
│ TLS 1.2: 2 RTT handshake │
│ TLS 1.3: 1 RTT handshake │
│ 0 RTT resumption │
│ │
│ Security: │
│ Removed: RSA key exchange, CBC ciphers, SHA-1, RC4, 3DES │
│ Required: Forward secrecy (ECDHE/DHE only) │
│ Encrypted: More handshake data hidden │
│ │
│ Simplicity: │
│ Cipher suites: 37+ → 5 │
│ Fewer options = fewer misconfigurations │
│ │
└─────────────────────────────────────────────────────────────────────┘
1-RTT Handshake
TLS 1.3 combines key exchange and authentication:
TLS 1.2:
ClientHello → 1st RTT
← ServerHello + Cert
ClientKeyExchange → 2nd RTT
← Finished
Application Data → Finally!
TLS 1.3:
ClientHello + KeyShare → 1st RTT
← ServerHello + KeyShare
← EncryptedExtensions
← Certificate, Finished
Finished →
Application Data → Immediately!
Client sends key share in first message.
Server can compute keys immediately.
Encrypted data flows after 1 RTT.
0-RTT Resumption
For returning clients:
TLS 1.3 0-RTT:
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ First Connection: │
│ Full handshake + receive session ticket │
│ │
│ Subsequent Connection: │
│ ClientHello + EarlyData → (request sent IMMEDIATELY) │
│ ← ServerHello + response │
│ │
│ No waiting! Request in first packet. │
│ │
└─────────────────────────────────────────────────────────────────────┘
Security caveat:
0-RTT data can be replayed by attacker
Only safe for idempotent operations (GET, not POST)
Server can reject 0-RTT for sensitive operations
Removed Features
TLS 1.3 removed insecure and unnecessary features:
Removed entirely:
✗ RSA key exchange (no forward secrecy)
✗ Static Diffie-Hellman (no forward secrecy)
✗ CBC mode ciphers (padding oracle attacks)
✗ RC4 (broken)
✗ 3DES (slow, small block size)
✗ MD5 and SHA-1 in signature algorithms
✗ Compression (CRIME attack)
✗ Renegotiation
✗ Custom DHE groups
✗ ChangeCipherSpec message
Result: All TLS 1.3 connections have forward secrecy
and use authenticated encryption (AEAD).
Encrypted Handshake
More of the handshake is encrypted:
TLS 1.2 visible to eavesdropper:
- Certificate (server identity)
- Server extensions
- Much of handshake
TLS 1.3 encrypted:
- Certificate
- Extensions after ServerHello
- Most handshake messages
Only visible:
- ClientHello (including SNI)
- ServerHello
Future: Encrypted Client Hello (ECH) hides SNI too.
Simplified Cipher Suites
TLS 1.2: 37+ cipher suites (many weak/redundant)
TLS 1.3: 5 cipher suites (all secure)
TLS 1.3 cipher suites:
TLS_AES_128_GCM_SHA256 (required)
TLS_AES_256_GCM_SHA384 (recommended)
TLS_CHACHA20_POLY1305_SHA256 (good for non-AES hardware)
TLS_AES_128_CCM_SHA256 (IoT)
TLS_AES_128_CCM_8_SHA256 (IoT, constrained)
Key exchange negotiated separately via supported_groups.
Signature algorithms negotiated separately.
Simpler configuration, fewer mistakes.
Downgrade Protection
TLS 1.3 prevents protocol downgrade attacks:
Attack scenario:
Client supports TLS 1.3
Server supports TLS 1.3
Attacker modifies ClientHello to say "TLS 1.2 only"
Connection uses weaker TLS 1.2
Protection:
Server random includes special bytes when downgrading
Client detects this and aborts
Man-in-the-middle cannot force downgrade
Migration Considerations
Compatibility
TLS 1.3 designed for compatibility:
- Uses same port (443)
- Can negotiate down to TLS 1.2 if needed
- Works with most proxies/load balancers
Potential issues:
- Old middleboxes may break 1.3
- Some intrusion detection fails on 1.3
- 0-RTT requires application awareness
Server Configuration
# nginx - enable TLS 1.3
ssl_protocols TLSv1.2 TLSv1.3;
# Enable 0-RTT (use with caution)
ssl_early_data on;
# In proxy situations, tell backend about early data
proxy_set_header Early-Data $ssl_early_data;
Application Changes for 0-RTT
# Check if request was 0-RTT
early_data = request.headers.get('Early-Data')
if early_data == '1':
# This request might be replayed!
if not is_idempotent(request):
# Reject or require retry without 0-RTT
return Response(status=425) # Too Early
Measuring TLS 1.3 Adoption
As of 2024:
- ~70% of websites support TLS 1.3
- All major browsers support TLS 1.3
- All major CDNs support TLS 1.3
Verify your site:
$ curl -v https://yoursite.com 2>&1 | grep "SSL connection"
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
Summary
TLS 1.3 advantages:
| Feature | Improvement |
|---|---|
| Handshake | 1 RTT (vs 2 RTT) |
| Resumption | 0 RTT possible |
| Security | Only secure options remain |
| Configuration | 5 ciphers vs 37+ |
| Privacy | More encrypted handshake |
| Forward secrecy | Mandatory |
TLS 1.3 should be enabled on all new deployments. The only reason to stay on TLS 1.2 is legacy client compatibility, and that’s decreasing rapidly.
Application Protocols
Beyond HTTP, many other application protocols power essential internet services. Understanding them provides insight into protocol design and helps when integrating with these systems.
Common Application Protocols
┌─────────────────────────────────────────────────────────────────────┐
│ Major Application Protocols │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Email: │
│ SMTP (25, 587) Sending email between servers │
│ IMAP (143, 993) Accessing mailbox, server stores mail │
│ POP3 (110, 995) Downloading mail, client stores │
│ │
│ File Transfer: │
│ FTP (21, 20) Classic file transfer (insecure) │
│ SFTP (22) SSH-based file transfer (secure) │
│ SCP (22) Secure copy over SSH │
│ │
│ Remote Access: │
│ SSH (22) Secure shell, tunneling, file transfer │
│ Telnet (23) Insecure remote access (legacy) │
│ RDP (3389) Windows remote desktop │
│ │
│ Name Resolution: │
│ DNS (53) Domain name → IP address │
│ mDNS (5353) Multicast DNS (local discovery) │
│ │
│ Time: │
│ NTP (123) Network time synchronization │
│ │
│ Directory: │
│ LDAP (389, 636) Directory services (Active Directory) │
│ │
└─────────────────────────────────────────────────────────────────────┘
Protocol Characteristics
Most application protocols share common traits:
Request-Response:
Client sends command/request
Server sends response
Back and forth until done
Text vs Binary:
Text: Human-readable (SMTP, HTTP/1.1)
Binary: Machine-efficient (HTTP/2, Protocol Buffers)
Stateful vs Stateless:
Stateful: Server remembers session (SMTP, FTP)
Stateless: Each request independent (HTTP, DNS)
What You’ll Learn
- SMTP: How email travels across the internet
- FTP and Alternatives: File transfer evolution
- SSH: Secure remote access and more
SMTP: Email Delivery
SMTP (Simple Mail Transfer Protocol) is how email moves between servers. Despite being from 1982, it remains the backbone of email delivery.
How Email Flows
┌─────────────────────────────────────────────────────────────────────┐
│ Email Delivery Path │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ alice@gmail.com sends to bob@example.com │
│ │
│ ┌────────────┐ │
│ │ Alice │ │
│ │ (Gmail) │ │
│ └─────┬──────┘ │
│ │ 1. Compose & Send │
│ ▼ │
│ ┌────────────┐ │
│ │Gmail Server│ │
│ │ MTA │ │
│ └─────┬──────┘ │
│ │ 2. DNS lookup: example.com MX │
│ │ 3. SMTP to mail.example.com │
│ ▼ │
│ ┌────────────┐ │
│ │Example.com │ │
│ │Mail Server │ │
│ └─────┬──────┘ │
│ │ 4. Store in Bob's mailbox │
│ ▼ │
│ ┌────────────┐ │
│ │ Bob │ 5. Retrieve via IMAP/POP3 │
│ └────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
SMTP Conversation
S: 220 mail.example.com ESMTP ready
C: EHLO gmail.com
S: 250-mail.example.com
S: 250-SIZE 35882577
S: 250-STARTTLS
S: 250 OK
C: STARTTLS
S: 220 Ready to start TLS
(TLS handshake happens)
C: EHLO gmail.com
S: 250 OK
C: MAIL FROM:<alice@gmail.com>
S: 250 OK
C: RCPT TO:<bob@example.com>
S: 250 OK
C: DATA
S: 354 Start mail input
C: From: alice@gmail.com
C: To: bob@example.com
C: Subject: Hello!
C:
C: Hi Bob, how are you?
C: .
S: 250 OK, message queued
C: QUIT
S: 221 Bye
Ports and Security
Port 25: Server-to-server (MTA to MTA)
Often blocked by ISPs for end users
Port 587: Client submission (with authentication)
Modern email clients use this
Port 465: SMTPS (implicit TLS)
Deprecated but re-standardized
Security:
STARTTLS: Upgrade plain connection to TLS
AUTH: Login with username/password
SPF: Verify sender IP authorized
DKIM: Cryptographic message signature
DMARC: Policy for SPF/DKIM failures
Email Authentication (SPF, DKIM, DMARC)
SPF (DNS TXT record):
example.com TXT "v=spf1 include:_spf.google.com -all"
"Only Google's servers can send as @example.com"
DKIM (signature in header):
DKIM-Signature: v=1; a=rsa-sha256; d=example.com; s=selector;
h=from:to:subject:date; bh=...; b=...
Receiver fetches public key from DNS, verifies signature.
DMARC (policy):
_dmarc.example.com TXT "v=DMARC1; p=reject; rua=mailto:..."
"If SPF/DKIM fail, reject the message and report."
Common Issues
Rejected as spam:
- Missing SPF/DKIM/DMARC
- IP on blocklist
- Poor sending reputation
Connection refused:
- Port 25 blocked (use 587)
- Firewall rules
- Server down
Authentication failed:
- Wrong credentials
- App-specific password needed
- TLS required but not enabled
FTP and Secure Alternatives
FTP (File Transfer Protocol) is one of the oldest internet protocols (1971). While still used, security concerns have led to better alternatives.
How FTP Works
FTP uses two connections:
Control Connection (Port 21):
- Commands and responses
- Stays open during session
- Text-based protocol
Data Connection (Port 20 or ephemeral):
- Actual file transfer
- Opened per transfer
- Closed after each file
┌────────────┐ ┌────────────┐
│ Client │ │ Server │
├────────────┤ ├────────────┤
│ Control ───┼────── Port 21 ─────┼─── Control │
│ │ │ │
│ Data ◄──┼─── Port 20/high ───┼──► Data │
└────────────┘ └────────────┘
Active vs Passive Mode
Active Mode:
1. Client opens control connection to server:21
2. Client tells server: "Connect to me on port 5000"
3. Server connects FROM port 20 TO client:5000
Problem: Client firewalls block incoming connections
Passive Mode (PASV):
1. Client opens control connection to server:21
2. Client: "PASV" (I'll connect to you)
3. Server: "227 Entering Passive (192,168,1,100,195,149)"
(Connect to 192.168.1.100 port 50069)
4. Client connects to server's data port
Better: Client initiates both connections (firewall-friendly)
FTP Session Example
$ ftp ftp.example.com
220 Welcome to Example FTP
Name: alice
331 Password required
Password: ********
230 Login successful
ftp> pwd
257 "/" is current directory
ftp> ls
227 Entering Passive Mode (192,168,1,100,195,149)
150 Here comes the directory listing
drwxr-xr-x 2 alice staff 68 Jan 15 10:00 documents
-rw-r--r-- 1 alice staff 1234 Jan 14 09:00 readme.txt
226 Directory send OK
ftp> get readme.txt
227 Entering Passive Mode (192,168,1,100,195,150)
150 Opening data connection
226 Transfer complete
ftp> quit
221 Goodbye
FTP Security Problems
✗ Passwords sent in plaintext
✗ Data transferred unencrypted
✗ No server authentication
✗ Complex firewall requirements
Anyone on the network can see:
- Username and password
- All file contents
- All commands
Secure Alternatives
SFTP (SSH File Transfer Protocol)
Runs over SSH (port 22):
✓ Encrypted connection
✓ Strong authentication
✓ Single port (firewall-friendly)
✓ Widely supported
$ sftp user@server.example.com
sftp> put localfile.txt
sftp> get remotefile.txt
sftp> ls
sftp> exit
SCP (Secure Copy)
Simple file copy over SSH:
# Copy local to remote
$ scp file.txt user@server:/path/
# Copy remote to local
$ scp user@server:/path/file.txt ./
# Copy directory recursively
$ scp -r localdir user@server:/path/
FTPS (FTP over TLS)
FTP with TLS encryption:
- Implicit FTPS: TLS from start (port 990)
- Explicit FTPS: STARTTLS upgrade (port 21)
Still has FTP complexity (dual connections).
SFTP generally preferred.
Recommendation
For new deployments:
1. SFTP Best overall (secure, firewall-friendly)
2. SCP Simple file copies
3. rsync Efficient synchronization
4. HTTPS API-based file transfer
Avoid:
- Plain FTP (insecure)
- TFTP (no authentication at all)
SSH: Secure Shell
SSH (Secure Shell) provides encrypted remote access, replacing insecure protocols like Telnet and rlogin. Beyond shell access, SSH enables secure file transfer, port forwarding, and tunneling.
SSH Capabilities
┌─────────────────────────────────────────────────────────────────────┐
│ SSH Use Cases │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Remote Shell: Interactive command line on remote server │
│ File Transfer: SFTP, SCP over encrypted channel │
│ Port Forwarding: Tunnel any TCP connection through SSH │
│ X11 Forwarding: Run graphical apps remotely │
│ Agent Forwarding: Use local keys on remote servers │
│ Git Transport: Secure repository access │
│ │
└─────────────────────────────────────────────────────────────────────┘
Authentication Methods
Password Authentication
$ ssh user@server.example.com
user@server.example.com's password: ********
Simple but:
- Vulnerable to brute force
- Requires typing password
- Can't be automated safely
Public Key Authentication
# Generate key pair
$ ssh-keygen -t ed25519 -C "alice@laptop"
# Creates: ~/.ssh/id_ed25519 (private)
# ~/.ssh/id_ed25519.pub (public)
# Copy public key to server
$ ssh-copy-id user@server.example.com
# Or manually add to ~/.ssh/authorized_keys
# Login (no password!)
$ ssh user@server.example.com
Key Types
Ed25519: Modern, fast, secure (recommended)
RSA: Widely compatible (4096-bit minimum)
ECDSA: Elliptic curve (P-256, P-384)
Avoid:
DSA: Deprecated, weak
RSA <2048: Too short
SSH Configuration
Client Config (~/.ssh/config)
# Default settings
Host *
AddKeysToAgent yes
IdentityFile ~/.ssh/id_ed25519
# Named host
Host myserver
HostName server.example.com
User alice
Port 22
IdentityFile ~/.ssh/work_key
# Now just:
$ ssh myserver
Server Config (/etc/ssh/sshd_config)
# Secure settings
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers alice bob
Protocol 2
Port Forwarding
Local Forwarding (-L)
Access remote service through local port:
$ ssh -L 8080:localhost:80 user@server
Local:8080 ──────SSH Tunnel──────> Server ────> localhost:80
│ (server's port 80)
└── Your browser connects here
Use case: Access web app behind firewall
Remote Forwarding (-R)
Expose local service to remote:
$ ssh -R 9000:localhost:3000 user@server
Server:9000 <─────SSH Tunnel─────── Local:3000
│ │
└── Internet can access Your dev server
Use case: Share local development server
Dynamic Forwarding (-D)
SOCKS proxy through SSH:
$ ssh -D 1080 user@server
Configure browser to use SOCKS proxy localhost:1080
All browser traffic goes through server.
Use case: Bypass network restrictions, privacy
SSH Tunnels for Security
┌─────────────────────────────────────────────────────────────────────┐
│ SSH Tunnel Example │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Scenario: Connect to database behind firewall │
│ │
│ ┌────────┐ ┌────────────┐ ┌──────────┐ │
│ │ Laptop │──SSH──>│ Jump Host │ │ Database │ │
│ │ │ │ (bastion) │──────>│ :5432 │ │
│ └────────┘ └────────────┘ └──────────┘ │
│ │
│ $ ssh -L 5432:db.internal:5432 user@bastion │
│ $ psql -h localhost -p 5432 mydb │
│ │
│ Database connection encrypted through SSH tunnel. │
│ │
└─────────────────────────────────────────────────────────────────────┘
Best Practices
Key Management:
✓ Use Ed25519 keys
✓ Protect private key with passphrase
✓ Use ssh-agent to avoid retyping passphrase
✓ Rotate keys periodically
Server Security:
✓ Disable password authentication
✓ Disable root login
✓ Use fail2ban for brute force protection
✓ Keep SSH updated
✓ Use non-standard port (security through obscurity, minor)
Access Control:
✓ Limit allowed users
✓ Use bastion/jump hosts
✓ Audit authorized_keys regularly
Troubleshooting
# Verbose output
$ ssh -v user@server # Basic
$ ssh -vvv user@server # Maximum verbosity
# Check key permissions
$ ls -la ~/.ssh/
# id_ed25519 should be 600
# authorized_keys should be 600
# Test authentication
$ ssh -T git@github.com
# Check server logs (on server)
$ sudo tail -f /var/log/auth.log
Protocol Design Principles
When building networked systems, you often need custom protocols or must extend existing ones. This chapter covers principles for designing protocols that are robust, evolvable, and performant.
Design Considerations
┌─────────────────────────────────────────────────────────────────────┐
│ Protocol Design Questions │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Communication Pattern: │
│ Request-response? Streaming? Pub-sub? Full-duplex? │
│ │
│ Reliability Requirements: │
│ Every message must arrive? Some loss acceptable? │
│ │
│ Latency Requirements: │
│ Real-time? Best-effort? Batch acceptable? │
│ │
│ Message Size: │
│ Small fixed? Variable? Very large? │
│ │
│ Security: │
│ Authentication? Encryption? Integrity? │
│ │
│ Compatibility: │
│ Must work with existing systems? Future evolution? │
│ │
└─────────────────────────────────────────────────────────────────────┘
Key Topics
- Versioning Strategies: How to evolve protocols over time
- Backwards Compatibility: Supporting old and new clients
- Performance Considerations: Optimizing for speed and efficiency
Versioning Strategies
Protocols evolve. New features are added, bugs are fixed, and requirements change. Good versioning makes this evolution manageable.
Why Version?
Without versioning:
Client (v1): Send message type A
Server (v2): Expects message type B
Result: Confusion, errors, failures
With versioning:
Client (v1): "I speak version 1"
Server (v2): "I understand v1 and v2, let's use v1"
Result: Graceful interoperability
Versioning Approaches
Explicit Version Numbers
In protocol header:
┌──────────────────────────────────────────────────────────────┐
│ Version │ Message Type │ Length │ Payload... │
│ (1) │ (2) │ (4) │ │
└──────────────────────────────────────────────────────────────┘
HTTP:
GET / HTTP/1.1
GET / HTTP/2
Pros: Clear, explicit
Cons: Major versions can break compatibility
Feature Negotiation
Instead of single version, negotiate capabilities:
Client: "I support: compression, encryption, batch"
Server: "I support: encryption, streaming"
Both: "We'll use: encryption"
TLS does this with cipher suites.
HTTP/2 does this with SETTINGS frames.
Pros: Granular, flexible
Cons: Complex negotiation
Semantic Versioning
MAJOR.MINOR.PATCH
Major: Breaking changes (v1 → v2)
Minor: New features, backwards compatible (v1.1 → v1.2)
Patch: Bug fixes only (v1.1.0 → v1.1.1)
For APIs:
v1 clients work with v1.x servers
v2 might require migration
Pros: Clear expectations
Cons: Major bumps still painful
Wire Format Versioning
Message format evolution:
Version 1:
{ "name": "Alice", "age": 30 }
Version 2 (additive):
{ "name": "Alice", "age": 30, "email": "alice@example.com" }
Old clients ignore unknown fields.
New clients handle missing fields.
No version number needed if done carefully.
Version in Different Layers
URL versioning (REST APIs):
/api/v1/users
/api/v2/users
Header versioning:
Accept: application/vnd.myapi.v2+json
Query parameter:
/api/users?version=2
Content negotiation:
Accept: application/json; version=2
Best Practices
1. Include version from day one
Adding versioning later is painful.
2. Plan for evolution
Reserve bits/fields for future use.
3. Support multiple versions
Don't force immediate upgrades.
4. Deprecation timeline
v1 supported until 2025-01-01.
5. Version at right granularity
API version? Message version? Both?
Backwards Compatibility
Maintaining backwards compatibility lets you evolve protocols without breaking existing deployments. It’s often the difference between smooth upgrades and painful migrations.
Compatibility Types
Backwards Compatible:
New servers work with old clients.
Client v1 ──────> Server v2 ✓
Forwards Compatible:
Old servers handle new clients gracefully.
Client v2 ──────> Server v1 ✓ (degraded)
Full Compatibility:
Both directions work.
Ideal but not always achievable.
Techniques for Compatibility
Ignore Unknown Fields
// Client v1 sends:
{ "name": "Alice", "age": 30 }
// Server v2 expects:
{ "name": "Alice", "age": 30, "email": "?" }
// Server should:
// - Accept missing email (use default or null)
// - Not reject the request
// Client v2 sends:
{ "name": "Bob", "age": 25, "email": "bob@example.com" }
// Server v1 should:
// - Ignore unknown "email" field
// - Process name and age normally
Optional Fields with Defaults
// Protocol Buffers example
message User {
string name = 1;
int32 age = 2;
optional string email = 3; // Added in v2
}
// Missing optional fields get default values.
// Old messages work with new code.
// New messages work with old code (email ignored).
Extensible Enums
Bad: Fixed enum, no room to grow
enum Status { OK = 0, ERROR = 1 }
Good: Reserve unknown handling
enum Status {
UNKNOWN = 0, // Default for unrecognized
OK = 1,
ERROR = 2
// Future: PENDING = 3
}
Old code receiving new status → UNKNOWN (handled gracefully)
Reserved Fields
message User {
string name = 1;
reserved 2; // Was 'age', removed in v3
string email = 3;
reserved "age"; // Prevent reuse of name
}
// Prevents accidentally reusing field numbers
// which would cause data corruption.
Breaking Changes
Sometimes breaking changes are necessary:
What's Breaking:
- Removing required fields
- Changing field types
- Renaming fields (in JSON)
- Changing semantics of existing fields
- Removing supported message types
Mitigation Strategies:
1. New endpoint/message type (keep old working)
2. Deprecation period with warnings
3. Version bump (v1 → v2)
4. Feature flags during transition
Postel’s Law (Robustness Principle)
"Be conservative in what you send,
be liberal in what you accept."
Send: Strictly conform to spec
Accept: Handle variations gracefully
This enables interoperability between
implementations with slight differences.
Testing Compatibility
# Test old client against new server
$ old-client --server=new-server --test-suite
# Test new client against old server
$ new-client --server=old-server --test-suite
# Fuzz testing with version mixing
$ compatibility-fuzzer --versions=v1,v2,v3
# Contract testing
$ pact-verify --provider=server --consumer=client-v1
Real-World Examples
JSON (excellent compatibility):
- Unknown fields ignored
- Missing fields → null/default
- Easy to extend
Protocol Buffers (good compatibility):
- Field numbers provide stability
- Unknown fields preserved
- Wire format stable
HTTP (exceptional compatibility):
- 20+ years of evolution
- HTTP/1.1 still works everywhere
- Headers ignore unknown values
Performance Considerations
Protocol design choices significantly impact performance. Understanding the trade-offs helps you make informed decisions.
Key Performance Factors
┌─────────────────────────────────────────────────────────────────────┐
│ Performance Dimensions │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Latency: Time for message round-trip │
│ Affected by: RTTs, encoding time, processing │
│ │
│ Throughput: Data volume per unit time │
│ Affected by: Message size, connection limits │
│ │
│ Overhead: Wasted bandwidth (headers, framing) │
│ Affected by: Protocol verbosity, encoding │
│ │
│ Efficiency: CPU/memory per message │
│ Affected by: Parsing, serialization │
│ │
└─────────────────────────────────────────────────────────────────────┘
Message Format Trade-offs
Text vs Binary
Text (JSON, XML):
+ Human readable
+ Easy debugging
+ Universal parsers
- Larger messages
- Slower parsing
- Ambiguous types
Binary (Protocol Buffers, MessagePack):
+ Compact messages
+ Fast parsing
+ Precise types
- Requires schema/decoder
- Harder debugging
- Versioning complexity
Rule of thumb:
Internal services: Binary (efficiency)
Public APIs: JSON (interoperability)
High-volume: Binary (worth complexity)
Size Comparison
Same data in different formats:
JSON (70 bytes):
{"id":123,"name":"Alice","email":"alice@example.com"}
Protocol Buffers (35 bytes):
[binary encoded, ~50% smaller]
MessagePack (45 bytes):
[binary JSON, ~35% smaller]
For millions of messages, these differences matter!
Connection Strategies
Persistent vs Per-Request
Per-request connections:
Each request: TCP handshake + TLS handshake + request
Latency: High (multiple RTTs)
Resources: Connection churn
Persistent connections:
One connection: Multiple requests
Latency: Low (no repeated handshakes)
Resources: Connection management
Always prefer persistent for repeated interactions.
Multiplexing
HTTP/1.1 (no multiplexing):
Connection 1: Request A ─────> Response A
Connection 2: Request B ─────> Response B
(Need multiple connections for parallelism)
HTTP/2 (multiplexing):
Connection 1: [A][B][C]───>[A][B][C]
(All requests on one connection)
Multiplexing reduces:
- Connection overhead
- Memory usage
- Head-of-line blocking (with QUIC)
Batching and Pipelining
Individual requests:
Request 1 → Response 1 → Request 2 → Response 2
Time: 4 RTT for 2 requests
Pipelining:
Request 1 → Request 2 → Response 1 → Response 2
Time: 2 RTT for 2 requests
Batching:
[Request 1, Request 2] → [Response 1, Response 2]
Time: 1 RTT for 2 requests
Trade-off: Batching adds latency for first item.
Compression
When to compress:
✓ Large text payloads (JSON, HTML)
✓ Repeated patterns in data
✓ Slow/metered networks
When not to compress:
✗ Already compressed (images, video)
✗ Tiny messages (overhead > savings)
✗ CPU-constrained environments
Common algorithms:
gzip: Universal, good compression
br: Better ratio, slower
zstd: Fast, good ratio (emerging)
Caching
Cacheable responses reduce load:
Without caching:
Every request → Server processing → Response
With caching:
First request → Server → Response (cached)
Repeat requests → Cache hit → Immediate response
Design for cacheability:
- Stable URLs for same content
- Proper cache headers
- ETags for validation
- Separate static/dynamic content
Measurement
Measure before optimizing:
Latency metrics:
- P50, P95, P99 response times
- Time to first byte (TTFB)
- Round-trip time
Throughput metrics:
- Requests per second
- Bytes per second
- Messages per connection
Tools:
- wrk, ab (HTTP benchmarking)
- tcpdump, wireshark (packet analysis)
- perf, flamegraphs (CPU profiling)
Summary
Performance optimization priorities:
- Reduce round trips (biggest impact)
- Use persistent connections
- Choose appropriate message format
- Enable compression for large text
- Implement caching where possible
- Batch when latency allows
Measure, then optimize. Premature optimization is the root of all evil, but informed optimization is essential.
Real-World Patterns
Production systems use additional infrastructure beyond basic protocols. This chapter covers common patterns for scaling, reliability, and performance.
Infrastructure Components
┌─────────────────────────────────────────────────────────────────────┐
│ Modern Web Architecture │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ User ──> CDN ──> Load Balancer ──> App Servers ──> Database │
│ │ │ │ │
│ │ │ └── Cache (Redis) │
│ │ │ │
│ │ └── WAF (Web Application Firewall) │
│ │ │
│ └── Edge cache, DDoS protection │
│ │
└─────────────────────────────────────────────────────────────────────┘
Each component serves specific purposes:
CDN: Cache static content near users
Load Balancer: Distribute traffic, health checks
WAF: Security filtering
App Servers: Business logic
Cache: Fast data access
Database: Persistent storage
Key Topics
- Load Balancing: Distributing traffic across servers
- Proxies: Forward and reverse proxies
- CDNs: Content delivery at scale
- Connection Pooling: Efficient resource usage
Load Balancing
Load balancers distribute traffic across multiple servers, improving availability and performance. Understanding load balancing helps you design scalable systems.
Why Load Balance?
Without load balancing:
All traffic → Single server
Problems: Single point of failure, limited capacity
With load balancing:
Traffic → Load Balancer → Multiple servers
Benefits: Redundancy, scalability, maintenance flexibility
Load Balancing Algorithms
Round Robin
Request 1 → Server A
Request 2 → Server B
Request 3 → Server C
Request 4 → Server A (repeat)
Pros: Simple, even distribution
Cons: Ignores server capacity, session state
Weighted Round Robin
Server A (weight 3): Gets 3x traffic
Server B (weight 1): Gets 1x traffic
Request pattern: A, A, A, B, A, A, A, B, ...
Use case: Servers with different capacities
Least Connections
Route to server with fewest active connections.
Server A: 10 connections
Server B: 5 connections
Server C: 8 connections
New request → Server B
Better for variable request durations.
IP Hash
Hash(Client IP) → Server selection
Same client always hits same server.
Useful for session affinity without cookies.
hash("192.168.1.100") % 3 = Server B
Least Response Time
Route to server with fastest response.
Combines: Connection count + response time
Best for: Heterogeneous backends
Requires: Active health monitoring
Layer 4 vs Layer 7
Layer 4 (Transport):
- Routes based on IP/port
- Faster (less inspection)
- Protocol-agnostic
- No content-based routing
Layer 7 (Application):
- Routes based on content (URL, headers, cookies)
- Can modify requests/responses
- SSL termination
- More flexible, more overhead
Example Layer 7 rules:
/api/* → API servers
/static/* → CDN
/admin/* → Admin servers
Health Checks
Load balancer monitors backends:
Active checks:
- Periodic HTTP requests to /health
- TCP connection attempts
- Custom scripts
Passive checks:
- Monitor real request success/failure
- Track response times
Unhealthy server:
- Remove from rotation
- Continue checking
- Return when healthy
Session Persistence
Problem: User state spread across servers
Login on Server A
Next request hits Server B
"Please login again" 😞
Solutions:
Sticky Sessions (affinity):
Set-Cookie: SERVERID=A
Load balancer routes by cookie
Shared Session Store:
All servers use Redis/Memcached for sessions
Any server can handle any request
Stateless Design:
JWT tokens contain user state
No server-side session needed (best!)
Common Load Balancers
Software:
- HAProxy: High performance, Layer 4/7
- nginx: Web server + load balancer
- Envoy: Modern, service mesh focused
- Traefik: Cloud-native, auto-discovery
Cloud:
- AWS ALB/NLB: Layer 7/4
- GCP Load Balancing: Global, anycast
- Azure Load Balancer: Layer 4
- Cloudflare: CDN + load balancing
Hardware (legacy):
- F5 BIG-IP
- Citrix NetScaler
Configuration Example (nginx)
upstream backend {
least_conn;
server 10.0.0.1:8080 weight=3;
server 10.0.0.2:8080 weight=2;
server 10.0.0.3:8080 backup;
keepalive 32;
}
server {
listen 80;
location / {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
location /health {
return 200 "OK";
}
}
Proxies and Reverse Proxies
Proxies act as intermediaries in network communication. Understanding them helps you design secure architectures and debug connectivity issues.
Forward Proxy
Client-side proxy: Client → Proxy → Internet
┌─────────────────────────────────────────────────────────────────────┐
│ Forward Proxy │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────┐ ┌───────┐ ┌──────────────┐ │
│ │ Client │──────>│ Proxy │──────>│ Internet │ │
│ └────────┘ └───────┘ │ (Server) │ │
│ │ └──────────────┘ │
│ │ │
│ Proxy hides client identity from server. │
│ Server sees proxy's IP, not client's. │
│ │
└─────────────────────────────────────────────────────────────────────┘
Use cases:
- Corporate content filtering
- Caching (reduce bandwidth)
- Anonymity
- Access control
Reverse Proxy
Server-side proxy: Internet → Reverse Proxy → Servers
┌─────────────────────────────────────────────────────────────────────┐
│ Reverse Proxy │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌───────────┐ ┌────────┐ │
│ │ Internet │──────>│ Reverse │──────>│ Server │ │
│ │ (Client) │ │ Proxy │──────>│ Server │ │
│ └──────────────┘ └───────────┘──────>│ Server │ │
│ │ └────────┘ │
│ │ │
│ Clients don't know backend servers exist. │
│ Single entry point to multiple backends. │
│ │
└─────────────────────────────────────────────────────────────────────┘
Use cases:
- SSL termination
- Load balancing
- Caching
- Compression
- Security (hide backend)
- A/B testing
Reverse Proxy Functions
SSL Termination
Client ──HTTPS──> Reverse Proxy ──HTTP──> Backend
Proxy handles TLS:
- Certificate management in one place
- Offloads crypto from backends
- Backends get plain HTTP (simpler)
- Internal traffic often trusted network
Request Routing
Based on URL path:
/api/* → API servers
/images/* → Image servers
/ → Web servers
Based on header:
Host: api.example.com → API servers
Host: www.example.com → Web servers
Based on cookie:
beta_user=true → Beta servers
Caching
Cache responses at proxy level:
Request 1: GET /logo.png
Proxy → Backend → Response (cached at proxy)
Request 2: GET /logo.png
Proxy → Cache hit → Response (no backend call)
Reduces backend load significantly.
Common Proxy Headers
X-Forwarded-For: Client IP (through proxy chain)
X-Forwarded-For: 203.0.113.195, 70.41.3.18, 150.172.238.178
X-Forwarded-Proto: Original protocol
X-Forwarded-Proto: https
X-Forwarded-Host: Original Host header
X-Forwarded-Host: www.example.com
X-Real-IP: Single client IP (nginx convention)
X-Real-IP: 203.0.113.195
Proxy Protocols
HTTP CONNECT (Forward Proxy)
Client → Proxy: CONNECT example.com:443 HTTP/1.1
Proxy → Client: HTTP/1.1 200 Connection Established
Client → (tunnel) → Server
Proxy creates TCP tunnel.
Used for HTTPS through forward proxies.
PROXY Protocol (Reverse Proxy)
HAProxy-style protocol:
Passes original client info to backend.
Binary or text header prepended to connection.
PROXY TCP4 192.168.1.1 10.0.0.1 56789 80\r\n
(Then normal HTTP traffic)
Backend sees real client IP.
nginx Reverse Proxy Config
server {
listen 443 ssl;
server_name example.com;
ssl_certificate /etc/nginx/cert.pem;
ssl_certificate_key /etc/nginx/key.pem;
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket support
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}
Debugging Through Proxies
# Check what proxy sees
$ curl -v -x http://proxy:8080 https://example.com
# See forwarded headers
$ curl -s https://httpbin.org/headers
# Trace proxy chain
$ curl -s https://httpbin.org/ip
# Returns visible IP (proxy's if through proxy)
CDNs
Content Delivery Networks (CDNs) cache content at edge locations worldwide, reducing latency by serving users from nearby servers.
How CDNs Work
┌─────────────────────────────────────────────────────────────────────┐
│ CDN Architecture │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Origin Server (Your Server) │
│ │ │
│ ┌──────────┼──────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Edge │ │ Edge │ │ Edge │ │
│ │ US │ │ EU │ │ Asia │ │
│ └────┬───┘ └────┬───┘ └────┬───┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Users Users Users │
│ │
│ User requests → Nearest Edge → Cached response (fast!) │
│ Cache miss → Edge fetches from Origin → Caches → Responds │
│ │
└─────────────────────────────────────────────────────────────────────┘
Benefits
Performance:
- Reduced latency (content closer to users)
- Faster page loads
- Better user experience
Scalability:
- Offload traffic from origin
- Handle traffic spikes
- Global reach without global infrastructure
Availability:
- DDoS protection
- Origin failover
- Always-on edge presence
What to Put on CDN
Ideal for CDN:
✓ Static files (JS, CSS, images)
✓ Videos and media
✓ Downloadable files
✓ Public API responses (if cacheable)
Not for CDN (usually):
✗ User-specific content
✗ Real-time data
✗ Authenticated endpoints
✗ Frequently changing data
Cache Control
# Cache for 1 day, revalidate after
Cache-Control: public, max-age=86400, must-revalidate
# Cache for 1 year (immutable assets)
Cache-Control: public, max-age=31536000, immutable
# No caching
Cache-Control: no-store
# Private only (not CDN)
Cache-Control: private, max-age=3600
CDN Configuration Concepts
TTL (Time To Live):
How long edge caches content
Balance freshness vs. origin load
Cache Keys:
What makes requests "same" for caching
URL, headers, cookies, query strings
Purge/Invalidation:
Force refresh of cached content
By URL, tag, or entire cache
Edge Functions:
Run code at edge (Cloudflare Workers, Lambda@Edge)
Customize responses, A/B testing, auth
Popular CDNs
Global CDNs:
- Cloudflare: Free tier, security focus
- Fastly: Real-time purging, edge compute
- Akamai: Enterprise, largest network
- AWS CloudFront: AWS integration
- Google Cloud CDN: GCP integration
Specialized:
- Bunny CDN: Cost-effective
- KeyCDN: Simple pricing
- imgix: Image optimization focus
Setting Up CDN
Basic setup:
1. Sign up with CDN provider
2. Configure origin (your server)
3. Get CDN domain (e.g., cdn.example.com)
4. Update DNS or reference CDN URLs
5. Configure cache rules
DNS example:
cdn.example.com CNAME example.cdn-provider.net
Or full site through CDN:
example.com → CDN → origin.example.com
Debugging CDN
# Check cache status
$ curl -I https://cdn.example.com/image.png
X-Cache: HIT # Served from edge
X-Cache: MISS # Fetched from origin
CF-Cache-Status: HIT # Cloudflare specific
# Check which edge served request
$ curl -I https://cdn.example.com/image.png
CF-RAY: 123abc-SJC # Cloudflare San Jose
# Bypass cache
$ curl -H "Cache-Control: no-cache" https://cdn.example.com/image.png
Connection Pooling
Connection pooling reuses established connections instead of creating new ones for each request. This is essential for performance in database access, HTTP clients, and service communication.
Why Pool Connections?
Without pooling (connection per request):
Request 1: [TCP handshake][TLS handshake][Query][Response][Close]
Request 2: [TCP handshake][TLS handshake][Query][Response][Close]
Request 3: [TCP handshake][TLS handshake][Query][Response][Close]
Each request pays full connection overhead!
With pooling (reuse connections):
[TCP][TLS] ← Once
Request 1: [Query][Response]
Request 2: [Query][Response]
Request 3: [Query][Response]
Connection overhead paid once, amortized across requests.
Performance Impact
Connection setup costs:
TCP handshake: 1 RTT (~50ms intercontinental)
TLS handshake: 1-2 RTT (~50-100ms)
Auth/setup: Varies
Without pooling (100ms RTT):
1000 requests × 150ms setup = 150 seconds in overhead alone!
With pooling:
10 connections × 150ms setup = 1.5 seconds
Requests reuse existing connections.
10-100x improvement in connection overhead.
Pool Configuration
Key parameters:
Min connections:
Connections kept open even when idle.
Ready for immediate use.
Max connections:
Upper limit on concurrent connections.
Prevents resource exhaustion.
Idle timeout:
Close connections unused for this long.
Frees resources, reduces stale connections.
Max lifetime:
Close connections older than this.
Prevents issues with long-lived connections.
Connection timeout:
How long to wait for new connection.
Fails fast if pool exhausted.
Database Connection Pooling
# Python with SQLAlchemy
from sqlalchemy import create_engine
engine = create_engine(
'postgresql://user:pass@host/db',
pool_size=5, # Maintained connections
max_overflow=10, # Extra connections allowed
pool_timeout=30, # Wait for connection
pool_recycle=3600, # Recreate after 1 hour
pool_pre_ping=True # Test connections before use
)
# Each request borrows from pool
with engine.connect() as conn:
result = conn.execute("SELECT 1")
# Connection returned to pool automatically
HTTP Connection Pooling
# Python requests with session (pooled)
import requests
# BAD: New connection per request
for url in urls:
response = requests.get(url) # New connection each time
# GOOD: Reuse connections via session
session = requests.Session()
adapter = requests.adapters.HTTPAdapter(
pool_connections=10,
pool_maxsize=20,
max_retries=3
)
session.mount('https://', adapter)
for url in urls:
response = session.get(url) # Reuses connections
Common Pooling Issues
Pool Exhaustion
All connections in use, new requests must wait.
Symptoms:
- Requests timeout waiting for connection
- "Connection pool exhausted" errors
- Latency spikes
Solutions:
- Increase pool size
- Reduce connection hold time
- Add timeouts for borrowing
- Monitor pool usage
Connection Leaks
Connections borrowed but never returned.
Causes:
- Exception before returning connection
- Forgot to close/return connection
- Infinite loop holding connection
Solutions:
- Always use try-finally or context managers
- Set connection timeouts
- Monitor active vs. available connections
- Implement leak detection
Stale Connections
Connection in pool is dead (server closed it).
Causes:
- Server timeout (closed idle connection)
- Network issue
- Server restart
Solutions:
- Connection validation before use (pool_pre_ping)
- Maximum connection lifetime
- Proper error handling with retry
Pool Sizing Guidelines
Too few connections:
- Requests queue up
- Increased latency
- Underutilized backend
Too many connections:
- Memory waste
- May exceed server limits
- Connection thrashing
Starting point:
connections = (requests_per_second × avg_request_duration) × 1.5
Example:
100 req/s × 0.1s duration = 10 concurrent
Pool size: 15 (10 × 1.5)
Adjust based on monitoring!
Monitoring Pool Health
Key metrics:
- Active connections (in use)
- Idle connections (available)
- Wait time for connections
- Connection creation rate
- Timeout/exhaustion errors
Alerts:
- Pool utilization > 80% sustained
- Connection wait time > threshold
- Pool exhaustion events
Conclusion
You’ve journeyed through the layers of network protocols that power the internet. From the fundamentals of OSI and TCP/IP to the cutting edge of HTTP/3 and QUIC, you now have a comprehensive understanding of how computers communicate.
Key Takeaways
The Layered Architecture Works
The genius of network layering:
- Each layer has a specific job
- Layers can evolve independently
- Complexity is manageable
- Interoperability is possible
Application ──────────────────────────────────
Transport ────────────────────────────────────
Network ──────────────────────────────────────
Link ─────────────────────────────────────────
This structure has served us for 50 years and counting.
Trade-offs Are Everywhere
Reliability vs. Latency:
TCP: Reliable, higher latency
UDP: Fast, no guarantees
QUIC: Best of both (complex)
Simplicity vs. Performance:
HTTP/1.1: Simple, limited parallelism
HTTP/2: Complex, highly parallel
Security vs. Speed:
Full TLS: Secure, connection overhead
0-RTT: Fast, replay risks
No perfect choice—understand your requirements.
Evolution Never Stops
1991: HTTP/0.9 (simple document retrieval)
2024: HTTP/3 + QUIC (multiplexed, encrypted, mobile-ready)
IPv4 → IPv6 (ongoing)
TLS 1.2 → TLS 1.3 (complete)
TCP → QUIC (emerging)
The protocols will continue to evolve.
The fundamentals you've learned will help you adapt.
Applying Your Knowledge
As a Developer
- Choose the right protocol for your use case
- Configure connections efficiently (pooling, keep-alive)
- Implement proper error handling and retries
- Understand timeout behavior
- Consider security at every layer
As a Debugger
- Use tools: tcpdump, Wireshark, curl, dig
- Understand what each layer provides
- Know where to look for different problems
- Read packet captures with confidence
As an Architect
- Design for resilience (multiple layers of redundancy)
- Plan for scale (load balancing, CDNs)
- Consider latency in distributed systems
- Stay current with protocol evolution
Keep Learning
Protocols not covered in depth:
- gRPC and Protocol Buffers
- GraphQL
- MQTT and IoT protocols
- BGP and routing details
- IPsec and VPNs
- SIP and VoIP
Resources for continued learning:
- RFCs (the definitive specifications)
- Wireshark packet analysis
- Building your own implementations
- Production system observation
Final Thought
Networks are the invisible infrastructure connecting billions of devices. Every API call, every web page, every video stream relies on the protocols covered in this book. Understanding them makes you a more effective developer—one who can debug the mysterious, optimize the slow, and design the robust.
The internet is a marvel of human collaboration and engineering. Now you understand how it works.
Happy networking!
Appendix: Tools and Debugging
This appendix covers essential tools for network debugging, packet analysis, and protocol troubleshooting.
Command Line Tools
curl - HTTP Client
# Basic GET request
$ curl https://example.com
# Verbose output (see headers, TLS handshake)
$ curl -v https://example.com
# Show only response headers
$ curl -I https://example.com
# POST with JSON
$ curl -X POST https://api.example.com/data \
-H "Content-Type: application/json" \
-d '{"key": "value"}'
# Follow redirects
$ curl -L https://example.com
# Save response to file
$ curl -o output.html https://example.com
# Show timing breakdown
$ curl -w "@curl-timing.txt" -o /dev/null -s https://example.com
# Custom timing format
$ curl -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nTotal: %{time_total}s\n" \
-o /dev/null -s https://example.com
dig - DNS Queries
# Query A record
$ dig example.com
# Query specific record type
$ dig example.com AAAA
$ dig example.com MX
$ dig example.com TXT
# Use specific DNS server
$ dig @8.8.8.8 example.com
# Short output
$ dig +short example.com
# Trace resolution path
$ dig +trace example.com
# Reverse lookup
$ dig -x 93.184.216.34
# Show all records
$ dig example.com ANY
nslookup - DNS Lookup (Alternative)
# Basic lookup
$ nslookup example.com
# Specify record type
$ nslookup -type=MX example.com
# Use specific server
$ nslookup example.com 8.8.8.8
netstat / ss - Network Connections
# Show all TCP connections
$ netstat -ant # Linux/Mac
$ ss -ant # Linux (faster)
# Show listening ports
$ netstat -tlnp # Linux
$ ss -tlnp # Linux
# Show UDP sockets
$ ss -u
# Show process using port
$ ss -tlnp | grep :8080
$ lsof -i :8080 # Mac/Linux
tcpdump - Packet Capture
# Capture all traffic on interface
$ sudo tcpdump -i eth0
# Capture specific port
$ sudo tcpdump -i eth0 port 80
# Capture specific host
$ sudo tcpdump -i eth0 host 192.168.1.100
# Save to file (for Wireshark)
$ sudo tcpdump -i eth0 -w capture.pcap
# Read from file
$ tcpdump -r capture.pcap
# Show packet contents (ASCII)
$ sudo tcpdump -i eth0 -A port 80
# Show packet contents (hex + ASCII)
$ sudo tcpdump -i eth0 -X port 80
# Capture only TCP SYN packets
$ sudo tcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0'
# Capture DNS queries
$ sudo tcpdump -i eth0 port 53
ping - Connectivity Test
# Basic ping
$ ping example.com
# Specify count
$ ping -c 4 example.com
# Set interval
$ ping -i 0.5 example.com
# Set packet size
$ ping -s 1000 example.com
# IPv6 ping
$ ping6 example.com
traceroute - Path Discovery
# Trace route to destination
$ traceroute example.com
# Use ICMP (like ping)
$ traceroute -I example.com # Linux
$ traceroute example.com # Mac (ICMP default)
# Use TCP
$ traceroute -T -p 80 example.com
# Use UDP (default on Linux)
$ traceroute -U example.com
mtr - Combined Ping + Traceroute
# Interactive mode
$ mtr example.com
# Report mode (run 10 times, output)
$ mtr -r -c 10 example.com
# Show IP addresses only
$ mtr -n example.com
openssl - TLS/SSL Testing
# Connect and show certificate
$ openssl s_client -connect example.com:443
# Show certificate details
$ openssl s_client -connect example.com:443 2>/dev/null | \
openssl x509 -text -noout
# Check certificate expiration
$ openssl s_client -connect example.com:443 2>/dev/null | \
openssl x509 -noout -dates
# Test specific TLS version
$ openssl s_client -connect example.com:443 -tls1_2
$ openssl s_client -connect example.com:443 -tls1_3
# Show supported ciphers
$ openssl s_client -connect example.com:443 -cipher 'ALL' 2>&1 | \
grep "Cipher is"
nc (netcat) - TCP/UDP Tool
# Connect to port
$ nc example.com 80
# Listen on port
$ nc -l 8080
# Send UDP packet
$ echo "test" | nc -u 192.168.1.1 53
# Port scanning
$ nc -zv example.com 20-25
# Transfer file
$ nc -l 8080 > received.txt # Receiver
$ nc host 8080 < file.txt # Sender
Wireshark
Wireshark is the standard GUI tool for packet analysis.
Capture Filters (BPF Syntax)
# Capture specific host
host 192.168.1.100
# Capture specific port
port 80
# Capture range of ports
portrange 8000-9000
# Capture TCP only
tcp
# Combine filters
host 192.168.1.100 and port 443
tcp and not port 22
Display Filters
# Filter by IP
ip.addr == 192.168.1.100
ip.src == 192.168.1.100
ip.dst == 10.0.0.1
# Filter by port
tcp.port == 80
tcp.dstport == 443
# Filter by protocol
http
dns
tls
tcp
udp
# HTTP specific
http.request.method == "GET"
http.response.code == 200
# TCP flags
tcp.flags.syn == 1
tcp.flags.fin == 1
tcp.flags.reset == 1
# TLS specific
tls.handshake.type == 1 # Client Hello
tls.handshake.type == 2 # Server Hello
# DNS specific
dns.qry.name == "example.com"
# Combine filters
ip.addr == 192.168.1.100 && tcp.port == 443
http.request || http.response
Useful Wireshark Features
Follow TCP Stream:
Right-click packet → Follow → TCP Stream
Shows complete conversation in readable format
Flow Graph:
Statistics → Flow Graph
Visualizes packet flow between hosts
Protocol Hierarchy:
Statistics → Protocol Hierarchy
Shows breakdown of protocols in capture
Expert Info:
Analyze → Expert Information
Highlights anomalies, retransmissions, errors
I/O Graph:
Statistics → I/O Graph
Visualizes traffic over time
HTTP-Specific Tools
httpie - Modern HTTP Client
# GET request
$ http example.com
# POST with JSON (automatic)
$ http POST api.example.com/users name=john age:=25
# Custom headers
$ http example.com Authorization:"Bearer token123"
# Form data
$ http -f POST example.com/login user=john pass=secret
wget - Download Tool
# Download file
$ wget https://example.com/file.zip
# Continue interrupted download
$ wget -c https://example.com/large-file.zip
# Mirror website
$ wget -m https://example.com
# Download with custom filename
$ wget -O output.zip https://example.com/file.zip
ab (Apache Bench) - Load Testing
# 1000 requests, 10 concurrent
$ ab -n 1000 -c 10 https://example.com/
# With keep-alive
$ ab -n 1000 -c 10 -k https://example.com/
wrk - Modern Load Testing
# 30 second test, 12 threads, 400 connections
$ wrk -t12 -c400 -d30s https://example.com/
# With Lua script for custom requests
$ wrk -t12 -c400 -d30s -s script.lua https://example.com/
Debugging Common Issues
Connection Refused
$ curl https://example.com:8080
curl: (7) Failed to connect: Connection refused
Causes:
- Service not running
- Wrong port
- Firewall blocking
Debug:
$ ss -tlnp | grep 8080 # Is anything listening?
$ sudo iptables -L -n # Check firewall
$ systemctl status service # Check service
Connection Timeout
$ curl --connect-timeout 5 https://example.com
curl: (28) Connection timed out
Causes:
- Host unreachable
- Firewall dropping packets (not rejecting)
- Network routing issue
Debug:
$ ping example.com # Basic connectivity
$ traceroute example.com # Where does it stop?
$ tcpdump -i eth0 host example.com # See outgoing packets
DNS Resolution Failure
$ curl https://example.com
curl: (6) Could not resolve host: example.com
Debug:
$ dig example.com # Query DNS directly
$ dig @8.8.8.8 example.com # Try different DNS
$ cat /etc/resolv.conf # Check DNS config
TLS/SSL Errors
$ curl https://example.com
curl: (60) SSL certificate problem
Debug:
$ openssl s_client -connect example.com:443
# Check for:
# - Certificate chain
# - Expiration date
# - Common name / SAN matching
$ curl -v https://example.com 2>&1 | grep -i ssl
Slow Connections
Debug with timing:
$ curl -w "DNS: %{time_namelookup}s
TCP: %{time_connect}s
TLS: %{time_appconnect}s
TTFB: %{time_starttransfer}s
Total: %{time_total}s\n" -o /dev/null -s https://example.com
High DNS time: DNS resolver issue
High TCP time: Network latency
High TLS time: TLS negotiation slow
High TTFB: Server processing slow
Quick Reference
┌────────────────────────────────────────────────────────────────┐
│ Tool Quick Reference │
├────────────────────────────────────────────────────────────────┤
│ │
│ What you need Tool to use │
│ ───────────────────────────────────────────────────────── │
│ HTTP debugging curl -v, httpie │
│ DNS lookup dig, nslookup │
│ Connectivity test ping, nc │
│ Path tracing traceroute, mtr │
│ Port checking ss, netstat, lsof │
│ Packet capture tcpdump, Wireshark │
│ TLS/Certificate check openssl s_client │
│ Load testing ab, wrk │
│ File download curl, wget │
│ │
└────────────────────────────────────────────────────────────────┘