Up to 100Gbps Streaming Infrastructure: Architecture & Bottleneck Prevention (Guide 2026)
Dedicated Servers
Streaming Server
Posted on
TABLE OF CONTENTS
- Introduction
- What “100Gbps Streaming” Actually Means
- High-Level Architecture for 100Gbps Streaming
- Network Layer: Where Bottlenecks Appear First
- Server Hardware Considerations
- Disk I/O & Data Flow
- Application Layer Throughput
- Load Distribution & Scaling Strategy
- Common Bottlenecks That Kill High-Bandwidth Streaming
- Observability for High-Throughput Systems
- Conclusion
Introduction
Scaling streaming infrastructure from 10Gbps to 100Gbps is not a linear upgrade; it is a paradigm shift in systems engineering. In a 10Gbps environment, inefficiencies in the Linux kernel or hardware interrupts are often masked by the available CPU headroom. At 100Gbps, these inefficiencies become catastrophic walls.
The physics of 100Gbps networking are unforgiving. A 100GbE interface receiving standard 1500-byte frames operates with a packet arrival interval of roughly 120 nanoseconds. This leaves the CPU with a budget of fewer than 300 clock cycles to process, queue, and route each packet before the buffer overflows. If the system stalls for even a microsecond due to a cache miss or lock contention, thousands of packets are dropped, resulting in immediate throughput degradation.
This guide outlines the engineering architecture required to sustain 100Gbps throughput per node. It moves beyond basic configuration to address the physical, kernel, and application-level bottlenecks that typically cap performance at 20–30Gbps in untuned environments.
What “100Gbps Streaming” Actually Means
Before procuring hardware, it is critical to distinguish between link speed and effective data delivery.
Theoretical vs. Effective Throughput
A 100Gbps link has a physical signaling limit. However, protocol overheads (Ethernet preamble, IP headers, TCP headers) consume approximately 3-5% of this bandwidth. The maximum theoretical TCP payload throughput—”Goodput”—is roughly 94-95Gbps. Expecting 99Gbps of video data is physically impossible without Jumbo Frames, which are rarely viable over the public internet.
Packet Rate (PPS) vs. Bandwidth
Streaming servers often fail not because of bandwidth limits, but because of Packet Per Second (PPS) limits. A 100Gbps stream of 4K video segments (large packets) is manageable for modern CPUs. However, if the traffic pattern shifts—for example, during a TCP handshake storm or a DDoS attack using small packets—the PPS rate can skyrocket from 8 million to over 100 million. Architecture must be designed for the PPS worst-case scenario, not just the bandwidth best-case.
High-Level Architecture for 100Gbps Streaming
Achieving line-rate performance requires treating the server as a specialized packet pipeline rather than a general-purpose compute node.
Ingress vs. Egress Separation
At this scale, mixing ingress (ingest/origin fetch) and egress (client delivery) on the same interface is risky. Best practice involves physically separating traffic. Dedicate one 100GbE interface for egress to the ISP/Peering partners and a separate interface (often 25G or 40G) for internal backend traffic and origin fetches. This prevents egress bursts from choking the control plane or ingest traffic.
Stateless vs. Stateful Components
The closer you get to the edge, the more stateless the architecture should be. Edge nodes should function as “dumb” memory-to-network bridges. Complex logic—such as DRM signing, manifest manipulation, or user authentication—should be offloaded to an upstream layer or handled via asynchronous sidecar processes. Every CPU cycle spent on logic at the edge is a cycle stolen from packet transmission.
Network Layer: Where Bottlenecks Appear First
The default Linux network stack is tuned for compatibility, not high-frequency packet processing.
NIC Capabilities and Offloading
The Network Interface Card (NIC) must be an active coprocessor. Modern cards like the NVIDIA ConnectX-6/7 or Intel E810 Series are mandatory.
- Hardware Flow Steering: The NIC must support “Flow Steering” or “aRFS” (Accelerated Receive Flow Steering) to intelligently distribute packets to the specific CPU cores handling the destination application threads, ensuring cache locality.
- Interrupt Coalescing: Default drivers often use “Adaptive RX,” which varies the interrupt rate based on load. For consistent streaming, disable adaptive-rx and tune the rx-usecs parameter to a static value (e.g., 50-100 microseconds). This forces the NIC to batch packets, significantly reducing CPU interrupt load and preventing jitter.
Kernel Network Stack Limits
- Ring Buffers: Standard ring buffers (often 512 or 1024) are insufficient. They should generally be increased to the hardware maximum (e.g., 4096 or 8192) to absorb micro-bursts. Note: On Linux Kernel 6.8+, blindly maximizing rings can increase memory pressure; benchmark incrementally.
- SoftIRQ Saturation: A common bottleneck is ksoftirqd hitting 100% on a single core. This indicates that Receive Side Scaling (RSS) is misconfigured or that the entropy of the traffic flows is insufficient to distribute load across queues.
Server Hardware Considerations
The “more cores is better” philosophy is flawed for 100Gbps networking.
CPU: Frequency vs. Cores
Network processing is serial per flow. A 64-core CPU running at 2.0 GHz will often perform worse than a 16-core CPU running at 3.5 GHz. High clock speed allows the CPU to drain the RX ring buffer faster, which is the primary defense against packet drops. Look for processors with high per-core performance and large L3 caches (e.g., AMD EPYC with 3D V-Cache or Intel Scalable Performance SKUs).
PCIe Topology and Bandwidth
This is the most common hardware failure point.
- PCIe Gen4 x16: A 100GbE NIC requires a full PCIe Gen4 x16 slot. While a Gen4 x8 slot theoretically offers ~126Gbps, protocol overheads (TLP headers, encoding) leave almost zero headroom. Any bus contention results in drops.
- NUMA Alignment: The NIC, the NVMe drives, and the CPU cores handling the traffic must reside on the same NUMA node. Crossing the UPI/Infinity Fabric interconnect adds latency and slashes effective memory bandwidth by up to 40%. Use lstopo to verify the physical layout before deploying software.
Disk I/O & Data Flow
You cannot stream 100Gbps if the storage subsystem delivers only 20Gbps. While NVMe is fast, RAM is the only tier capable of sustaining line-rate 100Gbps random reads efficiently.
Tier 0: Streaming from RAM
The primary strategy for 100Gbps delivery is to serve the “Working Set” (the hottest 80-90% of requested content) entirely from system memory.
- Bandwidth Physics: A single DDR5 memory channel delivers ~38 GB/s (300Gbps). An 8-channel server offers >300 GB/s. Compare this to a PCIe Gen4 NVMe drive (~7 GB/s). RAM is the only medium that can handle the random read IOPS of thousands of concurrent clients without latency spikes.
- Linux Page Cache: The most efficient method is leveraging the kernel’s Page Cache. When Nginx reads a file, Linux caches it in RAM. By equipping edge nodes with 256GB–512GB of RAM, you ensure that popular segments remain memory-resident.
- Explicit RAM Disks (tmpfs): For absolute determinism, mount a tmpfs volume (e.g., /mnt/ramdisk). This forces content into RAM, bypassing the eviction logic of the Page Cache.
Bash
mount -t tmpfs -o size=200G tmpfs /mnt/ramdisk
Trade-off: tmpfs is volatile. You must architect a sync mechanism to populate it from NVMe/Origin on boot.
Tier 1: NVMe Backing
For the “Long Tail” of content that exceeds RAM capacity:
- Software RAID: Avoid hardware RAID cards; they become CPU bottlenecks. Use Linux MDRAID (RAID 0 or 10) or specialized high-performance software RAID (such as xiRAID), which utilizes AVX instructions for parity calculations.
- Filesystem: Use XFS. Its allocation group design minimizes lock contention during parallel reads/writes compared to EXT4.
I/O Strategy: Zero-Copy
Whether serving from RAM or NVMe, the data path must use Zero-Copy (sendfile).
- Standard Path (Slow): Disk -> Kernel Buffer -> User Buffer (Nginx) -> Kernel Buffer -> NIC. This triggers context switches and memory copies, burning CPU.
- Zero-Copy Path (Fast): Disk/RAM -> Kernel Buffer -> NIC. The CPU never “touches” the data; it simply directs the DMA engine
Application Layer Throughput
Nginx is the industry standard, but a default apt-get install nginx will cap at ~20Gbps.
Connection Handling
- Thread Pools: Standard Nginx workers are blocking. If a worker blocks on a disk read, it stops serving thousands of connections. Enabling aio threads offloads disk operations to a separate thread pool, keeping the main event loop non-blocking and responsive to network packets.
- Zero-Copy (Sendfile): The sendfile syscall is non-negotiable. It instructs the kernel to copy data directly from the disk buffer to the socket buffer, bypassing userspace entirely.
Encryption (TLS) at Scale
HTTPS is the heaviest operational cost. Standard OpenSSL involves copying data to userspace for encryption, breaking Zero-Copy.
- kTLS (Kernel TLS): Enabling kTLS allows the Linux kernel to handle encryption. Crucially, this restores the ability to use sendfile with encrypted streams.
- Hardware Offload: If using ConnectX-6 Dx or newer, kTLS can push the encryption tasks down to the NIC (Inline TLS), freeing up massive amounts of CPU and allowing the server to push 100Gbps of encrypted video as easily as HTTP.
Load Distribution & Scaling Strategy
Why One Server is Not Enough
Even a perfectly tuned 100Gbps server has failure domains. A single NIC driver crash or PCIe error takes down the entire capacity.
Scaling Architecture
- Traffic Sharding: Do not use a load balancer in front of streaming servers; the load balancer becomes the bottleneck. Use DNS-based Round Robin or, preferably, Anycast routing to distribute traffic directly to the edge nodes.
- Consistent Hashing: Route requests for specific files to specific servers. This maximizes RAM cache hit rates (Hot/Cold separation). If every server tries to cache the entire library, RAM efficiency drops, and disk I/O rises, killing throughput.
Common Bottlenecks That Kill High-Bandwidth Streaming
If you are stuck at 40-50Gbps, check these usual suspects:
- PCIe Oversubscription: Putting a 100GbE NIC and 4x NVMe drives on the same PCIe root complex without enough lanes. The CPU physically cannot move the data fast enough.
- SoftIRQ Saturation: One CPU core hitting 100% si usage. The system is dropping packets because it can’t process the interrupts. Solution: Tune RSS (Receive Side Scaling) to spread interrupts across more cores.
- Conntrack: The Linux connection tracking table (nf_conntrack) uses spinlocks. At millions of connections, this locking destroys performance. For pure streaming edges, disable connection tracking using iptables -t raw -j NOTRACK for port 80/443.
- Memory Bandwidth: If you populate only 4 out of 8 memory channels on the motherboard, you halve the system’s memory bandwidth. 100Gbps requires reading/writing >25GB/s to RAM; starve the memory channels, and the NIC starves too.
Observability for High-Throughput Systems
Traditional tools like htop are insufficient and misleading.
- Ethtool: Use ethtool -S eth0 to look for “hardware” drops (rx_missed_errors, rx_out_of_buffer). These indicate the NIC itself is overwhelmed or the PCIe bus is stalled.
- Dropwatch: Use dropwatch to monitor the kernel. It will pinpoint exactly which function in the kernel stack is freeing packets (e.g., udp_queue_rcv_skb vs nf_hook_slow).
- Perf: Use perf top to see CPU time spent in kernel functions. If you see high time in _raw_spin_lock, you have a configuration concurrency issue, not a hardware limit.
Conclusion
Building 100Gbps streaming infrastructure in 2025 is an exercise in removing friction. It is not enough to have a wide pipe; you must ensure the data can flow through that pipe with minimal CPU intervention.
The formula for success is strict:
- Hardware: PCIe Gen4 x16 slots and strict NUMA alignment.
- Kernel: kTLS, BBR congestion control, and static interrupt coalescing.
- Application: Nginx with aio threads and Zero-Copy enabled.
Throughput is the result of a thousand small optimizations working in concert. Bottleneck prevention starts with the architectural decision to treat the server not as a computer, but as a high-velocity data forwarding plane.
FAQs
Yes, but only with optimized payload sizes (video segments). If the traffic consists of tiny packets (e.g., extensive metadata or API calls), the CPU will hit a Packet-Per-Second limit long before filling the 100Gbps pipe.
Usually due to “Head-of-Line Blocking” in the hardware or software queues. If a single CPU core handling a specific RX queue gets saturated (100% SoftIRQ), it drops packets for all flows assigned to that queue, triggering TCP congestion controls that slow down the entire stream.
A fast NIC is useless if the CPU cannot handle the interrupt rate. Conversely, a fast CPU is useless if the NIC cannot offload TLS or steer flows effectively. The bottleneck is almost always the interaction (PCIe/Memory/IRQ) between the two.
When you exceed the reliability domain of a single hardware unit. Even if you can push 200Gbps from one box, you shouldn’t. Scaling horizontally (e.g., 2x 100Gbps nodes) provides redundancy and better cache surface area than a single “super-node.”
For internal storage traffic (Origin -> Edge), yes. It reduces CPU load significantly. For Edge -> Client (Internet) traffic, no. You cannot control the MTU of the public internet, and fragmentation will cause performance to be worse than standard 1500-byte frames.