Read Book: Systems Performance 2nd edition
Notes while reading Systems Performance: Enterprise and the Cloud, 2nd Edition. This is still WIP as I’m still reading.
I read the 1st edition 9 years ago. It turns out that the 2nd edition has so many updates since the 1st including more coverage for Linux rather than Solaris and Cloud computing. I recommend to read it even if you’ve read the 1st edition.
1 Introduction
- 1.1 Systems Performance
-
Systems performance studies the performance of an entire computer system, including all major software and hardware components. Anything in the data path, from storage devices to application software, is included, because it can affect performance. For distribnuted systems this means multiple servers and applications.
- Write a diagram of system showing the data path
- understand relationships between components and don’t overlook entire areas
-
- 1.2 Roles
-
For some performance issues, finding the root cause or contributing factors requires a cooperative effor from more than one team.
-
- 1.5.1 Subjectivity
- Performance is often subjective
- 1.5.2 Complexity
-
soling complex performance issue often requires a holistic approach
-
- 1.5.3 Multiple Performance Issues
-
the real task isn’t finding an issue; it’s identifying which issue or issues matter the most
-
- 1.10 Methodologies
-
Without a methodology, a methodology, a perforamnce investigation can turn into a fishing expedition
-
2 Methodologies
- experienced performance engineers understand which metrics are important and when they point to an issue, and how to use them to narrow down an investigation
- 2.3.1 Latency
-
single word “latency” can be ambiguous, it is best to include qualifying terms
-
- 2.3.2 Time Scales
-
have an instinct about time and reasonable expectation for latency from different sources
-
- 2.3.3 Trade Offs
- good/fast/cheap “pick two” trade-off
- in most cases, good and cheap are picked
- That choice can become problematic when architecture and tech stacks choices don’t allow good performance.
- 2.3.4 Tuning Efforts
-
Performance tuning is most effective when done closest to where the work is performed.*
-
Operating system performance analysis can also identify application-level issues, not just OS-level issues
-
- 2.3.6 When to stop analysis
- when major problems are solved
- potential ROI is less than the cost of analysis
- 2.3.7 Point-in-time reccommendataions
- Peformance recommendataions are valid only at a specific point in time.
- workload change, software/hardware change changes performance characteristics.
- 2.3.9 Scalablity
- fast degradation profile examples: memory load, i.e. moving memory pages to disk, Disk I/O on high queue depth
- slow degradation profile examples: CPU load
- 2.3.11 Utilization
- Time-based utilization
- saturation may not happen at 100% time-based utilization depending on capability of parallelism
- Capacity-based utilization
- Time-based utilization
- 2.3.15 Known-Unknowns
-
The more you learn about systems, the more unknown-unknowns you become aware of.
- which are then known-unknowns that you can check on
-
- 2.4 Perspectives
- Resource analysis (bottom-up) vs workload analysis (top-down)
- typical metrics for resource analysis
- IOPS
- Throughput (e.g. bytes per second)
- Utilization
- Saturation
- typical metrics for workload analysis
- Throughput (e.g. transactions per second)
- Latency
- typical metrics for resource analysis
- Resource analysis (bottom-up) vs workload analysis (top-down)
- 2.5 Methodology
- Resist temptation of anti-methods, start from more logical approaches like problem statement, scientific method, USE method
- 2.5.3 Blame-Someone-Else Anti-Method
- Be aware that use of this method may waste time and effort of other teams.
- Lack of data leading to a hypothesis results in this method.
- 2.5.9 The USE Method
- measure utilization, saturation and errors (USE) for every resource
- listing resources and finding metrics are possibly time consuming for the first time, but it should be much faster next time
- 2.5.10 RED Method
- For every service, check Request rate, errors and duration (RED)
- USE and RED methods are complementary: USE method for machine health, RED method for user health
- 2.5.11 Workload Characterization
-
The best performance wins are the result of eliminating unnecessary work.
-
- 2.6 Modeling
- It’s critical to know where knee points exist and what resource is a bottleneck for that performance behavior. It impacts the system architecture design decision.
- 2.8.5 Multimodal Distributions
- Average is useful only for unimordal distributions. ask what is the distribution before using average
- Latency metrics are often bimodal
3 Operating Systems
- 3.2.5 Clock and Idle
clock()
routine: updating the system time, maintaining CPU statistics, etc- executed from a timer interrupt
- each execution is called a
tick
- 3.2.9 Schedulers
- prioritize I/O-bound workloads over CPU-bound workloads
- 3.3 Kernels
- Kernel differences: file system support, system calls, network stack architecture, real-time support, scheduling algorithms for CPUs, disk I/O, networking
- 3.4.1 Linux Kernel Developments
- Multi-queue block I/O scheduler is default in 5.0 and classic schedulers like CFQ, deadline have been deleted
5 Applications
- 5.1 Objectives
- Better observability enables to see and eliminate unnecessary work and to better understanding and tune active work.
- It should be a major factor to choose applications / middlewares / languages and runtimes.
- Better observability enables to see and eliminate unnecessary work and to better understanding and tune active work.
- 5.2.5 Concurrency and Parallelism
- Linux mutex is implemented in 3 paths
- Hash table of locks is a design option to limit number of locks for fine grained locking
- Avoid CPU overheads for the creation and destruction of the lock too
- false sharing - two CPUs updating different locks in the same cache line
- encounter cache coherencey overhead
- 5.3.1 Compiled Languages
- gcc applies some optimizations even at
-O0
- gcc applies some optimizations even at
- 5.4 Methodology
- CPU / off-CPU profiling and thread state analysis can reveal how compute resources are used
- Distributed tracing is suggested as the last resort
- It appears the methodologies are described from the point of view of engineers who don’t have much context about application.
- When application developpers work on analysis, the order to try methodologies may change
- 5.4.2 Off-CPU Analysis
- off-CPU sampling comes with with major overhead
- It must sample the pool of threads rather than the pool of CPUs
- off-CPU sampling comes with with major overhead
- 5.5 Observability Tools
6 CPUs
- 6.3.8 Utilization
-
The measure of CPU utilization spans all clock cycles for eligible activities, including memory stall cycles.
- High CPU utilization doesn’t immediately mean CPU bound workload.
- A CPU may be utilized for stalls waiting for memory I/O.
- High CPU & high IPC suggests CPU bound workload.
-
- 6.4.1 Hardwares
- Handling of TLD misses is processor-dependent
-
Newer processors can service TLB misses in hardware.
-
- Handling of TLD misses is processor-dependent
- 6.4.2 Software
- In completely fair scheduler (CFS), tasks are managed on a red-black tree keyed from the task CPU time.
- 6.5.3 Workload Characterization
- High system time (time spent in kernel) may be futher understood by the syscall and interrupt rate.
-
I/O bound workloads have high system time, syscalls and higher voluntary context switches as threads block waiting for I/O.
- 6.5.4 Profiling
-
99 Hertz is used to avoid lock-step sampling that may occur at 100 Hertz, which would produce a skewed profile.
-
- 6.6.1 uptime
- Since 1993 on Linux, load averages show system-wide demand: CPUs, disks and other resources, not only CPU demand.
- Pressure Stall Information (PSI) was added in Linux 4.20
- available on
/proc/pressure/cpu
- shows saturation of CPU, memory and I/O
- The average shows the percent of time something was stalled on a resource
- available on
- 6.6.6 top
- CPU usage of top(1) itself can be significant
- due to the system calles to read /proc, open(2), read(2), close(2), over many processes.
- CPU usage of top(1) itself can be significant
- 6.6.13 perf
- Since Linux 5.8, CPU flame graphs can be generated from
perf.data
perf record -F 99 -a -g -- sleep 10 perf script report flamegraph
- Since Linux 5.8, CPU flame graphs can be generated from
- 6.6.21 Other Tools
-
GPU profiling is different from CPU profiling, as CPUs do not have a stack trace showing code path ancestry.
-
Profilters instead can instrument API and memory transfer calls and their timing.
-
- 6.7.2 SUbsecond-Offset Heat Map
-
CPU activity is typically measured in microseconds or milliseconds; reporting this data as averages over an entire second can wipe out useful information.
-
- 6.9.7 Exclusive CPU Sets
- cpusets of Linux allows to make a set of CPUs exclusive for processes.
7 Memory
- 7.1 Terminology
- Swapping (in Linux): anonymous paging to the swap device (the transfer of swap pages)
- 7.2.1 Virtual Memory
- Oversubscribe vs overcommit
- oversubscribe: allows bounded allocation more than main memory
- e.g. the size of main memory + swap device
- overcommit (Linux term): allows unbounded memory allocation
- oversubscribe: allows bounded allocation more than main memory
- Oversubscribe vs overcommit
- 7.2.2 Paging
- File system paging is caused by read/write of pages in memory-mapped files.
- normal behavior for applications that use mmap(2) and file systems that use the page cache.
- Page-out: a page was moved out of memory.
- may or may not include a write to a storage device
- Anonymous page-outs: requre moving the data to the physical swap devices.
- File system paging is caused by read/write of pages in memory-mapped files.
- 7.2.3 Demand Paging
- Minor fault: physical memory mapping can be satisfied from another page in memory
- e.g. memory growth of the process, mapping to another existing page, such as reading a page from a mapped shared library.
- Major fault: require storage device access
- Minor fault: physical memory mapping can be satisfied from another page in memory
- 7.2.5 Process Swapping
-
Linux systems do not swap processes at all and rely only on paging.
-
- 7.2.9 Shared Memory
- Proportional set size (PSS): private memory + shared memory divided by the number of users
- 7.3.1 Hardware
- Column address strobe (CAS): time between sending a memory module the desired address (column) and when the data is available to be read.
- depends on type of memory, e.g. DDR4, 5
-
For memory I/O transfers, this latency may occur multiple times for a memory bus (e.g., 64 bits wide) to transfer a cache line (e.g., at 64 bytes wide).
- There are also other latencies involved with the CPU and MMU
- Column address strobe (CAS): time between sending a memory module the desired address (column) and when the data is available to be read.
- 7.3.2 Software
- swappiness: the degree to which the system should favor freeing memory from the page cache instead of swapping
- 0 means always prefer freeing the page cache
- control the balance how much warm file system cache should be preserved
- Without swap, there is no paging grace period.
- application hits OOM error or OOM killer terminates it.
- Linux uses the buddy allocator for managing pages
- Multiple free lists for different sized memory allocations
- Page Scanning: on linux, the page-out daemon is called “kswapd”
- Scans LRU page lists of inactive and active memory to free pages
- swappiness: the degree to which the system should favor freeing memory from the page cache instead of swapping
- 7.3.3 Process Virtual Address Space
- For simple allocators, free(3) does not return memory to OS
- memory is kept to serve future allocations
- Process resident memory can only grow, which is normal
- Memory allocators
- glibc: behavior depends on allocation request size.
- small allocations are served from bins of memory, buddy-like algorithm
- large allocations can use a tree lookup to find space efficiently
- jemalloc: uses techniques such as multiple arenas, per-thread caching, and small object slabs
- improve scalability, reduce memory fragmentation
- glibc: behavior depends on allocation request size.
- For simple allocators, free(3) does not return memory to OS
- 7.4.2 USE Method
-
Saturation: The degree of page scanning, paging, swapping, and Linux OOM killer sacrifices performed, as measures to relieve memory pressure.
-
Historically, memory allocation errors have been left for the applications to report
-
- 7.5.4 sar (system activity reporter)
-
To understand these in deeper detail, you may need to use tracers to instrument memory tracepoints and kernel functions, such as perf(1) and bpftrace
- %vmeff: measure of page reclaim efficiency
- High means pages are successfully stolen from the inactive list
- Low means the system is struggling
- The man page describes near 100% as high, less than 30% as low
-
- 7.6.2 Multiple Page Sizes
- Transparent huge pages (THP): use huge pages by automatically promoting and demoting normal pages to huge
- application doesn’t need to specify huge pages
- Transparent huge pages (THP): use huge pages by automatically promoting and demoting normal pages to huge
8 File Systems
- 8.3.1 File System Latency
-
Operating systems have not historically made the file system latency readily observable, instead providing disk device-level statistics
- There are many cases disk metrics are unrelated to the application performance
- e.g. file systems perform background flushing of written data
- causes burst of high latency disk I/Os
- however, no application is waiting them
- e.g. file systems perform background flushing of written data
- There are many cases disk metrics are unrelated to the application performance
-
- 8.3.2 Caching
-
File systems use caching to improve read performance, and buffering (in the cache) to improve write performance.
-
- 8.3.4 Prefetch
- File systems allow tuning prefetch behavior
- 8.3.5 Read-Ahead
-
Historically, prefetch has also been known as read-ahead. Linux uses the read-ahead term for a system call, readahead(2), that allows applications to explicitly warm up the file system cache.
-
- 8.3.8 Raw and Direct I/O
- Raw I/O: request directly to disk offset. Bypassing the file system.
- Direct I/O: bypass the file system cache. Using the file system.
- mapping of file offsets to disk offsets must still be performed
- I/O may be resized to match the size of on-disk layout
- depending on file system, not only disabling read caching and write buffering, but also disable prefetch
- 8.3.10 Memory-Mapped Files
- syscall execution and context switch overheads for read(2), write(2) can be avoided
- double copying of data can also be avoided, if the kernel supports directly mapping the file data buffer to the process adress space
- It’s not effective when high disk I/O latency is dominant.
- 8.3.12 Logical vs. Physical I/O
- File system can make differences between logical and physical I/O, e.g.
- cache read
- buffer write
- map files to address spaces
- create additional I/O to maintain the on-disk physical layout metadata
- journaling
- cause disk I/O that is unrelated, indirect, implicit, inflated, or deflated as compared to application I/O.
- File system can make differences between logical and physical I/O, e.g.
- 8.3.15 Access Timestamps
- access time updates amplify write I/Os
- 8.3.16 Capacity
- Low free file system size cause consuming more CPU time and disk I/O to find free blocks
- Also, free space on disk are likely to be smaller and sparsely located. Smaller or random I/O degrades performance.
- 8.4.2 VFS
-
The terminology used by the Linux VFS interface can be a little confusing, since it reuses the terms inodes and superblocks to refer to VFS objects—terms that originated from Unix file system on-disk data structures.
-
- 8.4.2 File System Caches
- Linux nowadays has multiple cache types
- Buffer Cache
- unified buffer cache: store buffer cache in page cache to avoid double caching and synchronization overhead.
- used since Linux 2.4
- unified buffer cache: store buffer cache in page cache to avoid double caching and synchronization overhead.
- Page Cache
- caches virtual memory pages including mapped file system pages
- more efficient for file access than buffer cache, which required translation from file offset to disk offset for each look up
- Since Linux 2.6.32, per-device flusher threads (named flush) are used
- caches virtual memory pages including mapped file system pages
- Dentry Cache (Dcache)
- remembers mapping from directory entry (struct dentry) to VFS inode
- improves path name lookups
- has negative caching
- failed lookups commonly occur when searching for shared libraries
- Inode Cache
- caching VFS inodes (struct inodes) typically retured via stat(2) syscall
- frequently accessed for checking permissions, updating timestamps during modification
- caching VFS inodes (struct inodes) typically retured via stat(2) syscall
- Buffer Cache
- Linux nowadays has multiple cache types
- 8.4.5 File System Types
- Berkely fast file system (FFS)
- origin of many file systems
- improved performance by splitting the partition into numerous cylinder groups
- block interleaving: placing sequential file blocks on disk with a spacing between them of one or more blocks.
- ext4
- Preallocation: via fallocate(2) syscall, applications to preallocate space that is likely contiguous
- zfs
- Pooled storage: allows all assigned devices to be used in parallel
- Snapshots: Due to COW architecture, snapshots can be created nearly instantaneously
- Data deduplication
- Berkely fast file system (FFS)
9 Disks
- 9.2.1 Simple Disk
- On-disk controller may process I/O queue by non-FIFO algorithm to optimize performance
- Elevetor seeking for rotational disk
- Separate queues for read and write I/O especailly for flash memory-based disks
- On-disk controller may process I/O queue by non-FIFO algorithm to optimize performance
- 9.2.2 Caching Disk
-
The on-disk cache may also be used to improve write performance, by using it as a write-back cache.
-
- 9.3.1 Measuring Time
- Block I/O wait time: time spent from I/O creation and insert into kernel I/O queue to when it left from the final kernel queue and was issued to the disk
- may span multiple kernel-level queues, e.g. block I/O layer queue, disk device queue
- 3 possible driver layers may implement their own queue, or may block on mutexes
- Block I/O wait time: time spent from I/O creation and insert into kernel I/O queue to when it left from the final kernel queue and was issued to the disk
- 9.3.2 Time Scales
-
These latencies may be interpreted differently based on the environment requirements. While working in the enterprise storage industry, I considered any disk I/O taking over 10 ms to be unusually slow and a potential source of performance issues.
-
- 9.3.4 Random vs. Sequential I/O
-
Sometimes random I/O isn’t identified by inspecting the offsets but may be inferred by measuring increased disk service time.
-
- 9.3.5 Read/Write Ratio
-
The reads and writes may themselves show different workload patterns: reads may be random I/O, while writes may be sequential (especially for copy-on-write file systems).
-
They may also exhibit different I/O sizes.
-
- 9.3.7 IOPS Are Not Equal
- IOPS must be used with details
- random or sequential
- I/O size
- read or write
- buffered or direct
- number of I/O in parallel
- IOPS must be used with details
- 9.3.9 Utilization
-
but know nothing about the performance of the underlying disks upon which it is built. This leads to scenarios where virtual disk utilization, as reported by the operating system, is significantly different from what is happening on the actual disks (and is counterintuitive):
-
- 9.3.10 Saturation
-
50% disk utilization during an interval may mean 100% utilized for half that time and idle for the rest. Any interval summary can suffer from similar issues.
-
- 9.3.11 I/O Wait
-
I/O wait can be a very confusing metric. If another CPU-hungry process comes along, the I/O wait value can drop: the CPUs now have something to do, instead of being idle.
- The time that application threads are blocked on disk I/O is more reliable metric.
-
- 9.4.1.1 Magnetic Rotational
- Short-Stroking: use only the outer tracks of disk for the workload
-
This reduces seek time as head movement is bounded by a smaller range, and the disk may put the heads at rest at the outside edge, reducing the first seek after idle.
- The reminder are either unused or used for low-throughput workloads
-
- Elevetor Seeking: reorder I/O based on their on-disk location
- minimize travel of the disk heads
- Sloth Disks: sometimes return very slow I/O, over one second, without any reported errors
- Short-Stroking: use only the outer tracks of disk for the workload
- 9.4.1.2 Solid-State Drives
-
when writing I/O sizes that are smaller than the flash memory block size (which may be as large as 512 Kbytes). This can cause write amplification, where the remainder of the block is copied elsewhere before erasure, and also latency for at least the erase-write cycle.
-
- 9.4.2 Interfaces
- SAS (Serial Attached SCSI)
-
Other SAS features include dual porting of drives for use with redundant connectors and architectures, I/O multipathing, SAS domains, hot swapping, and compatibility support for SATA devices. These features have made SAS popular for enterprise use, especially with redundant architectures.
-
- FC
-
FC is commonly used in enterprise environments to create storage area networks (SANs) where multiple storage devices can be connected to multiple servers via a Fibre Channel Fabric. This offers greater scalability and accessibility than other interfaces, and is similar to connecting multiple hosts via a network.
-
- SAS (Serial Attached SCSI)
- 9.4.3 Storage Types
- Software RAID reduces complexity and hardware cost and improve observability from OS
- Read-Modify-Write
-
When data is stored as a stripe including a parity, as with RAID-5, write I/O can incur additional read I/O and compute time. This is because writes that are smaller than the stripe size may require the entire stripe to be read, the bytes modified, the parity recalculated, and then the stripe rewritten.
-
- Advanced disk controller cards can provide advanced features that can affect performance
- 9.4.4 Operating System Disk I/O Stack
- I/O merging: statistics for front and back merges are available in iostat(1)
- rqm/s, rrqm/s, wrqm/s, drqm/s, etc
- I/O Schedulers
- The multi-queue driver (blk-mq, added in Linux 3.13) uses separate submission queues for each CPU, and multiple dispatch queues for the devices
- The classic scheduler used a single request queue, protected by a single lock
- performance bottleneck at high I/O rate
- The classic scheduler used a single request queue, protected by a single lock
-
Kyber has shown improved storage I/O latencies in the Netflix cloud, where it is used by default.
- The multi-queue driver (blk-mq, added in Linux 3.13) uses separate submission queues for each CPU, and multiple dispatch queues for the devices
- I/O merging: statistics for front and back merges are available in iostat(1)
- 9.5.2 USE Method
-
the observability tools (e.g., Linux iostat(1)) do not present per-controller metrics but provide them only per disk. There are workarounds for this: if the system has only one controller, you can determine the controller IOPS and throughput by summing those metrics for all disks.
-
- 9.6.1 iostat
-
For resource usage and capacity planning, %util is important, but bear in mind that it is only a measure of busyness (non-idle time) and may mean little for virtual devices backed by multiple disks.
-
Nonzero counts in the rqm/s column show that contiguous requests were merged before delivery to the device, to improve performance. This metric is also a sign of a sequential workload.
-
The discard and flush statistics are new additions to iostat(1). Discard operations free up blocks on the drive (the ATA TRIM command), and their statistics were added in the Linux 4.19 kernel. Flush statistics were added in Linux 5.5. These can help to narrow down the reason for disk latency.
-
- 9.6.4 pidstat
-
Some time later the page cache was flushed, as can be seen in the second interval output by the kworker/u4:1-flush-259:0 process, which experienced iodelay.
-
- 9.6.5 perf
-
Often I/O will be queued and then issued later by a kernel thread, and tracing the block:block_rq_issue tracepoint will not show the originating process or user-level stack trace. In those cases you can try tracing block:block_rq_insert instead, which is for queue insertion. Note that it misses I/O that did not queue.
-
Disk I/O latency (described earlier as disk request time) can also be determined by recording both the disk issue and completion events for later analysis.
-
- 9.6.8 biotop
-
By the time disk I/O is issued to the device, the requesting process may no longer be on CPU, and identifying it can be difficult. biotop(8) uses a best-effort approach: the PID and COMM columns will usually match the correct process, but this is not guaranteed.
-
- 9.6.9 biostacks
-
I have seen cases where there was mysterious disk I/O without any application causing it. The reason turned out to be background file system tasks.
-
- 9.6.14 SCSI Logging
-
Linux has a built-in facility for SCSI event logging.
-
- 9.9 Tuning
-
While changing tunables can be easy to do, the default settings are usually reasonable and rarely need much adjusting.
-
- 9.9.1 Operating System Tunables
-
For Linux, the container groups (cgroups) block I/O (blkio) subsystem provides storage device resource controls for processes or process groups. This can be a proportional weight (like a share) or a fixed limit. Limits can be set for read and write independently, and for either IOPS or throughput (bytes per second).
-
10 Network
-
The network is often blamed for poor performance given the potential for congestion and its inherent complexity (blame the unknown).
- 10.1 Terminology
- Network latency can refer to the message round trip time, or the time to establish a connection (e.g. TCP handshake), excluding the data transfer time that follows.
- 10.3.4 Packet size
-
Packet size is usually limited by the network interface maximum transmission unit (MTU) size, which for many Ethernet networks is configured to be 1,500 bytes.
-
Ethernet now supports larger packets (frames) of up to approximately 9,000 bytes, termed jumbo frames. These can improve network throughput performance, as well as the latency of data transfers, by requiring fewer packets.
- Many systems stick to the 1500 MTU default
- some firewall administrators have blocked all ICMP to avoid ICMP-based attacks
- It prevents “can’t fragment” message and causes network packets to be silently dropped once the packet size increases beyond 1500.
-
If the ICMP message is received and fragmentation occurs, there is also the risk of fragmented packets getting dropped by devices that do not support them.
-
- 10.3.5 Latency
- Ping latency may not exactly reflect the round-trip time of application requests
- ICMP may be handled with a different priority by routers.
- Connection latency is exercises more kernel code to establish a connection and includes time to retransmit any dropped packets.
- TCP SYN packet can be dropped by the server if its backlog is full.
- First-Byte latency includes the think time of target server
- increase if the server is overloaded or takes time to schedule
- Ping latency may not exactly reflect the round-trip time of application requests
- 10.3.6 Buffering
-
Buffering can also be performed by external network components, such as switches and routers, in an effort to improve their own throughput. Unfortunately, the use of large buffers on these components can lead to bufferbloat, where packets are queued for long intervals. This causes TCP congestion avoidance on the hosts, which throttles performance.
-
- 10.3.10 Utilization
-
Given variable bandwidth and duplex due to autonegotiation, calculating this isn’t as straightforward as it sounds.
-
- 10.4.1 Protocols
- Important topics for TCP performance
- three-way handshake
- duplicate ACK detection
- congestion control algorithms
- Nagle
- delayed ACKs
- SACK, and FACK
-
A session that has fully closed enters the
TIME_WAIT
state so that late packets are not mis-associated with a new connection on the same ports.- This can lead to a performance issue of port exhaustion
- QUIC is built upon UDP, and provides several features on top of it,
- Important topics for TCP performance
- 10.4.2 Hardware
-
Most interfaces have separate channels for transmit and receive, and when operating in full-duplex mode, each channel’s utilization must be studied separately.
-
The use of extended BPF to implement firewalls on commodity hardware is growing, due to its performance, programmability, ease of use, and final cost.
-
- 10.4.3 Software
- TCP Connection Queues: Bursts of inbound connections are handled by using backlog queues
- queue for incomplete connections while the TCP handshake completes (also known as the SYN backlog)
- queue for established sessions waiting to be accepted by the application
- TCP Buffering
-
The Linux kernel will also dynamically increase the size of these buffers based on connection activity, and allows tuning of their minimum, default, and maximum sizes.
-
- Queueing Discipline
- An optional layer for managing traffic classification (tc), scheduling, manipulation, filtering, and shaping of network packets.
- Network Device Drivers
- interrupt coalescing mode: instead of interrupting the kernel for every arrived packet, an interrupt is sent only when a timer (polling) or certain number of packets is reached.
-
The Linux kernel uses a new API (NAPI) framework that uses an interrupt mitigation technique: for low packet rates, interrupts are used (processing is scheduled via a softirq); for high packet rates, interrupts are disabled, and polling is used to allow coalescing
-
RSS: Receive Side Scaling: For modern NICs that support multiple queues and can hash packets to different queues, which are in turn processed by different CPUs, interrupting them directly.
-
Without a CPU load-balancing strategy for network packets, a NIC may interrupt only one CPU, which can reach 100% utilization and become a bottleneck. This may show up as high softirq CPU time on a single CPU
- Kernel Bypass: The expense of copying packet data can be avoided by directly accessing memory on the NIC.
- TCP Connection Queues: Bursts of inbound connections are handled by using backlog queues
- 10.5.1 Tools Method
-
an Internet-facing system with unreliable remote clients should have a higher retransmit rate than an internal system with clients in the same data center.
-
- 10.5.2 USE Method
-
Saturation of the network interface is difficult to measure. Some network buffering is normal, as applications can send data much more quickly than an interface can transmit it. It may be possible to measure as the time application threads spend blocked on network sends, which should increase as saturation increases.
-
- 10.5.4 Latency Analysis
-
SO_TIMESTAMPING
can identify transmission delays, network round-trip time, and inter-stack latencies; this can be especially helpful when analyzing complex packet latency involving tunneling
-
- 10.5.6 Packet Sniffing
-
packet capture implementations commonly allow a filtering expression to be supplied by the user and perform this filtering in the kernel. This reduces overhead by not transferring unwanted packets to user level. The filter expression is typically optimized using Berkeley Packet Filter (BPF), which compiles the expression to BPF bytecode that can be JIT-compiled to machine code by the kernel.
-
- 10.6.1 ss
-
Similar per-socket information is available using the older netstat(8) tool. ss(8), however, can show much more information when using options.
- ss doesn’t have the age of connections, which is needed to calculate the average throughput
-
ss(8) reads these extended details from the netlink(7) interface, which operates via sockets of family AF_NETLINK to fetch information from the kernel.
-
- 10.6.5 netstat
-
Some of the statistic names include typos (e.g., packetes rejected). These can be problematic to simply fix, if other monitoring tools have been built upon the same output.
- /proc/net/snmp and /proc/net/netstat are better for data sources of tools
-
- 10.6.7 nicstat
-
nicstat(1) prints network interface statistics, including throughput and utilization.
-
- 10.6.11 tcpretrans
-
Packet-capture can only see details that are on the wire, whereas tcpretrans(8) prints the TCP state directly from the kernel, and can be enhanced to print more kernel state if needed.
-
- 10.7.1 ping
-
Newer kernels and ping(8) versions use kernel timestamp support (SIOCGSTAMP or SO_TIMESTAMP) to improve the accuracy of the reported ping times.
-
- 10.7.2 traceroute
- A hop may not return ICMP at all or ICMP be blocked by a firewall.
- A workaround is switching to TCP using
-T
option.
- A workaround is switching to TCP using
- A hop may not return ICMP at all or ICMP be blocked by a firewall.
- 10.8.1 System-Wide
- The max socket buffer size may need to be set to 16MB or higher to support full-speed 10GbE connections.
net.core.rmem_max = 16777216
andnet.core.wmem_max = 16777216
-
The Tuned Project provides automatic tuning for some of these tunables based on selectable profiles, and supports Linux distributions
- The max socket buffer size may need to be set to 16MB or higher to support full-speed 10GbE connections.
11 Could Computing
- 11.1.3 Capacity Planning
- Cloud computing makes people free from strict capacity planning to purchase proper hardwares.
- For growing startups, it’s particularlly difficult to estimate because demand changes more aggressively and the pace of code changes is high.
- Cloud computing makes people free from strict capacity planning to purchase proper hardwares.
- 11.1.6 Orchestration (Kubernetes)
-
Performance challenges in Kubernetes include scheduling, and network performance, as extra components are used to implement container networking and load balancing.
-
- 11.2.2 Overhead
-
Understanding when and when not to expect performance overhead from virtulization is important
- The guest applications execute directly on the processors, so CPU-bound applications may experience almost the same performance as a bare-metal system.
-
CPU overheads may be encountered when making privileged processor calls, accessing hardware, and mapping main memory
- The mapping from guest virtual memory to host physical memory is cached in the TLB.
- The storage architecture may also lead to double caching, i.e. caching on both host and guest.
-
- 11.2.3 Resource Controls
-
A guest’s CPU usage is typically opaque to the hypervisor, and guest kernel thread priorities cannot typically be seen or respected.
-
In the Amazon EC2 cloud, network I/O and disk I/O to network-attached devices are throttled to quotas using external systems.
-
- 11.2.4 Observability
-
From the guest, physical resource usage may not be observable at all.
- vmstats(8) command includes CPU percent stolen (st). It shows CPU time not available to the guest. It may be consumed by other tenants or other hypervisor functions.
- Disk and network resource contention may be identified by careful analysis of I/O patterns and latency outliers.
-