Over the last several years, the Linux operating system has gained
acceptance as the operating system of choice in many scientific and
commercial environments, respectively. Today, the performance aspects of
the Linux operating system has improved significantly, as compared to
traditional UNIX flavors. This is particularly true for smaller SMP systems
with up to 4 processors. Recently, there has been an increased emphasis on
Linux performance in mid to high-end enterprise-class environments,
consisting of SMP systems that are configured with 64 CPUs. Therefore,
scalability and performance of Linux 2.6 are paramount for applications on
large systems that are scalable to high CPU counts. This article highlights
some of the performance and scalability improvements of the Linux 2.6
kernel.
The Virtual Memory (VM) Subsystem
Most modern computer architectures support more than one memory page size.
To illustrate, the IA-32 architecture supports either 4KB or 4MB pages. The
2.4 Linux kernel used to only utilize large pages for mapping the kernel
image. In general, large page usage is primarily intended to provide
performance improvements for high performance computing applications, as
well as database applications that have large working sets. Any memory
access intensive application that utilizes large amounts of virtual memory
may obtain performance improvements by using large pages. Linux 2.6 can
utilize 2MB or 4MB large pages, AIX uses 16MB large pages, whereas Solaris
large pages are 4MB in size. The large page performance improvements are
attributable to reduced translation lookaside buffer (TLB) misses. Large
pages further improve the process of memory prefetching, by eliminating the
necessity to restart prefetch operations on 4KB boundaries.
CPU Scheduler
The Linux 2.6 scheduler is a multi queue scheduler that assigns a run-queue
to each CPU, promoting a local scheduling approach. The previous
incarnation of the Linux scheduler utilized the concept of goodness to
determine which thread to execute next. All runnable tasks were kept on a
single run-queue that represented a linked list of threads. In Linux 2.6,
the single run-queue lock was replaced with a per CPU lock, ensuring better
scalability on SMP systems. The new per CPU run-queue scheme decomposes the
run-queue into a number of buckets (in priority order) and utilizes a
bitmap to identify the buckets that hold runnable tasks. Locating the next
task to execute requires a read from the bitmap to identify the first
bucket with runnable tasks, and choosing the first task in that bucket's
run-queue.
It should be pointed out that the Linux 2.6 environment provides a Non
Uniform Memory Access (NUMA) aware extension to the new scheduler. The
focus is on increasing the likelihood that memory references are local
rather than remote on NUMA systems. The NUMA aware extension augments the
existing CPU scheduler implementation via a node-balancing framework.
Further, it is imperative to point out that next to the preemptible kernel
support in Linux 2.6, the Native POSIX Threading Library (NPTL) represents
the next generation POSIX threading solution for Linux, and hence has
received a lot of attention from the performance community. The new
threading implementation in Linux 2.6 has several major advantages, such as
in-kernel POSIX signal handling. In a well-designed multi-threaded
application domain, fast user space synchronization (futex) can be
utilized. In contrast to the Linux 2.4, the futex framework avoids a
scheduling collapse during heavy lock contention among different threads.
I/O Scheduling
The I/O scheduler in Linux is the interface between the generic block layer
and the low-level device drivers. The block layer provides functions that
are utilized by file systems and the virtual memory manager to submit I/O
requests to block devices. As prioritized resource management seeks to
regulate the use of a disk subsystem by an application, the I/O scheduler
is considered an important kernel component in the I/O path.
It is further possible to tune the disk usage in the kernel layers above
and below the I/O scheduler. Adjusting the I/O pattern generated by the
file system or the virtual memory manager (VMM) is now an option. Another
option is to adjust the way specific device drivers or device controllers
handle the I/O requests. Further, a new read-ahead algorithm designed and
implemented by Dominique Heger and Steve Pratt for Linux 2.6 significantly
boosts read IO throughput for all the discussed IO schedulers below.
The Deadline I/O scheduler available in Linux 2.6 incorporates a
per-request expiration based approach, and operates on five I/O queues. The
basic idea behind the implementation is to aggressively reorder requests to
improve I/O performance while simultaneously ensuring that no I/O request
is being starved. More specifically, the scheduler introduces the notion of
a per-request deadline, which is used to assign a higher preference to read
than write requests. To summarize, the basic idea behind the deadline
scheduler is that all read requests are satisfied within a specified time
period. On the other hand, write requests do not have any specific
deadlines associated. As the block device driver is ready to launch another
disk I/O request, the core algorithm of the deadline scheduler is invoked.
In a simplified form, the first action being taken is to identify if there
are I/O requests waiting in the dispatch queue, and if yes, there is no
additional decision to be made on what to execute next. Otherwise, it is
necessary to move a new set of I/O requests to the dispatch queue.
The Anticipatory I/O scheduler's design attempts to reduce the per-thread
read response time. It introduces a controlled delay component into the
dispatching equation. The delay is being invoked on any new request to the
device driver, thereby allowing a thread that just finished its I/O request
to submit a new request. This basically enhances the chances (based on
locality) that this scheduling behavior will result in smaller seek
operations. The tradeoff between reduced seeks and decreased disk
utilization (due to the additional delay factor in dispatching a request)
is managed by utilizing an actual cost-benefit calculation method.
The Completely Fair Queuing (CFQ) I/O scheduler can be considered as
representing an extension to the better known stochastic fair queuing (SFQ)
scheduler implementation. The focus of both implementations is on the
concept of fair allocation of I/O bandwidth among all the initiators of I/O
requests. A SFQ based scheduler design was initially proposed for some
network subsystems. The goal to be accomplished is to distribute the
available I/O bandwidth as equally as possible among the I/O requests.
The Linux 2.6 Noop I/O scheduler can be considered a minimal I/O scheduler
that performs basic merging and sorting functionalities. The main usage of
the noop scheduler revolves around non disk-based block devices like memory
devices, as well as specialized software or hardware environments that
incorporate their own I/O scheduling and caching functionality, and hence
require only minimal assistance from the kernel. Hence, for large-scale I/O
configurations that incorporate RAID controllers and many disk drives, the
noop scheduler has the potential to outperform the other three I/O
schedulers.
Conclusion
The Linux 2.6 kernel represents another evolutionary step forward, and
builds upon its predecessors to boost (application) performance, through
enhancements to the VM subsystem, the CPU scheduler and the I/O scheduler.
In addition, this new version of the kernel delivers important functional
enhancements in security, scalability, and networking. This outline
only highlights the major performance features in Linux 2.6. Please visit
the Fortuitous Website http://www.fortuitous.com for the full article on
Linux 2.6 Performance Enhancements. Fortuitous Technologies provides high
quality IT services, focusing on performance tuning and
capacity planning.