LinuxHPC: What is the architecture of the Altix systems, and were any additional components used in the NASA installation?
Jason: The SGI Altix 3000 server uses a near-uniform memory access architecture first introduced by SGI in 1997 in the Origin 2000 server which utilized the 64-bit MIPS processor and SGI’s UNIX based IRIX operating system. SGI updated this architecture again in the year 2000 when we introduced the Origin 3000 servers which deploy a modular “brick” based architecture that allows users to add memory, processors, and additional system I/O independently, enabling flexibility to build systems that fit their applications needs exactly. Recently in 2003 SGI introduced the SGI Altix 3000 which updated this well-proven architecture by offering the Intel Itanium 2 processor and scalable production ready Linux. Both the SGI Origin and SGI Altix systems are built around our memory channel interconnect called NUMAlink, which creates a flexible switch based fabric, which acts as the systems backplane, and is made up of routers and cables. Each NUMAlink 3 connection has an aggregate bi-directional bandwidth of 3.2GB/second. This fabric of routers and cables allow SGI Altix 3000 hardware to be configured up to 512 processors and 8 Terabytes of cache-coherent memory today. This capability in part of the standard system design so no special hardware was required for the NASA system.
Figure 1 below illustrates the architecture of each ‘node’ of an Altix system.
Figure 1 – Altix C-Brick Schematic
LinuxHPC: How much of the system infrastructure is shared between the Origin and Altix systems?
Jason: The entire system infrastructure is shared with the exception of the processors and memory controller ASIC. But unfortunately Altix and Origin hardware can not be part of the same NUMAlink connected fabric because of processor endianess, hardware discovery, and operating system differences.
LinuxHPC: What is the maximum number of processors that are used in a single system image (SSI)?
Jason: Today, the supported SSI on Altix is 64 processors, which is a wonderful technical achievement for open source community and our engineers, but as you can see by our proof-of-concept demonstration with NASA we hope to work with the Linux community to achieve much more. In fact over the last 5 months SGI has been working with a set of our users to beta test the reliability of a standard 128 processor SSI. This beta test has exceeded our expectations, and we will start supporting at least 128 processor scalability for all Altix users when we release SGI ProPack 2.4 in February.
LinuxHPC: What is the latency for a single hop across the NUMAflex fabric, between two SHUBs?
Jason: Within an Altix C-brick there are physically two separate “nodes” each consisting of two processors, local memory, and a SHUB. Hardware latency to local memory is on the order of 145 nanoseconds. Within the same C-brick to the other node crossing two SHUBS, the remote memory latency increases to 275 nanoseconds. As Altix is expanded to 512 processors we increase latency in a near-uniform fashion. By utilizing a dual-plane fat tree topology we minimize the number router hops and latency induced by the interconnect fabric. Worst case latency in a 512 processor system, for example from processor 1 to processor 512 is 800 nanoseconds under NUMAlink 3, when the NUMAlink 4 router is introduced this worst case number will drop by approximately 19%. By using job placement software to enforce data locality, which is included in our SGI ProPack software, worst case latencies are minimized.
LinuxHPC: Are there plans to develop a NUMAlink 5 in the future? What is the time frame for this?
Jason: We have just starting introducing NUMAlink 4 in our systems. The previous generation NUMAlink 3 operated at an aggregate bandwidth of 3.2GB/sec. NUMAlink 4 doubles that aggregate bi-directional bandwidth to 6.4GB/sec. All Altix 3000 system bricks are NUMAlink 3 and NUMAlink 4 enabled, and in fact Altix 3300 and newly introduced Altix 350 which use a ring topology are already utilizing NUMAlink 4.
We are preparing to release the NUMAlink 4 router bricks, which will finalize the introduction of NUMAlink 4 across the entire Altix product line. Currently Altix 3700 systems operate in NUMAlink 3 mode using a dual-plane fat-tree topology that delivers 6.4GB/sec of aggregate bandwidth per brick and an overall system bi-section bandwidth of 400MB/second/processor. When the NUMAlink 4 routers become available these numbers will double and will enable Altix to continue providing leadership performance with the next generations of the Itanium processor.
As far as NUMAlink 5 is concerned, we are just beginning development of that next generation of infrastructure, and it’s really to soon to say when it will be available in products.
LinuxHPC: Directory based cache coherency seems to be the best approach to ccNUMA, are there any developments planned to improve or refine the cache coherence in the SHUBs in the future?
Jason: There are really no significant changes planned in the way that cache-coherency is enforced in our architecture. However, we will be extending the size of the cache-coherency domain from the 512-processor limit we have today. In the next generation of the SHUB ASIC we will move to cache-coherence domain of at least 1024 processors.
LinuxHPC: Are there plans to support 4GB DIMMs on Altix in the future?
Jason: Certainly we will support denser memory as it becomes available at economically viable prices. We will introduced 2GB DIMMs on Altix which utilize registered ECC DDR memory in April or May.
LinuxHPC: How is CXFS different from other file systems, and what makes it uniquely suited for the Altix?
Jason: CXFS or clustered XFS, is SGI’s distributed-shared file system. I want to first mention that there is nothing about CXFS per se that makes it unique to Altix, in fact CXFS supports heterogeneous file access to a wide-variety of operating systems including IRIX, AIX, HP-UX, 32-bit and 64-bit Linux, Solaris, MacOS, and Windows. The unique capability of CXFS is the seamless file sharing it enables. Unlike traditional SAN file systems, CXFS allows administrators to share a single volume with all attached systems. This is a benefit because it means that you don’t have replicated data specific to each server nor do you need to spend hours-transferring data between volumes. But the real benefit for Altix and HPC computer users is that it allows for the rapid sharing and analysis of output files, which can be terabytes in size. By shortening the time it takes to complete computation, analysis, and visual analysis CXFS combined with scalable systems can enable better science purely by enabling more cycles to be completed in a given period of time.
LinuxHPC: Do most Altix users compile with Intel’s compiler or do you find many using GCC?
Jason: I think most of our users are using both as appropriate. For their high-performance computing applications I believe they primarily use the Intel compilers, but for many open source utilities GCC is the right choice for recompilation. The Intel compilers have just proven better than GCC for parallel applications, although there is still room for improvement in both sets of tools.
LinuxHPC: Are there any features you would like to see added to the compilers you are using?
Jason: We work closely with Intel to give feedback on improving the compilers for large-scale systems, and Intel has been very receptive to our input. Our focus has been on using SGI's depth of experience to help Intel enhance compiler performance on parallel applications and OpenMP codes. SGI has established a strong dialogue between our applications engineers and core-engineering teams regarding areas where the compilers and tools can continue to improve for scalable parallel performance. Our experience with the MIPSpro compilers used on IRIX and our extensive experience on large Origin systems has really allowed us to offer some unique expertise to the development process.
LinuxHPC: Does NASA employ any optimization techniques for their software (PGO, etc.)?
Jason: NASA Ames research center has developed a programming technique called Multi-Level Parallelism (MLP). MLP seeks to optimize an applications performance by synchronizing completion of all threads / processes of a particular job to reduce wait times. The result is increased efficiency for the overall computation. The benefits of this technique are particularly pronounced on very large systems: in a 512P system a 10% efficiency loss is equivalent to losing 51 processors of computing power which can get expensive very quickly.
LinuxHPC: Obviously, there is still a need for large SMP systems, what types of problems do you see such systems addressing?
Jason: SGI customers with large Altix systems are researching a wide variety of fields. The COMOS project at Cambridge University, led by Professor Stephen Hawking is using an Altix to research the origins of the universe. Climate modeling, weather forecasting, oil field yield optimization and exploration, vehicle manufacturing, earthquake research, homeland security are just a few areas I can think of but there are many more. It really is one of the most fascinating areas of computer science today and my hope is that the capabilities of Altix and the rapid adoption of Linux will create some amazing new breakthroughs that will change our lives. I think the history of computing has shown that the research being done on these types of systems has always generated scientific discoveries that are later adopted by a majority of users.
LinuxHPC: What industries are you seeing a lot of interest in the Altix systems from?
Jason: SGI focuses on five market areas for our products energy, manufacturing, government and defense, media, and sciences. We are seeing interest from each of these areas and have already placed Altix systems in all of them.
LinuxHPC: What is so compelling about Linux for HPC and research computing; in the past, Tru64 or IRIX were the OSes of choice?
Jason: I think the most compelling aspect to Linux for HPC is the access to source and the ability to collaborate with other researchers using different hardware. This was not impossible in the past but it was certainly much more cumbersome. More than likely if you were sharing code between two large scientific computing facilities it meant you had to deal with a different UNIX variant, different development tools, and different processor technologies. Linux in combination with industry standard processors like Itanium and Pentium ease collaboration by providing a de-facto standard of hardware and operating systems available from a variety of vendors.
LinuxHPC: Are most of NASA’s key applications written in house?
Jason: There are very few commercial applications that scale to the level of today’s supercomputing centers so most of the codes utilized are “roll-your-own” codes. In addition to homegrown codes unique to their mission, NASA also utilizes public domain codes many of which they have contributed to and are used in the global HPC community; the ECCO ocean simulation model is one such example.
LinuxHPC: Which Linux distribution is being used with Altix systems?
Jason: SGI provides choice of two Linux distributions on Altix server (1) SGI Advanced Linux Environment + SGI ProPack, and (2) SUSE LINUX Enterprise Server.
SGI Advanced Linux Environment is an SGI created distribution that is 100% binary compatible with Red Hat Enterprise Linux AS 2.1, which is currently based on a 2.4.22 kernel. Over the next few months we expect to be updating the distribution to be compatible with AS 3.0.
SGI ProPack is a set of open source and proprietary enhancements developed to deliver optimized performance for HPC applications and provide resource management tools for NUMA systems. These include programming libraries, such as our SCSL math libraries and MPT our MPI library. Also included are our CPUsets and dplace tools that allow the user to manage memory and process placement within the system.
SUSE LINUX Enterprise Server is also supported and runs unaltered on Altix starting with SUSE LINUX Enterprise Server 8 (SLES8) service pack 3. SUSE has real technical leadership as a company and distribution and have willingly worked with us to support scalable Linux. In fact SLES 8 supports 64-processor scalability on Altix out of the box, and SUSE has indicated they would like to continue working with us to support even greater scalability. The purpose of SUSE on Altix is to enable user who have standardized on SUSE to deploy Altix easily, and to provide additional certified ISV applications on Altix.
LinuxHPC: It seems like a lot of the 2.5/2.6 kernel enhancements were backported to 2.4, when will SGI begin using the 2.5/2.6 kernels?
Jason: We plan to initially make a 2.6 kernel available as a unsupported option for our users starting in the May timeframe. I expect we will have a fair number of our users try it and provide us with feedback on its initial performance and areas where we can make some improvements. We currently intend to fully support both the 2.4 and 2.6 kernel by the end of the year, and expect to eventually phase out 2.4 kernel support over the course of 2005.
LinuxHPC: What were the challenges in getting Linux to scale so far (in terms of CPU count)? How much performance gain did all of the enhancements to Linux achieve (cumulatively)?
Jason: Linux is rapidly evolving and the developments ongoing in the 2.5/2.6 kernel tree have many things heading in the right direction. Internally we were skeptical about how Linux would scale initially. However, we have been very satisfied with the work the community has been doing to improve the performance of the operating system overall and have contributed to that effort where we can. There were some basic things that needed to be fixed initially, changing variable types for example, which was straightforward. The implementation of the O(1) scheduler was a more recent improvement that has been advantageous. A group of our engineers worked extensively on identifying bottlenecks in the locking mechanisms of the Linux kernel through an iterative process as we fixed problems and gained access to more hardware to build larger systems. Discontinuous memory work was another essential area, as well as, overall NUMA platform support. Overall we estimate the performance gain can be anywhere from 20-30% depending on the application. But most of this work is in the open source so overall everyone is benefiting.
LinuxHPC: What do you see as the strengths and weaknesses of Linux?
Jason: The strength of Linux is undoubtedly the community. It has really powered Linux to the top of the operating system world, made it ready to scale and robust for production use. In addition it is open source, available from multiple vendors, and enables collaboration, and hardware vendor choice.
The current weakness of Linux as I see it has nothing to do with the operating system itself. It has more to do with corporate users risk adversity. It’s only been the last year or so that Linux has been considered a Tier 1 operating system by applications vendors and that has limited the broader success of Linux. By all accounts the analysts are saying that 2004 will be another break out year for Linux. We saw a big bump in Linux adoption in 1997 during the dot-com era, certainly with the growing support by hardware and application vendors, the 2.6 kernel arriving, and a growing base of happy production Linux users they’ll be right about 2004.
I think Linux has a lot to offer. If the Linux community and vendors keep doing what they’ve been doing I think Linux will become commonplace.
The success of Altix systems in the high performance computing market are a very positive sign for both Linux and Itanium. Clearly, the popularity of large processor count Altix systems dispels any notions of whether Linux is a scalable OS for scientific applications. Linux is quite popular for HPC and will continue to remain so in the future, as it inherits the role that was filled by Tru64 and IRIX. One of the most impressive facts to come out of this interview is that kernel tweaks can yield a 20-30% performance boost. This is a very high figure and helps to show how much Linux has been improved over the past year and half, thanks to the investment of major OEMs and IP donors.
However, scientific applications have very different operating characteristics from commercial applications. Typically, much of the work in scientific code is done inside loops, whereas commercial applications, such as database or ERP software are far more branch intensive. This makes the memory hierarchy more important, particularly the latency to main memory. Whether Linux can scale well with a workload is an open question. However, there is no doubt that with each passing month, the scalability in such environments will improve. Unfortunately, SGI has no plans to move into this market, at this point in time. However, it would be very interesting to see how the low latency Altix systems would perform with commercial workloads.
The performance demonstrated by Altix systems in computational areas are also a testament to the design of the system. More than anything, it goes to show the impact that a good system infrastructure can have on overall performance. Perhaps the greatest difference between the Altix and other Itanium 2 based systems is the available bandwidth within a single ‘cell’. Other systems have 4 processors per local cell, while the Altix has only 2, giving each processor twice the effective bandwidth. This architecture is showing its merits today, and once Montecito is released, these advantages will be accentuated by multithreading. Montecito’s multithreading will allow for greater parallelism and will require even more bandwidth than its predecessors. It will be quite exciting to examine the different Montecito based systems and see how the available bandwidth affects the overall performance.
SGI Altix webpage – http://www.sgi.com/altix
Whitepapers - http://www.sgi.com/servers/altix/whitepapers/
About NASA AMES and Altix - http://www.sgi.com/features/2003/nov/nasa/
This Q&A contains forward-looking statements regarding SGI technologies and third-party technologies that are subject to risks and uncertainties. These risks and uncertainties could cause actual results to differ materially from those described in such statements. The reader [is cautioned not to rely unduly on these forward-looking statements, which are not a guarantee of future or current performance. Such risks and uncertainties include long-term program commitments, the performance of third parties, the sustained performance of current and future products, financing risks, the ability to integrate and support a complex technology solution involving multiple providers and users, and other risks detailed from time to time in the company's most recent SEC reports, including its reports on From 10-K and Form 10-Q.