An Update on High-Performance Computing at FSL
By Mark Govett, Daniel Schaffer, Thomas Henderson, Leslie Hart, and Jaques Middlecoff
Numerical Weather Prediction requires some of the world’s most powerful computers to solve problems in fluid dynamics, physics, and data assimilation on high- resolution model grids. Improvements in weather prediction are tied to both higher model resolution and the integration of new datasets that offer finer resolution information spatially and temporally. As atmospheric models have grown more complex, modelers have become increasingly dependent on high-performance parallel computers to meet their needs.
Massively Parallel Processing (MPP) technology offers affordable opportunities to meet these computing requirements. The MPP systems use relatively inexpensive commodity microprocessors and memory that are connected using a high-performance communication network. High performance can be achieved by using large numbers of inexpensive processors. In contrast, traditional vector supercomputers are built using a few expensive custom CPUs and memory that rely on "state-of-the-art" semiconductor technology. These systems typically share memory via a high-speed bus that limits their scalability.
MPP Technology at FSL
As NOAA's technology transfer laboratory, FSL has long recognized the importance of MPP technologies. The laboratory used a 208-node Intel Paragon in 1992 for producing weather forecasts in real time using the 60-km version of the Rapid Update Cycle (RUC) model. This was the first demonstration of operational forecasts produced in real time using an MPP class system. Since then, FSL has parallelized mesoscale models such as the Global Forecast System (GFS) for the Taiwan Central Weather Bureau (CWB), a derivative of the RAMS model called the Scalable Forecast Model (SFM), the Quasi-Nonhydrostatic (QNH) research model, and the 40-km operational version of the RUC model currently running at the National Centers for Environmental Prediction (NCEP). FSL’s extensive experience with MPP class supercomputers has led to continued advancements in high resolution numerical weather prediction models using these systems.
FSL recently purchased a high-performance computing system from High Performance Technologies Incorporated (HPTi). The phase I system contains 256 Compaq Alpha EV6 processors connected via Myrinet running under the Linux operating system (Figure 1). This system, dubbed Jet, represents a new generation of supercomputers, a Commodity Parallel Processor (CPP) that integrates the best of multiple technologies from different vendors into a single high-performance system. FSL will use Jet to continue development, testing, and validation of new supercomputing technologies destined for eventual operational use. One example istesting of the North American Atmospheric Observing System (NAOS), which will determine the effect of different mixes of observing systems on forecast accuracy and lead to decisions on the most cost-effective configuration of upper-air observing systems over North America and adjacent waters.
Recognizing the superior price and performance of MPPs, other supercomputer centers are now selecting these systems to meet their computational needs. In 1996, the European Center for Medium Range Forecasts purchased a 696-node Cray, and more recently, NCEP selected a 768-node IBM SP2 computer as its operational system.
In order to efficiently use and reap the most benefit from these MPP systems, modelers must address software issues such as programmability, performance, and portability of their codes. MPPs are generally more difficult to program because the problem domain must be divided into smaller subdomains so that each subdomain can be solved in parallel. If memory is physically distributed, messages must be passed between systems when data sharing is required. These considerations require changes and additions to the existing serial code, and in some cases, the model's code may need to be completely rewritten.
Performance and Portability – Performance and portability are often combined into a single metric to define the ability of a code to run efficiently on multiple architectures. Performance portability has become increasingly important for two main reasons. First, computers quickly become obsolete; typically a new generation is introduced every two to four years. New systems utilize the latest advancements in computer architecture and hardware technology. MPP systems now comprise a wide range and class of systems, including fully distributed systems, fully shared memory systems called Symmetric Multiprocessors (SMPs) containing up to 256 or more CPUs, and a new class of hybrid systems that connect multiple SMPs using a high-speed network. Programming on these systems offers different performance benefits, but many programming challenges.
Another significant aspect of performance and portability relates to the fact that models are frequently moved between different computing platforms for operational and developmental use. If these models rely on a specific computer architecture or use vendor-specific routines not available on another system, the codes will require modification. In addition, climate and weather modelers frequently share codes, and when multiple versions of these codes exist due to portability problems, the integration of disparate formulations and the maintenance of a consistent source become more difficult. Lack of portability can cost agencies significant resources and lost productivity when multiple source codes must be maintained. Ideally, a single source code capable of running on any serial or parallel system allows the developer to test new theories on a desktop, yet easily transfer the model to an MPP when operational runs are desired.
Central to FSL’s success with MPPs is the development of the Scalable Modeling System (SMS), designed to improve the programmability and portability of model codes and provide efficient parallel code that achieves good performance and scaling. For example, the RUC model, parallelized using SMS, runs efficiently without code change on most supercomputing systems (including the IBM SP2, T3E, SGI Origin 2000) and clusters of workstations. FSL has successfully used SMS to parallelize eight numerical weather prediction models, both spectral and finite difference.
It may be helpful at this point of the article to discuss the various approaches to parallelization before further describing SMS: its development, usage, and recent performances on Jet, and future enhancements.
Approaches to Parallelization
In the past decade, several distinct approaches, or classes of solutions have been used to parallelize model codes, as follows.
In simplifying the parallel code, these high-level libraries also reduce the impact to the original serial version. However, parallelization is still time-consuming and invasive, since code must be inserted by hand and multiple versions must be maintained. To further reduce this impact, source translation tools have been developed to help modify these codes automatically. One such tool, the Fortran Loop and Index Converter (FLIC), generates calls to the RSL library based on command line arguments that identify decomposed loops needing parallelization. This tool is useful but also has limited capabilities.
Another library-based tool is a directive-based source translation tool called the Parallel Preprocessor (PPP), which is a new addition to SMS. The programmer inserts the directives (as comments) directly into the Fortran serial code, then PPP transates the directives and serial code into a parallel version that runs on shared and distributed memory machines. Since the programmer adds only comments to the code, there is little impact on the serial version. SMS also hides sufficient detail of the parallelism to significantly reduce the coding and testing time, as compared to an MPI-based solution. Though SMS has been applied only to atmospheric model codes, the approach is sufficiently general to also work for other structured grid models.
How the Scalable Modeling System Works
In any parallelization effort, the most important issue is determining how to divide the work among the processors. To enable the problem to scale to large numbers of processors, the most common approach for the structured grids found in atmospheric model codes is to use data domain decomposition. In this case, parallelism is usually achieved by having multiple processes execute the same program on different segments of the distributed data. This type of parallelism is called Single Program Multiple Data (SPMD).
A horizontal data domain decomposition in Figure 2 shows each processor controlling the data in a slab from the surface to the top of the atmosphere. This decomposition is often used in finite difference approximation (FDA) weather models, since complex dependencies for the physical parameterizations are rarely encountered in the horizontal.
Once a data decomposition strategy is chosen, the code must then be analyzed to determine when data dependencies occur. For example, the computation of y(i,j) in
depends on the i+1, i-1, j+1, j-1 points of the x array. If these data points are not local to the processor, they must be obtained from another processor. SMS handles this type of dependence, called adjacent dependence, by creating a halo or ghost region (Figure 3). Each processor sends edges of its data to its neighbors, where it is stored in the halo regions, and then loop calculations can be executed over each processor’s local domain.
Parallelism – SMS was designed to support SPMD parallelism on both shared and distributed memory systems. To ensure portability across these systems, SMS assumes that memory is distributed; no processor can address memory belonging to another processor. A local address space is used to access data by each processor; SMS provides mechanisms to translate global addresses into processor local references. Communications between processes are implemented using the message passing paradigm. Despite the assumption that memory is distributed, the performance on shared memory architectures is good due to efficient implementation of message passing on these systems.
High-Level Libraries – SMS consists of a set of high-level libraries that rely on MPI to implement the lowest layered functionality required. Once directives have been inserted into the serial code, PPP translates the directives and serial code into a parallel version. A serial code segment with SMS directives added that computes a global sum on three processors is shown in Figure 4. Parallel code generation will decompose the x and y arrays and modify the loops to ensure that computations are done only on each processor’s local data points. Further information about SMS code translation and the use of directives is available from the SMS Website, http://www-ad.fsl.noaa.gov/ac/.
Performance Optimization – SMS provides several techniques for optimizing parallel performance, such as making platform-specific optimizations in the message passing layer. During a recent FSL procurement, the MPI implementation of key SMS routines was replaced with one vendor’s native communications package to improve performance. Since these changes were made at a lower layer of SMS, no modification to the model codes was necessary.
Another area of SMS support is performance optimization of interprocessor communications. Techniques now available include aggregating multiple model variables to reduce communications latency. For details about communication optimization, refer to the SMS user guide and overview at http://www-ad.fsl.noaa,gov/ac/sms.html.
Performance optimizations have also been built into SMS I/O operations. By default, all I/O is handled by a single processor. Input data are read by this node and then scattered to the other processors. Similarly, decomposed output data are gathered by a single process and then written asynchronously. Since atmospheric models typically output forecasts several times during a model run, these operations can significantly affect the overall performance. To improve performance, SMS allows the user to dedicate multiple output processors to gather and output these data asynchronously. This allows compute operations to continue at the same time data are written to disk. The use of multiple output processors has been shown to improve model performance by up to 25%.
Parallel Performance Results using SMS
As mentioned before, the SMS directive-based approach was used to parallelize several models. The focus here is on earlier preliminary performance results of FSL's Rapid Update Cycle (RUC) and Quasi-nonhydrostatic (QNH) models run on Jet. The next section presents more recent results using the parallelized Eta model.
The RUC model is an operational atmospheric prediction system that provides high-frequency, hourly analyses of conventional and new data sources over the contiguous United States and short-range numerical forecasts in support of aviation and other mesoscale forecast users. Preliminary performance results, shown in Table 1, are from a 40-km RUC model that was decomposed in both horizontal dimensions with the I/O times subtracted out. The performance numbers are for the 40-km RUC model configured with a 151x113 horizontal grid and 40 vertical levels. The model performs reasonably well to 16 processors before a limited problem size relative to processor speed limits further scalability. Some performance tuning of this model should improve these preliminary numbers.
The QNH model is a high resolution, grid-based, explicit finite-difference, limited-area mesoscale model designed to run on high-performance parallel supercomputers. QNH is decomposed in both horizontal dimensions to avoid complex dependencies in the vertical. Except for diagnostic print statements, all horizontal dependencies are adjacent, so only exchange communications are needed. Redundant computations are implemented to reduce the amount of communications required.
Table 2 shows the performance (without I/O operations) of the 20-km version of the QNH model configured with a 258 x 194 horizontal grid and 32 vertical levels. Performance efficiency values of better than 1 indicate superscalar performance of the model, which occurs when per-processor performance increases as more processors are added. This effect can often be attributed to model data fitting increasingly well in cache as the number of processors increase. These preliminary results illustrate very good performance and scaling to more than 100 processors. Further analysis of these results continues and results will be published later.
Eta Model Parallelization and Recent Results
As a high-level software tool, SMS requires extra computations to maintain data structures that encapsulate low-level MPI functionality leading to potential performance degradation. To measure this impact, a performance comparison was done between the hand-coded MPI-based version of the Eta model running operationally at NCEP, and the same Eta model parallelized using SMS. The MPI Eta model was considered a good candidate for fair comparison since it is an operational model and has been optimized for high performance on the IBM SP2. Performance optimizations of the Eta model include the use of IBM’s parallel I/O capability that offers fast asynchronous output of intermediate results during the course of a model run.
To accomplish parallelization, the MPI Eta code was reverse engineered to return the code to its original serial form. This code was then parallelized using SMS. Code changes included restoring the original global loop bounds found in the serial code, removing MPI-based communications routines, and restoring array declarations. Fewer than 200 directives were added to the 19,000-line Eta model during SMS parallelization. To ensure correctness of parallelization, generated output files were bit-wise exact and compared for both serial and parallel runs.
After parallelization was complete, performance studies were done to compare SMS Eta to the handed-coded MPI Eta. In these tests, shown in Table3, the SMS-Eta performance results matched those from MPI-Eta model running on the IBM SP2 at NCEP. These results demonstrate that SMS can be used to speed and simplify parallelization, improve code readability, and allow the user to maintain a single source, without incurring significant performance overhead.
Summary and Future SMS Upgrades
High-performance computing will continue to fill an important role in FSL’s mission to research and test new technologies and transfer these results to NOAA and other government agencies. FSL's expertise with Massively Parallel Processing (MPP) and Commodity Parallel Processor (CPP) systems, along with the powerful resources within its new High-Performance Computer Center, will benefit science research at NOAA, academia, and eventually the public through improved weather prediction. Jet will undergo two upgrades within the next three years that will result in a 1280 processor system available for testing and running next-generation high-resolution meteorological models.
Central to FSL’s success in high-performance computing is the development of the Scalable Modeling System (SMS). Use of this directive-based approach to parallelization for both shared and distributed memory platforms provides general and high-level, comment-based directives that allow retention of the serial code nearly intact. SMS can be used to develop portable parallel code on multiple hardware platforms and achieve good performance.
Future enhancements to SMS include increasing support for Fortran 90, including features such as allocatable arrays, array syntax, and modules. Also, the Parallelizing Preprocessor (PPP) will be enhanced to automatically generate OpenMP directives.
The upgrade would target state-of-the-art machines consisting of clusters of Symmetric Multiprocessors (SMPs) by allowing microtasking "within the box" using OpenMP and message passing "between the boxes" using MPI. The PPP translator could be designed to generate both message passing and microtasking parallel code. Finally, support for dynamic load balancing will be added.
Another important upgrade involves eliminating the need for modelers to perform dependence analysis by hand. This enhancement would combine SMS with a semiautomatic dependency analysis tool, which would produce SMS directives instead of parallel code. These directives could then be translated by SMS whenever parallel code is desired.
(Mark Govett is a computer scientist in the Advanced Computing Branch of FSL's Aviation Division. He can be reached by e-mail at govett@fsl. noaa.gov or (303)497-6278. More Information on this research is available at Website http://www-ad.fsl.noaa.gov/ac/.)