FSL in Review

IFPS

Scalable Modeling System

GFESuite

Software Approaches

Smart Tools

GFESuite Software

Designing Digital

Deriving Sensible Weather Elements

Rapid Prototype Project

Forecast Methodology

Mt. Washington

Etcetera

Recent Publications

Contact The Editor

Design:
Wilfred von Dauster

Scalable Modeling System An Update on High-Performance Computing at FSL

By Mark Govett, Daniel Schaffer, Thomas Henderson, Leslie Hart, and Jaques Middlecoff

Introduction

Numerical Weather Prediction requires some of the world’s most powerful computers to solve problems in fluid dynamics, physics, and data assimilation on high- resolution model grids. Improvements in weather prediction are tied to both higher model resolution and the integration of new datasets that offer finer resolution information spatially and temporally. As atmospheric models have grown more complex, modelers have become increasingly dependent on high-performance parallel computers to meet their needs.

Massively Parallel Processing (MPP) technology offers affordable opportunities to meet these computing requirements. The MPP systems use relatively inexpensive commodity microprocessors and memory that are connected using a high-performance communication network. High performance can be achieved by using large numbers of inexpensive processors. In contrast, traditional vector supercomputers are built using a few expensive custom CPUs and memory that rely on "state-of-the-art" semiconductor technology. These systems typically share memory via a high-speed bus that limits their scalability.

MPP Technology at FSL

As NOAA's technology transfer laboratory, FSL has long recognized the importance of MPP technologies. The laboratory used a 208-node Intel Paragon in 1992 for producing weather forecasts in real time using the 60-km version of the Rapid Update Cycle (RUC) model. This was the first demonstration of operational forecasts produced in real time using an MPP class system. Since then, FSL has parallelized mesoscale models such as the Global Forecast System (GFS) for the Taiwan Central Weather Bureau (CWB), a derivative of the RAMS model called the Scalable Forecast Model (SFM), the Quasi-Nonhydrostatic (QNH) research model, and the 40-km operational version of the RUC model currently running at the National Centers for Environmental Prediction (NCEP). FSL’s extensive experience with MPP class supercomputers has led to continued advancements in high resolution numerical weather prediction models using these systems.

FSL recently purchased a high-performance computing system from High Performance Technologies Incorporated (HPTi). The phase I system contains 256 Compaq Alpha EV6 processors connected via Myrinet running under the Linux operating system (Figure 1). This system, dubbed Jet, represents a new generation of supercomputers, a Commodity Parallel Processor (CPP) that integrates the best of multiple technologies from different vendors into a single high-performance system. FSL will use Jet to continue development, testing, and validation of new supercomputing technologies destined for eventual operational use. One example istesting of the North American Atmospheric Observing System (NAOS), which will determine the effect of different mixes of observing systems on forecast accuracy and lead to decisions on the most cost-effective configuration of upper-air observing systems over North America and adjacent waters.

Recognizing the superior price and performance of MPPs, other supercomputer centers are now selecting these systems to meet their computational needs. In 1996, the European Center for Medium Range Forecasts purchased a 696-node Cray, and more recently, NCEP selected a 768-node IBM SP2 computer as its operational system.

Figure 1

Figure 1. FSL’s new high-performance computer system. The nearest "wall" contains 100 processors (10 systems per each of the 10 columns). Each of the 256 Compaq Alpha systems is rack mounted in standard desktop system boxes.

In order to efficiently use and reap the most benefit from these MPP systems, modelers must address software issues such as programmability, performance, and portability of their codes. MPPs are generally more difficult to program because the problem domain must be divided into smaller subdomains so that each subdomain can be solved in parallel. If memory is physically distributed, messages must be passed between systems when data sharing is required. These considerations require changes and additions to the existing serial code, and in some cases, the model's code may need to be completely rewritten.

Performance and Portability – Performance and portability are often combined into a single metric to define the ability of a code to run efficiently on multiple architectures. Performance portability has become increasingly important for two main reasons. First, computers quickly become obsolete; typically a new generation is introduced every two to four years. New systems utilize the latest advancements in computer architecture and hardware technology. MPP systems now comprise a wide range and class of systems, including fully distributed systems, fully shared memory systems called Symmetric Multiprocessors (SMPs) containing up to 256 or more CPUs, and a new class of hybrid systems that connect multiple SMPs using a high-speed network. Programming on these systems offers different performance benefits, but many programming challenges.

Another significant aspect of performance and portability relates to the fact that models are frequently moved between different computing platforms for operational and developmental use. If these models rely on a specific computer architecture or use vendor-specific routines not available on another system, the codes will require modification. In addition, climate and weather modelers frequently share codes, and when multiple versions of these codes exist due to portability problems, the integration of disparate formulations and the maintenance of a consistent source become more difficult. Lack of portability can cost agencies significant resources and lost productivity when multiple source codes must be maintained. Ideally, a single source code capable of running on any serial or parallel system allows the developer to test new theories on a desktop, yet easily transfer the model to an MPP when operational runs are desired.

Central to FSL’s success with MPPs is the development of the Scalable Modeling System (SMS), designed to improve the programmability and portability of model codes and provide efficient parallel code that achieves good performance and scaling. For example, the RUC model, parallelized using SMS, runs efficiently without code change on most supercomputing systems (including the IBM SP2, T3E, SGI Origin 2000) and clusters of workstations. FSL has successfully used SMS to parallelize eight numerical weather prediction models, both spectral and finite difference.

It may be helpful at this point of the article to discuss the various approaches to parallelization before further describing SMS: its development, usage, and recent performances on Jet, and future enhancements.

Approaches to Parallelization

In the past decade, several distinct approaches, or classes of solutions have been used to parallelize model codes, as follows.

  • Directive-Based Microtasking (Loop Level) – This approach is offered by companies such as Cray. More recently, OpenMP has become accepted in the computer community as a standard for the directive-based microtasking approach. OpenMP can be used to quickly produce parallel code with minimal impact on the serial version. Unfortunately, however, OpenMP does not work well for distributed memory architectures.
  • Message Passing Libraries – A message passing library, such as the Message Passing Interface (MPI), represents a class of solutions suitable for shared or distributed memory architectures. The scalability of parallel codes using these libraries can be quite good. Since the MPI libraries are relatively low level, the modeler must expend a lot more effort to parallelize the code, and the resulting code may differ substantially from the original serial version.
  • Parallelizing Compilers – This class offers the ability to automatically produce a portable parallel code to shared and distributed memory machines. The compiler does the dependence analysis and offers the user directives and/or language exten- sions that reduce the development time and impact on the serial code. The most notable example of a parallelizing compiler is High Performance Fortran (HPF). In some cases, the resulting parallel code is quite efficient, but there are also deficiencies in this approach. For example, compilers are often forced to make conservative assumptions about data dependence relationships, which impacts performance. In addition, poor compiler technology provided by some vendors can result in widely varying performance across systems.
  • Interactive Parallelization Tools – An example of an interactive parallelization tool is the Computer-Aided Parallelization Tools (CAPTools), which attempts a comprehensive dependence analysis. This tool is highly interactive, querying the user for both high-level information (decomposition strategy) and low-level details, such as loop dependencies and ranges that variables can take. CAPTools offers the possibility of a quality parallel solution in a fraction of the time required to analyze dependencies and generate code by hand. However, its downfall lies in its limited ability to efficiently parallelize numerical weather prediction codes that contain more advanced features.
  • Library-Based Tools – Library-based tools such as the Runtime System Library (RSL) and FSL’s Nearest Neighbor Tool [see the May 1995 issue of the FSL Forum] are built on top of the lower level libraries and serve to relieve the programmer of handling many of the details of message passing programming. Performance optimizations can be added to these libraries that target specific machine architectures. However, unlike computer-aided parallelization tools, such as CAPTools, the user is still required to do all dependence analysis by hand.

    In simplifying the parallel code, these high-level libraries also reduce the impact to the original serial version. However, parallelization is still time-consuming and invasive, since code must be inserted by hand and multiple versions must be maintained. To further reduce this impact, source translation tools have been developed to help modify these codes automatically. One such tool, the Fortran Loop and Index Converter (FLIC), generates calls to the RSL library based on command line arguments that identify decomposed loops needing parallelization. This tool is useful but also has limited capabilities.

    Another library-based tool is a directive-based source translation tool called the Parallel Preprocessor (PPP), which is a new addition to SMS. The programmer inserts the directives (as comments) directly into the Fortran serial code, then PPP transates the directives and serial code into a parallel version that runs on shared and distributed memory machines. Since the programmer adds only comments to the code, there is little impact on the serial version. SMS also hides sufficient detail of the parallelism to significantly reduce the coding and testing time, as compared to an MPI-based solution. Though SMS has been applied only to atmospheric model codes, the approach is sufficiently general to also work for other structured grid models.

    How the Scalable Modeling System Works

    In any parallelization effort, the most important issue is determining how to divide the work among the processors. To enable the problem to scale to large numbers of processors, the most common approach for the structured grids found in atmospheric model codes is to use data domain decomposition. In this case, parallelism is usually achieved by having multiple processes execute the same program on different segments of the distributed data. This type of parallelism is called Single Program Multiple Data (SPMD).

    A horizontal data domain decomposition in Figure 2 shows each processor controlling the data in a slab from the surface to the top of the atmosphere. This decomposition is often used in finite difference approximation (FDA) weather models, since complex dependencies for the physical parameterizations are rarely encountered in the horizontal.

    Figure 2

    Figure 2. Representation of a horizontal data domain decomposition. Thin solid lines show model grid boxes; thick solid lines indicate processor boundaries. Here, the model data are divided among four processors.

    Once a data decomposition strategy is chosen, the code must then be analyzed to determine when data dependencies occur. For example, the computation of y(i,j) in

      y(i,j) = x(i+1),j) +x(i-1,j) + x(i,j+1) + x(i,j-1)

    depends on the i+1, i-1, j+1, j-1 points of the x array. If these data points are not local to the processor, they must be obtained from another processor. SMS handles this type of dependence, called adjacent dependence, by creating a halo or ghost region (Figure 3). Each processor sends edges of its data to its neighbors, where it is stored in the halo regions, and then loop calculations can be executed over each processor’s local domain.

    Figure 3

    Figure 3. Schematic of how communications are implemented to handle adjacent dependencies using halo regions and "exchanges." The first column of P2 is sent to P1 where it is stored in the halo region just to the right of P1’s data. The last column of P1 is stored in the left halo region of P2. The other communication works analogously.

    Parallelism – SMS was designed to support SPMD parallelism on both shared and distributed memory systems. To ensure portability across these systems, SMS assumes that memory is distributed; no processor can address memory belonging to another processor. A local address space is used to access data by each processor; SMS provides mechanisms to translate global addresses into processor local references. Communications between processes are implemented using the message passing paradigm. Despite the assumption that memory is distributed, the performance on shared memory architectures is good due to efficient implementation of message passing on these systems.

    High-Level Libraries – SMS consists of a set of high-level libraries that rely on MPI to implement the lowest layered functionality required. Once directives have been inserted into the serial code, PPP translates the directives and serial code into a parallel version. A serial code segment with SMS directives added that computes a global sum on three processors is shown in Figure 4. Parallel code generation will decompose the x and y arrays and modify the loops to ensure that computations are done only on each processor’s local data points. Further information about SMS code translation and the use of directives is available from the SMS Website, http://www-ad.fsl.noaa.gov/ac/.

    Performance Optimization – SMS provides several techniques for optimizing parallel performance, such as making platform-specific optimizations in the message passing layer. During a recent FSL procurement, the MPI implementation of key SMS routines was replaced with one vendor’s native communications package to improve performance. Since these changes were made at a lower layer of SMS, no modification to the model codes was necessary.

    Another area of SMS support is performance optimization of interprocessor communications. Techniques now available include aggregating multiple model variables to reduce communications latency. For details about communication optimization, refer to the SMS user guide and overview at http://www-ad.fsl.noaa,gov/ac/sms.html.

    Figure 4

    Figure 4. Serial code segment with SMS directives (in bold) to compute and print a global sum.

    Performance optimizations have also been built into SMS I/O operations. By default, all I/O is handled by a single processor. Input data are read by this node and then scattered to the other processors. Similarly, decomposed output data are gathered by a single process and then written asynchronously. Since atmospheric models typically output forecasts several times during a model run, these operations can significantly affect the overall performance. To improve performance, SMS allows the user to dedicate multiple output processors to gather and output these data asynchronously. This allows compute operations to continue at the same time data are written to disk. The use of multiple output processors has been shown to improve model performance by up to 25%.

    Parallel Performance Results using SMS

    As mentioned before, the SMS directive-based approach was used to parallelize several models. The focus here is on earlier preliminary performance results of FSL's Rapid Update Cycle (RUC) and Quasi-nonhydrostatic (QNH) models run on Jet. The next section presents more recent results using the parallelized Eta model.

    The RUC model is an operational atmospheric prediction system that provides high-frequency, hourly analyses of conventional and new data sources over the contiguous United States and short-range numerical forecasts in support of aviation and other mesoscale forecast users. Preliminary performance results, shown in Table 1, are from a 40-km RUC model that was decomposed in both horizontal dimensions with the I/O times subtracted out. The performance numbers are for the 40-km RUC model configured with a 151x113 horizontal grid and 40 vertical levels. The model performs reasonably well to 16 processors before a limited problem size relative to processor speed limits further scalability. Some performance tuning of this model should improve these preliminary numbers.

    Table 1

    Table 1 - Performance of the 40-km RUC model on Jet

    The QNH model is a high resolution, grid-based, explicit finite-difference, limited-area mesoscale model designed to run on high-performance parallel supercomputers. QNH is decomposed in both horizontal dimensions to avoid complex dependencies in the vertical. Except for diagnostic print statements, all horizontal dependencies are adjacent, so only exchange communications are needed. Redundant computations are implemented to reduce the amount of communications required.

    Table 2 shows the performance (without I/O operations) of the 20-km version of the QNH model configured with a 258 x 194 horizontal grid and 32 vertical levels. Performance efficiency values of better than 1 indicate superscalar performance of the model, which occurs when per-processor performance increases as more processors are added. This effect can often be attributed to model data fitting increasingly well in cache as the number of processors increase. These preliminary results illustrate very good performance and scaling to more than 100 processors. Further analysis of these results continues and results will be published later.

    Table 2

    Table 2 - Performance of the 20-km QNH model on Jet

    Eta Model Parallelization and Recent Results

    As a high-level software tool, SMS requires extra computations to maintain data structures that encapsulate low-level MPI functionality leading to potential performance degradation. To measure this impact, a performance comparison was done between the hand-coded MPI-based version of the Eta model running operationally at NCEP, and the same Eta model parallelized using SMS. The MPI Eta model was considered a good candidate for fair comparison since it is an operational model and has been optimized for high performance on the IBM SP2. Performance optimizations of the Eta model include the use of IBM’s parallel I/O capability that offers fast asynchronous output of intermediate results during the course of a model run.

    To accomplish parallelization, the MPI Eta code was reverse engineered to return the code to its original serial form. This code was then parallelized using SMS. Code changes included restoring the original global loop bounds found in the serial code, removing MPI-based communications routines, and restoring array declarations. Fewer than 200 directives were added to the 19,000-line Eta model during SMS parallelization. To ensure correctness of parallelization, generated output files were bit-wise exact and compared for both serial and parallel runs.

    After parallelization was complete, performance studies were done to compare SMS Eta to the handed-coded MPI Eta. In these tests, shown in Table3, the SMS-Eta performance results matched those from MPI-Eta model running on the IBM SP2 at NCEP. These results demonstrate that SMS can be used to speed and simplify parallelization, improve code readability, and allow the user to maintain a single source, without incurring significant performance overhead.

    Summary and Future SMS Upgrades

    High-performance computing will continue to fill an important role in FSL’s mission to research and test new technologies and transfer these results to NOAA and other government agencies. FSL's expertise with Massively Parallel Processing (MPP) and Commodity Parallel Processor (CPP) systems, along with the powerful resources within its new High-Performance Computer Center, will benefit science research at NOAA, academia, and eventually the public through improved weather prediction. Jet will undergo two upgrades within the next three years that will result in a 1280 processor system available for testing and running next-generation high-resolution meteorological models.

    Central to FSL’s success in high-performance computing is the development of the Scalable Modeling System (SMS). Use of this directive-based approach to parallelization for both shared and distributed memory platforms provides general and high-level, comment-based directives that allow retention of the serial code nearly intact. SMS can be used to develop portable parallel code on multiple hardware platforms and achieve good performance.

    Future enhancements to SMS include increasing support for Fortran 90, including features such as allocatable arrays, array syntax, and modules. Also, the Parallelizing Preprocessor (PPP) will be enhanced to automatically generate OpenMP directives.

    Table 3

    Table 3 - Eta model performance for an MPI-Eta and SMS-Eta run on NCEP’s IBM SP-2

    The upgrade would target state-of-the-art machines consisting of clusters of Symmetric Multiprocessors (SMPs) by allowing microtasking "within the box" using OpenMP and message passing "between the boxes" using MPI. The PPP translator could be designed to generate both message passing and microtasking parallel code. Finally, support for dynamic load balancing will be added.

    Another important upgrade involves eliminating the need for modelers to perform dependence analysis by hand. This enhancement would combine SMS with a semiautomatic dependency analysis tool, which would produce SMS directives instead of parallel code. These directives could then be translated by SMS whenever parallel code is desired.

    (Mark Govett is a computer scientist in the Advanced Computing Branch of FSL's Aviation Division. He can be reached by e-mail at govett@fsl. noaa.gov or (303)497-6278. More Information on this research is available at Website http://www-ad.fsl.noaa.gov/ac/.)


    FSL Staff