Fine-Grain Parallel Computing: the Next Frontier in High Performance Computing

The Boulder HPC Facility: Exploring New Computing Technologies for NOAA



NVIDIA Fermi GPU C2090

  512 cores
  1331 GFlops Single Precision
    665 GFlops Double Precision
    225 Watts power

Graphical Processor Units (GPUs) and Intel MIC are considered by many to be the next frontier in High Performance Computing (HPC).  With CPU performance stalling, vendors have been forced to increase the number of cores per chip at the expense of increased power and cooling requirements.  To run climate and weather models at finer scales needed to advance prediction capabilities (2-4KM for global weather models, and 10KM for cimate models), CPU systems with over 200,000 CPU cores will be required.  However, systems of this size are impractical for operational weather forecasting for many reasons including power and cooling requirements, infrastructure costs, system reliability, and power costs.  GPUs can deliver 5x the performance per watt than comparable CPUs today with continued efficiencies expected in the future from NVIDIA, AMD and Intel MIC.



Intel MIC (Many Integrated Core)

 - 22-nanometer technology
     32 - 64 cores
    512-bit SIMD vector registers
     x86 programming environment
     expected release in 2013

Major Activites in 2011 / 2012

  • Fortran to CUDA (or C) Compiler:  The F2C-ACC compiler was released to the public in June 2009. While there are limitations development of this compiler has proven useful for the parallelization of the NIM and other weather models. 
  • F2C-ACC Version 4.6 was released in October 2012 with updated documentation.  This version provides limited support for modules.  The distribution also contains a number of working examples and tests which can be compiled and run for Fortran only, Fortran + C , and Fortran + CUDA that runs on the GPU.

  • Commercial Fortran GPU Compilers: We continue to evaluate the Fortran GPU compilers from CAPS (HMPP), and PGI (PGI Accel) using the NIM model.  We are also evaluating a beta version of the Cray GPU compiler.  These compiler vendors all have plans to support the Intel MIC and we plan to evaluate them in 2012. We hope to use these compilers to run other models  including the HRRR (a WRF-ARW variant), HYCOM, and FIM.
  • GPU Parallelization of NIM model dynamics using F2C-ACC:  Parallelization efforts focused on (1) maintaining a single source code for CPU and GPU execution, and (2) run efficiently on the CPU while optimizing performance on the GPU.  Performance optimizations we made for the GPU, also improved the CPU performance.  Dynamics currently runs at 30 percent of the peak performance of the Intel Westmere CPU.  Performance comparisons between Fermi (GPU) and Intel Westmere (CPU) show NIM runs 5 times faster on the GPU (socket-to-socket).
  • GPU Parallelization of NIM Physics using F2C-ACC: This work has begun with exploratory work using select routines from WRF Physics.  The emphasis in parallelization is to retain the original community code (written in Fortan).  We are using F2C-ACC directives to convert the code to CUDA.  Initial performance results for the YSU PBL is running 2x faster on the GPU (socket to socket). Speedup does not include the time to transfer data between the CPU and GPU.

Recent Presentations

Prepared by Mark Govett, Mark.W.Govett@noaa.gov
Date of last update:July 18, 2012