5.1.6.6. Performance Profiling and Tuning

A major focus of the OpenFAST team is performance-profiling and optimization of the OpenFAST software with the goal of improving time-to-solution performance for the most computationally expensive use cases. The process generally involves initial profiling and hotspot analysis, then identifying specific subroutines to target for optimization in the physics modules and glue-codes.

A portion of this work was supported by Intel® through its designation of NREL as an Intel® Parallel Computing Center (IPCC).

The procedures, findings, and recommended programming practices are presented here. This is a working document and will be updating as additional studies are completed.

5.1.6.6.1. High-performance programming techniques

Being a compiled language designed for scientific and engineering applications, Fortran is well suited for producing very efficient code. However, the process of tuning code requires developers to understand the language as well as the tools available (compilers, performance libraries, etc) in order to generate the highest performance. This section identifies programming patterns to use Fortran and the Intel® Fortran compiler effectively. Developers should also reference the Intel® Fortran compiler documentation regarding optimization, in general, and especially the sections on automatic vectorization and coarrays.

5.1.6.6.1.1. Optimization report

When evaluating compiler optimization performance in a complex software, timing tests alone are not a good indication for the compiler’s ability to optimize particular lines of code. For low-level information on the compilers attempts in optimization, developers should generate optimization reports to get line-by-line reporting on various metrics such as vectorization, parallelization, memory and cache usage, threading, and others. Developers should refer to the Intel® Fortran compiler documentation on optimization reports.

For Linux and macOS, the OpenFAST CMake configuration has compiler flags for generating optimization reports available but commented in the set_fast_intel_fortran_posix macro in openfast/cmake/OpenfastFortranOptions.cmake. Primarily, the qopt-report-phase and qopt-report flags should be used. See the optimization report options documentation for more information on additional flags and configurations.

With compiler flags correctly configured, the copmiler will output files with the extension .optrpt alongside the intermediate compile artifacts like .o files. The compile process will state that additional files are being generated:

ifort: remark #10397: optimization reports are generated in *.optrpt files in the output location

And the additional files should be located in the corresponding CMakeFiles directory for each compile target. For example, the optimization report for the VersionInfo module in OpenFAST are at:

>> ls -l openfast/build/modules/version/CMakeFiles/versioninfolib.dir/src/
-rw-r--r--  2740 May 12 23:10 VersionInfo.f90.o
-rw-r--r--     0 May 12 23:10 VersionInfo.f90.o.provides.build
-rw-r--r--   668 May 12 23:10 VersionInfo.f90.optrpt

5.1.6.6.1.2. Operator Strength Reduction

Each mathematical operation has an effective “strength”, and some operations can be equivalently represented as a combination of multiple reduced-strength operations that have better performance than the original. As part of the code optimization step, compilers may be able to identify areas where a mathetical operation’s strength can be reduced. Compilers are not able to optimize all expensive operations. For example, cases where a portion of an expensive mathematical operation is a variable contained in a derived data type are frequently skipped. Therefore, it is recommended that expensive subroutines be profiled and searched for possible strength reduction opportunities.

A concrete example of operator strength reduction is in dividing many elements of an array by a constant.

module_type%scale_factor = 10.0

do i = 1
  if array(i) < 30.0
    array(i) = array(i) / module_type%scale_factor
  end if
end do

In this case, a multiplication of real numbers is less expensive than a division of real numbers. The code can be factored so that the inverse of the scale factor is computed outside of the loop and the mathematical operation in the loop is converted to a multiplication.

module_type%scale_factor = 10.0
inverse_scale_factor = 1.0 / module_type%scale_factor

do i = 1
  if array(i) < 30.0
    array(i) = array(i) * inverse_scale_factor
  end if
end do

5.1.6.6.1.3. Coarrays

Coarrays are a feature introduced to the Fortran language in the 2008 standard to provide language-level parallelization for array operations in a Single Program, Multiple Data (SPMD) programming paradigm. The Fortran standard leaves the method of parallelization up to the compiler, and the Intel® Fortran compiler uses MPI.

Coarrays are used to split an array operation across multiple copies of the program. The copies are called “images”. Each image has its own local variables, plus a portion of any coarrays as shared variables. A coarray can be thought of as having extra dimensions, referred to as “codimensions”. A single image (typically the 1-th index) controls the I/O and problem setup, and images can read from memory in other images.

For operations on large arrays such as constructing a super-array from many sub-arrays (as in the construction of a Jacobian matrix), the coarray feature of Fortran 08 can parallelize the procedure improving overall computational efficiency.

5.1.6.6.1.4. Data modeling and access rules

Fortran represents arrays in column-major order. This means that a multidimensional array is represented in memory with column elements being adjacent. If a given element in an array is at a location in memory, one element before in memory corresponds to the element above it in its column.

In order to make use of the single instruction, multiple data features of modern processors, array construction and access should happen in column-major order. That is, loops should loop over the left-most index quickest. Slicing should occur with the : also on the left-most index when possible.

With this in mind, data should be represented as structures of arrays rather than arrays of structures. Concretely, this means that data types within OpenFAST should contain the underlying arrays and arrays should generally contain only numeric types.

The short program below displays the distance in memory in units of bytes between elements of an array and neighboring elements.

program memloc

implicit none

integer(kind=8), dimension(3, 3) :: r, distance_up, distance_left

! Take the element values as their "ID"
! r(row, column)
r(1,:) = (/ 1, 2, 3 /)
r(2,:) = (/ 4, 5, 6 /)
r(3,:) = (/ 7, 8, 9 /)
print *, "Reference array:"
call pretty_print_array(r)

! Compute the distance between matrix elements. Inputs to the `calculate_distance` function
! are indices for elements in the equation loc(this_element) - loc(other_element)
distance_up(1,:) = (/ calculate_distance( 1,1 , 1,1), calculate_distance( 1,2 , 1,2), calculate_distance( 1,3 , 1,3) /)
distance_up(2,:) = (/ calculate_distance( 2,1 , 1,1), calculate_distance( 2,2 , 1,2), calculate_distance( 2,3 , 1,3) /)
distance_up(3,:) = (/ calculate_distance( 3,1 , 2,1), calculate_distance( 3,2 , 2,2), calculate_distance( 3,3 , 2,3) /)
print *, "Distance in memory (bytes) for between an element and the one above it (top row zeroed):"
call pretty_print_array(distance_up)

distance_left(1,:) = (/ calculate_distance( 1,1 , 1,1), calculate_distance( 1,2 , 1,1), calculate_distance( 1,3 , 1,2) /)
distance_left(2,:) = (/ calculate_distance( 2,1 , 2,1), calculate_distance( 2,2 , 2,1), calculate_distance( 2,3 , 2,2) /)
distance_left(3,:) = (/ calculate_distance( 3,1 , 3,1), calculate_distance( 3,2 , 3,1), calculate_distance( 3,3 , 3,2) /)
print *, "Distance in memory (bytes) for between an element and the one to the its left (first column zeroed):"
call pretty_print_array(distance_left)

contains

integer(8) function calculate_distance(c1, r1, c2, r2)

    integer, intent(in) :: c1, r1, c2, r2
    calculate_distance = loc(r(c1, r1)) - loc(r(c2, r2))

end function

subroutine pretty_print_array(array)

    integer(8), dimension(3,3), intent(in) :: array
    print *, "["
    print '(I4, I4, I4)', array(1,1), array(1,2), array(1,3)
    print '(I4, I4, I4)', array(2,1), array(2,2), array(2,3)
    print '(I4, I4, I4)', array(3,1), array(3,2), array(3,3)
    print *, "]"

end subroutine

end program

5.1.6.6.2. Optimization Studies

This section describes specific efforts to profile sections of OpenFAST and improve performance with the Intel® compiler suite.

5.1.6.6.2.1. BeamDyn Performance Profiling and Optimization (IPCC Year 1 and 2)

The general mechanisms identified for performance improvements in OpenFAST were:

  • Intel® compiler suite and Intel® Math Kernel Library (Intel® MKL)

  • Algorithmic improvements

  • Memory-access optimization enabling more efficient cache usage

  • Data type alignment allowing for SIMD vectorization

  • Multithreading with OpenMP

To establish a path forward with these options, OpenFAST was first profiled with Intel® VTune™ Amplifier to get a clear breakdown of time spent in the simulation. Then, the optimization report generated from the Intel® Fortran compiler was analyzed to determine areas that were not autovectorized. Finally, Intel® Advisor was used to highlight areas of the code that the compiler identified as potentially improved with multithreading.

Two OpenFAST test cases have been chosen to provide meaningful and realistic timing benchmarks. In addition to real-world turbine and atmospheric models, these cases are computationally expensive and expose the areas where performance improvements would make a difference.

5MW_Land_BD_DLL_WTurb

Download case files here.

The physics modules used in this case are:

  • BeamDyn

  • InflowWind

  • AeroDyn 15

  • ServoDyn

This is a land based NREL 5-MW turbine simulation using BeamDyn as the structural module. It simulates 20 seconds with a time step size of 0.001 seconds and executes in 3m 55s on NREL’s Peregrine supercomputer.

5MW_OC4Jckt_DLL_WTurb_WavesIrr_MGrowth

Download case files here.

This is an offshore, fixed-bottom NREL 5-MW turbine simulation with the majority of the computational expense occurring in the HydroDyn wave-dynamics calculation.

The physics modules used in this case are:

  • ElastoDyn

  • InflowWind

  • AeroDyn 15

  • ServoDyn

  • HydroDyn

  • SubDyn

It simulates 60 seconds with a time step size of 0.01 seconds and executes in 20m 27s on NREL’s Peregrine supercomputer.

5.1.6.6.2.1.1. Profiling

The OpenFAST test cases were profiled with Intel® VTune™ Amplifier to identify performance hotspots. Being that the two test cases exercise difference portions of the OpenFAST software, different hotspots were identified. In all cases and environment settings, the majority of the CPU time was spent in fast_solution loop which is a high-level subroutine that coordinates the solution calculation from each physics module.

5.1.6.6.2.1.1.1. LAPACK

In the offshore case, the LAPACK usage was identified as a performance load. Within the fast_solution loop, the calls to the LAPACK function dgetrs consume 3.3% of the total CPU time.

../../_images/offshore_lapack.png
5.1.6.6.2.1.1.2. BeamDyn

While BeamDyn provides a high-fidelity blade-response calculation, it is a computationally expensive module. Initial profiling highlighted the bd_elementmatrixga2 subroutine as a hotspot. However, initial attempts to improve performance in BeamDyn revealed needs for algorithmic improvements and refinements to the module’s data structures.

5.1.6.6.2.1.2. Results

Though work is ongoing, OpenFAST time-to-solution performance has improved and the performance potential is better understood.

Some keys outcomes from the first year of the IPCC project are as follows:

  • Use of Intel® compiler and MKL library provides dramatic speedup over GCC and LAPACK

    • Additional significant gains are possible through MKL threading for offshore simulations

  • Offshore-wind-turbine simulations are poorly load balanced across modules

    • Land-based-turbine configuration better balanced

    • OpenMP Tasks are employed to achieve better load-balancing

  • OpenMP module-level parallelism provides significant, but limited speed up due to imbalance across different module tasks

  • Core algorithms need significant modification to enable OpenMP and SIMD benefits

Tuning the Intel® tools to perform best on NREL’s hardware and adding high level multithreading yielded a maximum 3.8x time-to-solution improvement for one of the benchmark cases.

5.1.6.6.2.1.2.1. Speedup - Intel® Compiler and MKL

By employing the standard Intel® developer tools tech stack, a performance improvement over GNU tools was demonstrated:

Compiler

Math Library

5MW_Land_BD_DLL_WTurb

5MW_OC4Jckt_DLL_WTurb_WavesIrr_MGrowth

GNU

LAPACK

2265 s (1.0x)

673 s (1.0x)

Intel® 17

LAPACK

1650 s (1.4x)

251 s (2.7x)

Intel® 17

MKL

1235 s (1.8x)

Intel® 17

MKL Multithreaded

722 s (3.1x)

5.1.6.6.2.1.2.2. Speedup - OpenMP at FAST_Solver

A performance improvement was domenstrated by adding OpenMP directives to the FAST_Solver module. Although the solution scheme is not well balanced, parallelizing mesh mapping and calculation routines resulted in the following speedup:

Compiler

Math Library

5MW_Land_BD_DLL_WTurb

5MW_OC4Jckt_DLL_WTurb_WavesIrr_MGrowth

Intel® 17

MKL - 1 thread

1073 s (2.1x)

100 s (6.7x)

Intel® 17

MKL - 8 threads

597 s (3.8x)

5.1.6.6.2.1.3. Ongoing Work

The next phase of the OpenFAST performance improvements are focused in two key areas:

  1. Implementing the outcomes from previous work throughout OpenFAST modules and glue codes

  2. Preparing OpenFAST for efficient execution on Intel®’s next generation platforms

5.1.6.6.2.2. Linearization routine profiling

As a portion of the ARPA-E WEIS project, the linearization capability within OpenFAST has been profiled in an effort to characterize the performance and current bottlenecks. This work specifically targetted the linearization routines within the FAST Library, primarily in FAST_Lin.f90, as well as the routines constructing the Jacobian matrices within individual physics modules. Because these routines require constructing large matrices, this is a computationally intensive process with a high rate of memory access.

A high-level flow of data in the linearization algorithm in the FAST_Linearize_OP subroutine is given below.

graph TD; D[Construct Module Jacobian]-->A[Calculate Module OP]; A[Calculate Module OP]-->B[Construct Glue Code Jacobians]; A[Calculate Module OP]-->C[Construct Glue Code State Matrices];

Each enabled physics module constructs module-level matrices in their respective <Module>_Jacobian and <Module>_GetOP routines, and the collection of these are assembled into global matrices in Glue_Jacobians and Glue_StateMatrices. In a top-down comparison of total CPU time in FAST_Linearize_OP, we see that the construction of the glue-code state matrices is the most expensive step. The HydroDyn Jacobian computation also stands out relative to other module Jacobian computations.

../../_images/TopDown_FAST_LinearizeOP.jpg

The Jacobian and state matrices are sized based on the total number of inputs, outputs, and continuous states. Though the size varies, these matrices generally contain thousands of elements in each dimension and are mostly zeroes. That is to say, the Jacobian and state matrices are large and sparse. To reduce the overhead of memory allocation and access, a sparse matrix representation is recommended.