FAQ:
Performance analysis tools

| Home | Support | FAQ | all just the FAQ

About

Presentations

Open MPI Team

FAQ

Rollup/ALL

General information

Supported systems

Contributing

Developer information

Sysadmin information

Fault Tolerance

Building

Building Open MPI

Removed MPI constructs

Compiling MPI apps

Running Jobs

Running MPI jobs

Troubleshooting

Parallel debugging

rsh/ssh

BProc

Torque / PBS Pro

Slurm

SGE

Large clusters

Tuning

General tuning

Shared memory (Vader)

TCP

IB, RoCE, iWARP

Omni-Path

Performance tools

OMPIO

UDAPL

Myrinet

Platform

OS X

AIX (unsupported)

Contrib

VampirTrace

Languages

Java

CUDA-aware

Building CUDA-aware

Running CUDA-aware

Videos

Performance

Open MPI Software

Download

Documentation

Source Code Access

Bug Tracking

Regression Testing

Version Information

Sub-Projects

Hardware Locality

Network Locality

MPI Testing Tool

Open Tool for Parameter Optimization

Community

Mailing Lists

Getting Help/Support

Contribute

Contact

License

This FAQ is for Open MPI v4.x and earlier.
If you are looking for documentation for Open MPI v5.x and later, please visit docs.open-mpi.org.

Table of contents:

What is special about MPI performance analysis?
What are "profiling" and "tracing"?
How do I sort out busy wait time from idle wait, user time from system time, and so on?
What is PMPI?
Should I use those switches --enable-mpi-profile and --enable-trace when I configure OMPI?
What support does OMPI have for performance analysis?
How do I view VampirTrace output?
Are there MPI performance analysis tools for OMPI that I can download for free?
Any other kinds of tools I should know about?

1. What is special about MPI performance analysis?

The synchronization among the MPI processes can be a key performance concern. For example, if a serial program spends a lot of time in function foo(), you should optimize foo(). In contrast, if an MPI process spends a lot of time in MPI_Recv(), not only is the optimization target probably not MPI_Recv(), but you should in fact probably be looking at some other process altogether. You should ask, "What is happening on other processes when this process has the long wait?"

Another issue is that a parallel program (in the case of MPI, a multi-process program) can generate much more performance data than a serial program due to the greater number of execution threads. Managing that data volume can be a challenge.

2. What are "profiling" and "tracing"?

These terms are sometimes used to refer to two different kinds of performance analysis.

In profiling, one aggregates statistics at run time — e.g., total amount of time spent in MPI, total number of messages or bytes sent, etc. Data volumes are small.

In tracing, an event history is collected. It is common to display such event history on a timeline display. Tracing data can provide much interesting detail, but data volumes are large.

3. How do I sort out busy wait time from idle wait, user time from system time, and so on?

Don't.

MPI synchronization delays, which are key performance inhibitors you will probably want to study, can show up as user or system time, all depending on the MPI implementation, the type of wait, what run-time settings you have chosen, etc. In many cases, it makes most sense for you just to distinguish between time spent inside MPI from time spent outside MPI.

Elapsed wall clock time will probably be your key metric. Exactly how the MPI implementation spends time waiting is less important.

4. What is PMPI?

PMPI refers to the MPI standard profiling interface.

Each standard MPI function can be called with an MPI_ or PMPI_ prefix. For example, you can call either MPI_Send() or PMPI_Send(). This feature of the MPI standard allows one to write functions with the MPI_ prefix that call the equivalent PMPI_ function. Specifically, a function so written has the behavior of the standard function plus any other behavior one would like to add. This is important for MPI performance analysis in at least two ways.

First, many performance analysis tools take advantage of PMPI. They capture the MPI calls made by your program. They perform the associated message-passing calls by calling PMPI functions, but also capture important performance data.

Second, you can use such wrapper functions to customize MPI behavior. E.g., you can add barrier operations to collective calls, write out diagnostic information for certain MPI calls, etc.

OMPI generally layers the various function interfaces as follows:

Fortran MPI_ interfaces are weak symbols for...
Fortran PMPI_ interfaces, which call...
C MPI_ interfaces, which are weak symbols for...
C PMPI_ interfaces, which provide the specified functionality.

Since OMPI generally implements MPI functionality for all languages in C, you only need to provide profiling wrappers in C, even if your program is in another programming language. Alternatively, you may write the wrappers in your program's language, but if you provide wrappers in both languages then both sets will be invoked.

There are a handful of exceptions. For example, MPI_ERRHANDLER_CREATE() in Fortran does not call MPI_Errhandler_create(). Instead, it calls some other low-level function. Thus, to intercept this particular Fortran call, you need a Fortran wrapper.

Be sure you make the library dynamic. A static library can experience the linker problems described in the Complications section of the Profiling Interface chapter of the MPI standard.

See the section on Profiling Interface in the MPI standard for more details.

5. Should I use those switches --enable-mpi-profile and --enable-trace when I configure OMPI?

Probably not.

The --enable-mpi-profile switch enables building of the PMPI interfaces. While this is important for performance analysis, this setting is already turned on by default.

The --enable-trace enables internal tracing of OMPI/ORTE/OPAL calls. It is used only for developer debugging, not MPI application performance tracing.

6. What support does OMPI have for performance analysis?

The OMPI source base has some instrumentation to capture performance data, but that data must be analyzed by other non-OMPI tools.

PERUSE was a proposed MPI standard that gives information about low-level behavior of MPI internals. Check the PERUSE web site for any information about analysis tools. When you configure OMPI, be sure to use --enable-peruse. Information is available describing its integration with OMPI.

Unfortunately, PERUSE didn't win standardization, so it didn't really go anywhere. Open MPI may drop PERUSE support at some point in the future.

MPI-3 standardized the MPI_T tools interface API (see Chapter 14 in the MPI-3.0 specification). MPI_T is fully supported starting with v1.7.3.

VampirTrace traces the entry to and exit from the MPI layer, along with important performance data, writing data using the open OTF format. VT is available freely and can be used with any MPI. Information is available describing its integration with OMPI.

7. How do I view VampirTrace output?

While OMPI includes VampirTrace instrumentation, it does not provide a tool for viewing OTF trace data. There is simply a primitive otfdump utility in the same directory where other OMPI commands (mpicc, mpirun, etc.) are located.

Another simple utility, otfprofile, comes with OTF software and allows you to produce a short profile in LaTeX format from an OTF trace.

The main way to view OTF data is with the Vampir tool. Evaluation licenses are available.

8. Are there MPI performance analysis tools for OMPI that I can download for free?

The OMPI distribution includes no such tools, but some general MPI tools can be used with OMPI.

...we used to maintain a list of links here. But the list changes over time; projects come, and projects go. Your best bet these days is simply to use Google to find MPI tracing and performance analysis tools.

9. Any other kinds of tools I should know about?

Well, there are other tools you should consider. Part of performance analysis is not just analyzing performance per se, but generally understanding the behavior of your program.

As such, debugging tools can help you step through or pry into the execution of your MPI program. Popular tools include TotalView, which can be downloaded for free trial use, and Arm DDT which also provides evaluation copies.

The command-line job inspection tool padb has been ported to ORTE and OMPI.

FAQ: Performance analysis tools

FAQ:
Performance analysis tools