MPI stands for the Message Passing Interface. Written by the
MPI Forum (a large committee comprised of a cross-section between
industry and research representatives), MPI is a standardized API
typically used for parallel and/or distributed computing. The MPI
standard has been published multiple times:
MPI-1.0 (published in 1994).
MPI-2.0 (published in 1996). MPI-2.0 is, for the most part,
additions and extensions to the original MPI-1.0 specification.
MPI-2.1 and MPI-2.2 were subsequently published, and contain
minor fixes, changes, and additions compared to MPI-2.0.
MPI-3.0 (published in 2012).
MPI-3.1 was subsequently published, and contains minor fixes, changes, and
additions compared to MPI-3.0.
All MPI specifications documents can be downloaded from the official
MPI Forum web site: http://www.mpi-forum.org/.
Open MPI is an open source, freely available implementation of the MPI
specifications. The Open MPI software achieves high performance; the
Open MPI project is quite receptive to community input.
2. Where can I learn about MPI? Are there tutorials available?
There are many resources available on the internet for
learning MPI.
The definitive reference for MPI is the MPI Forum Web site. It has
copies of the MPI standards documents and all of the errata. This is
not recommended for beginners, but is an invaluable reference.
Several books on MPI are available (search your favorite book
sellers for availability):
MPI: The Complete Reference, Marc Snir et al. (an annotated
version of the MPI-1 and MPI-2 standards; a 2 volume set,
also known as "The orange book" and "The yellow
book")
Using MPI, William Gropp et al. (2nd edition, also known as
"The purple book")
Create a free, open source, peer-reviewed, production-quality
complete MPI implementation.
Provide extremely high, competitive performance (latency,
bandwidth, ...pick your favorite metric).
Directly involve the HPC community with external development
and feedback (vendors, 3rd party researchers, users, etc.).
Provide a stable platform for 3rd party research and commercial
development.
Help prevent the "forking problem" common to other MPI
projects.
Support a wide variety of HPC platforms and environments.
In short, we want to work with and for the HPC community to make a
world-class MPI implementation that can be used on a huge number and
kind of systems.
4. Will you allow external involvement?
ABSOLUTELY.
Bringing together smart researchers and developers to work on a common
product is not only a good idea, it's the open source model. Merging
the multiple MPI implementation teams has worked extremely well for us
over the past year — extending this concept to the HPC open source
community is the next logical step.
The component architecture that Open MPI is founded upon (see the
"Publications" link for papers about this) is designed to foster 3rd
party collaboration by enabling independent developers to use Open MPI
as a production quality research platform. Although Open MPI is a
relatively large code base, it is rarely necessary to learn much more
than the interfaces for the component type which you are
implementing. Specifically, the component architecture was designed
to allow small, discrete implementations of major portions of MPI
functionality (e.g., point-to-point messaging, collective
communications, run-time environment support, etc.).
We envision at least the following forms of collaboration:
Peer review of the Open MPI code base
Discussion with Open MPI developers on public mailing lists
Direct involvement from HPC software and hardware vendors
3rd parties writing and providing their own Open MPI
components
That being said, although we are an open source project, we recognize
that everyone does not provide free, open source software. Our
collaboration models allow (and encourage!) 3rd parties to write and
distribute their own components — perhaps with a different license,
and perhaps even as closed source. This is all perfectly acceptable
(and desirable!).
6. I want to redistribute Open MPI. Can I?
Absolutely.
NOTE: We are not lawyers and this is not legal advice.
Please read the Open MPI
license (the BSD license). It contains extremely liberal
provisions for redistribution.
7. Preventing forking is a goal; how will you enforce that?
By definition, we can't. If someone really wants to fork the Open MPI code base, they can.
By virtue of our extremely liberal license, it is possible for
anyone to fork at any time.
However, we hope that no one does.
We intend to distinguish ourselves from other projects by:
Working with the HPC community to accept best-in-breed
improvements and functionality enhancements.
Providing a flexible framework and set of APIs that allow a
wide variety of different goals within the same code base through the
combinatorial effect of mixing-and-matching different components.
Hence, we hope that no one ever has a reason to fork the main code
base. We intend to work with the community to accept the best
improvements back into the main code base. And if some developers
want to do things to the main code base that are different from the
goals of the main Open MPI Project, it is our hope that they can do
what they need in components that can be distributed without forking
the main Open MPI code base.
Only time will tell if this ambitious plan is feasible, but we're going
to work hard to make it a reality!
8. How are 3rd party contributions handled?
Before accepting any code from 3rd parties, we require an original
signed contribution agreement from the donator.
These agreements assert that the contributor has the right to donate
the code and allow the Open MPI Project to perpetually distribute it
under the project's
licensing terms.
This prevents a situation where intellectual property gets into the
Open MPI code base and then someone later claims that we owe them
money for it. Open MPI is a free, open source code base. And we
intend it to remain that way.
9. Is this just YAMPI (yet another MPI implementation)?
No!
Open MPI initially represented the merger between three well-known MPI
implementations (none of which are being developed any more):
FT-MPI from the University of Tennessee
LA-MPI from Los Alamos National Laboratory
LAM/MPI from Indiana University
with contributions from the PACX-MPI team at the University of
Stuttgart.
Each of these MPI implementations excelled in one or more areas. The
driving motivation behind Open MPI is to bring the best ideas and
technologies from the individual projects and create one world-class
open source MPI implementation that excels in all areas.
Open MPI was started with the best of the ideas from these four MPI
implementations and ported them to an entirely new code base: Open
MPI. This also had the simultaneous effect of enabling us to jettison
old, crufty code that was only maintained for historical reasons from
each project. We started with a clean slate and decided to "do it
Right this time." As such, Open MPI also contains many new designs
and methodologies based on (literally) years of MPI implementation
experience.
After version 1.0 was released, the Open MPI Project grew to include
many other
members who have each brought their knowledge, expertise, and
resources to Open MPI. Open MPI is now far more than just
the best ideas of the founding for MPI implementation projects.
10. But I love [FT-MPI | LA-MPI | LAM/MPI | PACX-MPI]!
Why should I use Open MPI?
Here's a few reasons:
Open MPI represents the next generation of each of these
implementations.
Open MPI effectively contains the union of features from each of
the previous MPI projects. If you find a feature in one of the prior
projects that is not in Open MPI, chances are that it will be
soon.
The vast majority of our future research and development work will
be in Open MPI.
All the same developers from your favorite project are working on
Open MPI.
Not to worry — each of the respective teams has a vested interest in
bringing over the "best" parts of their prior implementation to Open
MPI. Indeed, we would love to migrate each of our current user bases
to Open MPI as their time, resources, and constraints allow.
In short: we believe that Open MPI — its code, methodology, and open
source philosophy — is the future.
11. What will happen to the prior projects?
Only time will tell (we cannot predict the future), but it is
likely that each project will eventually either end when funding stops
or be used exclusively as a research vehicle. Indeed, some of the
projects must continue to exist at least until their existing
funding expires.
12. What operating systems does Open MPI support?
We primarily develop Open MPI on Linux and OS X.
Other operating systems are supported, however. The exact list of operating
systems supported has changed over time (e.g., native Microsoft
Windows support was added in v1.3.3, and although it was removed prior
to v1.8, is still supported through Cygwin). See the README file in
your copy of Open MPI for a listing of the OSes that that version
supports.
Open MPI is fairly POSIX-neutral, so it will run without too many
modifications on most POSIX-like systems. Hence, if we haven't listed
your favorite operating system here, it should not be difficult to get
Open MPI to compile and run properly. The biggest obstacle is
typically the assembly language, but that's fairly modular and we're
happy to provide information about how to port it to new platforms.
It should be noted that we are quite open to accepting patches for
operating systems that we do not currently support. If we do not have
systems to test these on, we probably will only claim to
"unofficially" support those systems.
13. What hardware platforms does Open MPI support?
Essentially all the common platforms that the operating
systems listed in the previous question support.
For example, Linux runs on a wide variety of platforms, and we
certainly can't claim to support all of them. Open MPI includes
Linux-compiler-based assembly for support of Intel, AMD, and PowerPC
chips, for example.
14. What network interconnects does Open MPI support?
Open MPI is based upon a component architecture; support for its MPI
point-to-point functionality only utilizes a small number of components
at run-time. Adding native support for a new network interconnect was
specifically designed to be easy.
The list of supported interconnects has changed over time. You should
consult your copy of Open MPI to see exactly which interconnects it
supports. The table below shows various interconnects and the
versions in which they were supported in Open MPI (in alphabetical
order):
15. What run-time environments does Open MPI support?
Open MPI is layered on top of the Open Run-Time Environment (ORTE),
which originally started as a small portion of the Open MPI code base.
However, ORTE has effectively spun off into its own sub-project.
ORTE is a modular system that was specifically architected to abstract
away the back-end run-time environment (RTE) system, providing a
neutral API to the upper-level Open MPI layer. Components can be
written for ORTE that allow it to natively utilize a wide variety of
back-end RTEs.
ORTE currently natively supports the following run-time environments:
Recent versions of BProc (e.g., Clustermatic, pre-1.3 only)
Prior to Open MPI v1.3, Platform (which is now IBM) released a script-based integration
in the LSF 6.1 and 6.2 maintenance packs around November of 2006. If
you want this integration, please contact your normal IBM support
channels.
17. How much MPI does Open MPI support?
Open MPI 1.2 supports all of MPI-2.0.
Open MPI 1.3 supports all of MPI-2.1.
Open MPI 1.8 supports all of MPI-3.
Starting with v2.0, Open MPI supports all of MPI-3.1
18. Is Open MPI thread safe?
Support for MPI_THREAD_MULTIPLE (i.e., multiple threads
executing within the MPI library) and asynchronous message passing
progress (i.e., continuing message passing operations even while no
user threads are in the MPI library) has been designed into Open MPI
from its first planning meetings.
Support for MPI_THREAD_MULTIPLE was included in the first version of
Open MPI, but it only became robust around v3.0.0. Subsequent
releases continually improve reliability and performance of
multi-threaded MPI applications.
19. Does Open MPI support 32 bit environments?
As far as we know, yes. 64 bit architectures have effectively
taken over the world, though, so 32-bit is not tested nearly as much
as 64-bit.
Specifically, most of the Open MPI developers only have 64-bit
machines, and therefore only test 32-bit in emulation mode.
20. Does Open MPI support 64 bit environments?
Yes, Open MPI is 64 bit clean. You should be able to use Open
MPI on 64 bit architectures and operating systems with no
difficulty.
21. Does Open MPI support execution in heterogeneous environments?
As of v1.1, Open MPI requires that the size of C, C++, and
Fortran datatypes be the same on all platforms within a single
parallel application, with the exception of types represented by
MPI_BOOL and MPI_LOGICAL — size differences in these types
between processes are properly handled. Endian differences between
processes in a single MPI job are properly and automatically handled.
Prior to v1.1, Open MPI did not include any support for data size or
endian heterogeneity.
22. Does Open MPI support parallel debuggers?
Yes. Open MPI supports the TotalView API for parallel process
attaching, which several parallel debuggers support (e.g., DDT, fx2).
As part of v1.2.4 (released in September 2007), Open MPI also supports the
TotalView API for viewing message queues in running MPI processes.
See this FAQ entry for
details on how to run Open MPI jobs under TotalView, and this FAQ entry for
details on how to run Open MPI jobs under DDT.
NOTE: The integration of Open
MPI message queue support is problematic with 64 bit versions of
TotalView prior to v8.3:
The message queues views will be truncated.
Both the communicators and requests list will be incomplete.
Both the communicators and requests list may be filled with wrong
values (such as an MPI_Send to the destination ANY_SOURCE).
There are two workarounds:
Use a 32 bit version of TotalView
Upgrade to TotalView v8.3
23. Can I contribute to Open MPI?
YES!
One of the main goals of the Open MPI project is to involve the
greater HPC community.
There are many ways to contribute to Open MPI. Here are a few:
Write your own components and distribute them yourself (i.e.,
outside of the main Open MPI distribution)
Write your own components and contribute them back to the main
code base
Contribute bug fixes and feature enhancements to the main code
base
24. I found a bug! How do I report it?
First check that this is not already a known issue by checking
the FAQ and the
mailing list archives. If you
can't find your problem mentioned anywhere, it is most helpful if you
can create a "recipe" to replicate the bug.
Please see the Getting
Help page for more details on submitting bug reports.
We need to have an established intellectual property pedigree of the
code in Open MPI. This means being able to ensure that all code
included in Open MPI is free, open source, and able to be distributed
under the BSD license.
This prevents a situation where intellectual property gets into the
Open MPI code base and then someone later claims that we owe them
money for it. Open MPI is a free, open source code base. And we
intend it to remain that way.
We enforce this policy by requiring all git commits to include a
"Signed-off-by" token in the commit message, indicating your
agreement to the Open
MPI Contributor's Declaration.
27. I can't submit an Open MPI Third Party Contribution Agreement;
how can I contribute to Open MPI?
This question is obsolete (as of November 2016). The Open MPI
project used to require a signed Open MPI Third Party Contribution
Agreement before we could accept code contributions.
If you are unable to agree to the Contributor's Declaration, fear not —
there are
other ways to contribute to Open MPI. Here are some examples:
Become an active participant in the mailing lists
Write and distribute your own components (remember: Open MPI
components can be distributed completely separately from the main Open
MPI distribution — they can be added to existing Open MPI
installations, and don't even need to be open source)
Report bugs
Do a good deed daily
28. What if I don't want my contribution to be free / open source?
No problem.
While we are creating free / open-source software, and we would prefer
if everyone's contributions to Open MPI were also free / open-source,
we certainly recognize that other organizations have different goals
from us. Such is the reality of software development in today's
global economy.
As such, it is perfectly acceptable to make non-free / non-open-source
contributions to Open MPI.
We obviously cannot accept such contributions into the main code base,
but you are free to distribute plugins, enhancements, etc. as you see
fit. Indeed, the the BSD
license is extremely liberal in its redistribution provisions.
Although Open MPI's
license allows third parties to fork the code base, we would
strongly prefer if you did not. Forking is not necessarily a Bad
Thing, but history has shown that creating too many forks in MPI
implementations leads to massive user and system administrator
confusion. We have personally seen parallel environments loaded with
tens of MPI implementations, each only slightly different from the
others. The users then become responsible for figuring out which MPI
they want / need to use, which can be a daunting and confusing task.
We do periodically have "short" forks. Specifically, sometimes an
origanization needs to release a version of Open MPI with a specific
feature.
If you're thinking of forking the Open MPI code base, please let us
know — let's see if we can work something out so that it is not
necessary.
30. Rats! My contribution was not accepted into the main Open MPI
code base. What now?
If your contribution was not accepted into the main Open MPI
code base, there are likely to be good reasons for it (perhaps
technical, perhaps due to licensing restrictions, etc.).
If you wrote a standalone component, you can still distribute this
component independent of the main Open MPI distribution. Open MPI
components can be installed into existing Open MPI installations. As
such, you can distribute your component — even if it is closed source
(e.g., distributed as binary-only) — via any mechanism you choose,
such as on a web site, FTP site, etc.
31. Open MPI terminology
Open MPI is a large project containing many different
sub-systems and a relatively large code base. Let's first cover some
fundamental terminology in order to make the rest of the discussion
easier.
Open MPI has three sections of code:
OMPI: The MPI API and supporting logic
ORTE: The Open Run-Time Environment (support for different
back-end run-time systems)
OPAL: The Open Portable Access Layer (utility and "glue" code
used by OMPI and ORTE)
There are strict abstraction barriers in the code between these
sections. That is, they are compiled into three separate libraries:
libmpi, liborte, and libopal with a strict dependency order:
OMPI depends on ORTE and OPAL, and ORTE depends on OPAL. More
specifically, OMPI executables are linked with:
1
2
3
shell$ mpicc myapp.c -o myapp
# This actually turns into:shell$ cc myapp.c -o myapp -lmpi-lopen-rte-lopen-pal ...
More system-level libraries may listed after -lopal, but you get the
idea.
Strictly speaking, these are not "layers" in the classic software
engineering sense (even though it is convenient to refer to them as
such). They are listed above in dependency order, but that does not
mean that, for example, the OMPI code must go through the ORTE and
OPAL code in order to reach the operating system or a network
interface.
As such, this code organization more reflects abstractions and
software engineering, not a strict hierarchy of functions that must be
traversed in order to reach a lower layer. For example, OMPI can call
OPAL functions directly — it does not have to go through ORTE.
Indeed, OPAL has a different set of purposes than ORTE, so it wouldn't
even make sense to channel all OPAL access through ORTE. OMPI can
also directly call the operating system as necessary. For example,
many top-level MPI API functions are quite performance sensitive; it
would not make sense to force them to traverse an arbitrarily deep
call stack just to move some bytes across a network.
Here's a list of terms that are frequently used in discussions about
the Open MPI code base:
MCA: The Modular Component Architecture (MCA) is the foundation
upon which the entire Open MPI project is built. It provides all the
component architecture services that the rest of the system uses.
Although it is the fundamental heart of the system, its
implementation is actually quite small and lightweight — it is
nothing like CORBA, COM, JINI, or many other well-known component
architectures. It was designed for HPC — meaning that it is small,
fast, and reasonably efficient — and therefore offers few services
other than finding, loading, and unloading components.
Framework: An MCA framework is a construct that is created
for a single, targeted purpose. It provides a public interface that
is used by external code, but it also has its own internal services. A
list of Open MPI frameworks is available here. An MCA
framework uses the MCA's services to find and load components at run-time
— implementations of the framework's interface. An easy example
framework to discuss is the MPI framework named "btl", or the Byte
Transfer Layer. It is used to send and receive data on different
kinds of networks. Hence, Open MPI has btl components for shared
memory, TCP, Infiniband, Myrinet, etc.
Component: An MCA component is an implementation of a
framework's interface. Another common word for component is
"plugin". It is a standalone collection of code that can be bundled
into a plugin that can be inserted into the Open MPI code base, either
at run-time and/or compile-time.
Module: An MCA module is an instance of a component (in the
C++ sense of the word "instance"; an MCA component is analogous to a
C++ class). For example, if a node running an Open MPI application has
multiple ethernet NICs, the Open MPI application will contain one TCP
btl component, but two TCP btl modules. This difference between
components and modules is important because modules have private state;
components do not.
Frameworks, components, and modules can be dynamic or static. That is,
they can be available as plugins or they may be compiled statically
into libraries (e.g., libmpi).
32. How do I get a copy of the most recent source code?
34. What is the main tree layout of the Open MPI source tree? Are
there directory name conventions?
There are a few notable top-level directories in the source
tree:
config/: M4 scripts supporting the top-level configure script
mpi.h)
etc/: Some miscellaneous text files
include/: Top-level include files that will be installed
ompi/: The Open MPI code base
orte/: The Open RTE code base
opal/: The OPAL code base
Each of the three main source directories (ompi/, orte/, and
opal/) generate a top-level library named libmpi, liborte, and
libopal, respectively. They can be built as either static or shared
libraries. Executables are also produced in subdirectories of some of
the trees.
Each of the sub-project source directories have similar (but not
identical) directory structures under them:
class/: C++-like "classes" (using the OPAL class system)
specific to this project
include/: Top-level include files specific to this project
mca/: MCA frameworks and components specific to this project
runtime/: Startup and shutdown of this project at runtime
tools/: Executables specific to this project (currently none in
OPAL)
util/: Random utility code
There are other top-level directories in each of the three
sub-projects, each having to do with specific logic and code for that
project. For example, the MPI API implementations can be found under
ompi/mpi/LANGUAGE, where
LANGUAGE is c, cxx, f77, and f90.
The layout of the mca/ trees are strictly defined. They are of the
form:
<project>/mca/<framework name>/<component name>/
To be explicit: it is forbidden to have a directory under the mca
trees that does not meet this template (with the exception of base
directories, explained below). Hence, only framework and component
code can be in the mca/ trees.
That is, framework and component names must be valid directory names
(and C variables; more on that later). For example, the TCP BTL
component is located in the following directory:
# In v1.6.x and earlier:
ompi/mca/btl/tcp/
# In v1.7.x and later:
opal/mca/btl/tcp/
The name base is reserved; there cannot be a framework or component
named "base." Directories named base are reserved for the
implementation of the MCA and frameworks. Here are a few examples (as
of the v1.8 source tree):
# Main implementation of the MCA
opal/mca/base
# Implementation of the btl framework
opal/mca/btl/base
# Implementation of the rml framework
orte/mca/rml/base
# Implementation of the pml framework
ompi/mca/pml/base
Under these mandated directories, frameworks and/or components may have
arbitrary directory structures, however.
35. Is there more information available?
Yes. In early 2006, Cisco hosted an Open MPI workshop where
the Open MPI Team provided several days of intensive
dive-into-the-code tutorials. The slides from these tutorials are available here.
Additionally, Greenplum videoed several Open MPI developers
discussing Open MPI internals in 2012. The videos are available here.
36. I'm a sysadmin; what do I care about Open MPI?
Several members of the Open MPI team have strong system
administrator backgrounds; we recognize the value of having software
that is friendly to system administrators. Here are some of the reasons
that Open MPI is attractive for system administrators:
Simple, standards-based installation
Reduction of the number of MPI installations
Ability to set system-level and user-level parameters
Scriptable information sources about the Open MPI installation
See the rest of the questions in this FAQ section for more details.
37. What hardware / software / run-time environments / networks
does Open MPI support?
Open MPI can handle a variety of different run-time environments
(e.g., rsh/ssh, Slurm, PBS, etc.) and a variety of different
interconnection networks (e.g., ethernet, Myrinet, Infiniband, etc.)
in a single installation. Specifically: because Open MPI is
fundamentally powered by a component architecture, plug-ins for all
these different run-time systems and interconnect networks can be
installed in a single installation tree. The relevant plug-ins will
only be used in the environments where they make sense.
Hence, there is no need to have one MPI installation for Myrinet, one
MPI installation for ethernet, one MPI installation for PBS, one MPI
installation for rsh, etc. Open MPI can handle all of these in a
single installation.
However, there are some issues that Open MPI cannot solve. Binary
compatibility between different compilers is such an issue. Let's
examine this on a per-language basis (be sure see the big caveat at
the end):
C: Most C compilers are fairly compatible, such that if you compile
Open MPI with one C library and link it to an application that was
compiled with a different C compiler, everything "should just work."
As such, a single installation of Open MPI should work for most C MPI
applications.
C++: The same is not necessarily true for C++. Most of Open
MPI's C++ code is simply the MPI C++ bindings, and in the default
build, they are inlined C++ code, meaning that they should compile on
any C++ compiler. Hence, you should be able to have one Open MPI
installation for multiple different C++ compilers (we'd like to hear
feedback either way). That being said, some of the top-level Open MPI
executables are written in C++ (e.g., mpicc, ompi_info, etc.). As
such, these applications may require having the C++ run-time support
libraries of whatever compiler they were created with in order to run
properly. Specifically, if you compile Open MPI with the XYZ C/C++
compiler, you may need to have the XYC C++ run-time libraries
installed everywhere you want to run mpicc or oompi_info.
Fortran 77: Fortran 77 compilers do something called "symbol
mangling," meaning that they change the names of global variables,
subroutines, and functions. There are 4 common name mangling schemes
in use by Fortran 77 compilers. On many systems (e.g., Linux), Open
MPI will automatically support all 4 schemes. As such, a single Open
MPI installation should just work with multiple different Fortran
compilers. However, on some systems, this is not possible (e.g., OS
X), and Open MPI will only support the name mangling scheme of the
Fortran 77 compiler that was identified during configure.
Also, there are two notable exceptions that do not work across
Fortran compilers that are "different enough":
The C constants MPI_F_STATUS_IGNORE and MPI_F_STATUSES_IGNORE
will only compare properly to Fortran applications that were
created with Fortran compilers that that use the same
name-mangling scheme as the Fortran compiler that Open MPI was
configured with.
Fortran compilers may have different values for the logical
.TRUE. constant. As such, any MPI function that uses the
Fortran LOGICAL type may only get .TRUE. values back that
correspond to the the .TRUE. value of the Fortran compiler that
Open MPI was configured with.
Fortran 90: Similar to C++, linking object files from different
Fortran 90 compilers is not likely to work. The F90 MPI module that
Open MPI creates will likely only work with the Fortran 90 compiler
that was identified during configure.
The big caveat to all of this is that Open MPI will only work with
different compilers if all the datatype sizes are the same. For
example, even though Open MPI supports all 4 name mangling schemes,
the size of the Fortran LOGICAL type may be 1 byte in some compilers
and 4 bytes in others. This will likely cause Open MPI to perform
unpredictably.
The bottom line is that Open MPI can support all manner of run-time
systems and interconnects in a single installation, but supporting
multiple compilers "sort of" works (i.e., is subject to trial and
error) in some cases, and definitely does not work in other cases.
There's unfortunately little that we can do about this — it's a
compiler compatibility issue, and one that compiler authors have
little incentive to resolve.
39. What are MCA Parameters? Why would I set them?
MCA parameters are a way to tweak Open MPI's behavior at
run-time. For example, MCA parameters can specify:
Which interconnect networks to use
Which interconnect networks not to use
The size difference between eager sends and rendezvous protocol
sends
How many registered buffers to pre-pin (e.g., for GM or mVAPI)
The size of the pre-pinned registered buffers
...etc.
It can be quite valuable for a system administrator to play with such
values a bit and find an "optimal" setting for a particular
operating environment. These values can then be set in a global text
file that all users will, by default, inherit when they run Open MPI
jobs.
For example, say that you have a cluster with 2 ethernet networks —
one for NFS and other system-level operations, and one for MPI jobs.
The system administrator can tell Open MPI to not use the NFS TCP
network at a system level, such that when users invoke mpirun or
mpiexec to launch their jobs, they will automatically only be using
the network meant for MPI jobs.
40. Do my users need to have their own installation of Open MPI?
Usually not. It is typically sufficient for a single Open MPI
installation (or perhaps a small number of Open MPI installations,
depending on compiler interoperability) to serve an entire parallel
operating environment.
Indeed, a system-wide Open MPI installation can be customized on a
per-user basis in two important ways:
Per-user MCA parameters: Each user can set their own set of MCA
parameters, potentially overriding system-wide defaults.
Per-user plug-ins: Users can install their own Open MPI
plug-ins under $HOME/.openmpi/components. Hence, developers can
experiment with new components without destabilizing the rest of the
users on the system. Or power users can download 3rd party components
(perhaps even research-quality components) without affecting other users.
41. I have power users who will want to override my global MCA
parameters; is this possible?
Absolutely.
See the run-time tuning FAQ
category for information how to set MCA parameters, both at the
system level and on a per-user (or per-MPI-job) basis.
42. What MCA parameters should I, the system administrator, set?
This is a difficult question and depends on both your specific
parallel setup and the applications that typically run there.
The best thing to do is to use the ompi_info command to see what
parameters are available and relevant to you. Specifically,
ompi_info can be used to show all the parameters that are available
for each plug-in. Two common places that system administrators like
to tweak are:
Only allow specific networks: Say you have a cluster with a
high-speed interconnect (such as Myrinet or Infiniband) and an
ethernet network. The high-speed network is intended for MPI jobs;
the ethernet network is intended for NFS and other
administrative-level jobs. In this case, you can simply turn off Open
MPI's TCP support. The "btl" framework contains Open MPI's network
support; in this case, you want to disable the tcp plug-in. You can
do this by adding the following line in the file
$prefix/etc/openmpi-mca-params.conf:
1
btl = ^tcp
This tells Open MPI to load all BTL components excepttcp.
Consider another example: your cluster has two TCP networks, one for
NFS and administration-level jobs, and another for MPI jobs. You can
tell Open MPI to ignore the TCP network used by NFS by adding the
following line in the file $prefix/etc/openmpi-mca-params.conf:
1
btl_tcp_if_exclude = lo,eth0
The value of this parameter is the device names to exclude. In this
case, we're excluding lo (localhost, because Open MPI has its own
internal loopback device) and eth0.
Tune the parameters for specific networks: Each network plug-in
has a variety of different tunable parameters. Use the ompi_info
command to see what is available. You show all available parameters
with:
1
shell$ ompi_info--param all all
NOTE: Starting with Open MPI v1.8, ompi_info categorizes
its parameters in so-called levels, as defined by
the MPI_T interface. You will need to specify --level 9
(or --all) to show all MCA parameters. See
this blog entry
for further information.
1
shell$ ompi_info--param all all --level9
or
1
shell$ ompi_info--all
Beware: there are many variables available. You can limit the
output by showing all the parameters in a specific framework or in a
specific plug-in with the command line parameters:
1
shell$ ompi_info--param btl all --level9
Shows all the parameters of all BTL components, and:
1
shell$ ompi_info--param btl tcp --level9
Shows all the parameters of just the tcp BTL component.
43. I just added a new plugin to my Open MPI installation; do I need to recompile all my MPI apps?
If your installation of Open MPI uses shared libraries and
components are standalone plug-in files, then no. If you add a new
component (such as support for a new network), Open MPI will simply
open the new plugin at run-time — your applications do not need to be
recompiled or re-linked.
44. I just upgraded my Myrinet|Infiniband network; do I need to
recompile all my MPI apps?
If your installation of Open MPI uses shared libraries and
components are standalone plug-in files, then no. You simply need to
recompile the Open MPI components that support that network and
re-install them.
More specifically, Open MPI shifts the dependency on the underlying
network away from the MPI applications and to the Open MPI plug-ins.
This is a major advantage over many other MPI implementations.
MPI applications will simply open the new plugin when they run.
45. We just upgraded our version of Open MPI; do I need to
recompile all my MPI apps?
It is unlikely. Most MPI applications solely interact with
Open MPI through the standardized MPI API and the constant values it
publishes in mpi.h. The MPI-2 API will not change until the MPI
Forum changes it.
We will try hard to make Open MPI's mpi.h stable such that the
values will not change from release-to-release. While we cannot
guarantee that they will stay the same forever, we'll try hard to make
it so.
46. I have an MPI application compiled for another MPI; will it
work with Open MPI?
It is strongly unlikely. Open MPI does not attempt to
interface to other MPI implementations, nor executables that were
compiled for them. Sorry!
MPI applications need to be compiled and linked with Open MPI in order
to run under Open MPI.
47. What is "fault tolerance"?
The phrase "fault tolerance" means many things to many
people. Typical definitions range from user processes dumping vital
state to disk periodically to checkpoint/restart of running processes
to elaborate recreate-process-state-from-incremental-pieces schemes to
... (you get the idea).
In the scope of Open MPI, we typically define "fault tolerance" to
mean the ability to recover from one or more component failures in a
well defined manner with either a transparent or application-directed
mechanism. Component failures may exhibit themselves as a corrupted
transmission over a faulty network interface or the failure of one or
more serial or parallel processes due to a processor or node failure.
Open MPI strives to provide the application with a consistent system
view while still providing a production quality, high performance
implementation.
Yes, that's pretty much as all-inclusive as possible — intentionally
so! Remember that in addition to being a production-quality MPI
implementation, Open MPI is also a vehicle for research. So while
some forms of "fault tolerance" are more widely accepted and used,
others are certainly of valid academic interest.
48. What fault tolerance techniques has/does/will Open MPI support?
Open MPI was a vehicle for research in fault tolerance and over the years provided
support for a wide range of resilience techniques (striked item have seem their support
deprecated):
Coordinated and uncoordinated process checkpoint and
restart. Similar to those implemented in LAM/MPI and MPICH-V,
respectively.
Message logging techniques. Similar to those implemented in
MPICH-V
Data Reliability and network fault tolerance. Similar to those
implemented in LA-MPI
User Level Fault Mitigation techniques similar to
those implemented in FT-MPI.
The Open MPI team will not limit their fault tolerance techniques to
those mentioned above, but intend on extending beyond them in the
future.
49. Does Open MPI support checkpoint and restart of parallel jobs (similar
to LAM/MPI)?
Old versions of OMPI (strarting from v1.3 series) had support for
the transparent, coordinated checkpointing and restarting of MPI
processes (similar to LAM/MPI).
Open MPI supported both the the BLCR
checkpoint/restart system and a "self" checkpointer that allows
applications to perform their own checkpoint/restart functionality while taking
advantage of the Open MPI checkpoint/restart infrastructure.
For both of these, Open MPI provides a coordinated checkpoint/restart protocol
and integration with a variety of network interconnects including shared memory,
Ethernet, InfiniBand, and Myrinet.
The implementation introduces a series of new frameworks and
components designed to support a variety of checkpoint and restart
techniques. This allows us to support the methods described above
(application-directed, BLCR, etc.) as well as other kinds of
checkpoint/restart systems (e.g., Condor, libckpt) and protocols
(e.g., uncoordinated, message induced).
Note: The
checkpoint/restart support was last released as part of the v1.6
series. The v1.7 series and the Open MPI main do not support this
functionality (most of the code is present in the repository, but it
is known to be non-functional in most cases). This feature is looking
for a maintainer. Interested parties should inquire on the developers
mailing list.
50. Where can I find the fault tolerance development work?
The only active work in resilience in Open MPI
targets the User Level Fault Mitigation (ULFM) approach, a
technique discussed in the context of the MPI standardization
body.
For information on the Fault Tolerant MPI prototype in Open MPI see the
links below:
Support for other types of resilience (data reliability,
checkpoint) has been deprecated over the years
due to lack of adoption and lack of maintanance. If you are interested
in doing some archeological work, traces are still available on the main
repository.
51. Does Open MPI support end-to-end data reliability in MPI
message passing?
Current OMPI releases have no support for end-to-end data
reliability, at least not more than currently provided by the
underlying network.
The data reliability ("dr") PML component available
on some past releases has been deprecated), assumed that the
underlying network is unreliable. It could drop / restart connections,
retransmit corrupted or lost data, etc. The end effect is that data
sent through MPI API functions will be guaranteed to be reliable.
For example, if you're using TCP as a message transport, chances of
data corruption are fairly low. However, other interconnects do not
guarantee that data will be uncorrupted when traveling across the
network. Additionally, there are nonzero possibilities that data can
be corrupted while traversing PCI buses, etc. (some corruption errors
at this level can be caught/fixed, others cannot). Such errors are
not uncommon at high altitudes (!).
Note that such added reliability does incur a performance cost —
latency and bandwidth suffer when Open MPI performs the consistency
checks that are necessary to provide such guarantees.
Most clusters/networks do not need data reliability. But some do
(e.g., those operating at high altitudes). The dr PML was intended for
these rare environments where reliability was an issue; and users were
willing to tolerate slightly slower applications in order to guarantee
that their job does not crash (or worse, produce wrong answers).
52. How do I build Open MPI?
If you have obtained a developer's checkout from Git, skip this
FAQ question and consult these
directions.
For everyone else, in general, all you need to do is expand the
tarball, run the provided configure script, and then run "[make all
install]". For example:
1
2
3
4
5
shell$ gunzip-c openmpi-5.0.6.tar.gz |tar xf -
shell$ cd openmpi-5.0.6
shell$ ./configure --prefix=/usr/local<...lots of output...>shell$ make all install
Note that the configure script supports a lot of different command
line options. For example, the --prefix option in the above example
tells Open MPI to install under the directory /usr/local/.
Other notable configure options are required to support specific
network interconnects and back-end run-time environments. More
generally, Open MPI supports a wide variety of hardware and
environments, but it sometimes needs to be told where support
libraries and header files are located.
Consult the README file in the Open MPI tarball and the output of
"configure --help" for specific instructions regarding Open MPI's
configure command line options.
53. Wow — I see a lot of errors during configure.
Is that normal?
If configure finishes successfully — meaning that it
generates a bunch of Makefiles at the end — then yes, it is
completely normal.
The Open MPI configure script tests for a lot of things, not all of
which are expected to succeed. For example, if you do not have
Myrinet's GM library installed, you'll see failures about trying to
find the GM library. You'll also see errors and warnings about
various operating-system specific tests that are not aimed that the
operating system you are running.
These are all normal, expected, and nothing to be concerned about. It
just means, for example, that Open MPI will not build Myrinet GM
support.
54. What are the default build options for Open MPI?
Try to find support for all hardware and environments by looking
for support libraries and header files in standard locations; skip them if not found
Open MPI's configure script has a large number of options, several of
which are of the form --with-<FOO>(=DIR), usually with a
corresponding --with-<FOO>-libdir=DIR option. The (=DIR)
part means that specifying the directory is optional. Here are some
examples (explained in more detail below):
--with-openib(=DIR) and --with-openib-libdir=DIR
--with-mx(=DIR) and --with-mx-libdir=DIR
--with-psm(=DIR) and --with-psm-libdir=DIR
...etc.
As mentioned above, by default, Open MPI will try to build support for
every feature that it can find on your system. If support for a given
feature is not found, Open MPI will simply skip building support for
it (this usually means not building a specific plugin).
"Support" for a given feature usually means finding both the
relevant header and library files for that feature. As such, the
command-line switches listed above are used to override default
behavior and allow specifying whether you want support for a given
feature or not, and if you do want support, where the header files
and/or library files are located (which is useful if they are not
located in compiler/linker default search paths). Specifically:
If --without-<FOO> is specified, Open MPI will not even
look for support for feature FOO. It will be treated as if support
for that feature was not found (i.e., it will be skipped).
If --with-<FOO> is specified with no optional directory,
Open MPI's configure script will abort if it cannot find support for
the FOO feature. More specifically, only compiler/linker default
search paths will be searched while looking for the relevant header
and library files. This option essentially tells Open MPI, "Yes, I
want support for FOO -- it is an error if you don't find support for
it."
If --with-<FOO>=/some/path is specified, it is
essentially the same as specifying --with-<FOO> but also
tells Open MPI to add -I/some/path/include to compiler search paths,
and try (in order) adding -L/some/path/lib and -L/some/path/lib64
to linker search paths when searching for FOO support. If found,
the relevant compiler/linker paths are added to Open MPI's general
build flags. This option is helpful when support for feature FOO is
not found in default search paths.
If --with-<FOO>-libdir=/some/path/lib is specified, it
only specifies that if Open MPI searches for FOO support, it
should use /some/path/lib for the linker search path.
In general, it is usually sufficient to run Open MPI's configure
script with no --with-<FOO> options if all the features you
need supported are in default compiler/linker search paths. If the
features you need are not in default compiler/linker search paths,
you'll likely need to specify --with-<FOO> kinds of flags.
However, note that it is safest to add --with-<FOO> types of
flags if you want to guarantee that Open MPI builds support for
feature FOO, regardless of whether support for FOO can be found in
default compiler/linker paths or not — configure will abort if you
can't find the appropriate support for FOO. *This may be preferable
to unexpectedly discovering at run-time that Open MPI is missing
support for a critical feature.*
Be sure to note the difference in the directory specification between
--with-<FOO> and --with-<FOO>-libdir. The former
takes a top-level directory (such that "/include", "/lib", and
"/lib64" are appended to it) while the latter takes a single
directory where the library is assumed to exist (i.e., nothing is
suffixed to it).
Finally, note that starting with Open MPI v1.3, configure will
sanity check to ensure that any directory given to
--with-<FOO> or --with-<FOO>-libdir actually exists
and will error if it does not. This prevents typos and mistakes in
directory names, and prevents Open MPI from accidentally using a
compiler/linker-default path to satisfy FOO's header and library
files.
55. Open MPI was pre-installed on my machine; should I overwrite it with a new version?
Probably not.
Many systems come with some version of Open MPI pre-installed (e.g.,
many Linuxes, BSD variants, and OS X. If you download a newer version
of Open MPI from this web site (or one of the Open MPI mirrors), you
probably do not want to overwrite the system-installed Open MPI.
This is because the system-installed Open MPI is typically under the
control of some software package management system (rpm, yum, etc.).
Instead, you probably want to install your new version of Open MPI to
another path, such as /opt/openmpi-<version> (or whatever is
appropriate for your system).
This FAQ
entry also has much more information about strategies for where to
install Open MPI.
56. Where should I install Open MPI?
A common environment to run Open MPI is in a "Beowulf"-class
or similar cluster (e.g., a bunch of 1U servers in a bunch of racks).
Simply stated, Open MPI can run on a group of servers or workstations
connected by a network. As mentioned above, there are several
prerequisites, however (for example, you typically must have an
account on all the machines, you can ssh or ssh between the nodes
without using a password etc.).
This raises the question for Open MPI system administrators: where to
install the Open MPI binaries, header files, etc.? This discussion
mainly addresses this question for homogeneous clusters (i.e., where
all nodes and operating systems are the same), although elements of
this discussion apply to heterogeneous clusters as well.
Heterogeneous admins are encouraged to read this discussion and then
see the heterogeneous section of this FAQ.
There are two common approaches:
Have a common filesystem, such as NFS, between all the machines
to be used. Install Open MPI such that the installation directory is
the same value on each node. This will greatly simplify user's
shell startup scripts (e.g., .bashrc, .cshrc, .profile etc.)
— the PATH can be set without checking which machine the user is
on. It also simplifies the system administrator's job; when the time
comes to patch or otherwise upgrade OMPI, only one copy needs to be
modified.
For example, consider a cluster of four machines: inky, blinky,
pinky, and clyde.
Install Open MPI on inky's local hard drive in the directory
/opt/openmpi-5.0.6. The system administrator then mounts
inky:/opt/openmpi-5.0.6 on the remaining three machines, such
that /opt/openmpi-5.0.6 on all machines is effectively "the
same". That is, the following directories all contain the Open MPI
installation:
Install Open MPI on inky's local hard drive in the directory
/usr/local/openmpi-5.0.6. The system administrator then
mounts inky:/usr/local/openmpi-5.0.6 on all four machines
in some other common location, such as /opt/openmpi-5.0.6 (a
symbolic link can be installed on inky instead of a mount point for
efficiency). This strategy is typically used for environments where
one tree is NFS exported, but another tree is typically used for the
location of actual installation. For example, the following
directories all contain the Open MPI installation:
Notice that there are the same four directories as the previous
example, but on inky, the directory is actually located in
/usr/local/openmpi-5.0.6.
There is a bit of a disadvantage in this approach; each of the remote
nodes have to incur NFS (or whatever filesystem is used) delays to
access the Open MPI directory tree. However, both the administration
ease and low cost (relatively speaking) of using a networked file
system usually greatly outweighs the cost. Indeed, once an MPI
application is past MPI_INIT, it doesn't use the Open MPI binaries
very much.
NOTE: Open MPI, by default, uses a plugin
system for loading functionality at run-time. Most of Open MPI's
plugins are opened during the call to MPI_INIT. This can cause a lot
of filesystem traffic, which, if Open MPI is installed on a networked
filesystem, may be noticable. Two common options to avoid this extra
filesystem traffic are to build Open MPI to not use plugins (see this FAQ entry for details) or to install
Open MPI locally (see below).
If you are concerned with networked filesystem costs of accessing
the Open MPI binaries, you can install Open MPI on the local hard
drive of each node in your system. Again, it is highly advisable to
install Open MPI in the same directory on each node so that each
user's PATH can be set to the same value, regardless of the node
that a user has logged on to.
This approach will save some network latency of accessing the Open MPI
binaries, but is typically only used where users are very concerned
about squeezing every spare cycle out of their machines, or are
running at extreme scale where a networked filesystem may get
overwhelmed by filesystem requests for Open MPI binaries when running
very large parallel jobs.
57. Should I install a new version of Open MPI over an old version?
We do not recommend this.
Before discussing specifics, here are some definitions that are
necessary to understand:
Source tree: The tree where the Open MPI source
code is located. It is typically the result of expanding an Open MPI
distribution source code bundle, such as a tarball.
Build tree: The tree where Open MPI was built.
It is always related to a specific source tree, but may actually be a
different tree (since Open MPI supports VPATH builds). Specifically,
this is the tree where you invoked configure, make, etc. to build
and install Open MPI.
Installation tree: The tree where Open MPI was
installed. It is typically the "prefix" argument given to Open MPI's
configure script; it is the directory from which you run installed Open
MPI executables.
In its default configuration, an Open MPI installation consists of
several shared libraries, header files, executables, and plugins
(dynamic shared objects — DSOs). These installation files act
together as a single entity. The specific filenames and
contents of these files are subject to change between different
versions of Open MPI.
KEY POINT: Installing one
version of Open MPI does not uninstall another version.
If you install a new version of Open MPI over an older version, this
may not remove or overwrite all the files from the older version.
Hence, you may end up with an incompatible muddle of files from two
different installations — which can cause problems.
The Open MPI team recommends one of the following methods for
upgrading your Open MPI installation:
Install newer versions of Open MPI into a different
directory. For example, install into /opt/openmpi-a.b.c and
/opt/openmpi-x.y.z for versions a.b.c and x.y.z, respectively.
Completely uninstall the old version of Open MPI before
installing the new version. The make uninstall process from Open
MPI a.b.c build tree should completely uninstall that version from
the installation tree, making it safe to install a new version (e.g.,
version x.y.z) into the same installation tree.
Remove the old installation directory entirely and then install
the new version. For example "rm -rf /opt/openmpi" *(assuming
that there is nothing else of value in this tree!)* The installation
of Open MPI x.y.z will safely re-create the /opt/openmpi tree. This
method is preferable if you no longer have the source and build trees
to Open MPI a.b.c available from which to "make uninstall".
Go into the Open MPI a.b.c installation directory and manually
remove all old Open MPI files. Then install Open MPI x.y.z into the
same installation directory. This can be a somewhat painful,
annoying, and error-prone process. We do not recommend it. Indeed,
if you no longer have access to the original Open MPI a.b.c source and
build trees, it may be far simpler to download Open MPI version a.b.c
again from the Open MPI web site, configure it with the same
installation prefix, and then run "make uninstall". Or use one of
the other methods, above.
58. Can I disable Open MPI's use of plugins?
Yes.
Open MPI uses plugins for much of its functionality. Specifically,
Open MPI looks for and loads plugins as dynamically shared objects
(DSOs) during the call to MPI_INIT. However, these plugins can be
compiled and installed in several different ways:
As DSOs: In this mode (the default), each of Open MPI's plugins
are compiled as a separate DSO that is dynamically loaded at run
time.
Advantage: this approach is highly flexible — it gives system
developers and administrators fine-grained approach to install new
plugins to an existing Open MPI installation, and also allows the
removal of old plugins (i.e., forcibly disallowing the use of specific
plugins) simply by removing the corresponding DSO(s).
Disadvantage: this approach causes additional filesystem
traffic (mostly during MPI_INIT). If Open MPI is installed on a
networked filesystem, this can cause noticeable network traffic when a
large parallel job starts, for example.
As part of a larger library: In this mode, Open MPI "slurps
up" the plugins and includes them in libmpi (and other libraries).
Hence, all plugins are included in the main Open MPI libraries
that are loaded by the system linker before an MPI process even
starts.
Advantage: Significantly less filesystem traffic than the DSO
approach. This model can be much more performant on network
installations of Open MPI.
Disadvantage: Much less flexible than the DSO approach; system
administrators and developers have significantly less ability to
add/remove plugins from the Open MPI installation at run-time. Note
that you still have some ability to add/remove plugins (see below),
but there are limitations to what can be done.
To be clear: Open MPI's plugins can be built either as standalone DSOs
or included in Open MPI's main libraries (e.g., libmpi).
Additionally, Open MPI's main libraries can be built either as static
or shared libraries.
You can therefore choose to build Open MPI in one of several different
ways:
--disable-mca-dso: Using the --disable-mca-dso switch to Open
MPI's configure script will cause all plugins to be built as part of
Open MPI's main libraries — they will not be built as standalone
DSOs. However, Open MPI will still look for DSOs in the filesystem at
run-time. Specifically: this option significantly decreases (but
does not eliminate) filesystem traffic during MPI_INIT, but does allow
the flexibility of adding new plugins to an existing Open MPI
installation.
Note that the --disable-mca-dso option does not affect whether Open
MPI's main libraries are built as static or shared.
--enable-static: Using this option to Open MPI's configure
script will cause the building of static libraries (e.g., libmpi.a).
This option automatically implies --disable-mca-dso.
Note that --enable-shared is also the default; so if you use
--enable-static, Open MPI will build both static and shared
libraries that contain all of Open MPI's plugins (i.e., libmpi.so and
libmpi.a). If you want only static libraries (that contain all of
Open MPI's plugins), be sure to also use --disable-shared.
--disable-dlopen: Using this option to Open MPI's configure
script will do two things:
Imply --disable-mca-dso, meaning that all plugins will be
slurped into Open MPI's libraries.
Cause Open MPI to not look for / open any DSOs at run time.
Specifically: this option makes Open MPI not incur any additional
filesystem traffic during MPI_INIT. Note that the --disable-dlopen
option does not affect whether Open MPI's main libraries are built as
static or shared.
59. How do I build an optimized version of Open MPI?
Building Open MPI from a tarball defaults to building an optimized
version. There is no need to do anything special.
60. Are VPATH and/or parallel builds supported?
Yes, both VPATH and parallel builds are supported. This
allows Open MPI to be built in a different directory than where its
source code resides (helpful for multi-architecture builds). Open MPI
uses Automake for its build system, so
For example:
1
2
3
4
5
6
7
shell$ gtar zxf openmpi-1.2.3.tar.gz
shell$ cd openmpi-1.2.3
shell$ mkdir build
shell$ cd build
shell$ ../configure ...
<... lots of output ...>shell$ make-j4
Running configure from a different directory from where it actually
resides triggers the VPATH build (i.e., it will configure and built
itself from the directory where configure was run, not from the
directory where configure resides).
Some versions of make support parallel builds. The example above
shows GNU make's "-j" option, which specifies how many compile
processes may be executing at any given time. We, the Open MPI Team,
have found that doubling or quadrupling the number of processors in a
machine can significantly speed up an Open MPI compile (since
compiles tend to be much more IO bound than CPU bound).
61. Do I need any special tools to build Open MPI?
If you are building Open MPI from a tarball, you need a C
compiler, a C++ compiler, and make. If you are building the Fortran
77 and/or Fortran 90 MPI bindings, you will need compilers for these
languages as well. You do not need any special version of the GNU
"Auto" tools (Autoconf, Automake, Libtool).
If you are building Open MPI from a Git checkout, you need some
additional tools. See the
source code access pages for more information.
62. How do I build Open MPI as a static library?
As noted above, Open MPI defaults to building shared libraries
and building components as dynamic shared objects (DSOs, i.e.,
run-time plugins). Changing this build behavior is controlled via
command line options to Open MPI's configure script.
Building static libraries: You can disable building shared libraries
and enable building static libraries with the following options:
Similarly, you can build both static and shared libraries by simply
specifying --enable-static (and not specifying
--disable-shared), if desired.
Including components in libraries: Instead of building components as
DSOs, they can also be "rolled up" and included in their respective
libraries (e.g., libmpi). This is controlled with the
--enable-mca-static option. Some examples:
Specifically, entire frameworks and/or individual components can be
specified to be rolled up into the library in a comma-separated list
as an argument to --enable-mca-static.
63. When I run 'make', it looks very much like the build system is going into a loop.
Open MPI uses the GNU Automake software to build itself.
Automake uses a tightly-woven set of file timestamp-based
dependencies to compile and link software. This behavior, frequently
paired with messages similar to:
1
Warning: File `Makefile.am' has modification time 3.6e+04 s in the future
typically means that you are building on a networked filesystem where
the local time of the client machine that you are building on does not
match the time on the network filesystem server. This will result in
files with incorrect timestamps, and Automake degenerates into undefined
behavior.
Two solutions are possible:
Ensure that the time between your network filesystem server and
client(s) is the same. This can be accomplished in a variety of ways
and is dependent upon your local setup; one method is to use an NTP
daemon to synchronize all machines to a common time server.
Build on a local disk filesystem where network timestamps are
guaranteed to be synchronized with the local build machine's
time.
After using one of the two options, it is likely safest to remove the
Open MPI source tree and re-expand the Open MPI tarball. Then you can
run configure, make, and make install. Open MPI should then
build and install successfully.
64. Configure issues warnings about sed and unterminated
commands
Some users have reported seeing warnings like this in the
final output from configure:
1
2
3
4
5
6
*** Final output
configure: creating ./config.status
config.status: creating ompi/include/ompi/version.h
sed: file ./confstatA1BhUF/subs-3.sed line 33: unterminated `s' command
sed: file ./confstatA1BhUF/subs-4.sed line 4: unterminated `s' command
config.status: creating orte/include/orte/version.h
These messages usually indicate a problem in the user's local shell
configuration. Ensure that when you run a new shell, no output is
sent to stdout. For example, if the output of this simple shell
script is more than just the hostname of your computer, you need to go
check your shell startup files to see where the extraneous output is
coming from (and eliminate it):
1
2
3
#!/bin/sh`hostname`exit0
65. Open MPI configured ok, but I get "Makefile:602: *** missing separator" kinds of errors when building
This is usually an indication that configure succeeded but
really shouldn't have. See this FAQ
entry for one possible cause.
66. Open MPI seems to default to building with the GNU compiler set. Can I use other compilers?
Yes.
Open MPI uses a standard Autoconf "configure" script to probe the
current system and figure out how to build itself. One of the choices
it makes it which compiler set to use. Since Autoconf is a GNU
product, it defaults to the GNU compiler set. However, this is easily
overridden on the configure command line. For example, to build
Open MPI with the Intel compiler suite:
Note that you can include additional parameters to configure,
implied by the "..." clause in the example above.
In particular, 4 switches on the configure command line are used to
specify the compiler suite:
CC: Specifies the C compiler
CXX: Specifies the C++ compiler
F77: Specifies the Fortran 77 compiler
FC: Specifies the Fortran 90 compiler
NOTE: The Open MPI team recommends using a
single compiler suite whenever possible. Unexpected or undefined
behavior can occur when you mix compiler suites in unsupported ways
(e.g., mixing Fortran 77 and Fortran 90 compilers between different
compiler suites is almost guaranteed not to work).
In all cases, the compilers must be found in your PATH and be able to
successfully compile and link non-MPI applications before Open MPI
will be able to be built properly.
67. Can I pass specific flags to the compilers / linker used to build Open MPI?
Yes.
Open MPI uses a standard Autoconf configure script to set itself up
for building. As such, there are a number of command line options
that can be passed to configure to customize flags that are passed
to the underlying compiler to build Open MPI:
CFLAGS: Flags passed to the C compiler.
CXXFLAGS: Flags passed to the C++ compiler.
FFLAGS: Flags passed to the Fortran 77 compiler.
FCFLAGS: Flags passed to the Fortran 90 compiler.
LDFLAGS: Flags passed to the linker (not language-specific).
This flag is rarely required; Open MPI will usually pick up all
LDFLAGS that it needs by itself.
LIBS: Extra libraries to link to Open MPI (not
language-specific). This flag is rarely required; Open MPI will
usually pick up all LIBS that it needs by itself.
LD_LIBRARY_PATH: Note that we do not recommend setting
LD_LIBRARY_PATH via configure, but it is worth noting that you
should ensure that your LD_LIBRARY_PATH value is appropriate for
your build. Some users have been tripped up, for example, by
specifying a non-default Fortran compiler to FC and F77, but then
having Open MPI's configure script fail because the LD_LIBRARY_PATH
wasn't set properly to point to that Fortran compiler's support
libraries.
Note that the flags you specify must be compatible across all the
compilers. In particular, flags specified to one language compiler
must generate code that can be compiled and linked against code that
is generated by the other language compilers. For example, on a 64
bit system where the compiler default is to build 32 bit executables:
1
2
# Assuming the GNU compiler suiteshell$ ./configure CFLAGS=-m64 ...
will produce 64 bit C objects, but 32 bit objects for C++, Fortran 77,
and Fortran 90. These codes will be incompatible with each other, and
Open MPI will build successfully. Instead, you must specify building
64 bit objects for all languages:
1
2
# Assuming the GNU compiler suiteshell$ ./configure CFLAGS=-m64 CXXFLAGS=-m64 FFLAGS=-m64 FCFLAGS=-m64 ...
The above command line will pass "-m64" to all four compilers, and
therefore will produce 64 bit objects for all languages.
68. I'm trying to build with the Intel compilers, but Open MPI
eventually fails to compile with really long error messages. What do
I do?
A common mistake when building Open MPI with the Intel
compiler suite is to accidentally specify the Intel C compiler as the
C++ compiler. Specifically, recent versions of the Intel compiler
renamed the C++ compiler "icpc" (it used to be "icc", the same
as the C compiler). Users accustomed to the old name tend to specify
"icc" as the C++ compiler, which will then cause a failure late in
the Open MPI build process because a C++ code will be compiled with
the C compiler. Bad Things then happen.
The solution is to be sure to specify that the C++ compiler is
"icpc", not "icc". For example:
For Googling purposes, here's some of the error messages that may be
issued during the Open MPI compilation of C++ codes with the Intel C compiler
(icc), in no particular order:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
IPO Error: unresolved : _ZNSsD1Ev
IPO Error: unresolved : _ZdlPv
IPO Error: unresolved : _ZNKSs4sizeEv
components.o(.text+0x17): In function `ompi_info::open_components()':
: undefined reference to `std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string()'
components.o(.text+0x64): In function `ompi_info::open_components()':
: undefined reference to `std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string()'
components.o(.text+0x70): In function `ompi_info::open_components()':
: undefined reference to `std::string::size() const'
components.o(.text+0x7d): In function `ompi_info::open_components()':
: undefined reference to `std::string::reserve(unsigned int)'
components.o(.text+0x8d): In function `ompi_info::open_components()':
: undefined reference to `std::string::append(char const*, unsigned int)'
components.o(.text+0x9a): In function `ompi_info::open_components()':
: undefined reference to `std::string::append(std::string const&)'
components.o(.text+0xaa): In function `ompi_info::open_components()':
: undefined reference to `std::string::operator=(std::string const&)'
components.o(.text+0xb3): In function `ompi_info::open_components()':
: undefined reference to `std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string()'
There are many more error messages, but the above should be sufficient
for someone trying to find this FAQ entry via a web crawler.
69. When I build with the Intel compiler suite, linking user MPI
applications with the wrapper compilers results in warning messages.
What do I do?
When Open MPI was built with some versions of the Intel
compilers on some platforms, you may see warnings similar to the
following when compiling MPI applications with Open MPI's wrapper
compilers:
1
2
3
shell$ mpicc hello.c -o hello
libimf.so: warning: warning: feupdateenv is not implemented and will always fail
shell$
This warning is generally harmless, but it can be alarming to some
users. To remove this warning, pass either the -shared-intel or
-i-dynamic options when linking your MPI application (the specific
option depends on your version of the Intel compilers; consult your
local documentation):
70. I'm trying to build with the IBM compilers, but Open MPI
eventually fails to compile. What do I do?
Unfortunately there are some problems between Libtool (which
Open MPI uses for library support) and the IBM compilers when creating
shared libraries. Currently the only workaround is to disable shared
libraries and build Open MPI statically. For example:
For Googling purposes, here's an error message that may be
issued when the build fails:
1
2
3
4
xlc: 1501-216command option --whole-archive is not recognized - passed to ld
xlc: 1501-216command option --no-whole-archive is not recognized - passed to ld
xlc: 1501-218file libopen-pal.so.0 contains an incorrect file suffix
xlc: 1501-228 input file libopen-pal.so.0 not found
71. I'm trying to build with the Oracle Solaris Studio (Sun) compilers on Linux, but Open MPI
eventually fails to compile. What do I do?
Below are some known issues that impact Oracle Solaris Studio
12 Open MPI builds. The easiest way to work around them is simply to
use the latest version of the Oracle Solaris Studio 12 compilers.
72. What configure options should I use when building with the Oracle Solaris Studio (Sun) compilers?
The below configure options are suggested for use with the Oracle Solaris Studio (Sun) compilers:
75. I'm trying to build with the PathScale 3.0 and 3.1 compilers on Linux, but all Open MPI commands seg fault. What do I do?
The PathScale compiler authors have identified a bug in the
v3.0 and v3.1 versions of their compiler; you must disable certain
"builtin" functions when building Open MPI:
With PathScale 3.0 and 3.1 compilers use the workaround options
-O2 and -fno-builtin in CFLAGS across the Open MPI build. For
example:
1
shell$ ./configure CFLAGS="-O2 -fno-builtin" ...
With PathScale 3.2 beta and later, no workaround options are
required.
76. All MPI C++ API functions return errors (or otherwise fail)
when Open MPI is compiled with the PathScale compilers. What do I do?
This is an old issue that seems to be a problem when
PathScale uses a back-end GCC 3.x compiler. Here's a proposed
solution from the PathScale support team (from July 2010):
The proposed work-around is to install gcc-4.x on the system and use
the pathCC -gnu4 option. Newer versions of the compiler (4.x and
beyond) should have this fixed, but we'll have to test to confirm it's
actually fixed and working correctly.
We don't anticipate that this will be much of a problem for Open MPI
users these days (our informal testing shows that not many users are
still using GCC 3.x), but this information is provided so that it is
Google-able for those still using older compilers.
77. How do I build Open MPI with support for [my favorite network type]?
To build support for high-speed interconnect networks, you
generally only have to specify the directory where its support header
files and libraries were installed to Open MPI's configure script.
You can specify where multiple packages were installed if you have
support for more than one kind of interconnect — Open MPI will build
support for as many as it can.
You tell configure where support libraries are with the appropriate
--with command line switch. Here is the list of available switches:
--with-libfabric=<dir>: Build support for OpenFabrics
Interfaces (OFI), commonly known as "libfabric" (starting with the
v1.10 series). NOTE: v4.1.6 or older will only build successfully
with libfabric v1.x.
--with-ucx=<dir>: Build support for the UCX library.
--with-mxm=<dir>: Build support for the Mellanox
Messaging (MXM) library (starting with the v1.5 series).
--with-verbs=<dir>: Build support for OpenFabrics verbs
(previously known as "Open IB", for Infiniband and iWARP
networks). NOTE: Up through the v1.6.x series, this option was
previously named --with-openib. In the v1.8.x series, it was
renamed to be --with-verbs.
--with-portals4=<dir>: Build support for the Portals v4
library (starting with the v1.7 series).
--with-psm=<dir>: Build support for the PSM
library.
--with-psm2=<dir>: Build support for the PSM 2
library (starting with the v1.10 series).
--with-usnic: Build support for usNIC networks (starting with
the v1.8 series). In the v1.10 series, usNIC support is included in
the libfabric library, but this option can still be used to ensure
that usNIC support is specifically available.
These switches enable Open MPI's configure script to automatically
find all the right header files and libraries to support the various
networks that you specified.
You can verify that configure found everything properly by examining
its output — it will test for each network's header files and
libraries and report whether it will build support (or not) for each
of them. Examining configure's output is the first place you
should look if you have a problem with Open MPI not correctly
supporting a specific network type.
If configure indicates that support for your networks will be
included, after you build and install Open MPI, you can run the
"ompi_info" command and look for components for your networks.
For example:
1
2
3
4
5
shell$ ompi_info|egrep': ofi|ucx'
MCA rml: ofi (MCA v2.1.0, API v3.0.0, Component v4.0.0)
MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v4.0.0)
MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.0.0)
MCA osc: ucx (MCA v2.1.0, API v2.0.0, Component v4.0.0)
Here's some network types that are no longer supported in current
versions of Open MPI:
--with-scif=<dir>: Build support for the SCIF library Last supported in the v3.1.x series.
--with-elan=<dir>: Build support for Elan. Last supported in the v1.6.x series.
--with-gm=<dir>: Build support for GM (Myrinet). Last supported in the v1.4.x series.
--with-mvapi=<dir>: Build support for mVAPI (Infiniband). Last supported in the v1.3 series.
--with-mx=<dir>: Build support for MX (Myrinet). Last supported in the v1.8.x series.
--with-portals=<dir>: Build support for the Portals
library. Last supported in the v1.6.x series.
78. How do I build Open MPI with support for Slurm / XGrid?
Slurm support is built automatically; there is nothing that
you need to do.
XGrid support is built automatically if the XGrid tools are installed.
79. How do I build Open MPI with support for SGE?
Support for SGE first appeared in the Open MPI v1.2 series.
The method for configuring it is slightly different between Open MPI
v1.2 and v1.3.
For Open MPI v1.2, no extra configure arguments are needed as SGE
support is built in automatically. After Open MPI is installed, you
should see two components named gridengine.
For Open MPI v1.3, you need to explicitly request the SGE support with
the "--with-sge" command line switch to the Open MPI configure
script. For example:
1
shell$ ./configure --with-sge
After Open MPI is installed, you should see one component named
gridengine.
Open MPI v1.3 only has the one specific gridengine component as the
other functionality was rolled into other components.
Component versions may vary depending on the version of Open MPI 1.2 or
1.3 you are using.
80. How do I build Open MPI with support for PBS Pro / Open PBS / Torque?
Support for PBS Pro, Open PBS, and Torque must be explicitly requested
with the "--with-tm" command line switch to Open MPI's configure
script. In general, the procedure is the same building support for high-speed interconnect
networks, except that you use --with-tm. For example:
Specific frameworks and version numbers may vary, depending on your
version of Open MPI.
NOTE: Update to the note below
(May 2006): Torque 2.1.0p0 now includes support for shared libraries
and the workarounds listed below are no longer necessary. However,
this version of Torque changed other things that require upgrading
Open MPI to 1.0.3 or higher (as of this writing, v1.0.3 has not yet
been released — nightly snapshot tarballs of what will become 1.0.3
are available at https://www.open-mpi.org/nightly/v1.0/).
NOTE: As of this writing
(October 2006), Open PBS, and PBS Pro do not (i.e., they only include
static libraries). Because of this, you may run into linking errors
when Open MPI tries to create dynamic plugin components for TM support
on some platforms. Notably, on at least some 64 bit Linux platforms
(e.g., AMD64), trying to create a dynamic plugin that links against a
static library will result in error messages such as:
1
relocation R_X86_64_32S against `a local symbol' can not be used when making a shared object; recompile with -fPIC
Note that recent versions of Torque (as of October 2006) have started
shipping shared libraries and this issue does not occur.
There are two possible solutions in Open MPI 1.0.x:
Recompile your PBS implementation with "-fPIC" (or whatever
the relevant flag is for your compiler to generate
position-independent code) and re-install. This will allow Open MPI
to generate dynamic plugins with the PBS/Torque libraries properly.
PRO: Open MPI enjoys the benefits of shared libraries and dynamic
plugins.
CON: Dynamic plugins can use more memory at run-time (e.g.,
operating systems tend to align each plugin on a page, rather than
densely packing them all into a single library).
CON: This is not possible for binary-only vendor distributions
(such as PBS Pro).
Configure Open MPI to build a static library that includes all of
its components. Specifically, all of Open MPI's components will be
included in its libraries — none will be discovered and opened at
run-time. This does not affect user MPI code at all (i.e., the
location of Open MPI's plugins is transparent to MPI applications).
Use the following options to Open MPI's configure script:
Note that this option only changes the location of Open MPI's _default
set_ of plugins (i.e., they are included in libmpi and friends
rather than being standalone dynamic shared objects that are
found/opened at run-time). This option does not change the fact
that Open MPI will still try to open other dynamic plugins at
run-time.
PRO: This works with binary-only vendor distributions (e.g., PBS
Pro).
CON: User applications are statically linked to Open MPI; if Open
MPI — or any of its default set of components — is updated, users
will need to re-link their MPI applications.
Both methods work equally well, but there are tradeoffs; each site
will likely need to make its own determination of which to use.
81. How do I build Open MPI with support for LoadLeveler?
Support for LoadLeveler will be automatically built if the LoadLeveler
libraries and headers are in the default path. If not, support
must be explicitly requested with the "--with-loadleveler" command
line switch to Open MPI's configure script. In general, the procedure
is the same building support for high-speed
interconnect networks, except that you use --with-loadleveler.
For example:
Specific frameworks and version numbers may vary, depending on your
version of Open MPI.
82. How do I build Open MPI with support for Platform LSF?
Note that only Platform LSF 7.0.2 and later is supported.
Support for LSF will be automatically built if the LSF libraries and
headers are in the default path. If not, support must be explicitly
requested with the "--with-lsf" command line switch to Open MPI's
configure script. In general, the procedure is the same building support for high-speed interconnect
networks, except that you use --with-lsf. For example:
Note: There are some dependencies needed to build with LSF.
1. Network Information Service Version 2, formerly referred to as YP.
This is typically found in libnsl, but could vary based on your OS.
- On RHEL: libnsl, libnsl2 AND libnsl2-devel are required.
2. Posix shmem. Can be found in librt on most distros.
After Open MPI is installed, you should see a component named
"lsf":
1
2
3
4
shell$ ompi_info|grep lsf
MCA ess: lsf (MCA v2.0, API v1.3, Component v1.3)
MCA ras: lsf (MCA v2.0, API v1.3, Component v1.3)
MCA plm: lsf (MCA v2.0, API v1.3, Component v1.3)
Specific frameworks and version numbers may vary, depending on your
version of Open MPI.
83. How do I build Open MPI with processor affinity support?
Open MPI supports processor affinity for many platforms. In
general, processor affinity will automatically be built if it is
supported — no additional command line flags to configure should be
necessary.
However, Open MPI will fail to build processor affinity if the appropriate
support libraries and header files are not available on the system on
which Open MPI is being built. Ensure that you have all appropriate
"development" packages installed. For example, Red Hat Enterprise
Linux (RHEL) systems typically require the numactl-devel packages to
be installed before Open MPI will be able to build full support for
processor affinity. Other OS's / Linux distros may have different
packages that are required.
84. How do I build Open MPI with memory affinity / NUMA support (e.g., libnuma)?
Open MPI supports memory affinity for many platforms. In
general, memory affinity will automatically be built if it is
supported — no additional command line flags to configure should be
necessary.
However, Open MPI will fail to build memory affinity if the appropriate
support libraries and header files are not available on the system on
which Open MPI is being built. Ensure that you have all appropriate
"development" packages installed. For example, Red Hat Enterprise
Linux (RHEL) systems typically require the numactl-devel packages to
be installed before Open MPI will be able to build full support for
memory affinity. Other OS's / Linux distros may have different
packages that are required.
85. How do I build Open MPI with CUDA-aware support?
CUDA-aware support means that the MPI library can send and receive GPU buffers
directly. This feature exists in the Open MPI 1.7 series and later. The support
is being continuously updated so different levels of support exist in different
versions. We recommend you use the latest 1.8 version for best support.
Configuring the Open MPI 1.8 series and Open MPI 1.7.3, 1.7.4, 1.7.5
With Open MPI 1.7.3 and later the libcuda.so library is loaded dynamically
so there is no need to specify a path to it at configure time. Therefore,
all you need is the path to the cuda.h header file.
1. Searches in default locations. Looks for cuda.h in
/usr/local/cuda/include.
1
shell$ ./configure --with-cuda
2. Searches for cuda.h in /usr/local/cuda-v6.0/cuda/include.
If the cuda.h or libcuda.so files cannot be found, then the configure
will abort.
Note: There is a bug in Open MPI 1.7.2 such that you will get an
error if you configure the library with --enable-static. To get
around this error, add the following to your configure line and
reconfigure. This disables the build of the PML BFO which is largely
unused anyways. This bug is fixed in Open MPI 1.7.3.
1
--enable-mca-no-build=pml-bfo
See this FAQ entry
for detals on how to use the CUDA support.
86. How do I not build a specific plugin / component for Open MPI?
The --enable-mca-no-build option to Open MPI's configure
script enables you to specify a list of components that you want to
skip building. This allows you to not include support for specific
features in Open MPI if you do not want to.
It takes a single argument: a comma-delimited list of
framework/component pairs inidicating which specific components you do
not want to build. For example:
Note that this option is really only useful for components that would
otherwise be built. For example, if you are on a machine without
Myrinet support, it is not necessary to specify:
1
shell$ ./configure --enable-mca-no-build=btl-gm
because the configure script will naturally see that you do not have
support for GM and will automatically skip the gm BTL component.
87. What other options to configure exist?
There are many options to Open MPI's configure script.
Please run the following to get a full list (including a short
description of each option):
1
shell$ ./configure --help
88. Why does compiling the Fortran 90 bindings take soooo long?
NOTE: Starting with Open
MPI v1.7, if you are not using gfortran, building the Fortran 90 and
08 bindings do not suffer the same performance penalty that previous
versions incurred. The Open MPI developers encourage all users to
upgrade to the new Fortran bindings implementation — including the
new MPI-3 Fortran'08 bindings — when possible.
This is actually a design problem with the MPI F90 bindings
themselves. The issue is that since F90 is a strongly typed language,
we have to overload each function that takes a choice buffer with a
typed buffer. For example, MPI_SEND has many different overloaded
versions — one for each type of the user buffer. Specifically, there
is an MPI_SEND that has the following types for the first argument:
On the surface, this is 17 bindings for MPI_SEND. Multiply this by
every MPI function that takes a choice buffer (50) and you 850
overloaded functions. However, the problem gets worse — for each
type, we also have to overload for each array dimension that needs to
be supported. Fortran allows up to 7 dimensional arrays, so this
becomes (17x7) = 119 versions of every MPI function that has a choice
buffer argument. This makes (17x7x50) = 5,950 MPI interface
functions.
To make matters even worse, consider the ~25 MPI functions that take
2 choice buffers. Functions have to be provided for all possible
combinations of types. This then becomes exponential — the total
number of interface functions balloons up to 6.8M.
Additionally, F90 modules must all have their functions in a single
source file. Hence, all 6.8M functions must be in one .f90 file and
compiled as a single unit (currently, no F90 compiler that we are
aware of can handle 6.8M interface functions in a single module).
To limit this problem, Open MPI, by default, does not generate
interface functions for any of the 2-buffer MPI functions.
Additionally, we limit the maximum number of supported dimensions to 4
(instead of 7). This means that we're generating (17x4*50) = 3,400
interface functions in a single F90 module. So it's far smaller than
6.8M functions, but it's still quite a lot.
This is what makes compiling the F90 module take so long.
Note, however, you can limit the maximum number of dimensions that
Open MPI will generate for the F90 bindings with the configure switch
--with-f90-max-array-dim=DIM, where DIM is an integer <= 7. The
default value is 4. Decreasing this value makes the compilation go
faster, but obviously supports fewer dimensions.
Other than this limit on dimension size, there is little else that we
can do — the MPI-2 F90 bindings were unfortunately not well thought
out in this regard.
Note, however, that the Open MPI team has proposed Fortran '03
bindings for MPI in a paper that was presented at the Euro
PVM/MPI'05 conference. These bindings avoid all the scalability
problems that are described above and have some other nice properties.
This is something that is being worked on in Open MPI, but there is
currently have no estimated timeframe on when it will be available.
89. Does Open MPI support MPI_REAL16 and MPI_COMPLEX32?
It depends. Note that these datatypes are optional in the MPI
standard.
Prior to v1.3, Open MPI supported MPI_REAL16 and MPI_COMPLEX32 if
a portable C integer type could be found that was the same size
(measured in bytes) as Fortran's REAL*16 type. It was later
discovered that even though the sizes may be the same, the bit
representations between C and Fortran may be different. Since Open
MPI's reduction routines are implemented in C, calling MPI_REDUCE (and
related functions) with MPI_REAL16 or MPI_COMPLEX32 would generate
undefined results (although message passing with these types in
homogeneous environments generally worked fine).
As such, Open MPI v1.3 made the test for supporting MPI_REAL16 and
MPI_COMPLEX32 more stringent: Open MPI will support these types only
if:
An integer C type can be found that has the same size (measured
in bytes) as the Fortran REAL*16 type.
The bit representation is the same between the C type and the
Fortran type.
Version 1.3.0 only checks for portable C types (e.g., long double).
A future version of Open MPI may include support for compiler-specific
/ non-portable C types. For example, the Intel compiler has specific
options for creating a C type that is the same as REAL*16, but we did
not have time to include this support in Open MPI v1.3.0.
90. Can I re-locate my Open MPI installation without re-configuring/re-compiling/re-installing from source?
Starting with Open MPI v1.2.1, yes.
Background: Open MPI hard-codes some directory paths in its
executables based on installation paths specified by the configure
script. For example, if you configure with an installation prefix of
/opt/openmpi/, Open MPI encodes in its executables that it should be
able to find its help files in /opt/openmpi/share/openmpi.
The "installdirs" functionality in Open MPI lets you change any of
these hard-coded directory paths at run time
(assuming that you have already adjusted your PATH
and/or LD_LIBRARY_PATH environment variables to the new location
where Open MPI now resides). There are three methods:
Move an existing Open MPI installation to a new prefix: Set the
OPAL_PREFIX environment variable before launching Open MPI. For
example, if Open MPI had initially been installed to /opt/openmpi
and the entire openmpi tree was later moved to /home/openmpi,
setting OPAL_PREFIX to /home/openmpi will enable Open MPI to
function properly.
"Stage" an Open MPI installation in a temporary location: When
creating self-contained installation packages, systems such as RPM
install Open MPI into temporary locations. The package system then
bundles up everything under the temporary location into a package that
can be installed into its real location later. For example, when
creating an RPM that will be installed to /opt/openmpi, the RPM
system will transparently prepend a "destination directory" (or
"destdir") to the installation directory. As such, Open MPI will
think that it is installed in /opt/openmpi, but it is actually
temporarily installed in (for example)
/var/rpm/build.1234/opt/openmpi. If it is necessary to use Open
MPI while it is installed in this staging area, the OPAL_DESTDIR
environment variable can be used; setting OPAL_DESTDIR to
/var/rpm/build.1234 will automatically prefix every directory such
that Open MPI can function properly.
Overriding individual directories: Open MPI uses the
GNU-specified directories (per Autoconf/Automake), and can be
overridden by setting environment variables directly related to their
common names. The list of environment variables that can be used is:
OPAL_PREFIX
OPAL_EXEC_PREFIX
OPAL_BINDIR
OPAL_SBINDIR
OPAL_LIBEXECDIR
OPAL_DATAROOTDIR
OPAL_DATADIR
OPAL_SYSCONFDIR
OPAL_SHAREDSTATEDIR
OPAL_LOCALSTATEDIR
OPAL_LIBDIR
OPAL_INCLUDEDIR
OPAL_INFODIR
OPAL_MANDIR
OPAL_PKGDATADIR
OPAL_PKGLIBDIR
OPAL_PKGINCLUDEDIR
Note that not all of the directories listed above are used by Open
MPI; they are listed here in entirety for completeness.
Also note that several directories listed above are defined in terms
of other directories. For example, the $bindir is defined by
default as $prefix/bin. Hence, overriding the $prefix (via
OPAL_PREFIX) will automatically change the first part of the
$bindir (which is how method 1 described above works).
Alternatively, OPAL_BINDIR can be set to an absolute value that
ignores $prefix altogether.
91. How do I statically link to the libraries of Intel compiler suite?
The Intel compiler suite, by default, dynamically links its runtime libraries
against the Open MPI binaries and libraries. This can cause problems if the Intel
compiler libraries are installed in non-standard locations. For example, you might
get errors like:
1
2
error while loading shared libraries: libimf.so: cannot open shared object file:
No such file or directory
To avoid such problems, you can pass flags to Open MPI's configure
script that instruct the Intel compiler suite to statically link its
runtime libraries with Open MPI:
92. Why do I get errors about hwloc or libevent not found?
Sometimes you may see errors similar to the following when attempting to build Open MPI:
1
2
3
4
5
6
7
8
...
PPFC profile/pwin_unlock_f08.lo
PPFC profile/pwin_unlock_all_f08.lo
PPFC profile/pwin_wait_f08.lo
FCLD libmpi_usempif08.la
ld: library not found for -lhwloc
collect2: error: ld returned 1 exit status
make[2]: *** [libmpi_usempif08.la] Error 1
This error can happen when a number of factors occur together:
If Open MPI's configure script chooses to use an "external"
installation of hwloc and/or Libevent (i.e., outside of Open
MPI's source tree).
If Open MPI's configure script chooses C and Fortran
compilers from different suites/installations.
Put simply: if the default search library search paths differ between
the C and Fortran compiler suites, the C linker may find a
system-installed libhwloc and/or libevent, but the Fortran linker
may not.
This may tend to happen more frequently starting with Open MPI v4.0.0
on Mac OS because:
In v4.0.0, Open MPI's configure script was changed to
"prefer" system-installed versions of hwloc and Libevent
(vs. preferring the hwloc and Libevent that are bundled in the Open
MPI distribution tarballs).
Installs the Gnu C and Fortran compiler suites v9.1.0 under
/usr/local. However, the C compiler executable is named gcc-9
(not gcc!), whereas the Fortran compiler executable is
named gfortran.
These factors, taken together, result in Open MPI's configure script deciding the following:
The C compiler is gcc (which is the MacOS-installed C compiler).
The Fortran compiler is gfortran (which is the
Homebrew-installed Fortran compiler).
There is a suitable system-installed hwloc in /usr/local, which
can be found -- by the C compiler/linker -- without specifying any
additional linker search paths.
The careful reader will realize that the C and Fortran compilers are
from two entirely different installations. Indeed, their default
library search paths are different:
The MacOS-installed gcc will search /usr/local/lib by default.
The Homebrew-installed gfortran will not search /usr/local/lib by default.
Hence, since the majority of Open MPI's source code base is in C, it
compiles/links against hwloc successfully. But when Open MPI's
Fortran code for the mpi_f08 module is compiled and linked, the
Homebrew-installed gfortran -- which does not search
/usr/local/lib by default -- cannot find libhwloc, and the link
fails.
There are a few different possible solutions to this issue:
The best solution is to always ensure that Open MPI uses a C and
Fortran compiler from the same suite/installation. This will ensure
that both compilers/linkers will use the same default library search
paths, and all behavior should be consistent. For example, the
following instructs Open MPI's configure script to use gcc-9 for
the C compiler, which (as of July 2019) is the Homebrew executable
name for its installed C compiler:
1
2
3
4
5
shell$ ./configure CC=gcc-9 ...
# You can be precise and specify an absolute path for the C# compiler, and/or also specify the Fortran compiler:shell$ ./configure CC=/usr/local/bin/gcc-9FC=/usr/local/bin/gfortran ...
Note that this will likely cause configure to not find the
Homebrew-installed hwloc, and instead fall back to using the bundled
hwloc in the Open MPI source tree (see this FAQ
question for more information about the bundled hwloc and/or
Libevent vs. system-installed versions).
Alternatively, you can simply force configure to select the
bundled versions of hwloc and libevent, which avoids the issue
altogether:
Finally, you can tell configure exactly where to find the
external hwloc library. This can have some unintended consequences,
however, because it will prefix both the C and Fortran linker's
default search paths with /usr/local/lib:
Be sure to also see this FAQ
question for more information about using the bundled hwloc and/or
Libevent vs. system-installed versions.
93. Should I use the bundled hwloc and Libevent, or system-installed versions?
From a performance perspective, there is no significant reason
to choose the bundled vs. system-installed hwloc and Libevent
installations. Specifically: both will likely give the same
performance.
There are other reasons to choose one or the other, however.
First, some background: Open MPI has internally used hwloc and/or Libevent for almost its entire
life. Years ago, it was not common for hwloc and/or Libevent to be
available on many systems, so the Open MPI community decided to bundle
entire copies of the hwloc and Libevent source code in Open MPI
distribution tarballs.
This system worked well: Open MPI used the bundled copies of hwloc and
Libevent which a) guaranteed that those packages would be available
(vs. telling users that they had to separately download/install those
packages before installing Open MPI), and b) guaranteed that the
versions of hwloc and Libevent were suitable for Open MPI's
requirements.
In the last few years, two things have changed:
hwloc and Libevent are now installed on many more systems by
default.
The hwloc and Libevent APIs have stabilized such that a wide
variety of hwloc/Libevent release versions are suitable for Open MPI's
requirements.
While not all systems have hwloc and Libevent available by default
(cough cough MacOS cough cough), it is now common enough that -- with
the suggestion from Open MPI's downstream packagers -- starting with
v4.0.0, Open MPI "prefers" system-installed hwloc and Libevent
installations over its own bundled copies.
Meaning: if configure finds a suitable system-installed hwloc and/or
Libevent, configure will chose to use those installations instead of
the bundled copies in the Open MPI source tree.
That being said, there definitely are obscure technical corner cases
and philosophical reasons to force the choice of one or the other. As
such, Open MPI provides configure command line options that can be
used to specify exact behavior in searching for hwloc and/or Libevent:
--with-hwloc=VALUE: VALUE can be one of the following:
internal: use the bundled copy of hwloc from Open MPI's source tree.
external: use an external copy of hwloc (e.g., a
system-installed copy), but only use default compiler/linker
search paths to find it.
A directory: use an external copy of hwloc that can be found
at dir/include and dir/lib or dir/lib64.
Note that Open MPI requires hwloc -- it is invalid to specify
--without-hwloc or --with-hwloc=no. Similarly, it is
meaningless to specify --with-hwloc (with no value) or
--with-hwloc=yes.
--with-hwloc-libdir=DIR: When used with
--with-hwloc=external, default compiler search paths will be used to
find hwloc's header files, but DIR will be used to specify the
location of the hwloc libraries. This can be necessary, for example,
if both 32 and 64 bit versions of the hwloc libraries are available,
and default linker search paths would find the "wrong" one.
--with-libevent and --with-libevent-libdir behave the same as the
hwloc versions described above, but influence configure's behavior
with respect to Libevent, not hwloc.
From Open MPI's perspective, it is always safe to use the bundled
copies. If there is ever a problem or conflict, you can specify
--with-hwloc=internal and/or --with-libevent=internal, and this
will likely solve your problem.
Additionally, note that Open MPI's configure will check some version
and functionality aspects from system-installed hwloc / Libevent, and
may still choose the bundled copies over system-installed copies
(e.g., the system-installed version is too low, the system-installed
version is not thread safe, ... etc.).
94. I'm still having problems / my problem is not listed here. What do I do?
Please see this FAQ
category for troubleshooting tips and the Getting Help page — it details
how to send a request to the Open MPI mailing lists.
95. Why does my MPI application fail to compile, complaining that
various MPI APIs/symbols are undefined?
Starting with v4.0.0, Open MPI — by default — removes the
prototypes from mpi.h for MPI symbols that were deprecated in 1996
in the MPI-2.0 standard, and finally removed from the MPI-3.0 standard
(2012).
Specifically, the following symbols (specified in the MPI
language-neutral names) are no longer prototyped in mpi.h by
default:
Although these symbols are no longer prototyped in mpi.h, _they are
still present in the MPI library in Open MPI v4.0.x._ This enables
legacy MPI applications to link and run successfully with Open MPI
v4.0.x, even though they will fail to compile.
*The Open MPI team strongly encourages all
MPI application developers to stop using these constructs that were
first deprecated over 20 years ago, and finally removed from the MPI
specification in MPI-3.0 (in 2012).* The FAQ items in this category
show how to update your application to stop using these removed
symbols.
All that being said, if you are unable to immediately update your
application to stop using these removed MPI-1 symbols, you can
re-enable them in mpi.h by configuring Open MPI with the
--enable-mpi1-compatibility flag.
NOTE: Future releases of Open MPI beyond the v4.0.x series may
remove these symbols altogether.
96. Why on earth are you breaking the compilation of MPI
applications?
The Open MPI developer community decided to take a first step
of removing the prototypes for these symbols from mpi.h starting
with the Open MPI v4.0.x series for the following reasons:
These symbols have been deprecated since 1996. That's
28 years ago! It's time to start raising awareness
for developers who are inadvertantly still using these removed
symbols.
The MPI Forum removed these symbols from the MPI-3.0
specification in 2012. This is a sign that the Forum itself
recognizes that these removed symbols are no longer needed.
Note that Open MPI did not fully remove these removed symbols:
we just made it slightly more painful to get to them. This is an
attempt to raise awareness so that MPI application developers can
update their applications (it's easy!).
In short: the only way to finally be able to remove these removed
symbols from Open MPI someday is to have a "grace period" where the
MPI application developers are a) made aware that they are using
removed symbols, and b) educated how to update their applications.
We, the Open MPI developers, recognize that your MPI application
failing to compile with Open MPI may be a nasty surprise. We
apologize for that.
Our intent is simply to use this minor shock to raise awareness and
use it as an educational opportunity to show you how to update your
application (or direct your friendly neighborhood MPI application
developer to this FAQ) to stop using these removed MPI symbols.
Thank you!
97. Why am I getting deprecation warnings when compiling my MPI application?
You are getting deprecation warnings because you are using
symbols / functions that are deprecated in MPI. For example:
1
2
3
4
5
6
7
8
9
shell$ mpicc deprecated-example.c -c
deprecated-example.c: In function'foo':
deprecated-example.c:6:5: warning: 'MPI_Attr_delete' is deprecated: MPI_Attr_delete was deprecated in MPI-2.0; use MPI_Comm_delete_attr instead [-Wdeprecated-declarations]
MPI_Attr_delete(MPI_COMM_WORLD, 2);
^~~~~~~~~~~~~~~
In file included from deprecated-example.c:2:
/usr/local/openmpi/include/mpi.h:2601:20: note: declared here
OMPI_DECLSPEC int MPI_Attr_delete(MPI_Comm comm, int keyval)
^~~~~~~~~~~~~~~
Note that the deprecation compiler warnings tells you how to upgrade
your code to avoid the deprecation warnings. In this example, it
advises you to use MPI_Comm_delete_attr() instead of
MPI_Attr_delete().
Also, note that when using --enable-mpi1-compatibility to re-enable
removed MPI-1 symbols you will still get compiler warnings when you use
the removed symbols. For example:
1
2
3
4
5
6
7
8
9
shell$ mpicc deleted-example.c -c
deleted-example.c: In function'foo':
deleted-example.c:8:5: warning: 'MPI_Address' is deprecated: MPI_Address was removed in MPI-3.0; use MPI_Get_address instead. [-Wdeleted-declarations]
MPI_Address(buffer, &address);
^~~~~~~~~~~
In file included from deleted-example.c:2:
/usr/local/openmpi/include/mpi.h:2689:20: note: declared here
OMPI_DECLSPEC int MPI_Address(void *location, MPI_Aint *address)
^~~~~~~~~~~
98. How do I update my MPI application to stop using MPI_ADDRESS?
In C, the only thing that changed was the function name:
MPI_Address() → MPI_Get_address(). Nothing else needs
to change:
1
2
3
4
5
6
7
8
char buffer[30];
MPI_Aint address;// Old way
MPI_Address(buffer,&address);// New way
MPI_Get_address(buffer,&address);
In Fortran, the type of the parameter changed from INTEGER
→ INTEGER(KIND=MPI_ADDRESS_KIND) so that it can hold
larger values (e.g., 64 bit pointers):
USE mpi
REAL buffer
INTEGER ierror
INTEGER old_address
INTEGER(KIND= MPI_ADDRESS_KIND) new_address
! Old wayCALL MPI_ADDRESS(buffer, old_address, ierror)! New wayCALL MPI_GET_ADDRESS(buffer, new_address, ierror)
99. How do I update my MPI application to stop using MPI_ERRHANDLER_CREATE?
In C, effectively the only thing that changed was the name
of the function: MPI_Errhandler_create() →
MPI_Comm_create_errhandler().
Technically, the type of the first parameter also changed
( MPI_Handler_function → MPI_Comm_errhandler_function),
but most applications do not use this type directly and may not even
notice the change.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
void my_errhandler_function(MPI_Comm *comm,int*code, ...){// Do something useful to handle the error}void some_function(void){
MPI_Errhandler my_handler;// Old way
MPI_Errhandler_create(my_errhandler_function,&my_handler);// New way
MPI_Comm_create_errhandler(my_errhandler_function,&my_handler);}
In Fortran, only the subroutine name changed: MPI_ERRHANDLER_CREATE
→ MPI_COMM_CREATE_ERRHANDLER.
USE mpi
EXTERNAL my_errhandler_function
INTEGER ierror
INTEGER my_handler
! Old wayCALL MPI_ERRHANDLER_CREATE(my_errhandler_function, my_handler, ierror)! Old wayCALL MPI_COMM_CREATE_ERRHANDLER(my_errhandler_function, my_handler, ierror)
100. How do I update my MPI application to stop using MPI_ERRHANDLER_GET?
In both C and Fortran, the only thing that changed with
regards to MPI_ERRHANDLER_GET is the name: MPI_ERRHANDLER_GET
→ MPI_COMM_GET_ERRHANDLER.
All parameter types stayed the same.
101. How do I update my MPI application to stop using MPI_ERRHANDLER_SET?
In both C and Fortran, the only thing that changed with
regards to MPI_ERRHANDLER_SET is the name: MPI_ERRHANDLER_SET
→ MPI_COMM_SET_ERRHANDLER.
All parameter types stayed the same.
102. How do I update my MPI application to stop using MPI_TYPE_HINDEXED?
In both C and Fortran, effectively the only change is the
name of the function: MPI_TYPE_HINDEXED →
MPI_TYPE_CREATE_HINDEXED.
In C, the new function also has a const attribute on the two array
parameters, but most applications won't notice the difference.
All other parameter types stayed the same.
1
2
3
4
5
6
7
8
9
10
int count =2;int block_lengths[]={1,2};
MPI_Aint displacements[]={0,sizeof(int)};
MPI_Datatype newtype;// Old way
MPI_Type_hindexed(count, block_lengths, displacements, MPI_INT,&newtype);// New way
MPI_Type_create_hindexed(count, block_lengths, displacements, MPI_INT,&newtype);
103. How do I update my MPI application to stop using MPI_TYPE_HVECTOR?
In both C and Fortran, the only change is the
name of the function: MPI_TYPE_HVECTOR →
MPI_TYPE_CREATE_HVECTOR.
All parameter types stayed the same.
104. How do I update my MPI application to stop using MPI_TYPE_STRUCT?
In both C and Fortran, effectively the only change is the
name of the function: MPI_TYPE_STRUCT →
MPI_TYPE_CREATE_STRUCT.
In C, the new function also has a const attribute on the three array
parameters, but most applications won't notice the difference.
All other parameter types stayed the same.
1
2
3
4
5
6
7
8
9
10
11
int count =2;int block_lengths[]={1,2};
MPI_Aint displacements[]={0,sizeof(int)};
MPI_Datatype datatypes[]={ MPI_INT, MPI_DOUBLE };
MPI_Datatype newtype;// Old way
MPI_Type_struct(count, block_lengths, displacements, datatypes,&newtype);// New way
MPI_Type_create_struct(count, block_lengths, displacements, datatypes,&newtype);
105. How do I update my MPI application to stop using MPI_TYPE_EXTENT?
In both C and Fortran, the MPI_TYPE_EXTENT function is
superseded by the slightly-different MPI_TYPE_GET_EXTENT function:
the new function also returns the lower bound.
1
2
3
4
5
6
7
8
MPI_Aint lb;
MPI_Aint extent;// Old way
MPI_Type_extent(MPI_INT,&extent);// New way
MPI_Type_get_extent(MPI_INT,&lb,&extent);
106. How do I update my MPI application to stop using MPI_TYPE_LB?
In both C and Fortran, the MPI_TYPE_LB function is
superseded by the slightly-different MPI_TYPE_GET_EXTENT function:
the new function also returns the extent.
1
2
3
4
5
6
7
8
MPI_Aint lb;
MPI_Aint extent;// Old way
MPI_Type_lb(MPI_INT,&lb);// New way
MPI_Type_get_extent(MPI_INT,&lb,&extent);
107. How do I update my MPI application to stop using MPI_TYPE_UB?
In both C and Fortran, the MPI_TYPE_UB function is
superseded by the slightly-different MPI_TYPE_GET_EXTENT function:
the new function returns the lower bound and the extent, which can be
used to compute the upper bound.
1
2
3
4
5
6
7
8
9
MPI_Aint lb, ub;
MPI_Aint extent;// Old way
MPI_Type_ub(MPI_INT,&ub);// New way
MPI_Type_get_extent(MPI_INT,&lb,&extent);
ub = lb + extent
Note the ub calculation after calling MPI_Type_get_extent().
108. How do I update my MPI application to stop using MPI_LB / MPI_UB?
The MPI_LB and MPI_UB positional markers were fully
replaced with MPI_TYPE_CREATE_RESIZED in MPI-2.0.
Prior to MPI-2.0, MPI_UB and MPI_LB were intended to be used as
input to MPI_TYPE_STRUCT (which, itself, has been deprecated and
renamed to MPI_TYPE_CREATE_STRUCT). The same end effect can now be
achieved with MPI_TYPE_CREATE_RESIZED.
For example, using the old method:
The MPI_TYPE_RESIZED function allows us to take any arbitrary
datatype and set the lower bound and extent directly (which indirectly
sets the upper bound), without needing to setup the arrays and
computing the displacements necessary to invoke
MPI_TYPE_CREATE_STRUCT.
Aside from the printf statement, the following example is exactly
equivalent to the prior example (see this FAQ entry for a mapping of
MPI_TYPE_UB to MPI_TYPE_GET_EXTENT):
109. How do I update my MPI application to stop using MPI_COMBINER_HINDEXED_INTEGER, MPI_COMBINER_HVECTOR_INTEGER, and MPI_COMBINER_STRUCT_INTEGER?
The MPI_COMBINER_HINDEXED_INTEGER,
MPI_COMBINER_HVECTOR_INTEGER, and MPI_COMBINER_STRUCT_INTEGER
constants could previously be returned from MPI_TYPE_GET_ENVELOPE.
Starting with MPI-3.0, these values will never be returned. Instead,
they will just return the same names, but without the _INTEGER
suffix. Specifically:
MPI_COMBINER_HINDEXED_INTEGER
→
MPI_COMBINER_HINDEXED
MPI_COMBINER_HVECTOR_INTEGER
→
MPI_COMBINER_HVECTOR
MPI_COMBINER_STRUCT_INTEGER
→
MPI_COMBINER_STRUCT
If your Fortran code is using any of the _INTEGER-suffixed names,
you can just delete the _INTEGER suffix.
110. How do I update my MPI application to stop using MPI_Handler_function?
The MPI_Handler_function C type is only used in the
deprecated/removed function MPI_Errhandler_create(), as described in this FAQ entry.
Most MPI applications likely won't use this type at all. But if they
do, they can simply use the new, exactly-equivalent type name (i.e.,
the return type, number, and type of parameters didn't change):
MPI_Comm_errhandler_function.
1
2
3
4
5
6
7
8
9
10
11
12
13
void my_errhandler_function(MPI_Comm *comm,int*code, ...){// Do something useful to handle the error}void some_function(void){// Old way
MPI_Handler_function *old_ptr = my_errhandler_function;// New way
MPI_Comm_errhandler_function *new_ptr = my_errhandler_function;}
The MPI_Handler_function type isn't used at all in the Fortran
bindings.
111. In general, how do I build MPI applications with Open MPI?
The Open MPI team strongly recommends that you simply use
Open MPI's "wrapper" compilers to compile your MPI applications.
That is, instead of using (for example) gcc to compile your program,
use mpicc. Open MPI provides a wrapper compiler for four languages:
Language
Wrapper compiler name
C
mpicc
C++
mpiCC, mpicxx, or mpic++ (note that mpiCC will not exist on case-insensitive filesystems)
Fortran
mpifort (for v1.7 and above) mpif77 and mpif90 (for older versions)
Note that Open MPI's wrapper compilers do not do any actual compiling
or linking; all they do is manipulate the command line and add in all
the relevant compiler / linker flags and then invoke the underlying
compiler / linker (hence, the name "wrapper" compiler). More
specifically, if you run into a compiler or linker error, check your
source code and/or back-end compiler — it is usually not the fault of
the Open MPI wrapper compiler.
112. Wait — what is mpifort? Shouldn't I use
mpif77 and mpif90?
mpifort is a new name for the Fortran wrapper compiler that
debuted in Open MPI v1.7.
It supports compiling all versions of Fortran, and *utilizing all
MPI Fortran interfaces* (mpif.h, use mpi, and [use
mpi_f08]). There is no need to distinguish between "Fortran 77"
(which hasn't existed for 30+ years) or "Fortran 90" — just use
mpifort to compile all your Fortran MPI applications and don't worry
about what dialect it is, nor which MPI Fortran interface it uses.
Other MPI implementations will also soon support a wrapper compiler
named mpifort, so hopefully we can move the whole world to this
simpler wrapper compiler name, and eliminate the use of mpif77 and
mpif90.
Specifically: mpif77 and mpif90 are
deprecated as of Open MPI v1.7. Although mpif77 and
mpif90 still exist in Open MPI v1.7 for legacy reasons, they will
likely be removed in some (undetermined) future release. It is in
your interest to convert to mpifort now.
Also note that these names are literally just sym links to mpifort
under the covers. So you're using mpifort whether you realize it or
not. :-)
Basically, the 1980's called; they want their mpif77 wrapper
compiler back. Let's let them have it.
113. I can't / don't want to use Open MPI's wrapper compilers.
What do I do?
We repeat the above statement: the Open MPI Team strongly
recommends that you use the wrapper compilers to compile and link MPI
applications.
If you find yourself saying, "But I don't want to use wrapper
compilers!", please humor us and try them. See if they work for you.
Be sure to let us know if they do not work for you.
Many people base their "wrapper compilers suck!" mentality on bad
behavior from poorly-implemented wrapper compilers in the mid-1990's.
Things are much better these days; wrapper compilers can handle
almost any situation, and are far more reliable than you attempting to
hard-code the Open MPI-specific compiler and linker flags manually.
That being said, there are some — very, very few — situations
where using wrapper compilers can be problematic — such as nesting
multiple wrapper compilers of multiple projects. Hence, Open MPI
provides a workaround to find out what command line flags you need to
compile MPI applications. There are generally two sets of flags that
you need: compile flags and link flags.
1
2
3
4
5
# Show the flags necessary to compile MPI C applicationsshell$ mpicc --showme:compile
# Show the flags necessary to link MPI C applicationsshell$ mpicc --showme:link
The --showme:* flags work with all Open MPI wrapper compilers
(specifically: mpicc, mpiCC / mpicxx / mpic++, mpifort, and
if you really must use them, mpif77, mpif90).
Hence, if you need to use some compiler other than Open MPI's
wrapper compilers, we advise you to run the appropriate Open MPI
wrapper compiler with the --showme flags to see what Open MPI needs
to compile / link, and then use those with your compiler.
NOTE: It is absolutely not sufficient
to simply add "-lmpi" to your link line and assume that you will
obtain a valid Open MPI executable.
NOTE: It is almost never a good idea to hard-code these results in a
Makefile (or other build system). It is almost always best to run
(for example) "mpicc --showme:compile" in a dynamic fashion to
find out what you need. For example, GNU Make allows running commands
and assigning their results to variables:
Where <compiler> is replaced by the default back-end compiler for each
language, and "x" is customized for each language (i.e., C, C++, F77,
and F90).
By setting appropriate environment variables, a user can
override default values used by the wrapper compilers. The table
below lists the variables for each of the wrapper compilers; the
Generic set applies to any wrapper compiler if the corresponding
wrapper-specific variable is not set. For example, the value of
$OMPI_LDFLAGS will be used with mpicc only if
$OMPI_MPICC_LDFLAGS is not set.
NOTE: If you set a variable listed above, Open MPI will entirely
replace the default value that was originally there. Hence, it is
advisable to only replace these values when absolutely necessary.
115. How do I override the flags specified by Open MPI's wrapper
compilers? (v1.1 series and beyond)
NOTE: This answer
applies to the v1.1 and later series of Open MPI only. If you are
using the v1.0 series, please see this
FAQ entry.
The Open MPI wrapper compilers are driven by text files that
contain, among other things, the flags that are passed to the
underlying compiler. These text files are generated automatically for
Open MPI and are customized for the compiler set that was selected
when Open MPI was configured; it is not recommended that users edit
these files.
Note that changing the underlying compiler may not work at
all. For example, C++ and Fortran compilers are notoriously
binary incompatible with each other (sometimes even within multiple
releases of the same compiler). If you compile/install Open MPI with
C++ compiler XYZ and then use the OMPI_CXX environment
variable to change the mpicxx wrapper compiler to use the
ABC C++ compiler, your application code may not compile and/or link.
The traditional method of using multiple different compilers with Open
MPI is to install Open MPI multiple times; each installation should be
built/installed with a different compiler. This is annoying, but it
is beyond the scope of Open MPI to be able to fix.
However, there are cases where it may be necessary or desirable to
edit these files and add to or subtract from the flags that Open MPI
selected. These files are installed in $pkgdatadir, which defaults
to $prefix/share/openmpi/<wrapper_name>-wrapper-data.txt. A
few environment variables are available for run-time replacement of
the wrapper's default values (from the text files):
Wrapper Compiler
Compiler
Preprocessor Flags
Compiler Flags
Linker Flags
Linker Library Flags
Data File
Open MPI wrapper compilers
mpicc
OMPI_CC
OMPI_CPPFLAGS
OMPI_CFLAGS
OMPI_LDFLAGS
OMPI_LIBS
mpicc-wrapper-data.txt
mpic++
OMPI_CXX
OMPI_CPPFLAGS
OMPI_CXXFLAGS
OMPI_LDFLAGS
OMPI_LIBS
mpic++-wrapper-data.txt
mpiCC
OMPI_CXX
OMPI_CPPFLAGS
OMPI_CXXFLAGS
OMPI_LDFLAGS
OMPI_LIBS
mpiCC-wrapper-data.txt
mpifort
OMPI_FC
OMPI_CPPFLAGS
OMPI_FCFLAGS
OMPI_LDFLAGS
OMPI_LIBS
mpifort-wrapper-data.txt
mpif77(deprecated as of v1.7)
OMPI_F77
OMPI_CPPFLAGS
OMPI_FFLAGS
OMPI_LDFLAGS
OMPI_LIBS
mpif77-wrapper-data.txt
mpif90(deprecated as of v1.7)
OMPI_FC
OMPI_CPPFLAGS
OMPI_FCFLAGS
OMPI_LDFLAGS
OMPI_LIBS
mpif90-wrapper-data.txt
OpenRTE wrapper compilers
ortecc
ORTE_CC
ORTE_CPPFLAGS
ORTE_CFLAGS
ORTE_LDFLAGS
ORTE_LIBS
ortecc-wrapper-data.txt
ortec++
ORTE_CXX
ORTE_CPPFLAGS
ORTE_CXXFLAGS
ORTE_LDFLAGS
ORTE_LIBS
ortec++-wrapper-data.txt
OPAL wrapper compilers
opalcc
OPAL_CC
OPAL_CPPFLAGS
OPAL_CFLAGS
OPAL_LDFLAGS
OPAL_LIBS
opalcc-wrapper-data.txt
opalc++
OPAL_CXX
OPAL_CPPFLAGS
OPAL_CXXFLAGS
OPAL_LDFLAGS
OPAL_LIBS
opalc++-wrapper-data.txt
Note that the values of these fields can be directly influenced by
passing flags to Open MPI's configure script. The following options
are available to configure:
--with-wrapper-cflags: Extra flags to add to CFLAGS when using
mpicc.
--with-wrapper-cxxflags: Extra flags to add to CXXFLAGS when
using mpiCC.
--with-wrapper-fflags: Extra flags to add to FFLAGS when using
mpif77(this option has disappeared in
Open MPI 1.7 and will not return; see this FAQ entry for more
details).
--with-wrapper-fcflags: Extra flags to add to FCFLAGS when
using mpif90 and mpifort.
--with-wrapper-ldflags: Extra flags to add to LDFLAGS when
using any of the wrapper compilers.
--with-wrapper-libs: Extra flags to add to LIBS when using any
of the wrapper compilers.
The files cited in the above table are fairly simplistic "key=value"
data formats. The following are several fields that are likely to be
interesting for end-users:
project_short: Prefix for all environment variables. See
below.
compiler_env: Specifies the base name of the environment
variable that can be used to override the wrapper's underlying
compiler at run-time. The full name of the environment variable is of
the form <project_short>_<compiler_env>; see table
above.
compiler_flags_env: Specifies the base name of the environment
variable that can be used to override the wrapper's compiler flags at
run-time. The full name of the environment variable is of the form
<project_short>_<compiler_flags_env>; see table
above.
compiler: The executable name of the underlying compiler.
extra_includes: Relative to $installdir, a list of directories
to also list in the preprocessor flags to find header files.
preprocessor_flags: A list of flags passed to the
preprocessor.
compiler_flags: A list of flags passed to the compiler.
linker_flags: A list of flags passed to the linker.
libs: A list of libraries passed to the linker.
required_file: If non-empty, check for the presence of this
file before continuing. If the file is not there, the wrapper will
abort saying that the language is not supported.
includedir: Directory containing Open MPI's header files. The
proper compiler "include" flag is prepended to this directory and
added into the preprocessor flags.
libdir: Directory containing Open MPI's library files. The
proper compiler "include" flag is prepended to this directory and
added into the linker flags.
module_option: This field only appears in mpif90. It is the
flag that the Fortran 90 compiler requires to declare where module
files are located.
116. How can I tell what the wrapper compiler default flags are?
If the corresponding environment variables are not set, the
wrappers will add -I$includedir and -I$includedir/openmpi (which
usually map to $prefix/include and $prefix/include/openmpi,
respectively) to the xFLAGS area, and add -L$libdir (which usually
maps to $prefix/lib) to the xLDFLAGS area.
To obtain the values of the other flags, there are two main methods:
Use the --showme option to any wrapper compiler. For example
(lines broken here for readability):
This shows a coarse-grained method for getting the entire command
line, but does not tell you what each set of flags are (xFLAGS,
xCPPFLAGS, xLDFLAGS, and xLIBS).
Use the ompi_info command. For example:
1
2
3
4
5
6
7
shell$ ompi_info--all|grep wrapper
Wrapper extra CFLAGS:
Wrapper extra CXXFLAGS:
Wrapper extra FFLAGS:
Wrapper extra FCFLAGS:
Wrapper extra LDFLAGS:
Wrapper extra LIBS: -lutil-lnsl-ldl -Wl,--export-dynamic -lm
This installation is only adding options in the xLIBS areas of the
wrapper compilers; all other values are blank (remember: the -I's
and -L's are implicit).
Note that the --parsable option can be used to obtain
machine-parsable versions of this output. For example:
117. Why does "mpicc --showme <some flags>" not show any
MPI-relevant flags?
The output of commands similar to the following may be
somewhat surprising:
1
2
3
shell$ mpicc-g--showmegcc-gshell$
Where are all the MPI-related flags, such as the necessary -I, -L, and
-l flags?
The short answer is that these flags are not included in the wrapper
compiler's underlying command line unless the wrapper compiler sees a
filename argument. Specifically (output artificially wrapped below for
readability)
The second command had the filename "foo.c" in it, so the wrapper
added all the relevant flags. This behavior is specifically to allow
behavior such as the following:
1
2
3
4
5
6
7
8
9
shell$ mpicc--version--showmegcc--versionshell$ mpicc--version
i686-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5363)
Copyright (C)2005 Free Software Foundation, Inc.
This is free software; see the sourcefor copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
shell$
That is, the wrapper compiler does not behave differently when
constructing the underlying command line if --showme is used or
not. The only difference is whether the resulting command line is
displayed or executed.
Hence, this behavior allows users to pass arguments to the underlying
compiler without intending to actually compile or link (such as
passing --version to query the underlying compiler's version). If the
wrapper compilers added more flags in these cases, some underlying
compilers emit warnings.
118. Are there ways to just add flags to the wrapper compilers?
Yes!
Open MPI's configure script allows you to add command line flags to
the wrappers on a permanent basis. The following configure options
are available:
--with-wrapper-cflags=<flags>: These flags are added into
the CFLAGS area in the mpicc wrapper compiler.
--with-wrapper-cxxflags=<flags>: These flags are added into
the CXXFLAGS area in the mpicxx wrapper compiler.
--with-wrapper-fflags=<flags>: These flags are added into
the FFLAGS area in the mpif77 wrapper compiler (this option has disappeared in Open MPI 1.7 and
will not return; see this
FAQ entry for more details).
--with-wrapper-fcflags=<flags>: These flags are added into
the FCFLAGS area in the mpif90 wrapper compiler.
--with-wrapper-ldflags=<flags>: These flags are added into
the LDFLAGS area in all the wrapper compilers.
--with-wrapper-libs=<flags>: These flags are added into
the LIBS area in all the wrapper compilers.
These configure options can be handy if you have some optional
compiler/linker flags that you need both Open MPI and all MPI
applications to be compiled with. Rather than trying to get all your
users to remember to pass the extra flags to the compiler when
compiling their applications, you can specify them with the configure
options shown above, thereby silently including them in the Open MPI
wrapper compilers — your users will therefore be using the correct
flags without ever knowing it.
119. Why don't the wrapper compilers add "-rpath" (or similar)
flags by default? (version v1.7.3 and earlier)
The default installation of Open MPI tries very hard to not
include any non-essential flags in the wrapper compilers. This is the
most conservative setting and allows the greatest flexibility for
end-users. If the wrapper compilers started adding flags to support
specific features (such as run-time locations for finding the Open MPI
libraries), such flags — no matter how useful to some portion of
users — would almost certainly break assumptions and functionality
for other users.
As a workaround, Open MPI provides several mechanisms for users to
manually override the flags in the wrapper compilers:
First and simplest, you can add your own flags to the wrapper
compiler command line by simply listing them on the command line. For
example:
Use the --showme options to the wrapper compilers to
dynamically see what flags the wrappers are adding, and modify them as
appropiate. See this FAQ entry for
more details.
Use environment variables to override the arguments that the
wrappers insert. If you are using Open MPI 1.0.x, see this FAQ entry, otherwise see this FAQ entry.
If you are using Open MPI 1.1 or later, you can modify text files
that provide the system-wide default flags for the wrapper compilers.
See this FAQ entry for more
details.
If you are using Open MPI 1.1 or later, you can pass additional
flags in to the system-wide wrapper compiler default flags through
Open MPI's configure script. See this FAQ entry for more
details.
You can use one of more of these methods to insert your own flags
(such as -rpath or similar).
120. Why do the wrapper compilers add "-rpath" (or similar)
flags by default? (version v1.7.4 and beyond)
Prior to v1.7.4, the Open MPI wrapper compilers did not
automatically add -rpath (or similar) flags when linking MPI
application executables (for all the reasons in this FAQ entry).
Due to popular user request, Open MPI changed its policy starting with
v1.7.4: by default on supported systems, Open MPI's wrapper compilers
do insert -rpath (or similar) flags when linking MPI applications.
You can see the exact flags added by the --showme functionality
described in this FAQ
entry.
This behavior can be disabled by configuring Open MPI with the
--disable-wrapper-rpath CLI option.
121. Can I build 100% static MPI applications?
Fully static linking is not for the weak, and it is not
recommended. But it is possible, with some caveats.
You must have static libraries available for everything that
your program links to. This includes Open MPI; you must have used the
--enable-static option to Open MPI's configure or otherwise have
available the static versions of the Open MPI libraries (note that
Open MPI static builds default to including all of its plugins in
its libraries — as opposed to having each plugin in its own dynamic
shared object file. So all of Open MPI's code will be contained in
the static libraries — even what are normally contained in Open MPI's
plugins). Note that some popular Linux libraries do not have static
versions by default (e.g., libnuma), or require additional RPMs to be
installed to get the equivalent libraries.
Open MPI must have been built without a memory manager. This
means that Open MPI must have been configured with the
--without-memory-manager flag. This is irrelevant on some platforms
for which Open MPI does not have a memory manager, but on some
platforms it is necessary (Linux). It is harmless to use this flag on
platforms where Open MPI does not have a memory manager. Not having a
memory manager means that Open MPI's mpi_leave_pinned behavior for
OS-bypass networks such as InfiniBand will not work.
On some systems (Linux), you may see linker warnings about some
files requiring dynamic libraries for functions such as gethostname
and dlopen. These are ok, but do mean that you need to have the
shared libraries installed. You can disable all of Open MPI's
dlopen behavior (i.e., prevent it from trying to open any plugins)
by specifying the --disable-dlopen flag to Open MPI's configure
script). This will eliminate the linker warnings about dlopen.
For example, this is how to configure Open MPI to build static
libraries on Linux:
1
2
shell$ ./configure --without-memory-manager--without-libnuma \
--enable-static[...your other configure arguments...]
Some systems may have additional constraints about their support
libraries that require additional steps to produce working 100% static
MPI applications. For example, the libibverbs support library from
OpenIB / OFED has its own plugin system (which, by default, won't work
with an otherwise-static application); MPI applications need
additional compiler/linker flags to be specified to create a working
100% MPI application. See this
FAQ entry for the details.
122. Can I build 100% static OpenFabrics / OpenIB / OFED MPI
applications on Linux?
Fully static linking is not for the weak, and it is not
recommended. But it is possible. First, you must read this FAQ entry.
For an OpenFabrics / OpenIB / OFED application to be built statically,
you must have libibverbs v1.0.4 or later (v1.0.4 was released after
OFED 1.1, so if you have OFED 1.1, you will manually need to upgrade
your libibverbs). Both libibverbs and your verbs hardware plugin must
be available in static form.
Once all of that has been setup, run the following (artificially
wrapped sample output shown below — your output may be slightly
different):
(Or use whatever wrapper compiler is relevant — the --showme flag
is the important part here.)
This example shows the steps for the GNU compiler suite, but other
compilers will be similar. This example also assumes that the
OpenFabrics / OpenIB / OFED install was rooted at /usr/local/ofed;
some distributions install under /usr/ofed (or elsewhere). Finally,
some installations use the library directory "lib64" while others
use "lib". Adjust your directory names as appropriate.
Take the output of the above command and run it manually to
compile and link your application, adding the following highlighted
arguments:
Note that the mthca.a file is the verbs plugin for Mellanox HCAs.
If you have an HCA from a different vendor (such as IBM or QLogic),
use the appropriate filename (look in $ofed_libdir/infiniband for
verbs plugin files for your hardware).
Specifically, these added arguments do the following:
-static: Tell the linker to generate a static executable.
-Wl,--whole-archive: Tell the linker to include the entire
ibverbs library in the executable.
$ofed_root/lib64/infiniband/mthca.a: Include the Mellanox verbs
plugin in the executable.
-Wl,--no-whole-archive: Tell the linker the return to the
default of not including entire libraries in the executable.
You can either add these arguments in manually, or you can see this FAQ entry to
modify the default behavior of the wrapper compilers to hide this
complexity from end users (but be aware that if you modify the wrapper
compilers' default behavior, all users will be creating static
applications!).
123. Why does it take soooo long to compile F90 MPI applications?
NOTE: Starting with Open
MPI v1.7, if you are not using gfortran, building the Fortran 90 and
08 bindings do not suffer the same performance penalty that previous
versions incurred. The Open MPI developers encourage all users to
upgrade to the new Fortran bindings implementation — including the
new MPI-3 Fortran'08 bindings — when possible.
This is unfortunately due to a design flaw in the MPI F90
bindings themselves.
The answer to this question is exactly the same as it is for why it
takes so long to compile the MPI F90 bindings in the Open MPI
implementation; please see
this FAQ entry for the details.
124. How do I build BLACS with Open MPI?
The blacs_install.ps file (available from that web site)
describes how to build BLACS, so we won't repeat much of it here
(especially since it might change in future versions). These
instructions only pertain to making Open MPI work correctly with
BLACS.
After selecting the appropriate starting Bmake.inc, make the
following changes to Sections 1, 2, and 3. The example below is from
the Bmake.MPI-SUN4SOL2; your Bmake.inc file may be different.
# Section 1:# Ensure to use MPI for the communication layer
COMMLIB = MPI
# The MPIINCdir macro is used to link in mpif.h and# must contain the location of Open MPI's mpif.h.# The MPILIBdir and MPILIB macros are irrelevant# and should be left empty.
MPIdir =/path/to/openmpi-5.0.6
MPILIBdir =
MPIINCdir =$(MPIdir)/include
MPILIB =# Section 2:# Set these values:
SYSINC =
INTFACE =-Df77IsF2C
SENDIS =
BUFF =
TRANSCOMM =-DUseMpi2
WHATMPI =
SYSERRORS =# Section 3:# You may need to specify the full path to# mpif77 / mpicc if they aren't already in# your path.
F77 = mpif77
F77LOADFLAGS =
CC = mpicc
CCLOADFLAGS =
The remainder of the values are fairly obvious and irrelevant to Open
MPI; you can set whatever optimization level you want, etc.
If you follow the rest of the instructions for building, BLACS will
build correctly and use Open MPI as its MPI communication layer.
125. How do I build ScaLAPACK with Open MPI?
The scalapack_install.ps file (available from that web site)
describes how to build ScaLAPACK, so we won't repeat much of it here
(especially since it might change in future versions). These
instructions only pertain to making Open MPI work correctly with
ScaLAPACK. These instructions assume that you have built and
installed BLACS with Open MPI.
# Make sure you follow the instructions to build BLACS with Open MPI,# and put its location in the following.
BLACSdir =...path where you installed BLACS...# The MPI section is commented out. Uncomment it. The wrapper# compiler will handle SMPLIB, so make it blank. The rest are correct# as is.
USEMPI =-DUsingMpiBlacs
SMPLIB =
BLACSFINIT =$(BLACSdir)/blacsF77init_MPI-$(PLAT)-$(BLACSDBGLVL).a
BLACSCINIT =$(BLACSdir)/blacsCinit_MPI-$(PLAT)-$(BLACSDBGLVL).a
BLACSLIB =$(BLACSdir)/blacs_MPI-$(PLAT)-$(BLACSDBGLVL).a
TESTINGdir =$(home)/TESTING
# The PVMBLACS setup needs to be commented out.#USEMPI =#SMPLIB = $(PVM_ROOT)/lib/$(PLAT)/libpvm3.a -lnsl -lsocket#BLACSFINIT =#BLACSCINIT =#BLACSLIB = $(BLACSdir)/blacs_PVM-$(PLAT)-$(BLACSDBGLVL).a#TESTINGdir = $(HOME)/pvm3/bin/$(PLAT)# Make sure that the BLASLIB points to the right place. We built this# example on Solaris, hence the name below. The Linux version of the# library (as of this writing) is blas_LINUX.a.
BLASLIB =$(LAPACKdir)/blas_solaris.a
# You may need to specify the full path to mpif77 / mpicc if they# aren't already in your path.
F77 = mpif77
F77LOADFLAGS =
CC = mpicc
CCLOADFLAGS =
The remainder of the values are fairly obvious and irrelevant to Open
MPI; you can set whatever optimization level you want, etc.
If you follow the rest of the instructions for building, ScaLAPACK
will build correctly and use Open MPI as its MPI communication
layer.
126. How do I build PETSc with Open MPI?
The only special configuration that you need to build PETSc is
to ensure that Open MPI's wrapper compilers (i.e., mpicc and
mpif77) are in your $PATH before running the PETSc configure.py
script.
PETSc should then automatically find Open MPI's wrapper compilers and
correctly build itself using Open MPI.
127. How do I build VASP with Open MPI?
The following was reported by an Open MPI user who was able to
successfully build and run VASP with Open MPI:
I just compiled the latest VASP v4.6 using Open MPI v1.2.1, ifort
v9.1, ACML v3.6.0, BLACS with patch-03 and Scalapack v1.7.5 built with
ACML.
I configured Open MPI with --enable-static flag.
I used the VASP supplied makefile.linux_ifc_opt and only corrected
the paths to the ACML, scalapack, and BLACS dirs (I didn't lower the
optimization to -O0 for mpi.f like I suggested before). The -D's
are standard except I get a little better performance with
-DscaLAPACK (I tested it with out this option too):
Also, Blacs and Scalapack used the -D's suggested in the Open MPI FAQ.
128. Are other language / application bindings available for Open MPI?
Other MPI language bindings and application-level programming
interfaces have been been written by third parties. Here are a link
to some of the available packages:
...we used to maintain a list of links here. But the list changes
over time; projects come, and projects go. Your best bet these days
is simply to use Google to find MPI bindings and application-level
programming interfaces.
129. Why does my legacy MPI application fail to compile with Open MPI v4.0.0 (and beyond)?
Starting with v4.0.0, Open MPI — by default — removes the
prototypes for MPI symbols that were deprecated in 1996 and finally
removed from the MPI standard in MPI-3.0 (2012).
130. What prerequisites are necessary for running an Open MPI job?
In general, Open MPI requires that its executables are in your
PATH on every node that you will run on and if Open MPI was compiled
as dynamic libraries (which is the default), the directory where its
libraries are located must be in your LD_LIBRARY_PATH on every node.
Specifically, if Open MPI was installed with a prefix of /opt/openmpi,
then the following should be in your PATH and LD_LIBRARY_PATH
See this FAQ entry for more
details on how to add Open MPI to your PATH and LD_LIBRARY_PATH.
Additionally, Open MPI requires that jobs can be started on remote
nodes without any input from the keyboard. For example, if using
rsh or ssh as the remote agent, you must have your environment
setup to allow execution on remote nodes without entering a password
or passphrase.
131. What ABI guarantees does Open MPI provide?
Open MPI's versioning and ABI scheme is described
here, but is summarized here in this FAQ entry for convenience.
Open MPI provided forward application binary interface (ABI)
compatibility for MPI applications starting with v1.3.2. Prior to
that version, no ABI guarantees were provided.
NOTE: Prior to v1.3.2, subtle
and strange failures are almost guaranteed to occur if applications
were compiled and linked against shared libraries from one version of
Open MPI and then run with another. The Open MPI team strongly
discourages making any ABI assumptions before v1.3.2.
NOTE: ABI for the "use mpi"
Fortran interface was inadvertantly broken in the v1.6.3 release, and
was restored in the v1.6.4 release. Any Fortran applications that
utilize the "use mpi" MPI interface that were compiled and linked
against the v1.6.3 release will not be link-time compatible with other
releases in the 1.5.x / 1.6.x series. Such applications remain source
compatible, however, and can be recompiled/re-linked with other Open
MPI releases.
Starting with v1.3.2, Open MPI provides forward ABI compatibility —
with respect to the MPI API only — in all versions of a given feature
release series and its corresponding super stable
series. For example, on a single platform, an MPI application
linked against Open MPI v1.3.2 shared libraries can be updated to
point to the shared libraries in any successive v1.3.x or v1.4 release
and still work properly (e.g., via the LD_LIBRARY_PATH environment
variable or other operating system mechanism).
For the v1.5 series, this means that all releases of v1.5.x and v1.6.x
will be ABI compatible, per the above definition.
Open MPI reserves the right to break ABI compatibility at new feature
release series. For example, the same MPI application from above
(linked against Open MPI v1.3.2 shared libraries) will not work with
Open MPI v1.5 shared libraries. Similarly, MPI applications
compiled/linked against Open MPI 1.6.x will not be ABI compatible with
Open MPI 1.7.x
132. Do I need a common filesystem on all my nodes?
No, but it certainly makes life easier if you do.
A common environment to run Open MPI is in a "Beowulf"-class or
similar cluster (e.g., a bunch of 1U servers in a bunch of racks).
Simply stated, Open MPI can run on a group of servers or workstations
connected by a network. As mentioned above, there are several
prerequisites, however (for example, you typically must have an
account on all the machines, you can rsh or ssh between the nodes
without using a password, etc.).
Regardless of whether Open MPI is installed on a shared / networked
filesystem or independently on each node, it is usually easiest if
Open MPI is available in the same filesystem location on every node.
For example, if you install Open MPI to /opt/openmpi-5.0.6 on
one node, ensure that it is available in /opt/openmpi-5.0.6
on all nodes.
This FAQ entry
has a bunch more information about installation locations for Open
MPI.
133. How do I add Open MPI to my PATH and LD_LIBRARY_PATH?
Open MPI must be able to find its executables in your PATH
on every node (if Open MPI was compiled as dynamic libraries, then its
library path must appear in LD_LIBRARY_PATH as well). As such, your
configuration/initialization files need to add Open MPI to your PATH
/ LD_LIBRARY_PATH properly.
How to do this may be highly dependent upon your local configuration,
so you may need to consult with your local system administrator. Some
system administrators take care of these details for you, some don't.
YMMV. Some common examples are included below, however.
You must have at least a minimum understanding of how your shell works
to get Open MPI in your PATH / LD_LIBRARY_PATH properly. Note
that Open MPI must be added to your PATH and LD_LIBRARY_PATH in
two situations: (1) when you login to an interactive shell,
(2) and when you login to non-interactive shells on remote nodes.
If (1) is not configured properly, executables like mpicc will
not be found, and it is typically obvious what is wrong. The Open MPI
executable directory can manually be added to the PATH, or the
user's startup files can be modified such that the Open MPI
executables are added to the PATH every login. This latter approach
is preferred.
All shells have some kind of script file that is executed at login
time to set things like PATH and LD_LIBRARY_PATH and perform other
environmental setup tasks. This startup file is the one that needs to
be edited to add Open MPI to the PATH and LD_LIBRARY_PATH. Consult
the manual page for your shell for specific details (some shells are
picky about the permissions of the startup file, for example). The
table below lists some common shells and the startup files that they
read/execute upon login:
Shell
Interactive login startup file
sh (Bourne shell, or bash named "sh")
.profile
csh
.cshrc followed by .login
tcsh
.tcshrc if it exists, .cshrc if it does not, followed by
.login
bash
.bash_profile if it exists, or .bash_login if it exists, or
.profile if it exists (in that order). Note that some Linux
distributions automatically come with .bash_profile scripts for
users that automatically execute .bashrc as well. Consult the bash
man page for more information.
If (2) is not configured properly, executables like mpirun will
not function properly, and it can be somewhat confusing to figure out
(particularly for bash users).
The startup files in question here are the ones that are
automatically executed for a non-interactive login on a remote node
(e.g., "rsh othernode ps"). Note that not all shells support
this, and that some shells use different files for this than listed in
(1). Some shells will supersede (2) with (1). That is, fulfilling
(2) may automatically fulfill (1). The following table lists some
common shells and the startup file that is automatically executed,
either by Open MPI or by the shell itself:
Shell
Non-interactive login startup file
sh (Bourne or bash named "sh")
This shell does not execute any file automatically, so Open MPI
will execute the .profile script before invoking Open MPI
executables on remote nodes
csh
.cshrc
tcsh
.tcshrc if it exists, or .cshrc if it does not
bash
.bashrc if it exists
134. What if I can't modify my PATH and/or LD_LIBRARY_PATH?
There are some situations where you cannot modify the PATH or
LD_LIBRARY_PATH — e.g., some ISV applications prefer to hide all
parallelism from the user, and therefore do not want to make the user
modify their shell startup files. Another case is where you want a
single user to be able to launch multiple MPI jobs simultaneously,
each with a different MPI implementation. Hence, setting shell
startup files to point to one MPI implementation would be problematic.
In such cases, you have two options:
Use mpirun's --prefix command line option (described
below).
Modify the wrapper compilers to include directives to include
run-time search locations for the Open MPI libraries (see this FAQ entry)
mpirun's --prefix command line option takes as an argument the
top-level directory where Open MPI was installed. While relative
directory names are possible, they can become ambiguous depending on
the job launcher used; using absolute directory names is strongly
recommended.
For example, say that Open MPI was installed into
/opt/openmpi-5.0.6. You would use the --prefix option like
this:
This will prefix the PATH and LD_LIBRARY_PATH on both the local
and remote hosts with /opt/openmpi-5.0.6/bin and
/opt/openmpi-5.0.6/lib, respectively. This is usually
unnecessary when using resource managers to launch jobs (e.g., Slurm,
Torque, etc.) because they tend to copy the entire local environment
— to include the PATH and LD_LIBRARY_PATH — to remote nodes
before execution. As such, if PATH and LD_LIBRARY_PATH are set
properly on the local node, the resource manager will automatically
propagate those values out to remote nodes. The --prefix option is
therefore usually most useful in rsh or ssh-based environments (or
similar).
Beginning with the 1.2 series, it is possible to make this the default
behavior by passing to configure the flag
--enable-mpirun-prefix-by-default. This will make mpirun behave
exactly the same as "mpirun --prefix $prefix ...", where $prefix is
the value given to --prefix in configure.
Finally, note that specifying the absolute pathname to mpirun is
equivalent to using the --prefix argument. For example, the
following is equivalent to the above command line that uses --prefix:
1
shell$ /opt/openmpi-5.0.6/bin/mpirun-np4 a.out
135. How do I launch Open MPI parallel jobs?
Similar to many MPI implementations, Open MPI provides the
commands mpirun and mpiexec to launch MPI jobs. Several of the
questions in this FAQ category deal with using these commands.
Note, however, that these commands are exactly identical.
Specifically, they are symbolic links to a common back-end launcher
command named orterun (Open MPI's run-time environment interaction
layer is named the Open Run-Time Environment, or ORTE — hence
orterun).
As such, the rest of this FAQ usually refers only to mpirun, even
though the same discussions also apply to mpiexec and orterun
(because they are all, in fact, the same command).
136. How do I run a simple SPMD MPI job?
Open MPI provides both mpirun and mpiexec commands. A simple way
to start a single program, multiple data (SPMD) application in
parallel is:
1
shell$ mpirun-np4 my_parallel_application
This starts a four-process parallel application, running four copies
of the executable named my_parallel_application.
The rsh starter component accepts the --hostfile (also known as
--machinefile) option to indicate which hosts to start the processes
on:
This command will launch one copy of my_parallel_application on each
of host01.example.com and host02.example.com.
More information about the --hostfile option, and hostfiles in
general, is available in this FAQ
entry.
Note, however, that not all environments require a hostfile. For
example, Open MPI will automatically detect when it is running in
batch / scheduled environments (such as SGE, PBS/Torque, Slurm, and
LoadLeveler), and will use host information provided by those systems.
Also note that if using a launcher that requires a hostfile and no
hostfile is specified, all processes are launched on the local host.
137. How do I run an MPMD MPI job?
Both the mpirun and mpiexec commands support multiple
program, multiple data (MPMD) style launches, either from the command
line or from a file. For example:
1
shell$ mpirun-np2 a.out : -np2 b.out
This will launch a single parallel application, but the first two
processes will be instances of the a.out executable, and the second
two processes will be instances of the b.out executable. In MPI
terms, this will be a single MPI_COMM_WORLD, but the a.out
processes will be ranks 0 and 1 in MPI_COMM_WORLD, while the b.out
processes will be ranks 2 and 3 in MPI_COMM_WORLD.
mpirun (and mpiexec) can also accept a parallel application
specified in a file instead of on the command line. For example:
1
shell$ mpirun--app my_appfile
where the file my_appfile contains the following:
1
2
3
4
5
6
7
# Comments are supported; comments begin with ## Application context files specify each sub-application in the# parallel job, one per line. The first sub-application is the 2# a.out processes:-np2 a.out
# The second sub-application is the 2 b.out processes:-np2 b.out
This will result in the same behavior as running a.out and b.out
from the command line.
Note that mpirun and mpiexec are identical in command-line options
and behavior; using the above command lines with mpiexec instead of
mpirun will result in the same behavior.
138. How do I specify the hosts on which my MPI job runs?
There are three general mechanisms:
The --hostfile option to mpirun. Use this option to specify
a list of hosts on which to run. Note that for compatibility with
other MPI implementations, --machinefile is a synonym for
--hostfile. See this FAQ entry for more information about the --hostfile option.
The --host option to mpirun can be used to specify a list of
hosts on which to run on the command line. See this FAQ entry for more information
about the --host option.
If you are running in a scheduled environment (e.g., in a Slurm,
Torque, or LSF job), Open MPI will automatically get the lists of
hosts from the scheduler.
NOTE: The specification
of hosts using any of the above methods has nothing to do with the
network interfaces that are used for MPI traffic. The list of hosts
is only used for specifying which hosts on which to launch
MPI processes.
139. I can run ompi_info and launch MPI jobs on a single host, but not across multiple hosts. Why?
If you can run ompi_info and possibly even launch MPI
processes locally, but fail to launch MPI processes on remote hosts,
it is likely that you do not have your PATH and/or LD_LIBRARY_PATH
setup properly on the remote nodes.
Specifically, the Open MPI commands usually run properly even if
LD_LIBRARY_PATH is not set properly because they encode the
Open MPI library location in their executables and search there by
default. Hence, running ompi_info (and friends) usually works, even
in some improperly setup environments.
However, Open MPI's wrapper compilers do not encode the Open MPI
library locations in MPI executables by default (the wrappers only
specify a bare minimum of flags necessary to create MPI executables;
we consider any flags beyond this bare minimum set a local policy
decision). Hence, attempting to launch MPI executables in
environments where LD_LIBRARY_PATH is either not set or was set
improperly may result in messages about libmpi.so not being found.
Depending on how Open MPI was configured
and/or invoked, it may even be possible to run MPI applications in
environments where PATH and/or LD_LIBRARY_PATH is not set, or is
set improperly. This can be desirable for environments where multiple
MPI implementations are installed, such as multiple versions of Open
MPI.
140. How can I diagnose problems when running across multiple hosts?
In addition to what is mentioned in this
FAQ entry, when you are able to run MPI jobs on a single host, but
fail to run them across multiple hosts, try the following:
Ensure that your launcher is able to launch across multiple
hosts. For example, if you are using ssh, try to ssh to each
remote host and ensure that you are not prompted for a password.
For example:
1
2
shell$ ssh remotehost hostname
remotehost
If you are unable to launch across multiple hosts, check that your SSH
keys are setup properly. Or, if you are running in a managed
environment, such as in a Slurm, Torque, or other job launcher, check
that you have reserved enough hosts, are running in an allocated job,
etc.
Ensure that your PATH and LD_LIBRARY_PATH are set correctly on
each remote host on which you are trying to run. For example, with
ssh:
1
2
3
shell$ ssh remotehost env|grep-i path
PATH=...path on the remote host...
LD_LIBRARY_PATH=...LD library path on the remote host...
If your PATH or LD_LIBRARY_PATH are not set properly, see this FAQ entry for the correct
values. Keep in mind that it is fine to have multiple Open MPI
installations installed on a machine; the first Open MPI
installation found by PATH and LD_LIBARY_PATH is the one that
matters.
Run a simple, non-MPI job across multiple hosts. This verifies
that the Open MPI run-time system is functioning properly across
multiple hosts. For example, try running the hostname command:
If you are unable to run non-MPI jobs across multiple hosts, check
for common problems such as:
Check your non-interactive shell setup on each remote host
to ensure that it is setting up the PATH and LD_LIBRARY_PATH properly.
Check that Open MPI is finding and launching the correct version of Open MPI on the remote hosts.
Ensure that you have firewalling disabled between hosts (Open MPI
opens random TCP and sometimes random UDP ports between hosts in a
single MPI job).
Try running with the plm_base_verbose MCA parameter at level
10, which will enable extra debugging output to see how Open MPI
launches on remote hosts. For example: [mpirun --mca plm_base_verbose
10 --host remotehost hostname]
Now run a simple MPI job across multiple hosts that does not
involve MPI communications. The "hello_c" program in the examples
directory in the Open MPI distribution is a good choice. This
verifies that the MPI subsystem is able to initialize and terminate
properly. For example:
1
2
3
shell$ mpirun--host remotehost,otherhost hello_c
Hello, world, I am 0 of 1, (Open MPI v5.0.6, package: Open MPI jsquyres@builder.cisco.com Distribution, ident: 5.0.6, DATE)
Hello, world, I am 1 of 1, (Open MPI v5.0.6, package: Open MPI jsquyres@builder.cisco.com Distribution, ident: 5.0.6, DATE)
If you are unable to run simple, non-communication MPI jobs, this can
indicate that your Open MPI installation is unable to initialize
properly on remote hosts. Double check your non-interactive login
setup on remote hosts.
Now run a simple MPI job across multiple hosts that does does
some simple MPI communications. The "ring_c" program in the
examples directory in the Open MPI distribution is a good choice.
This verifies that the MPI subsystem is able to pass MPI traffic
across your network. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
shell$ mpirun--host remotehost,otherhost ring_c
Process 0 sending 10 to 0, tag 201(1 processes in ring)
Process 0 sent to 0
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
If you are unable to run simple MPI jobs across multiple hosts, this
may indicate a problem with the network(s) that Open MPI is trying to
use for MPI communications. Try limiting the networks that it uses,
and/or exploring levels 1 through 3 MCA parameters for the
communications module that you are using. For example, if you're
using the TCP BTL, see the output of [ompi_info --level 3 --param btl
tcp] .
141. When I build Open MPI with the Intel compilers, I get warnings
about "orted" or my MPI application not finding libimf.so. What do I do?
The problem is usually because the Intel libraries cannot be
found on the node where Open MPI is attempting to launch an MPI
executable. For example:
1
2
3
4
5
6
shell$ mpirun-np1--host node1.example.com mpi_hello
orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 11893) died unexpectedly with status 127while
attempting to launch so we are aborting.
[...more error messages...]
Open MPI first attempts to launch a "helper" daemon
(orted) on node1.example.com, but it failed because one
of orted's dependent libraries was not able to be found. This
particular library, libimf.so, is an Intel compiler library. As
such, it is likely that the user did not setup the Intel compiler
library in their environment properly on this node.
Double check that you have setup the Intel compiler environment on the
target node, for both interactive and non-interactive logins. It is a
common error to ensure that the Intel compiler environment is setup
properly for interactive logins, but not for
non-interactive logins. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
head_node$ cd$HOMEhead_node$ mpicc mpi_hello.c -o mpi_hello
head_node$ ./mpi_hello
Hello world, I am 0 of 1.
head_node$ ssh node2.example.com
Welcome to node2.
node2$ ./mpi_hello
Hello world, I am 0 of 1.
node2$ exithead_node$ ssh node2.example.com $HOME/mpi_hello
mpi_hello: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
The above example shows that running a trivial C program compiled by
the Intel compilers works fine on both the head node and node1 when
logging in interactively, but fails when run on node1
non-interactively. Check your shell script startup files and verify
that the Intel compiler environment is setup properly for
non-interactive logins.
142. When I build Open MPI with the PGI compilers, I get warnings
about "orted" or my MPI application not finding libpgc.so. What do I do?
The problem is usually because the PGI libraries cannot be
found on the node where Open MPI is attempting to launch an MPI
executable. For example:
1
2
3
4
5
6
shell$ mpirun-np1--host node1.example.com mpi_hello
orted: error while loading shared libraries: libpgc.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 11893) died unexpectedly with status 127while
attempting to launch so we are aborting.
[...more error messages...]
Open MPI first attempts to launch a "helper" daemon
(orted) on node1.example.com, but it failed because one
of orted's dependent libraries was not able to be found. This
particular library, libpgc.so, is a PGI compiler library. As
such, it is likely that the user did not setup the PGI compiler
library in their environment properly on this node.
Double check that you have setup the PGI compiler environment on the
target node, for both interactive and non-interactive logins. It is a
common error to ensure that the PGI compiler environment is setup
properly for interactive logins, but not for
non-interactive logins. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
head_node$ cd$HOMEhead_node$ mpicc mpi_hello.c -o mpi_hello
head_node$ ./mpi_hello
Hello world, I am 0 of 1.
head_node$ ssh node2.example.com
Welcome to node2.
node2$ ./mpi_hello
Hello world, I am 0 of 1.
node2$ exithead_node$ ssh node2.example.com $HOME/mpi_hello
mpi_hello: error while loading shared libraries: libpgc.so: cannot open shared object file: No such file or directory
The above example shows that running a trivial C program compiled by
the PGI compilers works fine on both the head node and node1 when
logging in interactively, but fails when run on node1
non-interactively. Check your shell script startup files and verify
that the PGI compiler environment is setup properly for
non-interactive logins.
143. When I build Open MPI with the PathScale compilers, I get warnings
about "orted" or my MPI application not finding libmv.so. What do I do?
The problem is usually because the PathScale libraries cannot be
found on the node where Open MPI is attempting to launch an MPI
executable. For example:
1
2
3
4
5
6
shell$ mpirun-np1--host node1.example.com mpi_hello
orted: error while loading shared libraries: libmv.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 11893) died unexpectedly with status 127while
attempting to launch so we are aborting.
[...more error messages...]
Open MPI first attempts to launch a "helper" daemon
(orted) on node1.example.com, but it failed because one
of orted's dependent libraries was not able to be found. This
particular library, libmv.so, is a PathScale compiler library. As
such, it is likely that the user did not setup the PathScale compiler
library in their environment properly on this node.
Double check that you have setup the PathScale compiler environment on the
target node, for both interactive and non-interactive logins. It is a
common error to ensure that the PathScale compiler environment is setup
properly for interactive logins, but not for
non-interactive logins. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
head_node$ cd$HOMEhead_node$ mpicc mpi_hello.c -o mpi_hello
head_node$ ./mpi_hello
Hello world, I am 0 of 1.
head_node$ ssh node2.example.com
Welcome to node2.
node2$ ./mpi_hello
Hello world, I am 0 of 1.
node2$ exithead_node$ ssh node2.example.com $HOME/mpi_hello
mpi_hello: error while loading shared libraries: libmv.so: cannot open shared object file: No such file or directory
The above example shows that running a trivial C program compiled by
the PathScale compilers works fine on both the head node and node1 when
logging in interactively, but fails when run on node1
non-interactively. Check your shell script startup files and verify
that the PathScale compiler environment is setup properly for
non-interactive logins.
144. Can I run non-MPI programs with mpirun / mpiexec?
Yes.
Indeed, Open MPI's mpirun and mpiexec are actually synonyms for
our underlying launcher named orterun (i.e., the Open Run-Time
Environment layer in Open MPI, or ORTE). So you can use mpirun and
mpiexec to launch any application. For example:
1
shell$ mpirun-np2--host a,b uptime
This will launch a copy of the Unix command uptime on the hosts a
and b.
Other questions in the FAQ section deal with the specifics of the
mpirun command line interface; suffice it to say that it works
equally well for MPI and non-MPI applications.
145. Can I run GUI applications with Open MPI?
Yes, but it will depend on your local setup and may require
additional setup.
In short: you will need to have X forwarding enabled from the remote
processes to the display where you want output to appear. In a secure
environment, you can simply allow all X requests to be shown on the
target display and set the DISPLAY environment variable in all MPI
processes' environments to the target display, perhaps something like
this:
However, this technique is not generally suitable for unsecure
environments (because it allows anyone to read and write to your
display). A slightly more secure way is to only allow X connections
from the nodes where your application will be running:
1
2
3
4
5
6
7
8
shell$ hostname
my_desktop.secure-cluster.example.com
shell$ xhost +compute1 +compute2 +compute3 +compute4
compute1 being added to access control list
compute2 being added to access control list
compute3 being added to access control list
compute4 being added to access control list
shell$ mpirun-np4-xDISPLAY=my_desktop.secure-cluster.example.com a.out
(assuming that the four nodes you are running on are compute1
through compute4).
Other methods are available, but they involve sophisticated X
forwarding through mpirun and are generally more complicated than
desirable.
146. Can I run ncurses-based / curses-based / applications with
funky input schemes with Open MPI?
Maybe. But probably not.
Open MPI provides fairly sophisticated stdin / stdout / stderr
forwarding. However, it does not work well with curses, ncurses,
readline, or other sophisticated I/O packages that generally require
direct control of the terminal.
Every application and I/O library is different — you should try to
see if yours is supported. But chances are that it won't work.
Sorry. :-(
147. What other options are available to mpirun?
mpirun supports the "--help" option which provides a usage
message and a summary of the options that it supports. It should be
considered the definitive list of what options are provided.
Several notable options are:
--hostfile: Specify a hostfile for launchers (such as the rsh
launcher) that need to be told on which hosts to start parallel
applications. Note that for compatibility with other MPI
implementations, --machinefile is a synonym for --hostfile.
--wdir <directory>: Set the working directory of the
started applications. If not supplied, the current working directory
is assumed (or $HOME, if the current working directory does not
exist on all nodes).
-x <env-variable-name>: The name of an environment
variable to export to the parallel application. The -x option can
be specified multiple times to export multiple environment
variables to the parallel application.
148. How do I use the --hostfile option to mpirun?
The --hostfile option to mpirun takes a filename that
lists hosts on which to launch MPI processes.
NOTE: The hosts listed in
a hostfile have nothing to do with which network interfaces are used
for MPI communication. They are only used to specify on which hosts
to launch MPI processes.
Hostfiles my_hostfile are simple text files with hosts specified,
one per line. Each host can also specify a default and maximum number
of slots to be used on that host (i.e., the number of available
processors on that host). Comments are also supported, and blank
lines are ignored. For example:
1
2
3
4
5
6
7
8
9
10
11
# This is an example hostfile. Comments begin with ### The following node is a single processor machine:
foo.example.com
# The following node is a dual-processor machine:
bar.example.com slots=2# The following node is a quad-processor machine, and we absolutely# want to disallow over-subscribing it:
yow.example.com slots=4 max-slots=4
Exclusionary: If a list of hosts to run on has been provided by
another source (e.g., by a hostfile or a batch scheduler such as
Slurm, PBS/Torque, SGE, etc.), the hosts provided by the hostfile must
be in the already-provided host list. If the hostfile-specified nodes
are not in the already-provided host list, mpirun will abort
without launching anything.
In this case, hostfiles act like an exclusionary filter — they limit
the scope of where processes will be scheduled from the original list
of hosts to produce a final list of hosts.
For example, say that a scheduler job contains hosts node01 through
node04. If you run:
This is an error (because node17 is not listed in my_hosts);
mpirun will abort.
Finally, note that in exclusionary mode, processes will only be
executed on the hostfile-specified hosts, even if it causes
oversubscription. For example:
This will launch 4 copies of hostname on host node03.
Inclusionary: If a list of hosts has not been provided by
another source, then the hosts provided by the --hostfile option
will be used as the original and final host list.
In this case, --hostfile acts as an inclusionary agent; all
--hostfile-supplied hosts become available for scheduling processes.
For example (assume that you are not in a scheduling environment
where a list of nodes is being transparently supplied):
This will launch a single copy of hostname on the hosts
node01.example.com, node02.example.com, and node03.example.com.
Note, too, that --hostfile is essentially a per-application switch.
Hence, if you specify multiple applications (as in an MPMD job),
--hostfile can be specified multiple times:
Notice that hostname was launched on node01.example.com and
uptime was launched on host02.example.com.
149. How do I use the --host option to mpirun?
The --host option to mpirun takes a comma-delimited list
of hosts on which to run. For example:
1
shell$ mpirun-np3--host a,b,c hostname
Will launch one copy of hostname on hosts a, b, and c.
NOTE: The hosts specified
by the --host option have nothing to do with which network
interfaces are used for MPI communication. They are only used to
specify on which hosts to launch MPI processes.
--host works in two different ways:
Exclusionary: If a list of hosts to run on has been provided by
another source (e.g., by a hostfile or a batch scheduler such as
Slurm, PBS/Torque, SGE, etc.), the hosts provided by the --host option
must be in the already-provided host list. If the --host-specified
nodes are not in the already-provided host list, mpirun will abort
without launching anything.
In this case, the --host option acts like an exclusionary filter —
it limits the scope of where processes will be scheduled from the
original list of hosts to produce a final list of hosts.
For example, say that the hostfile my_hosts contains the hosts
node1 through node4. If you run:
This is an error (because node17 is not listed in my_hosts);
mpirun will abort.
Finally, note that in exclusionary mode, processes will only be
executed on the --host-specified hosts, even if it causes
oversubscription. For example:
1
shell$ mpirun-np4--host a uptime
This will launch 4 copies of uptime on host a.
Inclusionary: If a list of hosts has not been provided by
another source, then the hosts provided by the --host option will be
used as the original and final host list.
In this case, --host acts as an inclusionary agent; all
--host-supplied hosts become available for scheduling processes.
For example (assume that you are not in a scheduling environment
where a list of nodes is being transparently supplied):
1
shell$ mpirun-np3--host a,b,c hostname
This will launch a single copy of hostname on the hosts a, b,
and c.
Note, too, that --host is essentially a per-application switch.
Hence, if you specify multiple applications (as in an MPMD job),
--host can be specified multiple times:
1
shell$ mpirun-np1--host a hostname : -np1--host b uptime
This will launch hostname on host a and uptime on host b.
150. How do I control how my processes are scheduled across nodes?
The short version is that if you are not oversubscribing your
nodes (i.e., trying to run more processes than you have told Open MPI
are available on that node), scheduling is pretty simple and occurs
either on a by-slot or by-node round robin schedule. If you're
oversubscribing, the issue gets much more complicated — keep reading.
The more complete answer is: Open MPI schedules processes to nodes by
asking two questions from each application on the mpirun command
line:
How many processes should be launched?
Where should those processes be launched?
The "how many" question is directly answered with the -np switch
to mpirun. The "where" question is a little more complicated, and
depends on three factors:
The final node list (e.g., after --host exclusionary or
inclusionary processing)
The scheduling policy (which applies to all applications in a
single job)
The default and maximum number of slots on each host
As briefly mentioned in this FAQ
entry, slots are Open MPI's representation of how many
processors are available on a given host.
The default number of slots on any machine, if not explicitly
specified, is 1 (e.g., if a host is listed in a hostfile by has no
corresponding "slots" keyword). Schedulers (such as Slurm,
PBS/Torque, SGE, etc.) automatically provide an accurate default slot
count.
Max slot counts, however, are rarely specified by schedulers. The max
slot count for each node will default to "infinite" if it is not
provided (meaning that Open MPI will oversubscribe the node if you ask
it to — see more on oversubscribing in this FAQ entry).
Open MPI currently supports two scheduling policies: by slot and by
node:
By slot: This is the default scheduling policy, but can also be
explicitly requested by using either the --byslot option to mpirun
or by setting the MCA parameter rmaps_base_schedule_policy to the
string "slot".
In this mode, Open MPI will schedule processes on a node until all of
its default slots are exhausted before proceeding to the next node.
In MPI terms, this means that Open MPI tries to maximize the number of
adjacent ranks in MPI_COMM_WORLD on the same host without
oversubscribing that host.
For example:
1
2
3
4
5
6
7
8
9
10
11
12
shell$ cat my-hosts
node0 slots=2max_slots=20
node1 slots=2max_slots=20shell$ mpirun--hostfile my-hosts -np8--byslot|sort
Hello World I am rank 0 of 8 running on node0
Hello World I am rank 1 of 8 running on node0
Hello World I am rank 2 of 8 running on node1
Hello World I am rank 3 of 8 running on node1
Hello World I am rank 4 of 8 running on node0
Hello World I am rank 5 of 8 running on node0
Hello World I am rank 6 of 8 running on node1
Hello World I am rank 7 of 8 running on node1
By node: This policy can be requested either by using the
--bynode option to mpirun or by setting the MCA parameter
rmaps_base_schedule_policy to the string "node".
In this mode, Open MPI will schedule a single process on each node in
a round-robin fashion (looping back to the beginning of the node list
as necessary) until all processes have been scheduled. Nodes are
skipped once their default slot counts are exhausted.
For example:
1
2
3
4
5
6
7
8
9
10
11
12
shell$ cat my-hosts
node0 slots=2max_slots=20
node1 slots=2max_slots=20shell$ mpirun--hostname my-hosts -np8--bynode hello |sort
Hello World I am rank 0 of 8 running on node0
Hello World I am rank 1 of 8 running on node1
Hello World I am rank 2 of 8 running on node0
Hello World I am rank 3 of 8 running on node1
Hello World I am rank 4 of 8 running on node0
Hello World I am rank 5 of 8 running on node1
Hello World I am rank 6 of 8 running on node0
Hello World I am rank 7 of 8 running on node1
In both policies, if the default slot count is exhausted on all nodes
while there are still processes to be scheduled, Open MPI will loop
through the list of nodes again and try to schedule one more process
to each node until all processes are scheduled. Nodes are skipped in
this process if their maximum slot count is exhausted. If the maximum
slot count is exhausted on all nodes while there are still processes
to be scheduled, Open MPI will abort without launching any processes.
NOTE: This is the scheduling policy in Open MPI because of a long
historical precedent in LAM/MPI. However, the scheduling of processes
to processors is a component in the RMAPS framework in Open MPI; it
can be changed. If you don't like how this scheduling occurs, please
let us know.
151. I'm not using a hostfile. How are slots calculated?
If you are using a supported resource manager, Open MPI will
get the slot information directly from that entity. If you are using
the --host parameter to mpirun, be aware that each instance of a
hostname bumps up the internal slot count by one. For example:
1
shell$ mpirun--host node0,node0,node0,node0 ....
This tells Open MPI that host "node0" has a slot count of 4. This is
very different than, for example:
1
shell$ mpirun-np4--host node0 a.out
This tells Open MPI that host "node0" has a slot count of 1 but you
are running 4 processes on it. Specifically, Open MPI assumes that
you are oversubscribing the node.
152. Can I run multiple parallel processes on a uniprocessor machine?
Yes.
But be very careful to ensure that Open MPI
knows that you are oversubscibing your node! If Open
MPI is unaware that you are oversubscribing a node, severe performance degradation can result.
See this FAQ entry for more details
on oversubscription.
153. Can I oversubscribe nodes (run more processes than processors)?
Yes.
However, it is critical that Open MPI knows that you are
oversubscribing the node, or severe performance degradation can result.
The short explanation is as follows: never
specify a number of slots that is more than the available number of
processors. For example, if you want to run 4
processes on a uniprocessor, then indicate that you only have 1 slot
but want to run 4 processes. For example:
Specifically: do NOT have a
hostfile that contains "slots = 4" (because there is only one
available processor).
Here's the full explanation:
Open MPI basically runs its message passing progression engine in two
modes: aggressive and degraded.
Degraded: When Open MPI thinks that it is in an oversubscribed
mode (i.e., more processes are running than there are processors
available), MPI processes will automatically run in degraded mode
and frequently yield the processor to its peers, thereby allowing all
processes to make progress (be sure to see this
FAQ entry that describes how degraded mode affects processor and
memory affinity).
Aggressive: When Open MPI thinks that it is in an exactly- or
under-subscribed mode (i.e., the number of running processes is equal
to or less than the number of available processors), MPI processes
will automatically run in aggressive mode, meaning that they will
never voluntarily give up the processor to other processes. With some
network transports, this means that Open MPI will spin in tight loops
attempting to make message passing progress, effectively causing other
processes to not get any CPU cycles (and therefore never make any
progress).
This would cause all 4 MPI processes to run in aggressive mode
because Open MPI thinks that there are 4 available processors
to use. This is actually a lie (there is only 1 processor — not 4),
and can cause extremely bad performance.
154. Can I force Agressive or Degraded performance modes?
Yes.
The MCA parameter mpi_yield_when_idle controls whether an MPI
process runs in Aggressive or Degraded performance mode. Setting it
to zero forces Aggressive mode; any other value forces Degraded mode
(see this FAQ
entry to see how to set MCA parameters).
Note that this value only affects the behavior of MPI processes when
they are blocking in MPI library calls. It does not affect behavior
of non-MPI processes, nor does it affect the behavior of a process
that is not inside an MPI library call.
Open MPI normally sets this parameter automatically (see this FAQ entry for details). Users are
cautioned against setting this parameter unless you are really,
absolutely, positively sure of what you are doing.
155. How do I run with the TotalView parallel debugger?
Generally, you can run Open MPI processes with TotalView as
follows:
1
shell$ mpirun--debug ...mpirun arguments...
Assuming that TotalView is the first supported parallel debugger in
your path, Open MPI will autmoatically invoke the correct underlying
command to run your MPI process in the TotalView debugger. Be sure to
see this
FAQ entry for details about what versions of Open MPI and
TotalView are compatible.
For reference, this underlying command form is the following:
1
shell$ totalview mpirun-a ...mpirun arguments...
So if you wanted to run a 4-process MPI job of your a.out
executable, it would look like this:
1
shell$ totalview mpirun-a-np4 a.out
Alternatively, Open MPI's mpirun offers the "-tv" convenience
option which does the same thing as TotalView's "-a" syntax. For
example:
1
shell$ mpirun-tv-np4 a.out
Note that by default, TotalView will stop deep in the machine code of
mpirun itself, which is not what most users want. It is possible
to get TotalView to recognize that mpirun is simply a "starter"
program and should be (effectively) ignored. Specifically, TotalView
can be configured to skip mpirun (and mpiexec and orterun) and
jump right into your MPI application. This can be accomplished by
placing some startup instructions in a TotalView-specific file named
$HOME/.tvdrc.
Open MPI includes a sample TotalView startup file that performs this
function (see etc/openmpi-totalview.tcl in Open MPI distribution
tarballs; it is also installed, by default, to
$prefix/etc/openmpi-totalview.tcl in the Open MPI installation).
This file can be either copied to $HOME/.tvdrc or sourced from the
$HOME/.tvdrc file. For example, placing the following line in your
$HOME/.tvdrc (replacing /path/to/openmpi/installation with the
proper directory name, of course) will use the Open MPI-provided
startup file:
As of August 2015, DDT has built-in startup for MPI
applications within its Alinea Forge GUI product. You can simply use
the built-in support to launch, monitor, and kill MPI jobs.
If you are using an older version of DDT that does not have this
built-in support, keep reading.
If you've used DDT at least once before (to use the
configuration wizard to setup support for Open MPI), you can start it
on the command line with:
1
shell$ mpirun--debug ...mpirun arguments...
Assuming that you are using Open MPI v1.2.4 or later, and assuming
that DDT is the first supported parallel debugger in your path, Open
MPI will automatically invoke the correct underlying command to run
your MPI process in the DDT debugger. For reference (or if you are
using an earlier version of Open MPI), this underlying command form is
the following:
1
shell$ ddt -n{nprocs}-start{exe-name}
Note that passing arbitrary arguments to Open MPI's mpirun is not
supported with the DDT debugger.
You can also attach to already-running processes with either of the
following two syntaxes:
DDT can even be configured to operate with cluster/resource schedulers
such that it can run on a local workstation, submit your MPI job via
the scheduler, and then attach to the MPI job when it starts.
See the official DDT documentation for more details.
157. What launchers are available?
The documentation contained in the Open MPI tarball will have
the most up-to-date information, but as of v1.0, Open MPI supports:
BProc versions 3 and 4 (discontinued starting with OMPI v1.3)
Sun Grid Engine (SGE), and the open source Grid Engine (support first introduced in Open MPI v1.2)
PBS Pro, Torque, and Open PBS
LoadLeveler scheduler (full support since 1.1.1)
rsh / ssh
Slurm
LSF
XGrid (discontinued starting with OMPI 1.4)
Yod (Cray XT-3 and XT-4)
158. How do I specify to the rsh launcher to use rsh or ssh?
159. How do I run with the Slurm and PBS/Torque launchers?
If support for these systems is included in your Open MPI
installation (which you can check with the ompi_info command — look
for components named "slurm" and/or "tm"), Open MPI will
automatically detect when it is running inside such jobs and will just
"do the Right Thing."
See this FAQ entry for
a description of how to run jobs in Slurm; see this FAQ entry for a description
of how to run jobs in PBS/Torque.
If support for LoadLeveler is included in your Open MPI
installation (which you can check with the ompi_info command — look
for components named "loadleveler"), Open MPI will
automatically detect when it is running inside such jobs and will just
"do the Right Thing."
Specifically, if you execute an mpirun command in a LoadLeveler job,
it will automatically determine what nodes and how many slots on each
node have been allocated to the current job. There is no need to
specify what nodes to run on. Open MPI will then attempt to launch the
job using whatever resource is available (on Linux rsh/ssh is used).
This will run 4 MPI process per node on the 3 nodes which were allocated by
LoadLeveler for this job.
For users of Open MPI 1.1
series: In version 1.1.0 there exists a problem which
will make it so that Open MPI will not be able to determine what nodes
are available to it if the job has more than 128 tasks. In the 1.1.x
series starting with version 1.1.1., this can be worked around by
passing "-mca ras_loadleveler_priority 110" to mpirun. Version 1.2
and above work without any additional flags.
162. How do I load libmpi at runtime?
If you want to load a the shared library libmpi explicitly
at runtime either by using dlopen() from C/C ++ or something like
the ctypes package from Python, some extra care is required. The
default configuration of Open MPI uses dlopen() internally to load
its support components. These components rely on symbols available in
libmpi. In order to make the symbols in libmpi available to the
components loaded by Open MPI at runtime, libmpi must be loaded with
the RTLD_GLOBAL option.
In C/C++, this option is specified as the second parameter to the
POSIX dlopen(3) function.
When using ctypes with Python, this can be done with the second
(optional) parameter to CDLL(). For example (shown below in Mac OS
X, where Open MPI's shared library name ends in ".dylib"; other
operating systems use other suffixes, such as ".so"):
from ctypes import *
mpi = CDLL('libmpi.0.dylib', RTLD_GLOBAL)
f = pythonapi.Py_GetArgcArgv
argc = c_int()
argv = POINTER(c_char_p)()
f(byref(argc), byref(argv))
mpi.MPI_Init(byref(argc), byref(argv))# Your MPI program here
mpi.MPI_Finalize()
Other scripting languages should have similar options when dynamically
loading shared libraries.
163. What MPI environmental variables exist?
Beginning with the v1.3 release, Open MPI provides the following
environmental variables that will be defined on every
MPI process:
OMPI_COMM_WORLD_SIZE - the number of processes in this process's
MPI_COMM_WORLD
OMPI_COMM_WORLD_RANK - the MPI rank of this process in
MPI_COMM_WORLD
OMPI_COMM_WORLD_LOCAL_RANK - the relative rank of this process
on this node within its job. For example, if four processes in a job
share a node, they will each be given a local rank ranging from 0 to
3.
OMPI_UNIVERSE_SIZE - the number of process slots allocated to
this job. Note that this may be different than the number of processes
in the job.
OMPI_COMM_WORLD_LOCAL_SIZE - the number of ranks from this job
that are running on this node.
OMPI_COMM_WORLD_NODE_RANK - the relative rank of this process on
this node looking across ALL jobs.
Open MPI guarantees that these variables will remain stable throughout
future releases
164. How do I get my MPI job to wireup its MPI connections right away?
By default, Open MPI opens MPI connections between processes
in a "lazy" fashion - i.e., the connections are only opened when the
MPI process actually attempts to send a message to another process for
the first time. This is done since (a) Open MPI has no idea what
connections an application process will really use, and (b) creating
the connections takes time. Once the connection is established, it
remains "connected" until one of the two connected processes
terminates, so the creation time cost is paid only once.
Applications that require a fully connected topology, however, can see
improved startup time if they automatically "pre-connect" all their
processes during MPI_Init. Accordingly, Open MPI provides the MCA
parameter "mpi_preconnect_mpi" which directs Open MPI to establish a
"mostly" connected topology during MPI_Init (note that this MCA
parameter used to be named "mpi_preconnect_all" prior to Open MPI
v1.5; in v1.5, it was deprecated and replaced with
"mpi_preconnect_mpi"). This is accomplished in a somewhat scalable
fashion to help minimize startup time.
Users can set this parameter in two ways:
in the environment as OMPI_MCA_mpi_preconnect_mpi=1
on the command line as mpirun -mca mpi_preconnect_mpi 1
See this FAQ entry
for more details on how to set MCA parameters.
165. What kind of CUDA support exists in Open MPI?
166. What are the Libfabric (OFI) components in Open MPI?
Open MPI has two main components for Libfabric (a.k.a., OFI) communications:
ofi MTL: Available since Open MPI v1.10, this component is used with the
cm PML and is used for two-sided MPI communication (e.g., MPI_SEND and MPI_RECV).
The ofi MTL requires that the Libfabric provider support reliable datagrams with
ordered tagged messaging (specifically: FI_EP_RDM endpoints, FI_TAGGED
capabilities, and FI_ORDER_SAS ordering).
ofi BTL: Available since Open MPI v4.0.0, this component is used for
one-sided MPI communications (e.g., MPI_PUT). The ofi BTL requires that
the Libfabric provider support reliable datagrams, RMA and atomic operations,
and remote atomic completion notifications (specifically: FI_EP_RDM endpoints,
FI_RMA and FI_ATOMIC capabilities, and FI_DELIVERY_COMPLETE op flags).
See each Lifabric provider man page (e.g., fi_sockets(7)) to understand which
provider will work for each of the above-listed Open MPI components. Some
providers may require to be used with one of the Libfabric utility providers;
for example, the verbs provider needs to be paired with utility provider
ofi_rxm to provide reliable datagram endpoint support (verbs;ofi_rxm).
Both components have MCA parameters to specify the Libfabric provider(s) that
will be included/excluded in the selection process. For example:
167. How can Open MPI communicate with Intel Omni-Path Architecture (OPA)
based devices?
Currently, Open MPI supports PSM2 MTL and OFI MTL (using PSM2 OFI
provider) components which can be used to communicate with
Intel Omni-Path (OPA) software stack
For guidlines on tuning run-time characteristics when using OPA devices, please
refer to this FAQ entry.
168. Open MPI tells me that it fails to load components with a "file not found" error — but the file is there! Why does it say this?
Open MPI loads a lot of plugins at run time. It opens its
plugins via the excellent GNU Libtool libltdl
portability library. If a plugin fails to load, Open MPI queries
libltdl to get a printable string indicating why the plugin failed
to load.
Unfortunately, there is a well-known bug in libltdl that may cause a
"file not found" error message to be displayed, even when the file
is found. The "file not found" error usually masks the real,
underlying cause of the problem. For example:
1
mca: base: component_find: unable to open /opt/openmpi/mca_ras_dash_host: file not found (ignored)
Note that Open MPI put in a libltdl workaround starting with version
1.5. This workaround should print the real reason the plugin failed
to load instead of the erroneous "file not found" message.
There are two common underlying causes why a plugin fails to load:
The plugin is for a different version of Open MPI. This FAQ entry has more information
about this case.
The plugin cannot find shared libraries that it requires. For
example, if the openib plugin fails to load, ensure that
libibverbs.so can be found by the linker at run time (e.g., check
the value of your LD_LIBRARY_PATH environment variable). The same is
true for any other plugins that have shared library dependencies (e.g.,
the mx BTL and MTL plugins need to be able to find
the libmyriexpress.so shared library at run time).
169. I see strange messages about missing symbols in my application; what do these mean?
Open MPI loads a lot of plugins at run time. It opens its
plugins via the excellent GNU Libtool libltdl
portability library. Sometimes a plugin can fail to load because it
can't resolve all the symbols that it needs. There are a few reasons
why this can happen.
The plugin is for a different version of Open MPI. See this FAQ entry
for an explanation of how Open MPI might try to open the "wrong"
plugins.
An application is trying to manually dynamically open libmpi in
a private symbol space. For example, if an application is not linked
against libmpi, but rather calls something like this:
1
2
3
/* This is a Linux example — the issue is similar/the same on other operating systems */
handle = dlopen("libmpi.so", RTLD_NOW | RTLD_LOCAL);
The dynamic library libmpi is opened in a "local" symbol
space.
MPI_INIT is invoked, which tries to open Open MPI's plugins.
Open MPI's plugins rely on symbols in libmpi (and other Open
MPI support libraries); these symbols must be resolved when the plugin
is loaded.
However, since libmpi was opened in a "local" symbol space,
its symbols are not available to the plugins that it opens.
Hence, the plugin fails to load because it can't resolve all of
its symbols, and displays a warning message to that effect.
The ultimate fix for this issue is a bit bigger than Open MPI,
unfortunately — it's a POSIX issue (as briefly described in the devel
posting, above).
However, there are several common workarounds:
Dynamically open libmpi in a public / global symbol scope —
not a private / local scope. This will enable libmpi's symbols to
be available for resolution when Open MPI dynamically opens its
plugins.
If libmpi is opened as part of some underlying framework where
it is not possible to change the private / local scope to a public /
global scope, then dynamically open libmpi in a public / global
scope before invoking the underlying framework. This sounds a little
gross (and it is), but at least the run-time linker is smart enough to
not load libmpi twice — but it does keeps libmpi in a public
scope.
Use the --disable-dlopen or
--disable-mca-dso options to Open MPI's configure script (see this FAQ entry for more
details on these options). These options slurp all of Open MPI's
plugins up in to libmpi — meaning that the plugins physically
reside in libmpi and will not be dynamically opened at run
time.
Build Open MPI as a static library by configuring Open MPI with
--disable-shared and --enable-static. This has the same effect as
--disable-dlopen, but it also makes libmpi.a (as opposed to a
shared library).
170. What is mca_pml_teg.so? Why am I getting warnings about not finding the mca_ptl_base_modules_initialized symbol from it?
You may wonder why you see this warning message (put here
verbatim so that it becomes web-searchable):
This happens when you upgrade to Open MPI v1.1 (or later) over an old
installation of Open MPI v1.0.x without previously uninstalling
v1.0.x. There are fairly uninteresting reasons why this problem
occurs; the simplest, safest solution is to uninstall version 1.0.x
and then re-install your newer version. For example:
1
2
3
4
5
shell# cd/path/to/openmpi-1.0shell# make uninstall
[... lots of output ...]shell# cd/path/to/openmpi-1.1shell# makeinstall
The above example shows changing into the Open MPI 1.1 directory to
re-install, but the same concept applies to any version after Open MPI
version 1.0.x.
Note that this problem is fairly specific to installing / upgrading
Open MPI from the source tarball. Pre-packaged installers (e.g., RPM)
typically do not incur this problem.
171. Can I build shared libraries on AIX with the IBM XL compilers?
Short answer: in older versions of Open MPI, maybe.
Add "LDFLAGS=-Wl,-brtl" to your configure command line:
1
shell$ ./configure LDFLAGS=-Wl,-brtl ...
This enables "runtimelinking", which will make GNU Libtool name the
libraries properly (i.e., *.so). More importantly, runtimelinking
will cause the runtime linker to behave more or less like an ELF
linker would (with respect to symbol resolution).
Future versions of OMPI may not require this flag (and "runtimelinking"
on AIX).
NOTE: As of OMPI v1.2, AIX is
no longer supported.
172. Why am I getting a seg fault in libopen-pal (or libopal)?
It is likely that you did not get a segv in libopen-pal (or
"libopal", in older versions of Open MPI); it is likely that you are
seeing a message like this:
This is actually the function that is printing out the stack trace
message; it is not the function that caused the segv itself. The
function that caused the problem will be a few below this. Future
versions of OMPI will simply not display this libopen-pal function in the
segv reporting to avoid confusion.
Let's provide a concrete example. Take the following trivial MPI
program that is guaranteed to cause a seg fault in MPI_COMM_WORLD rank
1:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#include <mpi.h>int main(int argc,char**argv){int rank;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);if(rank ==1){char*d =0;/* This will cause a seg fault */*d =3;}
MPI_Finalize();return0;}
Running this code, you'll see something similar to the following:
1
2
3
4
5
6
7
8
9
10
11
shell$ mpicc segv.c -o segv -gshell$ mpirun-np2--mca btl tcp,self segv
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:(nil)[0] func:/opt/ompi/lib/libopen-pal.so.0(opal_backtrace_print+0x2b)[0x2a958de8a7][1] func:/opt/ompi/lib/libopen-pal.so.0 [0x2a958dd2b7][2] func:/lib64/tls/libpthread.so.0 [0x3be410c320][3] func:segv(main+0x3c)[0x400894][4] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb)[0x3be361c4bb][5] func:segv [0x4007ca]*** End of error message ***
The real error was back up in main, which is #3 on the stack trace.
But Open MPI's stack-tracing function (opal_backtrace_print, in this
case) is what is displayed as #0, so it's an easy mistake to assume
that libopen-pal is the culprit.
173. Why am I getting seg faults / MPI parameter errors when compiling C++ applications with the Intel 9.1 C++ compiler?
Early versions of the Intel 9.1 C++ compiler series had
problems with the Open MPI C++ bindings. Even trivial MPI
applications that used the C++ MPI bindings could incur process
failures (such as segmentation violations) or generate MPI-level
errors complaining about invalid parameters.
Intel released a new version of their 9.1 series C++ compiler on
October 5, 2006 (build 44) that seems to solve all of these issues.
The Open MPI team recommends that all users needing the C++ MPI API
upgrade to this version (or later) if possible. Since the problems
are with the compiler, there is little that Open MPI can do to work
around the issue; upgrading the compiler seems to be the only
solution.
174. All my MPI applications segv! Why? (Intel Linux 12.1 compiler)
Users have reported on the Open MPI users mailing list
multiple times that when they compile Open MPI with the Intel 12.1
compiler suite, Open MPI tools (such as the wrapper compilers,
including mpicc) and MPI applications will seg fault immediately.
As far as we know, this affects both Open MPI v1.4.4 (and later) and
v1.5.4 (and later).
The cause of the problem has turned out to be a bug in early versions
of the Intel Linux 12.1 compiler series itself. *If you upgrade your
Intel compiler to the latest version of the Intel 12.1 compiler suite
and rebuild Open MPI, the problem will go away.*
175. Why can't I attach my parallel debugger (TotalView, DDT, fx2,
etc.) to parallel jobs?
As noted in this FAQ
entry, Open MPI supports parallel debuggers that utilize the
TotalView API for parallel process attaching. However, it can
sometimes fail if Open MPI is not installed correctly. Symptoms of
this failure typically involve having the debugger hang (or crash)
when attempting to attach to a parallel MPI application.
Parallel debuggers may rely on having Open MPI's mpirun program
being compiled without optimization. Open MPI's configure and build
process therefore attempts to identify optimization flags and remove
them when compiling mpirun, but it does not have knowledge of all
optimization flags for all compilers. Hence, if you specify some
esoteric optimization flags to Open MPI's configure script, some
optimization flags may slip through the process and create an mpirun
that cannot be read by TotalView and other parallel debuggers.
If you run into this problem, you can manully build mpirun without
optimization flags. Go into the tree where you built Open MPI:
1
2
3
4
5
6
7
shell$ cd/path/to/openmpi/build/treeshell$ cd orte/tools/orterun
shell$ make clean
[...output not shown...]shell$ make all CFLAGS=-g
[...output not shown...]shell$
This will build mpirun (also known as orterun) with just the "-g"
flag. Once this completes, run make install, also from within the
orte/tools/orterun directory, and possibly as root depending on
where you installed Open MPI. Using this new orterun (mpirun),
your parallel debugger should be able to attach to MPI jobs.
Additionally, a user reported to us that setting some TotalView flags
may be helpful with attaching. The user specifically cited the Open
MPI v1.3 series compiled with the Intel 11 compilers and TotalView
8.6, but it may also be helpful for other versions, too:
176. When launching large MPI jobs, I see messages like: mca_oob_tcp_peer_complete_connect: connection failed: Connection timed out (110) - retrying
This is a known issue in the Open MPI v1.2 series. Try the
following:
If you are using Linux-based systems, increase some of the limits
on the node where mpirun is invoked (you must have
administrator/root privlidges to increase these limits):
1
2
3
4
5
# The default is 128; increase it to 10,000shell# echo10000>/proc/sys/net/core/somaxconn
# The default is 1,000; increase it to 100,000shell# echo100000>/proc/sys/net/core/netdev_max_backlog
Set the oob_tcp_listen_mode MCA parameter to the string value
listen_thread. This enables Open MPI's mpirun to respond much
more quickly to incoming TCP connections during job launch, for
example:
See this FAQ entry
for more details on how to set MCA parameters.
177. How do I find out what MCA parameters are being seen/used by my job?
As described elsewhere, MCA parameters are the "life's blood" of
Open MPI. MCA parameters are used to control both detailed and large-scale
behavior of Open MPI and are present throughout the code base.
This raises an important question: since MCA parameters can be set from a
file, the environment, the command line, and even internally within Open MPI,
how do I actually know what MCA params my job is seeing, and their value?
One way, of course, is to use the ompi_info command, which is documented
elsewhere (you can use "man ompi_info", or "ompi_info --help" to get more info
on this command). However, this still doesn't fully answer the question since
ompi_info isn't an MPI process.
To help relieve this problem, Open MPI (starting with the 1.3 release)
provides the MCA parameter mpi_show_mca_params that directs the rank=0 MPI process to report the
name of MCA parameters, their current value as seen by that process, and
the source that set that value. The parameter can take several values that define
which MCA parameters to report:
all: report all MCA params. Note that this typically generates a rather long
list of parameters since it includes all of the default parameters defined inside
Open MPI
default: MCA params that are at their default settings - i.e., all
MCA params that are at the values set as default within Open MPI
file: MCA params that had their value set by a file
api: MCA params set using Open MPI's internal APIs, perhaps to override an incompatible
set of conditions specified by the user
enviro: MCA params that obtained their value either from the local environment
or the command line. Open MPI treats environmental and command line parameters as
equivalent, so there currently is no way to separate these two sources
These options can be combined in any order by separating them with commas.
Here is an example of the output generated by this parameter:
1
2
3
4
5
6
7
8
$ mpirun-mca grpcomm basic -mca mpi_show_mca_params enviro ./hello
ess=env(environment or cmdline)orte_ess_jobid=1016725505(environment or cmdline)orte_ess_vpid=0(environment or cmdline)grpcomm=basic (environment or cmdline)mpi_yield_when_idle=0(environment or cmdline)mpi_show_mca_params=enviro (environment or cmdline)
Hello, World, I am 0 of 1
Note that several MCA parameters set by Open MPI itself for internal uses are displayed in addition to the
ones actually set by the user.
Since the output from this option can be long, and since it can be helpful to have a more
permanent record of the MCA parameters used for a job, a companion MCA parameter
mpi_show_mca_params_file is provided. If mpi_show_mca_params is also set, the output listing of MCA parameters
will be directed into the specified file instead of being printed to stdout.
178. How do I debug Open MPI processes in parallel?
This is a difficult question. Debugging in serial can be
tricky: errors, uninitialized variables, stack smashing, etc.
Debugging in parallel adds multiple different dimensions to this
problem: a greater propensity for race conditions, asynchronous
events, and the general difficulty of trying to understand N processes
simultaneously executing — the problem becomes quite formidable.
This FAQ section does not provide any definite solutions to
debugging in parallel. At best, it shows some general techniques and
a few specific examples that may be helpful to your situation.
But there are various controls within Open MPI that can help with
debugging. These are probably the most valuable entries in this FAQ
section.
179. What tools are available for debugging in parallel?
There are two main categories of tools that can aid in
parallel debugging:
Debuggers: Both serial and parallel debuggers are useful.
Serial debuggers are what most programmers are used to (e.g., gdb),
while parallel debuggers can attach to all the individual processes in
an MPI job simultaneously, treating the MPI application as a single
entity. This can be an extremely powerful abstraction, allowing the
user to control every aspect of the MPI job, manually replicate race
conditions, etc.
Profilers: Tools that analyze your usage of MPI and display
statistics and meta information about your application's run. Some
tools present the information "live" (as it occurs), while others
collect the information and display it in a post mortem analysis.
Both freeware and commercial solutions are available for each kind of
tool.
181. What controls does Open MPI have that aid in debugging?
Open MPI has a series of MCA parameters for the MPI layer
itself that are designed to help with debugging. These parameters can
be can be set in the
usual ways. MPI-level MCA parameters can be displayed by invoking
the following command:
1
2
3
4
5
6
# Starting with Open MPI v1.7, you must use "--level 9" to see# all the MCA parameters (the default is "--level 1"):shell$ ompi_info--param mpi all --level9# Before Open MPI v1.7:shell$ ompi_info--param mpi all
Here is a summary of the debugging parameters for the MPI layer:
mpi_param_check: If set to true (any positive value), and when
Open MPI is compiled with parameter checking enabled (the default),
the parameters to each MPI function can be passed through a series of
correctness checks. Problems such as passing illegal values (e.g.,
NULL or MPI_DATATYPE_NULL or other "bad" values) will be discovered
at run time and an MPI exception will be invoked (the default of which
is to print a short message and abort the entire MPI job). If set to
0, these checks are disabled, slightly increasing performance.
mpi_show_handle_leaks: If set to true (any positive value),
OMPI will display lists of any MPI handles that were not freed before
MPI_FINALIZE (e.g., communicators, datatypes, requests, etc.).
mpi_no_free_handles: If set to true (any positive value), do
not actually free MPI objects when their corresponding MPI "free"
function is invoked (e.g., do not free communicators when MPI_COMM_FREE is
invoked). This can be helpful in tracking down applications that
accidentally continue to use MPI handles after they have been
freed.
mpi_show_mca_params: If set to true (any positive value), show
a list of all MCA parameters and their values during MPI_INIT. This
can be quite helpful for reproducibility of MPI applications.
mpi_show_mca_params_file: If set to a non-empty value, and if
the value of mpi_show_mca_params is true, then output the list of
MCA parameters to the filename value. If this parameter is an empty
value, the list is sent to stderr.
mpi_keep_peer_hostnames: If set to a true value (any positive
value), send the list of all hostnames involved in the MPI job to
every process in the job. This can help the specificity of error
messages that Open MPI emits if a problem occurs (i.e., Open MPI can
display the name of the peer host that it was trying to communicate
with), but it can somewhat slow down the startup of large-scale
MPI jobs.
mpi_abort_delay: If nonzero, print out an identifying message
when MPI_ABORT is invoked showing the hostname and PID of the process
that invoked MPI_ABORT, and then delay that many seconds before
exiting. A negative value means to delay indefinitely. This allows a
user to manually come in and attach a debugger when an error occurs.
Remember that the default MPI error handler — MPI_ERRORS_ABORT —
invokes MPI_ABORT, so this parameter can be useful to discover
problems identified by mpi_param_check.
mpi_abort_print_stack: If nonzero, print out a stack trace (on
supported systems) when MPI_ABORT is invoked.
mpi_ddt_<foo>_debug, where <foo> can be one of
pack, unpack, position, or copy: These are internal debugging
features that are not intended for end users (but ompi_info will
report that they exist).
182. Do I need to build Open MPI with compiler/linker debugging
flags (such as -g) to be able to debug MPI applications?
No.
If you build Open MPI without compiler/linker debugging flags (such as
-g), you will not be able to step inside MPI functions
when you debug your MPI applications. However, this is likely what
you want — the internals of Open MPI are quite complex and you
probably don't want to start poking around in there.
You'll need to compile your own applications with -g (or whatever
your compiler's equivalent is), but unless you have a need/desire to
be able to step into MPI functions to see the internals of Open MPI,
you do not need to build Open MPI with -g.
183. Can I use serial debuggers (such as gdb) to debug MPI
applications?
Yes; the Open MPI developers do this all the time.
There are two common ways to use serial debuggers:
Attach to individual MPI processes after they are running.
For example, launch your MPI application as normal with mpirun.
Then login to the node(s) where your application is running and use
the --pid option to gdb to attach to your application.
An inelegant-but-functional technique commonly used with this method
is to insert the following code in your application where you want to
attach:
1
2
3
4
5
6
7
8
9
{volatileint i =0;char hostname[256];
gethostname(hostname,sizeof(hostname));printf("PID %d on %s ready for attach\n", getpid(), hostname);fflush(stdout);while(0== i)
sleep(5);}
This code will output a line to stdout outputting the name of the host
where the process is running and the PID to attach to. It will then
spin on the sleep() function forever waiting for you to attach with
a debugger. Using sleep() as the inside of the loop means that the
processor won't be pegged at 100% while waiting for you to attach.
Once you attach with a debugger, go up the function stack until you
are in this block of code (you'll likely attach during the sleep())
then set the variable i to a nonzero value. With GDB, the syntax
is:
1
(gdb)set var i = 7
Then set a breakpoint after your block of code and continue execution
until the breakpoint is hit. Now you have control of your live MPI
application and use of the full functionality of the debugger.
You can even add conditionals to only allow this "pause" in the
application for specific MPI processes (e.g., MPI_COMM_WORLD rank 0,
or whatever process is misbehaving).
Use mpirun to launch separate instances
of serial debuggers.
This technique launches a separate window for each MPI process in
MPI_COMM_WORLD, each one running a serial debugger (such as gdb)
that will launch and run your MPI application. Having a separate
window for each MPI process can be quite handy for low process-count
MPI jobs, but requires a bit of setup and configuration that is
outside of Open MPI to work properly. A naive approach would be to
assume that the following would immediately work:
1
shell$ mpirun-np4 xterm -egdb my_mpi_application
If running on a personal computer, this will probably work.
You can also use tmpi to launch the debuggers in separate tmux
panes instead of separate xterm windows, which has the advantage of
synchronizing keyboard input between all debugger instances.
Unfortunately, the tmpi or xterm approaches likely won't work
on an computing cluster. Several factors must be considered:
What launcher is Open MPI using? In an rsh/ssh environment, Open
MPI will default to using ssh when it is available, falling back to
rsh when ssh cannot be found in the $PATH. But note that Open
MPI closes the ssh (or rsh) sessions when the MPI job starts for
scalability reasons. This means that the built-in SSH X forwarding
tunnels will be shut down before the xterms can be launched.
Although it is possible to force Open MPI to keep its SSH connections
active (to keep the X tunneling available), we recommend using
non-SSH-tunneled X connections, if possible (see below).
In non-rsh/ssh environments (such as when using resource
managers), the environment of the process invoking mpirun may be
copied to all nodes. In this case, the DISPLAY environment variable
may not be suitable.
Some operating systems default to disabling the X11 server from
listening for remote/network traffic. For example, see this
post on the user's mailing list, describing how to enable network
access to the X11 server on Fedora Linux.
There may be intermediate firewalls or other network blocks that
prevent X traffic from flowing between the hosts where the MPI
processes (and xterms) are running and the host connected
to the output display.
The easiest way to get remote X applications (such as
xterm) to display on your local screen is to forego the
security of SSH-tunneled X forwarding. In a closed environment such
as an HPC cluster, this may be an acceptable practice (indeed, you may
not even have the option of using SSH X forwarding if SSH logins
to cluster nodes are disabled), but check with your security
administrator to be sure.
If using non-encrypted X11 forwarding is permissible, we recommend the
following:
For each non-local host where you will be running an MPI process,
add it to your X server's permission list with the xhost command.
For example:
Use the -x option to mpirun to export an appropriate DISPLAY
variable so that the launched X applications know where to send their
output. An appropriate value is usually (but not always) the
hostname containing the display where you want the output and the :0
(or :0.0) suffix. For example:
Note that X traffic is fairly "heavy" — if you are operating over a
slow network connection, it may take some time before the xterm
windows appear on your screen.
If your xterm supports it, the -hold option may be useful.
-hold tells xterm to stay open even when the application has
completed. This means that if something goes wrong (e.g., gdb fails
to execute, or unexpectedly dies, or ...), the xterm window will
stay open, allowing you to see what happened, instead of closing
immediately and losing whatever error message may have been
output.
When you have finished, you may wish to disable X11 network
permissions from the hosts that you were using. Use xhost again to
disable these permissions:
1
shell$ for host in`cat my_hostfile` ; doxhost-host ; done
Note that mpirun will not complete until all the xterms
complete.
184. My process dies without any output. Why?
There many be many reasons for this; the Open MPI Team
strongly encourages the use of tools (such as debuggers) whenever
possible.
One of the reasons, however, may come from inside Open MPI itself. If
your application fails due to memory corruption, Open MPI may
subsequently fail to output an error message before dying.
Specifically, starting with v1.3, Open MPI attempts to aggregate error
messages from multiple processes in an attempt to show unique error
messages only once (vs. one for each MPI process — which can be
unwieldy, especially when running large MPI jobs).
However, this aggregation process requires allocating memory in the
MPI process when it displays the error message. If the process's
memory is already corrupted, Open MPI's attempt to allocate memory may
fail and the process will simply die, possibly silently. When Open
MPI does not attempt to aggregate error messages, most of its setup
work is done during MPI_INIT and no memory is allocated during the
"print the error" routine. It therefore almost always successfully
outputs error messages in real time — but at the expense that you'll
potentially see the same error message for each MPI process that
encountered the error.
Hence, the error message aggregation is usually a good thing, but
sometimes it can mask a real error. You can disable Open MPI's error
message aggregation with the orte_base_help_aggregate MCA
parameter. For example:
1
shell$ mpirun--mca orte_base_help_aggregate 0 ...
185. What is Memchecker?
The Memchecker-MCA is implemented to allow MPI-semantic
checking for your application (as well as internals of Open MPI), with
the help of memory checking tools such as the Memcheck of the
Valgrind-suite (http://www.valgrind.org/).
Memchecker component is included in Open MPI v1.3 and later.
186. What kind of errors can Memchecker find?
Memchecker is implemented on the basis of the Memcheck tool from
Valgrind, so it takes all the advantages from it. Firstly, it checks
all reads and writes of memory, and intercepts calls to
malloc/new/free/delete. Most importantly, Memchecker is able to detect
the user buffer errors in both Non-blocking and One-sided
communications, e.g. reading or writing to buffers of active
non-blocking Recv-operations and writing to buffers of active
non-blocking Send-operations.
Here are some example codes that Memchecker can detect:
Accessing buffer under control of non-blocking communication:
1
2
3
4
5
int buf;
MPI_Irecv(&buf,1, MPI_INT,1,0, MPI_COMM_WORLD,&req);// The following line will produce a memchecker warning
buf =4711;
MPI_Wait (&req,&status);
Wrong input parameters, e.g. wrongly sized send buffers:
1
2
3
4
5
char*send_buffer;
send_buffer =malloc(5);memset(send_buffer,0,5);// The following line will produce a memchecker warning
MPI_Send(send_buffer,10, MPI_CHAR,1,0, MPI_COMM_WORLD);
Accessing window under control of one-sided communication:
char*buffer;
buffer =malloc(10);// The following line will produce a memchecker warning
MPI_Send(buffer,10, MPI_INT,1,0, MPI_COMM_WORLD);
Usage of the uninitialized MPI_Status field in MPI_ERROR structure:
(The MPI-1 standard defines the MPI ERROR-field to be undefined for
single-completion calls such as MPI Wait or MPI Test, see MPI-1 p. 22):
1
2
3
4
MPI_Wait(&request,&status);// The following line will produce a memchecker warningif(status.MPI_ERROR!= MPI_SUCCESS)return ERROR;
187. How can I use Memchecker?
To use Memchecker, you need Open MPI 1.3 or later, and
Valgrind 3.2.0 or later.
As this functionality is off by default, one needs to turn them on
with the configure flag --enable-memchecker. Then, configure will
check for a recent Valgrind-distribution and include the compilation
of ompi/opal/mca/memchecker. You may ensure that the library is
being built by using the ompi_info application. Please note that
all of this will only make sense together with --enable-debug, which
is required by Valgrind for outputting messages pointing directly to
the relevant source code lines. Otherwise, without debugging info,
the messages from Valgrind are nearly useless.
Here is a configuration example to enable Memchecker:
To check if Memchecker is successfully enabled after installation,
simply run this command:
1
shell$ ompi_info|grep memchecker
You will get an output message like this:
1
MCA memchecker: valgrind (MCA v1.0, API v1.0, Component v1.3)
Otherwise, you probably didn't configure and install Open MPI correctly.
188. How to run my MPI application with Memchecker?
First of all, you have to make sure that Valgrind 3.2.0 or
later is installed, and Open MPI is compiled with Memchecker
enabled. Then simply run your application with Valgrind, e.g.:
1
shell$ mpirun-np2valgrind ./my_app
Or if you enabled Memchecker, but you don't want to check the
application at this time, then just run your application as
usual. E.g.:
1
shell$ mpirun-np2 ./my_app
189. Does Memchecker cause performance degradation to my application?
The configure option --enable-memchecker (together with --enable-debug) does
cause performance degradation, even if not running under Valgrind.
The following explains the mechanism and may help in making the decision
whether to provide a cluster-wide installation with --enable-memchecker.
There are two cases:
If run without Valgrind, the Valgrind ClientRequests (assembler
instructions added to the normal execution path for checking) do
not affect overall MPI performance. Valgrind ClientRequests are
explained in detail in
Valgrind's documentation.
In the case of x86-64 ClientRequests boil down to the following
four rotate-left (ROL) and one xchange (XCHG) assembler instructions
(from valgrind.h):
for every single ClientRequest. In the case of not running
Valgrind, these ClientRequest instructions do not change the
arithmetic outcome (rotating a 64-bit register left by 128-Bits,
exchanging a register with itself), except for the carry flag.
The first request is checking whether we're running under Valgrind.
In case we're not running under Valgrind subsequent checks (aka ClientRequests)
are not done.
190. Is Open MPI 'Valgrind-clean' or how can I identify real errors?
This issue has been raised many times on the mailing list, e.g.,
here or
here.
There are many situations where Open MPI purposefully does not initialize and
subsequently communicates memory, e.g., by calling writev.
Furthermore, several cases are known where memory is not properly freed upon
MPI_Finalize.
This certainly does not help distinguishing real errors from false positives.
Valgrind provides functionality to suppress errors and warnings from certain
function contexts.
In an attempt to ease debugging using Valgrind, starting with v1.5, Open MPI
provides a so-called Valgrind-suppression file, that can be passed on the
command line:
More information on suppression-files and how to generate
them can be found in
Valgrind's Documentation.
191. Can I make Open MPI use rsh instead of ssh?
Yes. The method to do this has changed over the different
versions of Open MPI.
v1.7 and later series: The plm_rsh_agent MCA parameter
accepts a colon-delimited list of programs to search for in your path
to use as the remote startup agent. The default value
is ssh : rsh, meaning that it will look for ssh first, and if it
doesn't find it, use rsh. You can change the value of this
parameter as relevant to your environment, such as simply changing it
to rsh or rsh : ssh if you have a mixture. The deprecated forms
pls_rsh_agent and orte_rsh_agent will also work.
v1.3 to v1.6 series: The orte_rsh_agent MCA parameter
accepts a colon-delimited list of programs to search for in your path
to use as the remote startup agent (the MCA parameter name
plm_rsh_agent also works, but it is deprecated). The default value
is ssh : rsh, meaning that it will look for ssh first, and if it
doesn't find it, use rsh. You can change the value of this
parameter as relevant to your environment, such as simply changing it
to rsh or rsh : ssh if you have a mixture.
v1.1 and v1.2 series: The v1.1 and v1.2 method is exactly the
same as the v1.3 method, but the MCA parameter name is slightly
different: pls_rsh_agent ("pls" vs. "plm"). Using the old
"pls" name will continue to work in the v1.3 series, but it is now
officially deprecated — you'll receive a warning if you use it.
v1.0 series: In the 1.0.x series, Open MPI defaults to using
ssh for remote startup of processes in unscheduled environments.
You can change this to rsh by setting the MCA
parameterpls_rsh_agent to rsh.
See this FAQ entry
for details on how to set MCA parameters — particularly with
multi-word values.
192. What prerequisites are necessary for running an Open MPI job
under rsh/ssh?
193. How can I make ssh not ask me for a password?
If you are using ssh to launch processes on remote nodes,
there are multiple ways.
Note that there are multiple versions of ssh available. References
to ssh in this text refer to OpenSSH.
This documentation provides an overview for using user keys and the
OpenSSH 2.x key management agent (if your OpenSSH only supports 1.x
key management, you should upgrade). See the OpenSSH documentation
for more details and a more thorough description. The process is
essentially the same for other versions of SSH, but the command names
and filenames may be slightly different. Consult your SSH
documentation for more details.
Normally, when you use ssh to connect to a remote host, it will
prompt you for your password. However, for the easiest way for mpirun
(and mpiexec, which, in Open MPI, is identical to mpirun) to work
properly, you need to be able to execute jobs on remote nodes without
typing in a password. In order to do this, you will need to set up
a passphrase. We recommend using RSA passphrases as they are generally
"better" (i.e., more secure) than DSA passphrases. As such, this
text will describe the process for RSA setup.
NOTE: This text will briefly
show you the steps involved in doing this, but the ssh documentation
is authorative on these matters should be consulted for more
information.
The first thing that you need to do is generate an RSA key pair to use
with ssh-keygen:
1
shell$ ssh-keygen-t rsa
Accept the default value for the file in which to store the key
($HOME/.ssh/id_rsa) and enter a passphrase for your key pair. You
may choose to not enter a passphrase and therefore obviate the need
for using the ssh-agent. However, this greatly
weakens the authentication that is possible, because your secret key
is potentially vulnerable to compromise
because it is unencrypted.
It has been compared to the moral equivalent of leaving a plain text
copy of your password in your $HOME directory. See the ssh
documentation for more details.
Next, copy the $HOME/.ssh/id_rsa.pub file generated by ssh-keygen
to $HOME/.ssh/authorized_keys (or add it to the end of
authorized_keys if that file already exists):
In order for RSA authentication to work, you need to have the
$HOME/.ssh directory in your home directory on all the machines you
are running Open MPI on. If your home directory is on a common
filesystem, this may be already taken care of. If not, you will need to
copy the $HOME/.ssh directory to your home directory on all Open
MPI nodes. (Be sure to do this in a secure manner — perhaps using the
scp command — particularly if your secret key is not encrypted.)
ssh is very particular about file permissions. Ensure that your home
directory on all your machines is set to at least mode 755, your
$HOME/.ssh directory is also set to at least mode 755, and that the
following files inside $HOME/.ssh have at least the following
permissions:
The phrase "at least" in the above paragraph means the following:
The files need to be readable by you.
The files should only be writable by you.
The files should not be executable.
Aside from id_rsa, the files can be readable by others, but
do not need to be.
Your $HOME and $HOME/.ssh directories can be readable by
others, but do not need to be.
You are now set up to use RSA authentication. However, when you ssh
to a remote host, you will still be asked for your RSA passphrase
(as opposed to your normal password). This is where the ssh-agent
program comes in. It allows you to type in your RSA passphrase once,
and then have all successive invocations of ssh automatically
authenticate you against the remote host. See the ssh-agent(1)
documentation for more details than what are provided here.
Additionally, check the documentation and setup of your local
environment; ssh-agent may already be setup for you (e.g., see if
the shell environment variable $SSH_AUTH_SOCK exists; if so,
ssh-agent is likely already running). If ssh-agent is not already
running, you can start it manually with the following:
1
shell$ eval`ssh-agent`
Note the specific invocation method: ssh-agent outputs some shell
commands to its output (e.g., setting the SSH_AUTH_SOCK environment
variable).
You will probably want to start the ssh-agent before you start your
graphics / windowing system so that all your windows will inherit the
environment variables set by this command. Note that some sites
invoke ssh-agent for each user upon login automatically; be sure to
check and see if there is an ssh-agent running for you already.
Once the ssh-agent is running, you can tell it your passphrase by
running the ssh-add command:
1
shell$ ssh-add$HOME/.ssh/id_rsa
At this point, if you ssh to a remote host that has the same
$HOME/.ssh directory as your local one, you should not be prompted
for a password or passphrase. If you are, a common problem is that
the permissions in your $HOME/.ssh directory are not as they should
be.
Note that this text has covered the ssh commands in _very little
detail._ Please consult the ssh documentation for more information.
194. What is a .rhosts file? Do I need it?
If you are using rsh to launch processes on remote nodes,
you will probably need to have a $HOME/.rhosts file.
This file allows you to execute commands on remote nodes without being
prompted for a password. The permissions on this file usually must be
0644 (rw-r--r--). It must exist in your home directory on every
node that you plan to use Open MPI with.
Each line in the .rhosts file indicates a machine and user that
programs may be launched from. For example, if the user
steve wishes to launch programs from the machine stevemachine to
the machines alpha, beta, and gamma, there must be a .rhosts
file on each of the three remote machines (alpha, beta, and
gamma) with at least the following line in it:
1
stevemachine steve
The first field indicates the name of the machine where jobs may
originate from; the second field indicates the user ID who may
originate jobs from that machine. It is better to supply a
fully-qualified domain name for the machine name (for security reasons
— there may be many machines named stevemachine on the internet).
So the above example should be:
1
stevemachine.example.com steve
*The Open MPI Team strongly discourages the
use of "+" in the .rhosts file. This is always a huge
security hole.*
If rsh does not find a matching line in the $HOME/.rhosts file, it
will prompt you for a password. Open MPI requires the password-less
execution of commands; if rsh prompts for a password, mpirun will
fail.
NOTE: Some implementations of
rsh are very picky about the format of text in the .rhosts file.
In particular, some do not allow leading white space on each line in
the .rhosts file, and will give a misleading "permission denied"
error if you have white space before the machine name.
NOTE: It should be noted that
rsh is not considered "secure" or "safe" — .rhosts
authentication is considered fairly weak. The Open MPI Team
recommends that you use ssh ("Secure Shell") to launch remote
programs as it uses a much stronger authentication system.
195. Should I use + in my .rhosts file?
No!
While there are a very small number of cases where using "+" in
your .rhosts file may be acceptable, the Open MPI Team highly
recommends that you do not.
Using a "+" in your .rhosts file indicates that you will allow
any machine and/or any user to connect as you. This is extremely
dangerous, especially on machines that are connected to the internet.
Consider the fact that anyone on the internet can connect to your
machine (as you) — it should strike fear into your heart.
The + should not be used for either field of the .rhosts file.
Instead, you should use the full and proper hostname and username of
accounts that are authorized to remotely login as you to that machine
(or machines). This is usually just a list of your own username on a
list of machines that you wish to run Open MPI with. See this FAQ entry for further details, as well
as your local rsh documentation.
Additionally, the Open MPI Team strongly recommends that rsh is not
used in unscheduled environments (espectially those connected to the
internet) — it is considered weak remote authentication. Instead, we
recommend the use of ssh — the secure remote shell. See this FAQ entry for more details.
196. What versions of BProc does Open MPI work with?
BProc support was dropped from Open MPI in the Open MPI v1.3 series.
The last version of Open MPI to include BProc support was Open MPI 1.2.9, which was
released in February of 2009.
As of December 2005, Open MPI supports recent versions of
BProc, such as those found in Clustermatic. We have
not tested with older forks of the BProc project, such as those from
Scyld (now defunct). Since Open MPI's BProc support uses some
advanced support from recent BProc versions, it is somewhat doubtful
(but totally untested) as to whether it would work on Scyld systems.
197. What prerequisites are necessary for running an Open MPI job under BProc?
However, with BProc it is worth noting that BProc may not bring all
necessary dynamic libraries with a process when it is migrated to a
back-end compute node. Plus, Open MPI opens components on the fly
(i.e., after the process has started), so if these components are
unavailable on the back-end compute nodes, Open MPI applications may
fail.
In general the Open MPI team recommends one of the following two
solutions when running on BProc clusters (in order):
Compile Open MPI statically, meaning that Open MPI's libraries
produce static ".a" libraries and all components are included in
the library (as opposed to dynamic ".so" libraries, and separate
".so" files for each component that is found and loaded at
run-time) so that applications do not need to find any shared
libraries or components when they are migrated to back-end compute
nodes. This can be accomplished by specifying [--enable-static
--disable-shared] to configure when building Open MPI.
If you do not wish to use static compilation, ensure that Open MPI
is fully installed on all nodes (i.e., the head node and all compute
nodes) in the same directory location. For example, if Open MPI is
installed in /opt/openmpi-5.0.6 on the head node, ensure that
it is also installed in that same directory on all the compute
nodes.
198. How do I run jobs under Torque / PBS Pro?
The short answer is just to use mpirun as normal.
When properly configured, Open MPI obtains both the list of hosts and how many
processes to start on each host from Torque / PBS Pro directly.
Hence, it is unnecessary to specify the --hostfile, --host, or
-np options to mpirun. Open MPI will use PBS/Torque-native
mechanisms to launch and kill processes (rsh and/or ssh are not
required).
For example:
1
2
3
4
5
6
7
# Allocate a PBS job with 4 nodesshell$ qsub -I-lnodes=4# Now run an Open MPI job on all the nodes allocated by PBS/Torque# (starting with Open MPI v1.2; you need to specify -np for the 1.0# and 1.1 series).shell$ mpirun my_mpi_application
This will run the 4 MPI processes on the nodes that were allocated by
PBS/Torque. Or, if submitting a script:
As of this writing, Open PBS is so ancient that we are not
aware of any sites running it. As such, we have never tested Open MPI
with Open PBS and therefore do not know if it would work or not.
200. How does Open MPI get the list of hosts from Torque / PBS Pro?
Open MPI has changed how it obtains hosts from Torque / PBS
Pro over time:
v1.0 and v1.1 series: The list of hosts allocated to a Torque /
PBS Pro job is obtained directly from the scheduler using the internal
TM API.
v1.2 series: Due to scalability limitations in how the TM API
was used in the v1.0 and v1.1 series, Open MPI was modified to read
the $PBS_NODEFILE to obtain hostnames. Specifically, reading the
$PBS_NODEFILE is much faster at scale than how the v1.0 and v1.1
series used the TM API.
It is possible that future versions of Open MPI may switch back to
using the TM API in a more scalable fashion, but there isn't currently
a huge demand for it (reading the $PBS_NODEFILE works just fine).
Note that the TM API is used to launch processes in all versions of
Open MPI; the only thing that has changed over time is how Open MPI
obtains hostnames.
201. What happens if $PBS_NODEFILE is modified?
Bad Things will happen.
We've had reports from some sites that system administrators modify
the $PBS_NODEFILE in each job according to local policies. This will
currently cause Open MPI to behave in an unpredictable fashion. As
long as no new hosts are added to the hostfile, it usually means
that Open MPI will incorrectly map processes to hosts, but in some
cases it can cause Open MPI to fail to launch processes altogether.
The best course of action is to not modify the $PBS_NODEFILE.
202. Can I specify a hostfile or use the --host option to mpirun
when running in a Torque / PBS environment?
Prior to v1.3, no.
Open MPI <v1.3 will fail to launch processes properly when a hostfile is
specified on the mpirun command line, or if the mpirun--host
option is used.
As of v1.3, Open MPI can use the --hostfile and --host options in
conjunction with TM jobs.
203. How do I determine if Open MPI is configured for Torque/PBS Pro?
If you are configuring and installing Open MPI yourself, and you want
to insure that you are building the components of Open MPI required for
Torque/PBS Pro support, include the --with-tm option on the configure
command line. Run ./configure --help for further information about this
configure option.
The ompi_info command can be used to determine whether or not an
installed Open MPI includes Torque/PBS Pro support:
1
shell$ ompi_info|grep ras
If the Open MPI installation includes support for Torque/PBS Pro, you
should see a line similar to that below. Note the MCA version information
varies depending on which version of Open MPI is installed.
1
MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v3.0.0)
204. How do I run with the SGE launcher?
Support for SGE is included in Open MPI version 1.2 and
later.
NOTE: To build SGE support in
v1.3, you will need to explicitly request the SGE support with the
"--with-sge" command line switch to Open MPI's configure script.
See this FAQ entry
for a description of how to correctly build Open MPI with SGE support.
To verify if support for SGE is configured into your Open MPI
installation, run ompi_info as shown below and look for gridengine.
The components you will see are slightly different between v1.2 and
v1.3.
Open MPI will automatically detect when it is running inside SGE and
will just "do the Right Thing."
Specifically, if you execute an mpirun command in a SGE job, it
will automatically use the SGE mechanisms to launch and kill
processes. There is no need to specify what nodes to run on — Open
MPI will obtain this information directly from SGE and default to a number
of processes equal to the slot count specified. For example, this
will run 4 MPI processes on the nodes that were allocated by SGE:
1
2
3
4
5
6
7
8
9
10
11
# Get the environment variables for SGE# (Assuming SGE is installed at /opt/sge and $SGE_CELL is 'default' in your environment)# C shell settings
shell%source/opt/sge/default/common/settings.csh
# bourne shell settingsshell$ . /opt/sge/default/common/settings.sh
# Allocate an SGE interactive job with 4 slots from a parallel# environment (PE) named 'orte' and run a 4-process Open MPI jobshell$ qrsh -pe orte 4-b y mpirun-np4 a.out
There are also other ways to submit jobs under SGE:
1
2
3
4
5
6
7
8
# Submit a batch job with the 'mpirun' command embedded in a scriptshell$ qsub -pe orte 4 my_mpirun_job.csh
# Submit an SGE and OMPI job and mpirun in one lineshell$ qrsh -V-pe orte 4mpirunhostname# Use qstat(1) to show the status of SGE jobs and queuesshell$ qstat -f
In reference to the setup, be sure you have a Parallel Environment
(PE) defined for submitting parallel jobs. You don't have to name your
PE "orte". The following example shows a PE named "orte" that would
look like:
"qsort_args" is necessary with the Son of Grid Engine distribution,
version 8.1.1 and later, and probably only applicable to it. For
very old versions of SGE, omit "accounting_summary" too.
You may want to alter other parameters, but the important one is
"control_slaves", specifying that the environment has "tight
integration". Note also the lack of a start or stop procedure.
The tight integration means that mpirun automatically picks up the
slot count to use as a default in place of the "-np" argument,
picks up a host file, spawns remote processes via "qrsh" so that
SGE can control and monitor them, and creates and destroys a
per-job temporary directory ($TMPDIR), in which Open MPI's
directory will be created (by default).
Be sure the queue will make use of the PE that you specified:
1
2
3
4
shell$ qconf -sq all.q
...
pe_list make cre orte
...
To determine whether the SGE parallel job is successfully launched to
the remote nodes, you can pass in the MCA parameter "[--mca
plm_base_verbose 1]" to mpirun.
This will add in a -verbose flag to the qrsh -inherit command that is used
to send parallel tasks to the remote SGE execution hosts. It will show
whether the connections to the remote hosts are established
successfully or not.
205. Does the SGE tight integration support the -notify flag to qsub?
If you are running SGE6.2 Update 3 or later, then the -notify flag
is supported. If you are running earlier versions, then the -notify flag
will not work and using it will cause the job to be killed.
To use -notify, one has to be careful. First, let us review what
-notify does. Here is an excerpt from the qsub man page for the
-notify flag.
-notify
This flag, when set causes Sun Grid Engine to send
warning signals to a running job prior to sending the
signals themselves. If a SIGSTOP is pending, the job
will receive a SIGUSR1 several seconds before the SIGSTOP.
If a SIGKILL is pending, the job will receive a SIGUSR2
several seconds before the SIGKILL. The amount of time
delay is controlled by the notify parameter in each
queue configuration.
Let us assume the reason you want to use
the -notify flag is to get the SIGUSR1 signal prior to getting the
SIGTSTP signal. As mentioned in this this FAQ entry one could
run the job as shown in this batch script.
However, one has to make one of two changes to this script for things
to work properly. By default, a SIGUSR1 signal will kill a
shell script. So we have to make sure that does not happen. Here
is one way to handle it.
A new feature was added into Open MPI v1.3.1 that supports
suspend/resume of an MPI job. To suspend the job, you send a SIGTSTP
(not SIGSTOP) signal to mpirun. mpirun will catch this signal and
forward it to the a.outs as a SIGSTOP signal. To resume the job,
you send a SIGCONT signal to mpirun which will be caught and
forwarded to the a.outs.
By default, this feature is not enabled. This means that both the
SIGTSTP and SIGCONT signals will simply be consumed by the mpirun
process. To have them forwarded, you have to run the job with [--mca
orte_forward_job_control 1]. Here is an example on Solaris.
In another window, we suspend and continue the job.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
shell$ prstat -p15301,15303,15305
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
15305 rolfv 158M 22M cpu1 000:00:215.9% a.out/115303 rolfv 158M 22M cpu2 000:00:215.9% a.out/115301 rolfv 8128K 5144K sleep5900:00:00 0.0% orterun/1shell$ kill-TSTP15301shell$ prstat -p15301,15303,15305
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
15303 rolfv 158M 22M stop 3000:01:4421% a.out/115305 rolfv 158M 22M stop 2000:01:4421% a.out/115301 rolfv 8128K 5144K sleep5900:00:00 0.0% orterun/1shell$ kill-CONT15301shell$ prstat -p15301,15303,15305
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
15305 rolfv 158M 22M cpu1 000:02:06 17% a.out/115303 rolfv 158M 22M cpu3 000:02:06 17% a.out/115301 rolfv 8128K 5144K sleep5900:00:00 0.0% orterun/1
Note that all this does is stop the a.out processes. It does not,
for example, free any pinned memory when the job is in the suspended
state.
To get this to work under the SGE environment, you have to change the
suspend_method entry in the queue. It has to be set to SIGTSTP.
Here is an example of what a queue should look like.
Note that if you need to suspend other types of jobs with SIGSTOP
(instead of SIGTSTP) in this queue then you need to provide a script
that can implement the correct signals for each job type.
207. How do I run jobs under Slurm?
The short answer is yes, provided you configured OMPI
--with-slurm. You can use mpirun as normal, or directly launch
your application using srun if OMPI is configured per
this FAQ entry.
The longer answer is that Open MPI supports launching parallel jobs in
all three methods that Slurm supports (you can find more info about
Slurm specific recommendations on the SchedMD web page:
Launching via "salloc ..."
Launching via "sbatch ..."
Launching via "srun -n X my_mpi_application"
Specifically, you can launch Open MPI's mpirun in an interactive
Slurm allocation (via the salloc command) or you can submit a
script to Slurm (via the sbatch command), or you can "directly"
launch MPI executables via srun.
Open MPI automatically obtains both the list of hosts and how many
processes to start on each host from Slurm directly. Hence, it is
unnecessary to specify the --hostfile, --host, or -np options to
mpirun. Open MPI will also use Slurm-native mechanisms to launch
and kill processes (rsh and/or ssh are not required).
For example:
1
2
3
4
5
6
7
# Allocate a Slurm job with 4 nodesshell$ salloc -N4sh# Now run an Open MPI job on all the nodes allocated by Slurm# (Note that you need to specify -np for the 1.0 and 1.1 series;# the -np value is inferred directly from Slurm starting with the# v1.2 series)shell$ mpirun my_mpi_application
This will run the 4 MPI processes on the nodes that were allocated by
Slurm. Equivalently, you can do this:
1
2
# Allocate a Slurm job with 4 nodes and run your MPI application in itshell$ salloc -N4mpirun my_mpi_aplication
208. Does Open MPI support "srun -n X my_mpi_application"?
Yes, if you have configured OMPI --with-pmi=foo, where foo is
the path to the directory where pmi.h/pmi2.h is located. Slurm (> 2.6,
> 14.03) installs PMI-2 support by default.
Older versions of Slurm install PMI-1 by default. If you desire PMI-2,
Slurm requires that you manually install that support. When the
--with-pmi option is given, OMPI will automatically determine if PMI-2
support was built and use it in place of PMI-1.
209. I use Slurm on a cluster with the OpenFabrics network stack. Do I need to do anything special?
210. My job fails / performs poorly when using mpirun under Slurm 20.11
There were some changes in Slurm behavior that were introduced
in Slurm 20.11.0 and subsequently reverted out in Slurm 20.11.3.
SchedMD (the makers of Slurm) strongly suggest that all Open MPI users
avoid using Slurm versions 20.11.0 through 20.11.2.
Indeed, you will likely run into problems using just about any version
of Open MPI these problematic Slurm releases. Please either downgrade
to an older version or upgrade to a newer version of Slurm.
211. How do I reduce startup time for jobs on large clusters?
There are several ways to reduce the startup time on large
clusters. Some of them are described on this page. We continue to work
on making startup even faster, especially on the large clusters coming
in future years.
Open MPI v5.0.6 is significantly faster and more robust than its
predecessors. We recommend that anyone running large jobs and/or on
large clusters make the upgrade to the v5.0 series.
Several major launch time enhancements have been made starting with the
v3.0 release. Most of these take place in the background — i.e., there
is nothing you (as a user) need do to take advantage of them. However,
there are a few that are left as options until we can assess any potential
negative impacts on different applications. Some options are only available
when launching via mpirun - these include:
adding --fwd-mpirun-port to the cmd line (or the corresponding
fwd_mpirun_port MCA parameter) will allow the daemons launched on compute
nodes to wireup to each other using an overlay network (e.g., a tree-based
pattern). This reduces the number of socket connections mpirun must handle
and can significantly reduce startup time.
Other options are available when launching via mpirun or when launching using
the native resource manager launcher (e.g., srun in a Slurm environment).
These are activated by setting the corresponding MCA parameter, and include:
Setting the pmix_base_async_modex MCA parameter will eliminate a global
out-of-band collective operation during MPI_Init. This operation is performed
in order to share endpoint information prior to communication. At scale, this
operation can take some time and scales at best logarithmically. Setting the
parameter bypasses the operation and causes the system to lookup the endpoint
information for a peer only at first message. Thus, instead of collecting
endpoint information for all processes, only the endpoint information for those
processes this peer communicates with will be retrieved. The parameter is
especially effective for applications with sparse communication patterns — i.e.,
where a process only communicates with a few other peers. Applications that
use dense communication patterns (i.e., where a peer communicates directly to
all other peers in the job) will probably see a negative impact of this option.
NOTE: This option is only available in PMIx-supporting environments, or when
launching via mpirun
The async_mpi_init parameter is automatically set to true when the
pmix_base_async_modex parameter has been set, but can also be independently
controlled. When set to true, this parameter causes MPI_Init to skip an
out-of-band barrier operation at the end of the procedure that is not required
whenever direct retrieval of endpoint information is being used.
Similarly, the async_mpi_finalize parameter skips an out-of-band barrier operation
usually performed at the beginning of MPI_Finalize. Some transports (e.g., the
usnic BTL) require this barrier to ensure that all MPI messages are completed
prior to finalizing, while other transports handle this internally and thus do
not require the additional barrier. Check with your transport provider to be sure,
or you can experiment to determine the proper setting.
212. Where should I put my libraries: Network vs. local filesystems?
Open MPI itself doesn't really care where its libraries are
stored. However, where they are stored does have an impact on startup
times, particularly for large clusters, which can be mitigated
somewhat through use of Open MPI's configuration options.
Startup times will always be minimized by storing the libraries local
to each node, either on local disk or in RAM-disk. The latter is
sometimes problematic since the libraries do consume some space, thus
potentially reducing memory that would have been available for MPI
processes.
There are two main considerations for large clusters that need to
place the Open MPI libraries on networked file systems:
While DSO's are more flexible, you definitely do not want to use
them when the Open MPI libraries will be mounted on a network file
system! Doing so will lead to significant network traffic and delayed
start times, especially on clusters with a large number of
nodes. Instead, be sure to configure your build with
--disable-dlopen. This will include the DSO's in the main libraries,
resulting in much faster startup times.
Many networked file systems use automount for user level
directories, as well as for some locally administered system
directories. There are many reasons why system administrators may
choose to automount such directories. MPI jobs, however, tend to
launch very quickly, thereby creating a situation wherein a large
number of nodes will nearly simultaneously demand automount of a
specific directory. This can overload NFS servers, resulting in
delayed response or even failed automount requests.
Note that this applies to both automount of directories containing
Open MPI libraries as well as directories containing user
applications. Since these are unlikely to be the same location,
multiple automount requests from each node are possible, thus
increasing the level of traffic.
213. Static vs shared libraries?
It is perfectly fine to use either shared or static
libraries. Shared libraries will save memory when operating multiple
processes per node, especially on clusters with high numbers of cores
on a node, but can also take longer to launch on networked file
systems. (See the network vs. local
filesystem FAQ entry for suggestions on how to mitigate such
problems.)
214. How do I reduce the time to wireup OMPI's out-of-band communication system?
Open MPI's run-time uses an out-of-band (OOB) communication
subsystem to pass messages during the launch, initialization, and
termination stages for the job. These messages allow mpirun to tell
its daemons what processes to launch, and allow the daemons in turn to
forward stdio to mpirun, update mpirun on process status, etc.
The OOB uses TCP sockets for its communication, with each daemon
opening a socket back to mpirun upon startup. In a large cluster, this
can mean thousands of connections being formed on the node where
mpirun resides, and requires that mpirun actually process all these
connection requests. mpirun defaults to processing connection requests
sequentially — so on large clusters, a backlog can be created that can
cause remote daemons to timeout waiting for a response.
Fortunately, Open MPI provides an alternative mechanism for processing
connection requests that helps alleviate this problem. Setting the MCA
parameter oob_tcp_listen_mode to listen_thread causes mpirun to
startup a separate thread dedicated to responding to connection
requests. Thus, remote daemons receive a quick response to their
connection request, allowing mpirun to deal with the message as soon
as possible.
This parameter can be included in the default MCA parameter file,
placed in the user's environment, or added to the mpirun command
line. See this FAQ
entry for more details on how to set MCA parameters.
215. Why is my job failing because of file descriptor limits?
This is a known issue in Open MPI releases prior to the v1.3
series. The problem lies in the connection topology for Open MPI's
out-of-band (OOB) communication subsystem. Prior to the 1.3 series, a
fully-connected topology was used that required every process to open
a connection to every other process in the job. This can rapidly
overwhelm the usual system limits.
There are two methods you can use to circumvent the problem. First,
upgrade to the v1.3 series if you can — this would be our recommended
approach as there are considerable improvements in that series vs. the
v1.2 one.
If you cannot upgrade and must stay with the v1.2 series, then you
need to increase the number of file descriptors in your system
limits. This commonly requires that your system administrator increase
the number of file descriptors allowed by the system itself. The
number required depends both on the number of nodes in your cluster
and the max number of processes you plan to run on each node. Assuming
you want to allow jobs that fully occupy the cluster, than the minimum
number of file descriptors you will need is roughly
(#procs_on_a_node+1) * #procs_in_the_job.
It is always wise to have a few extra just in case. :-)
Note that this only covers the file descriptors needed for the
out-of-band communication subsystem. It specifically does not address
file descriptors needed to support the MPI TCP transport, if that is
being used on your system. If it is, then additional file descriptors
will be required for those TCP sockets. Unfortunately, a simple
formula cannot be provided for that value as it depends completely on
the number of point-to-point TCP connections being made. If you
believe that users may want to fully connect an MPI job via TCP, then
it would be safest to simply double the number of file descriptors
calculated above.
This can, of course, get to be a really big number...which is why
you might want to consider upgrading to the v1.3 series, where OMPI
only opens #nodes OOB connections on each node. We are currently
working on even more sparsely connected topologies for very large
clusters, with the goal of constraining the number of connections
opened on a node to an arbitrary number as specified by an MCA
parameter.
216. I know my cluster's configuration - how can I take advantage of that knowledge?
Clusters rarely change from day-to-day, and large clusters
rarely change at all. If you know your cluster's configuration, there
are several steps you can take to both reduce Open MPI's memory
footprint and reduce the launch time of large-scale applications.
These steps use a combination of build-time configuration options to
eliminate components — thus eliminating their libraries and avoiding
unnecessary component open/close operations — as well as run-time MCA
parameters to specify what modules to use by default for most users.
One way to save memory is to avoid building components that will
actually never be selected by the system. Unless MCA parameters
specify which components to open, built components are always opened
and tested as to whether or not they should be selected for use. If
you know that a component can build on your system, but due to your
cluster's configuration will never actually be selected, then it is
best to simply configure OMPI to not build that component by using the
--enable-mca-no-build configure option.
For example, if you know that your system will only utilize the
ob1 component of the PML framework, then you can no_build all
the others. This not only reduces memory in the libraries, but also
reduces memory footprint that is consumed by Open MPI opening all the
built components to see which of them can be selected to run.
In some cases, however, a user may optionally choose to use a
component other than the default. For example, you may want to build
all of the routed framework components, even though the vast
majority of users will simply use the default binomial
component. This means you have to allow the system to build the other
components, even though they may rarely be used.
You can still save launch time and memory, though, by setting the
routed=binomial MCA parameter in the default MCA parameter
file. This causes OMPI to not open the other components during
startup, but allows users to override this on their command line or in
their environment so no functionality is lost — you just save some
memory and time.
Rather than have to figure this all out by hand, we are working on a
new OMPI tool called ompi-profiler. When run on a cluster, it will
tell you the selection results of all frameworks — i.e., for each
framework on each node, which component was selected to run — and a
variety of other information that will help you tailor Open MPI for
your cluster.
Stay tuned for more info as we continue to work on ways
to improve your performance...
217. What is the Modular Component Architecture (MCA)?
The Modular Component Architecture (MCA) is the backbone for
much of Open MPI's functionality. It is a series of frameworks,
components, and modules that are assembled at run-time to create
an MPI implementation.
Frameworks: An MCA framework manages zero or more components at run-time
and is targeted at a specific task (e.g., providing MPI collective
operation functionality). Each MCA framework supports a single
component type, but may support multiple versions of that type. The
framework uses the services from the MCA base functionality to find
and/or load components.
Components: An MCA component is an implementation of a framework's
interface. It is a standalone collection of code that can be bundled
into a plugin that can be inserted into the Open MPI code base,
either at run-time and/or compile-time.
Modules: An MCA module is an instance of a component (in the C++
sense of the word "instance"; an MCA component is analogous to a C++
class). For example, if a node running an Open MPI application has
multiple ethernet NICs, the Open MPI application will contain one TCP
MPI point-to-point component, but two TCP point-to-point modules.
Frameworks, components, and modules can be dynamic or static. That
is, they can be available as plugins or they may be compiled statically
into libraries (e.g., libmpi).
218. What are MCA parameters?
MCA parameters are the basic unit of run-time tuning for Open
MPI. They are simple "key = value" pairs that are used extensively
throughout the code base. The general rules of thumb that the
developers use are:
Instead of using a constant for an important value, make it an MCA
parameter.
If a task can be implemented in multiple, user-discernible ways,
implement as many as possible and make choosing between them be an MCA
parameter.
For example, an easy MCA parameter to describe is the boundary between
short and long messages in TCP wire-line transmissions. "Short"
messages are sent eagerly whereas "long" messages use a rendezvous
protocol. The decision point between these two protocols is the
overall size of the message (in bytes). By making this value an MCA
parameter, it can be changed at run-time by the user or system
administrator to use a sensible value for a particular environment or
set of hardware (e.g., a value suitable for 100 Mbps Ethernet is
probably not suitable for Gigabit Ethernet, and may require a
different value for 10 Gigabit Ethernet).
Note that MCA parameters may be set in several different ways
(described in another FAQ entry). This allows, for example, system
administrators to fine-tune the Open MPI installation for their
hardware / environment such that normal users can simply use the
default values.
More specifically, HPC environments — and the applications that run
on them — tend to be unique. Providing extensive run-time tuning
capabilities through MCA parameters allows the customization of Open
MPI to each system's / user's / application's particular needs.
219. What frameworks are in Open MPI?
There are three types of frameworks in Open MPI: those in the
MPI layer (OMPI), those in the run-time layer (ORTE), and those in the
operating system / platform layer (OPAL).
The specific list of frameworks varies between each major release
series of Open MPI. See the links below to FAQ entries for specific
versions of Open MPI:
ptl: (outdated / deprecated) MPI point-to-point transport layer
rcache: Memory registration management
topo: MPI topology information
ORTE frameworks
errmgr: Error manager
gpr: General purpose registry
iof: I/O forwarding
ns: Name server
oob: Out-of-band communication
pls: Process launch subsystem
ras: Resource allocation subsystem
rds: Resource discovery subsystem
rmaps: Resource mapping subsystem
rmgr: Resource manager (upper meta layer for all other Resource
frameworks)
rml: Remote messaging layer (routing of OOB messages)
schema: Name schemas
sds: Startup discovery services
soh: State of health
OPAL frameworks
maffinity: Memory affinity
memory: Memory hooks
paffinity: Processor affinity
timer: High-resolution timers
221. What frameworks are in Open MPI v1.3?
The comprehensive list of frameworks in Open MPI is
continually being augmented. As of November 2008, here is the current
list in the Open MPI v1.3 series:
OMPI frameworks
allocator: Memory allocator
bml: BTL management layer
btl: MPI point-to-point Byte Transfer Layer, used for MPI
point-to-point messages on some types of networks
coll: MPI collective algorithms
crcp: Checkpoint/restart coordination protocol
dpm: MPI-2 dynamic process management
io: MPI-2 I/O
mpool: Memory pooling
mtl: Matching transport layer, used for MPI point-to-point
messages MPI-2 one-sided communications
222. What frameworks are in Open MPI v1.4 (and later)?
The comprehensive list of frameworks in Open MPI tends to
change over time. The README file in each Open MPI version maintains
a list of the frameworks that are contained in that version.
It is best to consult that README file; it is kept up to date.
223. How do I know what components are in my Open MPI installation?
The ompi_info command, in addition to providing a wealth of
configuration information about your Open MPI installation, will list
all components (and the frameworks that they belong to) that are
available. These include system-provided components as well as
user-provided components.
Please note that starting with Open MPI v1.8, ompi_info categorizes its
parameter parameters in so-called levels, as defined by the MPI_T
interface. You will need to specify --level 9 (or
--all) to show all MCA parameters. See
Jeff Squyres' Blog for further information.
224. How do I install my own components into an Open MPI installation?
By default, Open MPI looks in two places for components at
run-time (in order):
$prefix/lib/openmpi/: This is the system-provided components
directory, part of the installation tree of Open MPI itself.
$HOME/.openmpi/components/: This is where users can drop their
own components that will automatically be "seen" by Open MPI at
run-time. This is ideal for developmental, private, or otherwise
unstable components.
Note that the directories and search ordering used for finding
components in Open MPI is, itself, an MCA parameter. Setting the
mca_component_path changes this value (a colon-delimited list of
directories).
Note also that components are only used on nodes where they are
"visible". Hence, if your $prefix/lib/openmpi/ is a directory on a
local disk that is not shared via a network filesystem to other nodes
where you run MPI jobs, then components that are installed to that
directory will only be used by MPI jobs running on the local node.
More specifically: components have the same visibility as normal
files. If you need a component to be available to all nodes where you
run MPI jobs, then you need to ensure that it is visible on all nodes
(typically either by installing it on all nodes for non-networked
filesystem installs, or by installing them in a directory that is
visibile to all nodes via a networked filesystem). Open MPI does not
automatically send components to remote nodes when MPI jobs are run.
225. How do I know what MCA parameters are available?
The ompi_info command can list the parameters for a given
component, all the parameters for a specific framework, or all
parameters. Most parameters contain a description of the parameter;
all will show the parameter's current value.
For example:
1
2
3
4
5
6
7
# Starting with Open MPI v1.7, you must use "--level 9" to see# all the MCA parameters (the default is "--level 1"):shell$ ompi_info--param all all --level9# Before Open MPI v1.7, the "--level" command line options# did not exist; do not use it.shell$ ompi_info --param all all
Shows all the MCA parameters for all components that ompi_info
finds, whereas:
1
2
3
# All remaining examples assume Open MPI v1.7 or later (i.e.,# they assume the use of the "--level" command line option)shell$ ompi_info --param btl all --level9
Shows all the MCA parameters for all BTL components that ompi_info
finds. Finally:
1
shell$ ompi_info --param btl tcp --level9
Shows all the MCA parameters for the TCP BTL component.
226. How do I set the value of MCA parameters?
There are three main ways to set MCA parameters, each of which
are searched in order.
Command line: The highest-precedence method is setting MCA
parameters on the command line. For example:
This sets the MCA parameter mpi_show_handle_leaks to the value of 1
before running a.out with four processes. In general, the format
used on the command line is "--mca <param_name>
<value>".
Note that when setting multi-word values, you need to use quotes to ensure that the shell and Open MPI understand that they are a single value. For example:
1
shell$ mpirun--mca param "value with multiple words" ...
Environment variable: Next, environment variables are searched.
Any environment variable named OMPI_MCA_<param_name> will be
used. For example, the following has the same effect as the previous
example (for sh-flavored shells):
Note that setting environment variables to values with multiple words
requires quoting, such as:
1
2
3
4
5
# sh-flavored shellsshell$ OMPI_MCA_param="value with multiple words"# csh-flavored shells
shell% setenv OMPI_MCA_param "value with multiple words"
Aggregate MCA parameter files: Simple text files can be used to
set MCA parameter values for a specific application. See this FAQ entry (Open MPI version 1.3
and higher).
Files: Finally, simple text files can be used to set MCA
parameter values. Parameters are set one per line (comments are
permitted). For example:
1
2
3
# This is a comment# Set the same MCA parameter as in previous examples
mpi_show_handle_leaks =1
Note that quotes are not necessary for setting multi-word values in
MCA parameter files. Indeed, if you use quotes in the MCA parameter
file, they will be used as part of the value itself. For example:
1
2
3
# The following two values are different:
param1 = value with multiple words
param2 ="value with multiple words"
By default, two files are searched (in order):
$HOME/.openmpi/mca-params.conf: The user-supplied set of
values takes the highest precedence.
$prefix/etc/openmpi-mca-params.conf: The system-supplied set
of values has a lower precedence.
More specifically, the MCA parameter mca_param_files specifies a
colon-delimited path of files to search for MCA parameters. Files to
the left have lower precedence; files to the right are higher
precedence.
Keep in mind that, just like components, these parameter files are
only relevant where they are "visible" (see this FAQ entry). Specifically,
Open MPI does not read all the values from these files during startup
and then send them to all nodes in the job — the files are read on
each node during each process' startup. This is intended behavior: it
allows for per-node customization, which is especially relevant in
heterogeneous environments.
227. What are Aggregate MCA (AMCA) parameter files?
Starting with version 1.3, aggregate MCA (AMCA) parameter
files contain MCA parameter key/value pairs similar to the
$HOME/.openmpi/mca-params.conf file described in this FAQ entry.
The motivation behind AMCA parameter sets came from the realization
that for certain applications a large number of MCA parameters are
required for the application to run well and/or as the user
expects. Since these MCA parameters are application specific (or even
application run specific) they should not be set in a global manner,
but only pulled in as determined by the user.
MCA parameters set in AMCA parameter files will override any MCA
parameters supplied in global parameter files (e.g.,
$HOME/.openmpi/mca-params.conf), but not command line or environment
parameters.
AMCA parameter files are typically supplied on the command line via
the --am option.
For example, consider an AMCA parameter file called foo.conf
placed in the same directory as the application a.out. A user
will typically run the application as:
1
shell$ mpirun-np2 a.out
To use the foo.conf AMCA parameter file this command line
changes to:
1
shell$ mpirun-np2--am foo.conf a.out
If the user wants to override a parameter set in foo.conf they
can add it to the command line as seen below.
AMCA parameter files can be coupled if more than one file is to be
used. If we have another AMCA parameter file called bar.conf
that we want to use, we add it to the command line as follows:
1
shell$ mpirun-np2--am foo.conf:bar.conf a.out
AMCA parameter files are loaded in priority order. This means that
foo.conf AMCA file has priority over the bar.conf file. So
if the bar.conf file sets the MCA parameter
mpi_leave_pinned=0 and the foo.conf file sets this MCA
parameter to mpi_leave_pinned=1 then the latter will be used.
The location of AMCA parameter files are resolved in a similar way as
the shell. If no path operator is provided (i.e., foo.conf) then
Open MPI will search the $SYSCONFDIR/amca-param-sets directory, then
the current working directory. If a relative path is specified, then
only that path will be searched (e.g., ./foo.conf,
baz/foo.conf). If an absolute path is specified, then only that
path will be searched (e.g., /bip/boop/foo.conf).
Though the typical use case for AMCA parameter files is to be
specified on the command line, they can also be set as MCA parameters
in the environment. The MCA parameter mca_base_param_file_prefix
contains a ':' separated list of AMCA parameter files exactly as they
would be passed to the --am command line option. The MCA
parameter mca_base_param_file_path specifies the path to search for
AMCA files with relative paths. By default this is
$SYSCONFDIR/amca-param-sets/:$CWD.
228. How do I set application specific environment variables in global
parameter files?
Starting with OMPI version 1.9, the --am option to supply
AMCA parameter files (see this FAQ
entry) is deprecated. Users should instead use the ---tune
option. This option allows one to specify both mca parameters and
environment variables from within a file using the same command line
syntax.
The usage of the --tune option is the same as that for the --am
option except that --tune requires a single file or a comma
delimited list of files, while a colon delimiter is used with the
--am option.
A valid line in the file may contain zero or many -x, -mca, or
--mca arguments. If any argument is duplicated in the file, the
last value read will be used.
Fox example, a file may contain the following line:
To use the foo.conf parameter file in order to run a.out
the command line looks as the following
1
shell$ mpirun-np2--tune foo.conf a.out
Similar to --am option, MCA parameters and environment specified on
the command line have higher precedence than variables specified in
the file.
The --tune option can also be replaced by the MCA parameter
mca_base_envar_file_prefix which is similar to
mca_base_param_file_prefix having the same meaning as the --am
option.
229. How do I select which components are used?
Each MCA framework has a top-level MCA parameter that helps
guide which components are selected to be used at run-time.
Specifically, there is an MCA parameter of the same name as each MCA
framework that can be used to include or exclude components from a
given run.
For example, the btl MCA parameter is used to control which BTL
components are used (e.g., MPI point-to-point communications; see this FAQ entry for a full list of MCA
frameworks). It can take as a value a comma-separated list of
components with the optional prefix "^". For example:
1
2
3
4
5
6
7
8
# Tell Open MPI to exclude the tcp and openib BTL components# and implicitly include all the restshell$ mpirun --mca btl ^tcp,openib ...
# Tell Open MPI to include *only* the components listed here and# implicitly ignore all the rest (i.e., the loopback, shared memory,# and OpenFabrics (a.k.a., "OpenIB") MPI point-to-point components):shell$ mpirun --mca btl self,sm,openib ...
Note that ^ can only be the prefix of the entire value because the
inclusive and exclusive behavior are mutually exclusive.
Specifically, since the exclusive behavior means "use all components
except these", it does not make sense to mix it with the inclusive
behavior of not specifying it (i.e., "use all of these components").
Hence, something like this:
1
shell$ mpirun --mca btl self,sm,openib,^tcp ...
does not make sense because it says both "use only the self, sm,
and openib components" and "use all components except tcp" and
will result in an error.
230. What is processor affinity? Does Open MPI support it?
Open MPI supports processor affinity on a variety of systems
through process binding, in which each MPI process, along with its
threads, is "bound" to a specific subset of processing resources
(cores, sockets, etc.). That is, the operating system will constrain
that process to run on only that subset. (Other processes might be
allowed on the same resources.)
Affinity can improve performance by inhibiting excessive process
movement — for example, away from "hot" caches or NUMA memory.
Judicious bindings can improve performance by reducing resource contention
(by spreading processes apart from one another) or improving interprocess
communications (by placing processes close to one another). Binding can
also improve performance reproducibility by eliminating variable process
placement. Unfortunately, binding can also degrade performance by
inhibiting the OS capability to balance loads.
You can run the ompi_info command and look for hwloc
components to see if your system is supported (older versions of Open
MPI used paffinity components). For example:
Older versions of Open MPI used paffinity components for process
affinity control; if your version of Open MPI does not have an
hwloc component, see if it has a paffinity component.
Note that processor affinity probably should not be used when a node
is over-subscribed (i.e., more processes are launched than there are
processors). This can lead to a serious degradation in performance
(even more than simply oversubscribing the node). Open MPI will
usually detect this situation and automatically disable the use of
processor affinity (and display run-time warnings to this effect).
231. What is memory affinity? Does Open MPI support it?
Memory affinity is increasingly relevant on modern servers
because most architectures exhibit Non-Uniform Memory Access (NUMA)
architectures. In a NUMA architecture, memory is physically
distributed throughout the machine even though it is virtually treated
as a single address space. That is, memory may be physically local to
one or more processors — and therefore remote to other processors.
Simply put: some memory will be faster to access (for a given process)
than others.
Open MPI supports general and specific memory affinity, meaning that
it generally tries to allocate all memory local to the processor that
asked for it. When shared memory is used for communication, Open MPI
uses memory affinity to make certain pages local to specific
processes in order to minimize memory network/bus traffic.
Open MPI supports memory affinity on a variety of systems.
In recent versions of Open MPI, memory affinity is controlled through
the hwloc component. In earlier versions of Open MPI, memory
affinity was controlled through maffinity components.
Older versions of Open MPI used maffinity components for memory
affinity control; if your version of Open MPI does not have an
hwloc component, see if it has a maffinity component.
Note that memory affinity support is enabled
only when processor affinity is enabled. Specifically: using memory
affinity does not make sense if processor affinity is not enabled
because processes may allocate local memory and then move to a
different processor, potentially remote from the memory that it just
allocated.
232. How do I tell Open MPI to use processor and/or memory affinity?
Assuming that your system supports processor and memory
affinity (check ompi_info for an hwloc component (or, in
earlier Open MPI versions, paffinity and maffinity
components)), you can explicitly tell Open MPI to use them when running
MPI jobs.
Note that memory affinity support is enabled
only when processor affinity is enabled. Specifically: using memory
affinity does not make sense if processor affinity is not enabled
because processes may allocate local memory and then move to a
different processor, potentially remote from the memory that it just
allocated.
Also note that processor and memory affinity is meaningless (but
harmless) on uniprocessor machines.
The use of processor and memory affinity has greatly evolved over the
life of the Open MPI project. As such, how to enable / use processor
and memory affinity in Open MPI strongly depends on
which version you are using:
On each node where your job is running, your job's MPI processes will
be bound, one-to-one, in the order of their global MPI ranks, to the
lowest-numbered processing units (for example, cores or hardware threads)
on the node as identified by the OS. Further, memory affinity will also
be enabled if it is supported on the node,
as described in a different FAQ entry.
If multiple jobs are launched on the same node in this manner, they will
compete for the same processing units and severe performance degradation
will likely result. Therefore, this MCA parameter is best used when you
know your job will be "alone" on the nodes where it will run.
Since each process is bound to a single processing unit, performance will
likely suffer catastrophically if processes are multi-threaded.
Depending on how processing units on your node are numbered, the binding
pattern may be good, bad, or even disastrous. For example, performance
might be best if processes are spread out over all processor sockets on
the node. The processor ID numbering, however, might lead to
mpi_paffinity_alone filling one socket before moving to another.
Indeed, on nodes with multiple hardware threads per core (e.g.,
"HyperThreads", "SMT", etc.), the numbering could lead to multiple
processes being bound to a core before the next core is considered.
In such cases, you should probably upgrade to a newer version of Open MPI
or use a different, external mechanism for processor binding.
Note that Open MPI will automatically disable processor affinity on
any node that is oversubscribed (i.e., where more Open MPI processes
are launched in a single job on a node than it has processors) and
will print out warnings to that effect.
Also note, however, that processor affinity is not exclusionary with
Degraded performance mode. Degraded mode is usually only used when
oversubscribing nodes (i.e., running more processes on a node than it
has processors — see this FAQ entry for
more details about oversubscribing, as well as a definition of
Degraded performance mode). It is possible manually to select
Degraded performance mode and use processor affinity as long as you
are not oversubscribing.
234. How do I tell Open MPI to use processor and/or memory affinity
in Open MPI v1.3.x? (What are rank files?)
Open MPI 1.3 supports the mpi_paffinity_alone MCA parameter
that is described in this FAQ
entry.
Open MPI 1.3 (and higher) also allows a different binding to be specified
for each process via a rankfile. Consider the following example:
The rank file specifies a host node and slot list binding for each
MPI process in your job. Note:
Typically, the slot list is a comma-delimited list of ranges. The
numbering is OS/BIOS-dependent and refers to the finest grained processing
units identified by the OS — for example, cores or hardware threads.
Alternatively, a colon can be used in the slot list for socket:core
designations. For example, 1:2-3 means cores 2-3 of socket 1.
It is strongly recommended that you provide a full rankfile when
using such affinity settings, otherwise there would be a very high
probability of processor oversubscription and performance degradation.
The hosts specified in the rankfile must be known to mpirun,
for example, via a list of hosts in a hostfile or as obtained from a
resource manager.
The number of processes np must be provided on the mpirun command
line.
If some processing units are not available — e.g., due to
unpopulated sockets, idled cores, or BIOS settings — the syntax assumes
a logical numbering in which numbers are contiguous despite the physical
gaps. You may refer to actual physical numbers with a "p" prefix.
For example, rank 4=host3 slot=p3:2
will bind rank4 to the physical socket3 : physical core2 pair.
Rank files are also discussed on the mpirun man page.
If you want to use the same slot list binding for each process,
presumably in cases where there is only one process per node, you can
specify this slot list on the command line rather than having to use a
rank file:
Remember, every process will use the same slot list. If multiple processes
run on the same host, they will bind to the same resources — in this case,
socket0:core1, presumably oversubscribing that core and ruining performance.
Slot lists can be used to bind to multiple slots, which would be helpful for
multi-threaded processes. For example:
Two threads per process: rank 0=host1 slot=0,1
Four threads per process: rank 0=host1 slot=0,1,2,3
Note that no thread will be bound to a specific slot within the list. OMPI
only supports process level affinity; each thread will be bound to all
of the slots within the list.
235. How do I tell Open MPI to use processor and/or memory affinity
in Open MPI v1.4.x? (How do I use the --by* and --bind-to-* options?)
Open MPI 1.4 supports all the same processor affinity controls
as Open MPI v1.3, but also
supports additional command-line binding switches to mpirun:
--bind-to-none: Do not bind processes.
(Default)
--bind-to-core: Bind each MPI process to a core.
--bind-to-socket: Bind each MPI process to a processor socket.
--report-bindings: Report how the launched processes were bound
by Open MPI.
In the case of cores with multiple hardware threads (e.g., "HyperThreads" or
"SMT"), only the first hardware thread on each core is used with the
--bind-to-* options. This will hopefully be fixed in the Open MPI v1.5 series.
The above options are typically most useful when used with the
following switches that indicate how processes are to be laid out in
the MPI job. To be clear: *if the following options are used without
a --bind-to-* option, they only have the effect of deciding which
node a process will run on. Only the --bind-to- options actually
bind a process to a specific (set of) hardware resource(s).
--byslot: Alias for --bycore.
--bycore: When laying out processes, put sequential MPI
processes on adjacent processor cores. *(Default)*
--bysocket: When laying out processes, put sequential MPI
processes on adjacent processor sockets.
--bynode: When laying out processes, put sequential MPI
processes on adjacent nodes.
Note that --bycore and --bysocket lay processes out in terms of the
actual hardware rather than by some node-dependent numbering, which
is what mpi_paffinity_alone does as described
in this FAQ entry.
Finally, there is a poorly-named "combination" option that effects both process
layout counting and binding: --cpus-per-proc (and an even more poorly-named
alias --cpus-per-rank).
Editor's note: I feel that these options are poorly named for two
reasons: 1) "cpu" is not consistently defined (i.e., it may be a
core, or may be a hardware thread, or it may be something else), and
2) even though many users use the terms "rank" and "MPI process"
interchangeably, they are NOT the same thing.
This option does the following:
Takes an integer argument (ncpus) that indicates how
many operating system processor IDs (which may be cores or may be
hardware threads) should be bound to each MPI process.
Allocates and binds ncpus OS processor IDs to each MPI process.
For example, on a machine with 4 processor sockets, each with 4
processor cores, each with one hardware thread:
1
shell$ mpirun-np8--cpus-per-proc2 my_mpi_process
This command will bind each MPI process to ncpus=2
cores. All cores on the machine will be used.
Note that ncpus cannot be more than the number of OS processor
IDs in a single processor socket. Put loosely: --cpus-per-proc only
allows binding to multiple cores/threads within a single socket.
The --cpus-per-proc can also be used with the --bind-to-* options
in some cases, but this code is not well tested and may result in
unexpected binding behavior. Test carefully to see where processes
actually get bound before relying on the behavior for production runs.
The --cpus-per-proc and other affinity-related command line options
are likely to be revamped some time during the Open MPI v1.5 series.
236. How do I tell Open MPI to use processor and/or memory affinity
in Open MPI v1.5.x?
Open MPI 1.5 currently has the same processor affinity
controls as Open MPI v1.4. This
FAQ entry is a placemarker for future enhancements to the 1.5 series'
processor and memory affinity features.
Stay tuned!
237. How do I tell Open MPI to use processor and/or memory affinity
in Open MPI v1.6 (and beyond)?
The use of processor and memory affinity evolved rapidly,
starting with Open MPI version 1.6.
The mpirun(1) man page for each version of Open MPI contains a lot of
information about the use of processor and memory affinity. You
should consult the mpirun(1) page for your version of Open MPI for
detailed information about processor/memory affinity.
Interactions with other middleware in the MPI process
In some cases, Open MPI will determine that it is not safe to
fork(). In these cases, Open MPI will register a pthread_atfork()
callback to print a warning when the process forks.
This warning is helpful for legacy MPI applications where the current
maintainers are unaware that system() or popen() is being invoked from
an obscure subroutine nestled deep in millions of lines of Fortran code
(we've seen this kind of scenario many times).
However, this atfork handler can be dangerous because there is no way
to unregister an atfork handler. Hence, packages that
dynamically open Open MPI's libraries (e.g., Python bindings for Open
MPI) may fail if they finalize and unload libmpi, but later call
fork. The atfork system will try to invoke Open MPI's atfork handler;
nothing good can come of that.
For such scenarios, or if you simply want to disable printing the
warning, Open MPI can be set to never register the atfork handler with
the mpi_warn_on_fork MCA parameter. For example:
239. I want to run some performance benchmarks with Open MPI. How do I do that?
Running benchmarks is an extremely difficult task to
do correctly. There are many, many factors to take into account; it
is not as simple as just compiling and running a stock benchmark
application. This FAQ entry is by no means a definitive guide, but it
does try to offer some suggestions for generating accurate, meaningful
benchmarks.
Decide exactly what you are benchmarking and setup your system
accordingly. For example, if you are trying to benchmark maximum
performance, then many of the suggestions listed below are extremely
relevant (be the only user on the systems and network in question, be
the only software running, use processor affinity, etc.). If you're
trying to benchmark average performance, some of the suggestions below
may be less relevant. Regardless, it is critical to know exactly
what you're trying to benchmark, and know (not guess) both your
system and the benchmark application itself well enough to understand
what the results mean.
To be specific, many benchmark applications are not well understood
for exactly what they are testing. There have been many cases where
users run a given benchmark application and wrongfully conclude that
their system's performance is bad — solely on the basis of a single
benchmark that they did not understand. Read the documentation of the
benchmark carefully, and possibly even look into the code itself to
see exactly what it is testing.
Case in point: not all ping-pong benchmarks are created equal. Most
users assume that a ping-pong benchmark is a ping-pong benchmark is a
ping-pong benchmark. But this is not true; the common ping-pong
benchmarks tend to test subtly different things (e.g., NetPIPE, TCP
bench, IMB, OSU, etc.). *Make sure you understand what your
benchmark is actually testing.*
Make sure that you are the only user on the systems where you
are running the benchmark to eliminate contention from other
processes.
Make sure that you are the only user on the entire network /
interconnect to eliminate network traffic contention from other
processes. This is usually somewhat difficult to do, especially in
larger, shared systems. But your most accurate, repeatable results
will be achieved when you are the only user on the entire
network.
Disable all services and daemons that are not being used. Even
"harmless" daemons consume system resources (such as RAM) and cause
"jitter" by occasionally waking up, consuming CPU cycles, reading
or writing to disk, etc. The optimum benchmark system has an absolute
minimum number of system services running.
Use processor affinity on multi-processor/core machines to
disallow the operating system from swapping MPI processes between
processors (and causing unnecessary cache thrashing, for
example).
On NUMA architectures, having the processes getting bumped from one
socket to another is more expensive in terms of cache locality (with
all of the cache coherency overhead that comes with the lack of it)
than in terms of hypertransport routing (see below).
Non-NUMA architectures such as Intel Woodcrest have a flat access
time to the South Bridge, but cache locality is still important so CPU
affinity is always a good thing to do.
Be sure to understand your system's architecture, particularly
with respect to the memory, disk, and network characteristics, and
test accordingly. For example, on NUMA architectures, most common
being Opteron, the South Bridge is connected through a hypertransport
link to one CPU on one socket. Which socket depends on the
motherboard, but it should be described in the motherboard
documentation (it's not always socket 0!). If a process on the other
socket needs to write something to a NIC on a PCIE bus behind the
South Bridge, it needs to first hop through the first socket. On
modern machines (circa late 2006), this hop cost usually something
like 100ns (i.e., 0.1 us). If the socket is further away, like in a 4-
or 8-socket configuration, there could potentially be more hops,
leading to more latency.
Compile your benchmark with the appropriate compiler optimization
flags. With some MPI implementations, the compiler wrappers (like
mpicc, mpif90, etc.) add optimization flags automatically.
Open MPI does not. Add -O or other flags explicitly.
Make sure your benchmark runs for a sufficient amount of time.
Short-running benchmarks are generally less accurate because they take
fewer samples; longer-running jobs tend to take more samples.
If your benchmark is trying to benchmark extremely short events
(such as the time required for a single ping-pong of messages):
Perform some "warmup" events first. Many MPI implementations
(including Open MPI) — and other subsystems upon which the MPI uses
— may use "lazy" semantics to setup and maintain streams of
communications. Hence, the first event (or first few events)
may well take significantly longer than subsequent events.
Use a high-resolution timer if possible — gettimeofday() only
returns millisecond precision (sometimes on the order of several
microseconds).
Run the event many, many times (hundreds or thousands, depending
on the event and the time it takes). Not only does this provide
more samples, it may also be necessary, especially when the precision
of the timer you're using may be several orders of magnitude less
precise than the event you're trying to benchmark.
Decide whether you are reporting minimum, average, or maximum
numbers, and have good reasons why.
Accurately label and report all results. Reproducibility is a
major goal of benchmarking; benchmark results are effectively useless
if they are not precisely labeled as to exactly what they are
reporting. Keep a log and detailed notes about the exact system
configuration that you are benchmarking. Note, for example, all
hardware and software characteristics (to include hardware, firmware,
and software versions as appropriate).
240. I am getting a MPI_Win_free error from IMB-EXT — what do I do?
When you run IMB-EXT with Open MPI, you'll see a
message like this:
1
2
3
4
[node01.example.com:2228] *** An error occurred in MPI_Win_free
[node01.example.com:2228] *** on win
[node01.example.com:2228] *** MPI_ERR_RMA_SYNC: error while executing rma sync
[node01.example.com:2228] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
This is due to a bug in the Intel MPI Benchmarks, known to be in at
least versions v3.1 and v3.2. Intel was notified of this bug in May
of 2009. If you have a version after then, it should include this bug
fix. If not, here is the fix that you can apply to the IMB-EXT source
code yourself.
Here is a small patch that fixes the bug in IMB v3.2:
The vader BTL is a low-latency, high-bandwidth
mechanism for transferring data between two processes via shared memory.
This BTL can only be used between processes executing on the same node.
Beginning with the v1.8 series, the vader BTL replaces the sm BTL unless
the local system lacks the required support or the user specifically requests
the latter be used. At this time, vader requires CMA support which is typically
found in more current kernels. Thus, systems based on older kernels may default
to the slower sm BTL.
242. What is the sm BTL?
The sm BTL (shared-memory Byte Transfer Layer) is a low-latency, high-bandwidth
mechanism for transferring data between two processes via shared memory.
This BTL can only be used between processes executing on the same node.
The sm BTL has high exclusivity. That is, if one process can reach another
process via sm, then no other BTL will be considered for that connection.
Note that with Open MPI v1.3.2, the sm so-called "FIFOs" were reimplemented and
the sizing of the shared-memory area was changed. So, much of this FAQ will
distinguish between releases up to Open MPI v1.3.1 and releases starting with Open MPI v1.3.2.
243. How do I specify use of sm for MPI messages?
Typically, it is unnecessary to do so; OMPI will use the best BTL available
for each communication.
Nevertheless, you may use the MCA parameter btl. You should also specify the
self BTL for communications between a process and itself. Furthermore, if not all
processes in your job will run on the same, single node, then you also need
to specify a BTL for internode communications. For example:
1
shell$ mpirun--mca btl self,sm,tcp -np16 ./a.out
244. How does the sm BTL work?
A point-to-point user message is broken up by the PML into fragments.
The sm BTL only has to transfer individual fragments. The steps are:
The sender pulls a shared-memory fragment out of one of its free lists.
Each process has one free list for smaller (e.g., 4Kbyte) eager
fragments and another free list for larger (e.g., 32Kbyte) max fragments.
The sender packs the user-message fragment into this shared-memory
fragment, including any header information.
The sender posts a pointer to this shared fragment into the
appropriate FIFO (first-in-first-out) queue of the receiver.
The receiver polls its FIFO(s). When it finds a new fragment
pointer, it unpacks data out of the shared-memory fragment and notifies
the sender that the shared fragment is ready for reuse (to be
returned to the sender's free list).
On each node where an MPI job has two or more processes running, the job creates
a file that each process mmaps into its address space. Shared-memory
resources that the job needs — such as FIFOs and fragment free lists — are
allocated from this shared-memory area.
245. Why does my MPI job no longer start when there are too many processes on
one node?
If you are using Open MPI v1.3.1 or earlier, it is possible that the shared-memory
area set aside for your job was not created large enough. Make sure you're running
in 64-bit mode (compiled with -m64) and set the MCA parameter mpool_sm_max_size
to be very large — even several Gbytes. Exactly how large is discussed further
below.
Regardless of which OMPI release you're using, make sure that there is sufficient
space for a large file to back the shared memory — typically in /tmp.
246. How do I know what MCA parameters are available for tuning MPI performance?
The ompi_info command can display all the parameters available for the
sm BTL and sm mpool:
1
2
shell$ ompi_info--param btl sm
shell$ ompi_info--param mpool sm
247. How can I tune these parameters to improve performance?
Mostly, the default values of the MCA parameters have already
been chosen to give good performance. To improve performance further
is a little bit of an art. Sometimes, it's a matter of trading off
performance for memory.
btl_sm_eager_limit:
If message data plus header information fits within this limit, the
message is sent "eagerly" — that is, a sender attempts to write
its entire message to shared buffers without waiting for a receiver to
be ready. Above this size, a sender will only write the first part of
a message, then wait for the receiver to acknowledge its readiness before
continuing. Eager sends can improve performance by decoupling
senders from receivers.
btl_sm_max_send_size:
Large messages are sent in fragments of this size. Larger segments
can lead to greater efficiencies, though they could perhaps also
inhibit pipelining between sender and receiver.
btl_sm_num_fifos:
Starting in Open MPI v1.3.2, this is the number of FIFOs per receiving
process. By default, there is only one FIFO per process.
Conceivably, if many senders are all sending to the same process and
contending for a single FIFO, there will be congestion. If there are
many FIFOs, however, the receiver must poll more FIFOs to find
incoming messages. Therefore, you might try increasing this
parameter slightly if you have many (at least dozens) of processes
all sending to the same process. For example, if 100 senders are all
contending for a single FIFO for a particular receiver, it may suffice
to increase btl_sm_num_fifos from 1 to 2.
btl_sm_fifo_size:
Starting in Open MPI v1.3.2, FIFOs could no longer grow. If you
believe the FIFO is getting congested because a process falls far
behind in reading incoming message fragments, increase this size
manually.
btl_sm_free_list_num:
This is the initial number of fragments on each (eager and max) free
list. The free lists can grow in response to resource congestion, but
you can increase this parameter to pre-reserve space for more
fragments.
mpool_sm_min_size:
You can reserve headroom for the shared-memory area to grow by
increasing this parameter.
248. Where is the file that sm will mmap in?
The file will be in the OMPI session directory, which is typically
something like /tmp/openmpi-sessions-myusername@mynodename/* .
The file itself will have the name
shared_mem_pool.mynodename. For example, the full path could be
/tmp/openmpi-sessions-myusername@node0_0/1543/1/shared_mem_pool.node0.
To place the session directory in a non-default location, use the MCA parameter
orte_tmpdir_base.
249. Why am I seeing incredibly poor performance with the sm BTL?
The most common problem with the shared memory BTL is when the
Open MPI session directory is placed on a network filesystem (e.g., if
/tmp is not on a local disk). This is because the shared memory BTL
places a memory-mapped file in the Open MPI session directory (see this entry for more details). If the
session directory is located on a network filesystem, the shared
memory BTL latency will be extremely high.
Try not mounting /tmp as a network filesystem, and/or moving the Open
MPI session directory to a local filesystem.
Some users have reported success and possible performance
optimizations with having /tmp mounted as a "tmpfs" filesystem
(i.e., a RAM-based filesystem). However, before configuring
your system this way, be aware of a few items:
Open MPI writes a few small meta data files into /tmp and may
therefore consume some extra memory that could have otherwise been
used for application instruction or data state.
If you use the "filem" system in Open MPI for moving
executables between nodes, these files are stored under /tmp.
Open MPI's checkpoint / restart files can also be saved under
/tmp.
If the Open MPI job is terminated abnormally, there are some
circumstances where files (including memory-mapped shared memory
files) can be left in /tmp. This can happen, for example, when a
resource manager forcibly kills an Open MPI job and does not give it
the chance to clean up /tmp files and directories.
Some users have reported success with configuring their resource
manager to run a script between jobs to forcibly empty the /tmp
directory.
250. Can I use SysV instead of mmap?
In the v1.3 and v1.4 Open MPI series, shared memory is established
via mmap. In future releases, there may be an option for using SysV
shared memory.
251. How much shared memory will my job use?
Your job will create a shared-memory area on each node where
it has two or more processes. This area will be fixed during the
lifetime of your job. Shared-memory allocations (for FIFOs and
fragment free lists) will be made in this area. Here, we look at the
size of that shared-memory area.
If you want just one hard number, then go with approximately 128
Mbytes per node per job, shared by all the job's processes on that
node. That is, an OMPI job will need more than a few Mbytes per node,
but typically less than a few Gbytes.
Better yet, read on.
Up through Open MPI v1.3.1, the shared-memory file would basically be
sized thusly:
where n is the number of processes in the job running on that
particular node and the mpool_sm_* are MCA parameters. For small
n, this size is typically excessive. For large n (e.g., 128 MPI
processes on the same node), this size may not be sufficient for the
job to start.
Starting in OMPI v1.3.2, a more sophisticated formula was introduced to
model more closely how much memory was actually needed. That formula
is somewhat complicated and subject to change. It guarantees that
there will be at least enough shared memory for the program to start
up and run. See this FAQ item to see
how much is needed. Alternatively, the motivated user can examine the
OMPI source code to see the formula used — for example, here is the
formula in OMPI commit 463f11f.
OMPI v1.3.2 also uses the MCA parameter mpool_sm_min_size to set a
minimum size — e.g., so that there is not only enough shared memory
for the job to start, but additionally headroom for further
shared-memory allocations (e.g., of more eager or max fragments).
Once the shared-memory area is established, it will not grow further
during the course of the MPI job's run.
252. How much shared memory do I need?
In most cases, OMPI will start your job with sufficient shared
memory.
Nevertheless, if OMPI doesn't get you enough shared memory (e.g.,
you're using OMPI v1.3.1 or earlier with roughly 128 processes or more
on a single node) or you want to trim shared-memory consumption, you
may want to know how much shared memory is really needed.
As we saw earlier, the shared
memory area contains:
FIFOs
eager fragments
max fragments
In general, you need only enough shared memory for the FIFOs and
fragments that are allocated during MPI_Init().
Beyond that, you might want additional shared memory for performance
reasons, so that FIFOs and fragment lists can grow if your program's
message traffic encounters resource congestion. Even if there is no
room to grow, however, your correctly written MPI program should still
run to completion in the face of congestion; performance simply degrades
somewhat. Note that while shared-memory resources can grow after
MPI_Init(), they cannot shrink.
So, how much shared memory is needed during MPI_Init() ?
You need approximately the total of:
FIFOs:
(≤ Open MPI v1.3.1): 3 × n × n × pagesize
(≥ Open MPI v1.3.2): n × btl_sm_num_fifos × btl_sm_fifo_size × sizeof(void *)
eager fragments: n × ( 2 × n + btl_sm_free_list_inc ) × btl_sm_eager_limit
max fragments: n × btl_sm_free_list_num × btl_sm_max_send_size
where:
n is the number of MPI processes in your job on the node
pagesize is the OS page size (4KB for Linux and 8KB for Solaris)
btl_sm_* are MCA parameters
253. How can I decrease my shared-memory usage?
There are two parts to this question.
First, how does one reduce how big the mmap file is? The answer is:
Up to Open MPI v1.3.1: Reduce mpool_sm_per_peer_size, mpool_sm_min_size,
and mpool_sm_max_size
Starting with Open MPI v1.3.2: Reduce mpool_sm_min_size
Second, how does one reduce how much shared memory is needed? (Just
making the mmap file smaller doesn't help if then your job won't
start up.) The answers are:
For small values of n — that is, for few processes per node —
shared-memory usage during MPI_Init() is predominantly for max free lists.
So, you can reduce the MCA parameter btl_sm_max_send_size. Alternatively,
you could reduce btl_sm_free_list_num, but it is already pretty small by
default.
For large values of n — that is, for many processes per node — there
are two cases:
Up to Open MPI v1.3.1: Shared-memory usage is dominated by the
FIFOs, which consume a certain number of pages. Usage is
high and cannot be reduced much via MCA parameter tuning.
Starting with Open MPI v1.3.2: Shared-memory usage is dominated
by the eager free lists. So, you can reduce the MCA parameter
btl_sm_eager_limit.
254. How do I specify to use the IP network for MPI messages?
In general, you specify that the tcp BTL component should be
used. This will direct Open MPI to use TCP-based communications over
IP interfaces / networks.
However, note that you should also specify that the self
BTL component should be used. self is for loopback communication
(i.e., when an MPI process sends to itself), and is technically a
different communication channel than TCP. For example:
1
shell$ mpirun--mca btl tcp,self ...
Failure to specify the self BTL may result in Open MPI being unable
to complete send-to-self scenarios (meaning that your program will run
fine until a process tries to send to itself).
Note that if the tcp BTL is available at run time (which it should
be on most POSIX-like systems), Open MPI should automatically use it
by default (ditto for self). Hence, it's usually unnecessary to
specify these options on the mpirun command line. They are
typically only used when you want to be absolutely positively
definitely sure to use the specific BTL.
255. But wait — I'm using a high-speed network. Do I have to
disable the TCP BTL?
No. Following the so-called "Law of Least Astonishment",
Open MPI assumes that if you have both a IP network and at least one
high-speed network (such InfiniBand), you will likely
only want to use the high-speed network(s) for MPI message passing.
Hence, the tcp BTL component will sense this and automatically
deactivate itself.
That being said, Open MPI may still use TCP for setup and teardown
information — so you'll see traffic across your IP network during
startup and shutdown of your MPI job. This is normal and does not
affect the MPI message passing channels.
256. How do I know what MCA parameters are available for tuning MPI performance?
The ompi_info command can display all the parameters
available for the tcp BTL component (i.e., the component that uses
TCP for MPI communications):
1
shell$ ompi_info--param btl tcp --level9
NOTE: Prior to the Open
MPI 1.7 series, ompi_info would show all MCA parameters by default.
Starting with Open MPI v1.7, you need to specify --level 9 (or
--all) to show all MCA parameters.
257. Does Open MPI use the IP loopback interface?
Usually not.
In general message passing usage, there are two scenarios where using
the IP loopback interface could be used:
Sending a message from one process to itself
Sending a message from one process to another process on the same
machine
The TCP BTL does not handle "send-to-self" scenarios in Open MPI;
indeed, it is not even capable of doing so. Instead, the self BTL
component is used for all send-to-self MPI communications. Not only
does this allow all Open MPI BTL components to avoid special case code
for send-to-self scenarios, it also allows avoiding using inefficient
loopback network stacks (such as the IP loopback device).
Specifically: the self component uses its own mechanisms for
send-to-self scenarios; it does not use network interfaces.
When sending to other processes on the same machine, Open MPI will
default to using a shared memory BTL (sm or vader).
If the user has deactivated these BTLs, depending on what other BTL
components are available, it is possible that the TCP BTL will be
chosen for message passing to processes on the same node, in which
case the IP loopback device will likely be used. But this is not the
default; either shared memory has to fail to startup properly or the
user must specifically request not to use the shared memory BTL.
258. I have multiple IP networks on some/all of my cluster nodes. Which ones will Open MPI use?
In general, Open MPI will greedily use all IP networks that
it finds per its reachability
computations.
To change this behavior, you can either specifically include certain
networks or specifically exclude certain networks. See this FAQ entry for more details.
259. I'm getting TCP-related errors. What do they mean?
TCP-related errors are usually reported by Open MPI in a
message similar to these:
1
2
btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113
mca_btl_tcp_frag_send: writev failed with errno=104
If an error number is displayed with no explanation string, you can
see what that specific error number means on your operating system
with the following command (the following example was run on Linux;
results may be different on other operating systems):
1
2
3
4
shell$ perl-e'die$!=113'
No route to host at -e line 1.
shell$ perl-e'die$!=104'
Connection reset by peer at -e line 1.
Two types of errors are commonly reported to the Open MPI user's
mailing list:
No route to host: These types of errors usually mean that
there are multiple IP interfaces available and they do not obey Open
MPI's assumptions about routability. See these two FAQ items for more
information:
Connection reset by peer: These types of errors usually occur
after MPI_INIT has completed, and typically indicate that an MPI
process has died unexpectedly (e.g., due to a seg fault). The
specific error message indicates that a peer MPI process tried to
write to the now-dead MPI process and failed.
260. How do I tell Open MPI which IP interfaces / networks to use?
In some parallel environments, it is not uncommon to have
multiple IP interfaces on each node — for example, one IP network
may be "slow" and used for control information such as a batch
scheduler, a networked filesystem, and/or interactive logins. Another
IP network (or networks) may be "fast" and be intended for parallel
applications to use during their runs. As another example, some
operating systems may also have virtual interfaces for communicating
with virtual machines.
Unless otherwise specified, Open MPI will greedily use all "up" IP
networks that it can find and try to connect to all peers _upon
demand_ (i.e., Open MPI does not open sockets to all of its MPI peers
during MPI_INIT — see this FAQ entry
for more details). Hence, if you want MPI jobs to not use specific
IP networks — or not use any IP networks at all — then you need to
tell Open MPI.
NOTE: Aggressively using all "up" interfaces can cause problems in
some cases. For example, if you have a machine with a local-only
interface (e.g., the loopback device, or a virtual-machine bridge
device that can only be used on that machine, and cannot be used to
communicate with MPI processes on other machines), you will likely
need to tell Open MPI to ignore these networks. Open MPI usually
ignores loopback devices by default, but *other local-only devices
must be manually ignored.* Users have reported cases where RHEL6
automatically installed a "virbr0" device for Xen virtualization.
This interface was automatically given an IP address in the
192.168.1.0/24 subnet and marked as "up". Since Open MPI saw this
192.168.1.0/24 "up" interface in all MPI processes on all nodes, it
assumed that that network was usable for MPI communications. This is
obviously incorrect, and it led to MPI applications hanging when they
tried to send or receive MPI messages.
To disable Open MPI from using TCP for MPI communications, the
tcp MCA parameter should be set accordingly. You can either
exclude the TCP component or include all other components.
Specifically:
1
2
3
4
5
6
7
# This says to exclude the TCP BTL component# (implicitly including all others)shell$ mpirun--mca btl ^tcp...
# This says to include only the listed BTL components# (tcp is not listed, and therefore will not be used)shell$ mpirun--mca btl self,vader,openib ...
If you want to use TCP for MPI communications, but want to
restrict it from certain networks, use the btl_tcp_if_include or
btl_tcp_if_exclude MCA parameters (only one of the two should be
set). The values of these parameters can be a comma-delimited list of
network interfaces. For example:
1
2
3
4
5
6
7
8
9
# This says to not use the eth0 and lo interfaces.# (and implicitly use all the rest). Per the description# above, IP loopback and all local-only devices *must*# be included if the exclude list is specified.shell$ mpirun--mca btl_tcp_if_exclude lo,eth0 ...
# This says to only use the eth1 and eth2 interfaces# (and implicitly ignore the rest)shell$ mpirun--mca btl_tcp_if_include eth1,eth2 ...
Starting in the Open MPI v1.5 series, you can specify subnets in the
include or exclude lists in CIDR notation. For example:
1
2
3
# Only use the 192.168.1.0/24 and 10.10.0.0/16 subnets for MPI# communications:shell$ mpirun--mca btl_tcp_if_include 192.168.1.0/24,10.10.0.0/16 ...
NOTE: You must specify the
CIDR notation for a given network precisely. For example, if you have
two IP networks 10.10.0.0/24 and 10.10.1.0/24, Open MPI will not
recognize either of them if you specify "10.10.0.0/16".
NOTE: If you use the
btl_tcp_if_include and btl_tcp_if_exclude MCA parameters to shape
the behavior of the TCP BTL for MPI communications, you may also
need/want to investigate the corresponding MCA parameters
oob_tcp_if_include and oob_tcp_if_exclude, which are used to shape
non-MPI TCP-based communication (e.g., communications setup and
coordination during MPI_INIT and MPI_FINALIZE).
Note that Open MPI will still use TCP for control messages, such as
data between mpirun and the MPI processes, rendezvous information
during MPI_INIT, etc. To disable TCP altogether, you also need to
disable the tcp component from the OOB framework.
261. Does Open MPI open a bunch of sockets during MPI_INIT?
Although Open MPI is likely to open multiple TCP sockets
during MPI_INIT, the tcp BTL component *does not open one socket per
MPI peer process during MPI_INIT.* Open MPI opens
sockets as they are required — so the first time a process sends a
message to a peer and there is a TCP connection between the two, Open
MPI will automatically open a new socket.
Hence, you should not have scalability issues with running large
numbers of processes (e.g., running out of per-process file
descriptors) if your parallel application is sparse in its
communication with peers.
262. Are there any Linux kernel TCP parameters that I should set?
Everyone has different opinions on this, and it also depends
on your exact hardware and environment. Below are general guidelines
that some users have found helpful.
net.ipv4.tcp_syn_retries: Some Linux systems
have very large initial connection timeouts — they retry sending SYN
packets many times before determining that a connection cannot be
made. If MPI is going to fail to make socket connections, it would be
better for them to fail somewhat quickly (minutes vs. hours). You
might want to reduce this value to a smaller value; YMMV.
net.ipv4.tcp_keepalive_time: Some MPI
applications send an initial burst of MPI messages (over TCP) and then
send nothing for long periods of time (e.g., embarrassingly parallel
applications). Linux may decide that these dormant TCP sockets are
dead because it has seen no traffic on them for long periods of time.
You might therefore need to lengthen the TCP inactivity timeout. Many
Linux systems default to 7,200 seconds; increase it if necessary.
Increase TCP buffering for 10G or 40G Ethernet. Many Linux distributions
come with good buffering presets for 1G Ethernet. In a datacenter/HPC
cluster with 10G or 40G Ethernet NICs, this amount of kernel buffering
is typically insufficient. Here's a set of parameters that some have
used for good 10G/40G TCP bandwidth:
Your Linux distro may also support putting individual files
in /etc/sysctl.d (even if that directory does not yet exist), which
is actually better practice than putting them in /etc/sysctl.conf.
For example:
263. How does Open MPI know which IP addresses are routable to each other in Open MPI 1.2?
This is a fairly complicated question — there can be
ambiguity when hosts have multiple NICs and/or there are multiple
IP networks that are not routable to each other in a single MPI job.
It is important to note that Open MPI's atomic unit of routing is a
process — not an IP address. Hence, Open MPI makes connections
between processes, not nodes (these processes are almost always on
remote nodes, but it's still better to think in terms of processes,
not nodes).
Specifically, since Open MPI jobs can span multiple IP networks, each
MPI process may be able to use multiple IP addresses to communicate
with each other MPI process (and vice versa). So for each process,
Open MPI needs to determine which IP address — if any — to use to
connect to a peer MPI process.
For example, say that you have a cluster with 16 nodes on a private
ethernet network. One of these nodes doubles as the head node for the
cluster and therefore has 2 ethernet NICs — one to the external
network and one to the internal cluster network. But since 16 is a
nice number, you also want to use it for computation as well. So when
you mpirun spanning all 16 nodes, OMPI has to figure out to not use
the external NIC on the head node and only use the internal NIC.
To explain what happens, we need to explain some of what happens in
MPI_INIT. Even though Open MPI only makes TCP connections between
peer MPI processes upon demand (see this FAQ
entry), each process publishes its TCP contact information which
is then made available to all processes. Hence, every process knows
the IP address(es) and corresponding port number(s) to contact every
other process.
But keep in mind that these addresses may span multiple IP networks
and/or not be routable to each other. So when a connection is
requested, the TCP BTL component in Open MPI creates pairwise
combinations of all the IP addresses of the localhost to all the IP
addresses of the peer process, looking for a match.
A "match" is defined by the following rules:
If the two IP addresses match after the subnet mask is applied,
assume that they are mutually routable and allow the connection.
If the two IP addresses are public, assume that they are mutually
routable and allow the connection.
Otherwise, the connection is disallowed (this is not an error —
we just disallow this connection on the hope that some other
device can be used to make a connection).
These rules tend to cover the following scenarios:
A cluster on a private network with a head node that has a NIC on
the private network and the public network
Clusters that have all public addresses
These rules do not cover the following cases:
Running an MPI job that spans public and private networks
Running an MPI job that spans a bunch of private networks with
narrowly-scoped netmasks, such as nodes that have IP addresses
192.168.1.10 and 192.168.2.10 with netmasks of 255.255.255.0 (i.e.,
the network fabric makes these two nodes be routable to each other,
even though the netmask implies that they are on different
subnets).
264. How does Open MPI know which IP addresses are routable to each other in Open MPI 1.3 (and beyond)?
Starting with the Open MPI v1.3 series, assumptions about
routability are much different than prior series.
With v1.3 and later, Open MPI assumes that all interfaces are routable
as long as they have the same address family, IPv4 or IPv6. We use
graph theory and give each possible connection a weight depending on
the quality of the connection. This allows the library to select the
best connections between nodes. This method also supports striping
but prevents more than one connection to any interface.
The quality of the connection is defined as follows, with a higher
number meaning better connection. Note that when giving a weight to a
connection consisting of a private address and a public address, it will
give it the weight of PRIVATE_DIFFERENT_NETWORK.
At this point, an example will best illustrate how two processes on two
different nodes would connect up. Here we have two nodes with a variety
of interfaces.
From these two nodes, the software builds up a bipartite graph that
shows all the possible connections with all the possible weights. The
lo0 interfaces are excluded as the btl_tcp_if_exclude MCA parameter
is set to lo by default. Here is what all the possible connections
with their weights look like.
The library then examines all the connections and picks the optimal
ones. This leaves us with two connections being established between
the two nodes.
If you are curious about the actual connect() calls being made by
the processes, then you can run with --mca btl_base_verbose 30.
This can be useful if you notice your job hanging and believe it may
be the library trying to make connections to unreachable hosts.
1
2
3
4
5
6
7
8
# Here is an example with some of the output deleted for clarity.# One can see the connections that are attempted.shell$ mpirun--mca btl self,sm,tcp --mca btl_base_verbose 30-np2-host NodeA,NodeB a.out
[...snip...][NodeA:18003] btl: tcp: attempting to connect() to address 10.8.47.2 on port 59822[NodeA:18003] btl: tcp: attempting to connect() to address 192.168.1.2 on port 59822[NodeB:16842] btl: tcp: attempting to connect() to address 192.168.1.1 on port 44500[...snip...]
In case you want more details about the theory behind the connection
code, you can find the background story in a brief
IEEE paper.
265. Does Open MPI ever close TCP sockets?
In general, no.
Although TCP sockets are opened "lazily" (meaning that MPI
connections / TCP sockets are only opened upon demand — as opposed to
opening all possible sockets between MPI peer processes during
MPI_INIT), they are never closed.
266. Does Open MPI support IP interfaces that have more than one IP address?
In general, no.
For example, if the output from your ifconfig has a single IP device
with multiple IP addresses like this:
1
2
3
4
5
6
0: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:18:ae:f4:d2:29 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.3/24 brd 192.168.0.255 scope global eth0:1
inet 10.10.0.3/24 brf 10.10.0.255 scope global eth0
inet6 fe80::218:aef2:29b4:2c4/64 scope link
valid_lft forever preferred_lft forever
(note the two "inet" lines in there)
Then Open MPI will be unable to use this device.
267. Does Open MPI support virtual IP interfaces?
No.
For example, if the output of your ifconfig has both "eth0" and
"eth0:0", Open MPI will get confused if you use the TCP BTL, and
may hang or otherwise act unpredictably.
Note that using btl_tcp_if_include or btl_tcp_if_exclude to avoid
using the virtual interface will not solve the issue.
This may get fixed in a future release. See GitHub issue
#160 to follow the progress on this issue.
268. Why do I only see 5 Gbps bandwidth benchmark results on 10 GbE or faster networks?
Before the 3.0 release series, Open MPI set two TCP tuning
parameters which, while a little large for 1 Gbps networks in 2005,
were woefully undersized for modern 10 Gbps networks. Further, the
Linux kernel TCP stack has progressed to a dynamic buffer scheme,
allowing even larger buffers (and therefore window sizes). The Open
MPI parameters meant that for most any multi-switch 10 GbE
configuration, the TCP window could not cover the bandwidth-delay
product of the network and, therefore, a single TCP flow could not
saturate the network link.
Open MPI 3.0 and later removed the problematic tuning parameters and
let the kernel do its (much more intelligent) thing. If you still see
unexpected bandwidth numbers in your network, this may be a bug.
Please file a GitHub Issue.
The tuning parameter patch was backported to the 2.0 series in 2.0.3
and the 2.1 series in 2.1.2, so those versions and later should also
not require workarounds. For earlier versions, the parameters can be
modified with an MCA parameter:
269. Can I use multiple TCP connections to improve network performance?
Open MPI 4.0.0 and later can use multiple TCP connections
between any pair of MPI processes, striping large messages across the
connections. The btl_tcp_links parameter can be used to set how
many TCP connections should be established between MPI ranks. Note
that this may not improve application performance for common use cases
of nearest-neighbor exchanges when there many MPI ranks on each host.
In these cases, there are already many TCP connections between any two
hosts (because of the many ranks all communicating), so the extra TCP
connections are likely just consuming extra resources and adding work
to the MPI implementation. However, for highly multi-threaded
applications, where there are only one or two MPI ranks per host, the
btl_tcp_links option may improve TCP throughput considerably.
270. What Myrinet-based components does Open MPI have?
Some versions of Open MPI support both GM and MX for MPI
communications.
Open MPI series
GM supported
MX supported
v1.0 series
Yes
Yes
v1.1 series
Yes
Yes
v1.2 series
Yes
Yes (BTL and MTL)
v1.3 / v1.4 series
Yes
Yes (BTL and MTL)
v1.5 / v1.6 series
No
Yes (BTL and MTL)
v1.7 / v1.8 series
No
Yes (MTL only)
v1.10 and beyond
No
No
271. How do I specify to use the Myrinet GM network for MPI messages?
In general, you specify that the gm BTL component should be used.
However, note that you should also specify that the self BTL component
should be used. self is for loopback communication (i.e., when an MPI
process sends to itself). This is technically a different
communication channel than Myrinet. For example:
1
shell$ mpirun--mca btl gm,self ...
Failure to specify the self BTL may result in Open MPI being unable
to complete send-to-self scenarios (meaning that your program will run
fine until a process tries to send to itself).
To use Open MPI's shared memory support for on-host communication
instead of GM's shared memory support, simply include the sm BTL.
For example:
1
shell$ mpirun--mca btl gm,sm,self ...
Finally, note that if the gm component is
available at run time, Open MPI should automatically use it by
default (ditto for self and sm). Hence, it's usually unnecessary to
specify these options on the mpirun command line. They are
typically only used when you want to be absolutely positively
definitely sure to use the specific BTL.
272. How do I specify to use the Myrinet MX network for MPI messages?
As of version 1.2, Open MPI has two different components
to support Myrinet MX, the mx BTL and the mx MTL, only one of which can be
used at a time. Prior versions only have the mx BTL.
If available, the mx BTL is used by default. However, to be sure it is
selected you can specify it. Note that you should also specify the
self BTL component (for loopback communication) and the sm BTL
component (for on-host communication). For example:
1
shell$ mpirun--mca btl mx,sm,self ...
To use the mx MTL component, it must be specified. Also, you must use
the cm PML component. For example:
1
shell$ mpirun--mca mtl mx --mca pml cm ...
Note that one cannot use both the mx MTL and the mx BTL components
at once. Deciding which to use largely depends on the application being
run.
273. But wait — I also have a TCP network. Do I need to explicitly
disable the TCP BTL?
274. How do I know what MCA parameters are available for tuning MPI performance?
The ompi_info command can display all the parameters
available for the gm and mx BTL components and the mx MTL component:
1
2
3
4
5
6
7
8
# Show the gm BTL parametersshell$ ompi_info<font color=red><strong>--param btl gm</strong></font># Show the mx BTL parametersshell$ ompi_info<font color=red><strong>--param btl mx</strong></font># Show the mx MTL parametersshell$ ompi_info<font color=red><strong>--param mtl mx</strong></font>
275. I'm experiencing a problem with Open MPI on my Myrinet-based network; how do I troubleshoot and get help?
In order for us to help you, it is most helpful if you can
run a few steps before sending an e-mail to both perform some basic
troubleshooting and provide us with enough information about your
environment to help you. Please include answers to the following
questions in your e-mail:
Which Myricom software stack are you running: GM or MX? Which
version?
Are you using "fma", the "gm_mapper", or the "mx_mapper"?
If running GM, include the output from running the gm_board_info
from a known "good" node and a known "bad" node.
If running MX, include the output from running mx_info from a known
"good" node and a known "bad" node.
Is the "Map version" value from this output the same across
all nodes?
NOTE: If the map version
is not the same, ensure that you are not running a mixture of FMA on
some nodes and the mapper on others. Also check the connectivity of
nodes that seem to have an inconsistent map version.
What are the contents of the file
/var/run/fms/fma.log?
Gather up this information and see
this page about how to submit a help request to the user's mailing
list.
276. How do I adjust the MX first fragment size? Are there constraints?
The MX library limits the maximum message fragment size for
both on-node and off-node messages. As of MX v1.0.3, the inter-node
maximum fragment size is 32k, and the intra-node maximum fragment size
is 16k — fragments sent larger than these sizes will fail.
Open MPI automatically fragments large messages; it currently limits
its first fragment size on MX networks to the lower of these two
values — 16k. As such, increasing the value of the MCA parameter
btl_mx_first_frag_size larger than 16k may cause failures in
some cases (e.g., when using MX to send large messages to processes on
the same node); it will cause failures in all cases if it is set above
32k.
Note that this only affects the first fragment of messages; latter
fragments do not have this size restriction. The MCA parameter
btl_mx_max_send_size can be used to vary the maximum size of
subsequent fragments.
277. What Open MPI components support InfiniBand / RoCE / iWARP?
In order to meet the needs of an ever-changing networking
hardware and software ecosystem, Open MPI's support of InfiniBand,
RoCE, and iWARP has evolved over time.
Here is a summary of components in Open MPI that support InfiniBand,
RoCE, and/or iWARP, ordered by Open MPI release series:
Open MPI series
OpenFabrics support
v1.0 series
openib and mvapi BTLs
v1.1 series
openib and mvapi BTLs
v1.2 series
openib and mvapi BTLs
v1.3 / v1.4 series
openib BTL
v1.5 / v1.6 series
openib BTL, mxm MTL, fca coll
v1.7 / v1.8 series
openib BTL, mxm MTL, fca and ml and hcoll coll
v2.x series
openib BTL, yalla (MXM) PML, ml and hcoll coll
v3.x series
openib BTL, ucx and yalla (MXM) PML, hcoll coll
v4.x series
openib BTL, ucx PML, hcoll coll, ofi MTL
History / notes:
The openib BTL uses the OpenFabrics Alliance's (OFA) verbs
API stack to support InfiniBand, RoCE, and iWARP devices. The OFA's
original name was "OpenIB", which is why the BTL is named
openib.
Before the verbs API was effectively standardized in the OFA's
verbs stack, Open MPI supported Mellanox VAPI in the mvapi module.
The MVAPI API stack has long-since been discarded, and is no longer
supported after Open MPI the v1.2 series.
The next-generation, higher-abstraction API for support
InfiniBand and RoCE devices is named UCX. As of Open MPI v1.4, the
ucx PML is the preferred mechanism for utilizing InfiniBand and RoCE
devices. As of UCX v1.8, iWARP is not supported. See this FAQ entry for more information about
iWARP.
278. What component will my OpenFabrics-based network use by default?
Per this FAQ item,
OpenFabrics-based networks have generally used the openib BTL for
native verbs-based communication for MPI point-to-point
communications. Because of this history, many of the questions below
refer to the openib BTL, and are specifically marked as such.
The following are exceptions to this general rule:
In the v2.x and v3.x series, Mellanox InfiniBand devices
defaulted to MXM-based components (e.g., mxm and/or yalla).
In the v4.0.x series, Mellanox InfiniBand devices default to the
ucx PML. The use of InfiniBand over the openib BTL is
officially deprecated in the v4.0.x series, and is scheduled to
be removed in Open MPI v5.0.0.
That being said, it is generally possible for any OpenFabrics device
to use the openib BTL or the ucx PML:
To make the openib BTL use InfiniBand in v4.0.x, set the
btl_openib_allow_ib parameter to 1.
See this FAQ item for information about
using the ucx PML with arbitrary OpenFabrics devices.
279. Does Open MPI support iWARP?
iWARP is fully supported via the openib BTL as of the Open
MPI v1.3 release.
Note that the openib BTL is scheduled to be removed from Open MPI
starting with v5.0.0. After the openib BTL is removed, support for
iWARP is murky, at best. As of June 2020 (in the v4.x series), there
are two alternate mechanisms for iWARP support which will likely
continue into the v5.x series:
The cm PML with the ofi MTL. This mechanism is actually
designed for networks that natively support "MPI-style matching",
which iWARP does not support. Hence, Libfabric adds in a layer of
software emulation to provide this functionality. This slightly
decreases Open MPI's performance on iWARP networks. That being said,
it seems to work correctly.
The ofi BTL. A new/prototype BTL named ofi is being
developed (and can be used in place of the openib BTL); it uses
Libfabric to directly access the native iWARP device functionality --
without the software emulation performance penality from using the
"MPI-style matching" of the cm PML + ofi MTL combination.
However, the ofi BTL is neither widely tested nor fully developed.
As of June 2020, it did not work with iWARP, but may be updated in the
future.
This state of affairs reflects that the iWARP vendor community is not
involved with Open MPI; we therefore have no one who is actively
developing, testing, or supporting iWARP users in Open MPI. If anyone
is interested in helping with this situation, please let the Open MPI
developer community know.
NOTE: A prior version of this FAQ entry stated that iWARP support
was available through the ucx PML. That was incorrect. As of UCX
v1.8, iWARP is not supported.
280. Does Open MPI support RoCE (RDMA over Converged Ethernet)?
RoCE is fully supported as of the Open MPI v1.4.4 release.
As of Open MPI v4.0.0, the UCX PML is the preferred mechanism for
running over RoCE-based networks. See this FAQ entry for details.
The openib BTL is also available for use with RoCE-based networks
through the v4.x series; see this FAQ
entry for information how to use it. Note, however, that the
openib BTL is scheduled to be removed from Open MPI in v5.0.0.
281. I have an OFED-based cluster; will Open MPI work with that?
Yes.
OFED (OpenFabrics Enterprise Distribution) is basically the release
mechanism for the OpenFabrics software packages. OFED releases are
officially tested and released versions of the OpenFabrics stacks.
282. Where do I get the OFED software from?
The "Download" section of the OpenFabrics web site has
links for the various OFED releases.
Additionally, Mellanox distributes Mellanox OFED and Mellanox-X binary
distributions. Consult with your IB vendor for more details.
283. Isn't Open MPI included in the OFED software package? Can I install another copy of Open MPI besides the one that is included in OFED?
Yes, Open MPI used to be included in the OFED software. And
yes, you can easily install a later version of Open MPI on
OFED-based clusters, even if you're also using the Open MPI that was
included in OFED.
You can simply download the Open MPI version that you want and install
it to an alternate directory from where the OFED-based Open MPI was
installed. You therefore have multiple copies of Open MPI that do not
conflict with each other. Make sure you set the PATH and
LD_LIBRARY_PATH variables to point to exactly one of your Open MPI
installations at a time, and never try to run an MPI executable
compiled with one version of Open MPI with a different version of Open
MPI.
The following versions of Open MPI shipped in OFED (note that
OFED stopped including MPI implementations as of OFED 1.5):
OFED 1.4.1: Open MPI v1.3.2.
OFED 1.4: Open MPI v1.2.8.
OFED 1.3.1: Open MPI v1.2.6.
OFED 1.3: Open MPI v1.2.5.
OFED 1.2: Open MPI v1.2.1.
NOTE: A prior version of this
FAQ entry specified that "v1.2ofed" would be included in OFED v1.2,
representing a temporary branch from the v1.2 series that included
some OFED-specific functionality. All of this functionality was
included in the v1.2.1 release, so OFED v1.2 simply included that.
Some public betas of "v1.2ofed" releases were made available, but
this version was never officially released.
OFED 1.1: Open MPI v1.1.1.
OFED 1.0: Open MPI v1.1b1.
285. Why are you using the name "openib" for the BTL name?
Before the iWARP vendors joined the OpenFabrics Alliance, the
project was known as OpenIB. Open MPI's support for this software
stack was originally written during this timeframe — the name of the
group was "OpenIB", so we named the BTL openib.
Since then, iWARP vendors joined the project and it changed names to
"OpenFabrics". Open MPI did not rename its BTL mainly for
historical reasons — we didn't want to break compatibility for users
who were already using the openib BTL name in scripts, etc.
286. Is the mVAPI-based BTL still supported?
Yes, but only through the Open MPI v1.2 series; mVAPI support
was removed starting with v1.3.
The mVAPI support is an InfiniBand-specific BTL (i.e., it will not
work in iWARP networks), and reflects a prior generation of
InfiniBand software stacks.
The Open MPI team is doing no new work with mVAPI-based networks.
Generally, much of the information contained in this FAQ category
applies to both the OpenFabrics openib BTL and the mVAPI mvapi BTL
— simply replace openib with mvapi to get similar results.
However, new features and options are continually being added to the
openib BTL (and are being listed in this FAQ) that will not be
back-ported to the mvapi BTL. So not all openib-specific items in
this FAQ category will apply to the mvapi BTL.
All that being said, as of Open MPI v4.0.0, the use of InfiniBand over
the openib BTL is deprecated — the UCX PML
is the preferred way to run over InfiniBand.
287. How do I specify to use the OpenFabrics network for MPI messages? (openib BTL)
In general, you specify that the openib BTL
components should be used. However, note that you should also
specify that the self BTL component should be used. self is for
loopback communication (i.e., when an MPI process sends to itself),
and is technically a different communication channel than the
OpenFabrics networks. For example:
1
shell$ mpirun--mca btl openib,self ...
Failure to specify the self BTL may result in Open MPI being unable
to complete send-to-self scenarios (meaning that your program will run
fine until a process tries to send to itself).
Note that openib,self is the minimum list of BTLs that you might
want to use. It is highly likely that you also want to include the
vader (shared memory) BTL in the list as well, like this:
1
shell$ mpirun--mca btl openib,self,vader ...
NOTE: Prior versions of Open MPI used an sm BTL for
shared memory. sm was effectively replaced with vader starting in
Open MPI v3.0.0.
See this FAQ
entry for more details on selecting which MCA plugins are used at
run-time.
Finally, note that if the openib component is available at run time,
Open MPI should automatically use it by default (ditto for self).
Hence, it's usually unnecessary to specify these options on the
mpirun command line. They are typically only used when you want to
be absolutely positively definitely sure to use the specific BTL.
288. But wait — I also have a TCP network. Do I need to explicitly
disable the TCP BTL?
289. How do I know what MCA parameters are available for tuning MPI performance?
The ompi_info command can display all the parameters
available for any Open MPI component. For example:
1
2
3
4
5
6
7
8
9
# Note that Open MPI v1.8 and later will only show an abbreviated list# of parameters by default. Use "--level 9" to show all available# parameters.# Show the UCX PML parametersshell$ ompi_info--param pml ucx --level9# Show the openib BTL parametersshell$ ompi_info--param btl openib --level9
290. I'm experiencing a problem with Open MPI on my OpenFabrics-based network; how do I troubleshoot and get help?
In order for us to help you, it is most helpful if you can
run a few steps before sending an e-mail to both perform some basic
troubleshooting and provide us with enough information about your
environment to help you. Please include answers to the following
questions in your e-mail:
Which Open MPI component are you using? Possibilities include:
the ucx PML, the yalla PML, the mxm MTL, the openib BTL.
Which OpenFabrics version are you running? Please specify where
you got the software from (e.g., from the OpenFabrics community web
site, from a vendor, or it was already included in your Linux
distribution).
What distro and version of Linux are you running? What is your
kernel version?
Which subnet manager are you running? (e.g., OpenSM, a
vendor-specific subnet manager, etc.)
What is the output of the ibv_devinfo command on a known
"good" node and a known "bad" node? (NOTE: there
must be at least one port listed as "PORT_ACTIVE" for Open MPI to
work. If there is not at least one PORT_ACTIVE port, something is
wrong with your OpenFabrics environment and Open MPI will not be able
to run).
What is the output of the ifconfig command on a known "good"
node and a known "bad" node? (Mainly relevant for IPoIB
installations.) Note that some Linux distributions do not put
ifconfig in the default path for normal users; look for it at
/sbin/ifconfig or /usr/sbin/ifconfig.
If running under Bourne shells, what is the output of the [ulimit
-l] command? If running under C shells, what is the output of
the limit | grep memorylocked command?
(NOTE: If the value is not "unlimited", see this FAQ entry and this FAQ entry).
Gather up this information and see
this page about how to submit a help request to the user's mailing
list.
291. What is "registered" (or "pinned") memory?
"Registered" memory means two things:
The memory has been "pinned" by the operating system such that
the virtual memory subsystem will not relocate the buffer (until it
has been unpinned).
The network adapter has been notified of the virtual-to-physical
address mapping.
These two factors allow network adapters to move data between the
network fabric and physical RAM without involvement of the main CPU or
operating system.
Note that many people say "pinned" memory when they actually mean
"registered" memory.
However, a host can only support so much registered memory, so it is
treated as a precious resource. Additionally, the cost of registering
(and unregistering) memory is fairly high. Open MPI takes aggressive
steps to use as little registered memory as possible (balanced against
performance implications, of course) and mitigate the cost of
registering and unregistering memory.
292. I'm getting errors about "error registering openib memory";
what do I do? (openib BTL)
With OpenFabrics (and therefore the openib BTL component),
you need to set the available locked memory to a large number (or
better yet, unlimited) — the defaults with most Linux installations
are usually too low for most HPC applications that utilize
OpenFabrics. Failure to do so will result in a error message similar
to one of the following (the messages have changed throughout the
release versions of Open MPI):
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory. This can cause MPI jobs to
run with erratic performance, hang, and/or crash.
This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered. You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.
See this Open MPI FAQ item for more information on these Linux kernel
module parameters:
http://www.linux-pam.org/Linux-PAM-html/sag-pam_limits.html
Local host: node02
Registerable memory: 32768 MiB
Total memory: 65476 MiB
Your MPI job will continue, but may be behave poorly and/or hang.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
1
2
3
4
5
6
7
8
9
10
11
12
The OpenIB BTL failed to initialize while trying to create an internal
queue. This typically indicates a failed OpenFabrics installation or
faulty hardware. The failure occurred here:
Host: compute_node.example.com
OMPI source: btl_openib.c:828
Function: ibv_create_cq()
Error: Invalid argument (errno=22)
Device: mthca0
You may need to consult with your system administrator to get this
problem fixed.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
The OpenIB BTL failed to initialize while trying to allocate some
locked memory. This typically can indicate that the memlock limits
are set too low. For most HPC installations, the memlock limits
should be set to "unlimited". The failure occurred here:
Host: compute_node.example.com
OMPI source: btl_opebib.c:114
Function: ibv_create_cq()
Device: Out of memory
Memlock limit: 32767
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
http://www.linux-pam.org/Linux-PAM-html/sag-pam_limits.html
293. How can a system administrator (or user) change locked memory limits?
There are two ways to control the amount of memory that a user
process can lock:
Assuming that the PAM limits module is being used (see
full docs for the Linux PAM limits module, or
this mirror), the system-level
default values are controlled by putting a file in
/etc/security/limits.d/ (or directly editing the
/etc/security/limits.conf file on older systems). Two limits are
configurable:
Soft: The "soft" value is how much memory is allowed to be
locked by user processes by default. Set it by creating a file in
/etc/security/limits.d/ (e.g., 95-openfabrics.conf) with the line
below (or, if your system doesn't have a /etc/security/limits.d/
directory, add a line directly to /etc/security/limits.conf):
1
* soft memlock <number>
where <number> is the number of bytes that you want user
processes to be allowed to lock by default (presumably rounded down to
an integral number of pages). <number> can also be
unlimited.
Hard: The "hard" value is the maximum amount of memory that a
user process can lock. Similar to the soft lock, add it to the file
you added to /etc/security/limits.d/ (or editing
/etc/security/limits.conf directly on older systems):
1
* hard memlock <number>
where <number> is the maximum number of bytes that you want
user processes to be allowed to lock (presumably rounded down to an
integral number of pages). <number> can also be
unlimited.
Per-user default values are controlled via the ulimit command (or
limit in csh). The default amount of memory allowed to be
locked will correspond to the "soft" limit set in
/etc/security/limits.d/ (or limits.conf — see above); users
cannot use ulimit (or limit in csh) to set their amount to be more
than the hard limit in /etc/security/limits.d (or limits.conf).
Users can increase the default limit by adding the following to their
shell startup files for Bourne style shells (sh, bash):
1
shell$ ulimit-l unlimited
Or for C style shells (csh, tcsh):
1
shell% limit memorylocked unlimited
This effectively sets their limit to the hard limit in
/etc/security/limits.d (or limits.conf). Alternatively, users can
set a specific number instead of "unlimited", but this has limited
usefulness unless a user is aware of exactly how much locked memory they
will require (which is difficult to know since Open MPI manages locked
memory behind the scenes).
It is important to realize that this must be set in all shells where
Open MPI processes using OpenFabrics will be run. For example, if you are
using rsh or ssh to start parallel jobs, it will be necessary to
set the ulimit in your shell startup files so that it is effective
on the processes that are started on each node.
More specifically: it may not be sufficient to simply execute the
following, because the ulimit may not be in effect on all nodes
where Open MPI processes will be run:
294. I'm still getting errors about "error registering openib memory"; what do I do? (openib BTL)
Ensure that the limits you've set (see this FAQ entry) are actually being
used. There are two general cases where this can happen:
Your memory locked limits are not actually being applied for
interactive and/or non-interactive logins.
You are starting MPI jobs under a resource manager / job
scheduler that is either explicitly resetting the memory limited or
has daemons that were (usually accidentally) started with very small
memory locked limits.
That is, in some cases, it is possible to login to a node and
not have the "limits" set properly. For example, consider the
following post on the Open MPI User's list:
In this case, the user noted that the default configuration on his
Linux system did not automatically load the pam_limits.so
upon rsh-based logins, meaning that the hard and soft
limits were not set.
There are also some default configurations where, even though the
maximum limits are initially set system-wide in limits.d (or
limits.conf on older systems), something
during the boot procedure sets the default limit back down to a low
number (e.g., 32k). In this case, you may need to override this limit
on a per-user basis (described in this FAQ
entry), or effectively system-wide by putting ulimit -l unlimited
(for Bourne-like shells) in a strategic location, such as:
/etc/init.d/sshd (or wherever the script is that starts up your
SSH daemon) and restarting the SSH daemon
In a script in /etc/profile.d, or wherever system-wide shell
startup scripts are located (e.g., /etc/profile and
/etc/csh.cshrc)
Also, note that resource managers such as Slurm, Torque/PBS, LSF,
etc. may affect OpenFabrics jobs in two ways:
Make sure that the resource manager daemons are started with
unlimited memlock limits (which may involve editing the resource
manager daemon startup script, or some other system-wide location that
allows the resource manager daemon to get an unlimited limit of locked
memory).
Otherwise, jobs that are started under that resource manager
will get the default locked memory limits, which are far too small for
Open MPI.
*The files in limits.d (or the limits.conf file) do not usually
apply to resource daemons!* The limits.s files usually only applies
to rsh or ssh-based logins. Hence, daemons usually inherit the
system default of maximum 32k of locked memory (which then gets passed
down to the MPI processes that they start). To increase this limit,
you typically need to modify daemons' startup scripts to increase the
limit before they drop root privliedges.
Some resource managers can limit the amount of locked
memory that is made available to jobs. For example, Slurm has some
fine-grained controls that allow locked memory for only Slurm jobs
(i.e., the system's default is low memory lock limits, but Slurm jobs
can get high memory lock limits). See these FAQ items on the Slurm
web site for more details: propagating limits and using PAM.
Finally, note that some versions of SSH have problems with getting
correct values from /etc/security/limits.d/ (or limits.conf) when
using privilege separation. You may notice this by ssh'ing into a
node and seeing that your memlock limits are far lower than what you
have listed in /etc/security/limits.d/ (or limits.conf) (e.g., 32k
instead of unlimited). Several web sites suggest disabling privilege
separation in ssh to make PAM limits work properly, but others imply
that this may be fixed in recent versions of OpenSSH.
If you do disable privilege separation in ssh, be sure to check with
your local system administrator and/or security officers to understand
the full implications of this change. See this Google search link for more information.
295. Open MPI is warning me about limited registered memory; what does this mean?
OpenFabrics network vendors provide Linux kernel module
parameters controlling the size of the size of the memory translation
table (MTT) used to map virtual addresses to physical addresses. The
size of this table controls the amount of physical memory that can be
registered for use with OpenFabrics devices.
With Mellanox hardware, two parameters are provided to control the
size of this table:
log_num_mtt (on some older Mellanox hardware, the parameter may be
num_mtt, not log_num_mtt): number of memory translation tables
log_mtts_per_seg:
The amount of memory that can be registered is calculated using this
formula:
1
2
3
4
5
In newer hardware:
max_reg_mem = (2^log_num_mtt) * (2^log_mtts_per_seg) * PAGE_SIZE
In older hardware:
max_reg_mem = num_mtt * (2^log_mtts_per_seg) * PAGE_SIZE
*At least some versions of OFED (community OFED,
Mellanox OFED, and upstream OFED in Linux distributions) set the
default values of these variables FAR too low!* For example, in
some cases, the default values may only allow registering 2 GB — even
if the node has much more than 2 GB of physical memory.
It is recommended that you adjust log_num_mtt (or num_mtt) such
that your max_reg_mem value is at least twice the amount of physical
memory on your machine (setting it to a value higher than the amount
of physical memory present allows the internal Mellanox driver tables
to handle fragmentation and other overhead). For example, if a node
has 64 GB of memory and a 4 KB page size, log_num_mtt should be set
to 24 and (assuming log_mtts_per_seg is set to 1). This will allow
processes on the node to register:
1
max_reg_mem = (2^24) * (2^1) * (4 kB) = 128 GB
NOTE: Starting with OFED 2.0, OFED's default kernel parameter values
should allow registering twice the physical memory size.
296. I'm using Mellanox ConnectX HCA hardware and seeing terrible
latency for short messages; how can I fix this?
Open MPI prior to v1.2.4 did not include specific
configuration information to enable RDMA for short messages on
ConnectX hardware. As such, Open MPI will default to the safe setting
of using send/receive semantics for short messages, which is slower
than RDMA.
To enable RDMA for short messages, you can add this snippet to the
bottom of the $prefix/share/openmpi/mca-btl-openib-hca-params.ini
file:
Enabling short message RDMA will significantly reduce short message
latency, especially on ConnectX (and newer) Mellanox hardware.
297. How much registered memory is used by Open MPI? Is there a way to limit it? (openib BTL)
Open MPI uses registered memory in several places, and
therefore the total amount used is calculated by a somewhat-complex
formula that is directly influenced by MCA parameter values.
It can be desirable to enforce a hard limit on how much registered
memory is consumed by MPI applications. For example, some platforms
have limited amounts of registered memory available; setting limits on
a per-process level can ensure fairness between MPI processes on the
same host. Another reason is that registered memory is not swappable;
as more memory is registered, less memory is available for
(non-registered) process code and data. When little unregistered
memory is available, swap thrashing of unregistered memory can occur.
Each instance of the openib BTL module in an MPI process (i.e.,
one per HCA port and LID) will use up to a maximum of the sum of the
following quantities:
Description
Amount
Explanation
User memory
mpool_rdma_rcache_size_limit
By default Open
MPI will register as much user memory as necessary (upon demand).
However, if mpool_rdma_cache_size_limit is greater than zero, it
is the upper limit (in bytes) of user memory that will be
registered. User memory is registered for ongoing MPI
communications (e.g., long message sends and receives) and via the
MPI_ALLOC_MEM function.
Note that this MCA parameter was introduced in v1.2.1.
A "free list" of buffers used in the openib BTL for "eager"
fragments (e.g., the first fragment of a long message). Two free
lists are created; one for sends and one for receives.
By default, btl_openib_free_list_max is -1, and the list size is
unbounded, meaning that Open MPI will try to allocate as many
registered buffers as it needs. If btl_openib_free_list_max is
greater than 0, the list will be limited to this size. Each entry
in the list is approximately btl_openib_eager_limit bytes —
some additional overhead space is required for alignment and
internal accounting. btl_openib_eager_limit is the
maximum size of an eager fragment.
A "free list" of buffers used for send/receive communication in
the openib BTL. Two free lists are created; one for sends and
one for receives.
By default, btl_openib_free_list_max is -1, and the list size is
unbounded, meaning that Open MPI will allocate as many registered
buffers as it needs. If btl_openib_free_list_max is greater
than 0, the list will be limited to this size. Each entry in the
list is approximately btl_openib_max_send_size bytes — some
additional overhead space is required for alignment and internal
accounting. btl_openib_max_send_size is the maximum
size of a send/receive fragment.
If btl_openib_user_eager_rdma is true, RDMA buffers are used
for eager fragments (because RDMA semantics can be faster than
send/receive semantics in some cases), and an additional set of
registered buffers is created (as needed).
Each MPI process will use RDMA buffers for eager fragments up to
btl_openib_eager_rdma_num MPI peers. Upon receiving the
btl_openib_eager_rdma_threshhold'th message from an MPI peer
process, if both sides have not yet setup
btl_openib_eager_rdma_num sets of eager RDMA buffers, a new set
will be created. The set will contain btl_openib_max_eager_rdma
buffers; each buffer will be btl_openib_eager_limit bytes (i.e.,
the maximum size of an eager fragment).
In general, when any of the individual limits are reached, Open MPI
will try to free up registered memory (in the case of registered user
memory) and/or wait until message passing progresses and more
registered memory becomes available.
Use the ompi_info command to view the values of the MCA parameters
described above in your Open MPI installation:
1
2
3
# Note that Open MPI v1.8 and later require the "--level 9"# CLIP option to display all available MCA parameters.shell$ ompi_info--param btl openib --level9
See this FAQ entry
for information on how to set MCA parameters at run-time.
298. How do I get Open MPI working on Chelsio iWARP devices? (openib BTL)
Please see this FAQ entry for
an important note about iWARP support (particularly for Open MPI
versions starting with v5.0.0).
For the Chelsio T3 adapter, you must have at least OFED v1.3.1 and
Chelsio firmware v6.0. Download the firmware from service.chelsio.com and put the uncompressedt3fw-6.0.0.bin
file in /lib/firmware. Then reload the iw_cxgb3 module and bring
up the ethernet interface to flash this new firmware. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
# Note that the URL for the firmware may change over timeshell# cd/lib/firmware
shell# wget http://service.chelsio.com/drivers/firmware/t3/t3fw-6.0.0.bin.gz
[...wget output...]shell# gunzip t3fw-6.0.0.bin.gz
shell# rmmod iw_cxgb3 cxgb3
shell# modprobe iw_cxgb3
# This last step *may* happen automatically, depending on your# Linux distro (assuming that the ethernet interface has previously# been properly configured and is ready to bring up). Substitute the# proper ethernet interface name for your T3 (vs. ethX).shell# ifup ethX
If all goes well, you should see a message similar to the following in
your syslog 15-30 seconds later:
1
kernel: cxgb3 0000:0c:00.0: successful upgrade to firmware 6.0.0
Open MPI will work without any specific configuration to the openib
BTL. Users wishing to performance tune the configurable options may
wish to inspect the receive queue values. Those can be found in the
"Chelsio T3" section of mca-btl-openib-hca-params.ini.
299. I'm getting "ibv_create_qp: returned 0 byte(s) for max inline
data" errors; what is this, and how do I fix it? (openib BTL)
Prior to Open MPI v1.0.2, the OpenFabrics (then known as
"OpenIB") verbs BTL component did not check for where the OpenIB API
could return an erroneous value (0) and it would hang during startup.
Starting with v1.0.2, error messages of the following form are
reported:
1
2
[0,1,0][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp]
ibv_create_qp: returned 0 byte(s) for max inline data
This is caused by an error in older versions of the OpenIB user
library. Upgrading your OpenIB stack to recent versions of the
OpenFabrics software should resolve the problem. See this post on the
Open MPI user's list for more details:
300. My bandwidth seems [far] smaller than it should be; why? Can this be fixed? (openib BTL)
Open MPI, by default, uses a pipelined RDMA protocol.
Additionally, in the v1.0 series of Open MPI, small messages use
send/receive semantics (instead of RDMA — small message RDMA was added in the v1.1 series).
For some applications, this may result in lower-than-expected
bandwidth. However, Open MPI also supports caching of registrations
in a most recently used (MRU) list — this bypasses the pipelined RDMA
and allows messages to be sent faster (in some cases).
For version the v1.1 series, see this FAQ entry for more
information about small message RDMA, its effect on latency, and how
to tune it.
To enable the "leave pinned" behavior, set the MCA parameter
mpi_leave_pinned to 1. For example:
1
shell$ mpirun--mca mpi_leave_pinned 1 ...
NOTE: The mpi_leave_pinned parameter was
broken in Open MPI v1.3 and v1.3.1 (see
this announcement). mpi_leave_pinned functionality was fixed in v1.3.2.
This will enable the MRU cache and will typically increase bandwidth
performance for applications which reuse the same send/receive
buffers.
NOTE: The v1.3 series enabled "leave
pinned" behavior by default when applicable; it is usually
unnecessary to specify this flag anymore.
301. How do I tune small messages in Open MPI v1.1 and later versions? (openib BTL)
Starting with Open MPI version 1.1, "short" MPI messages are
sent, by default, via RDMA to a limited set of peers (for versions
prior to v1.2, only when the shared receive queue is not used). This
provides the lowest possible latency between MPI processes.
However, this behavior is not enabled between all process peer pairs
because it can quickly consume large amounts of resources on nodes
(specifically: memory must be individually pre-allocated for each
process peer to perform small message RDMA; for large MPI jobs, this
can quickly cause individual nodes to run out of memory). Outside the
limited set of peers, send/receive semantics are used (meaning that
they will generally incur a greater latency, but not consume as many
system resources).
This behavior is tunable via several MCA parameters:
btl_openib_use_eager_rdma (default value: 1): These both
default to 1, meaning that the small message behavior described above
(RDMA to a limited set of peers, send/receive to everyone else) is
enabled. Setting these parameters to 0 disables all small message
RDMA in the openib BTL component.
btl_openib_eager_rdma_threshold (default value: 16): This is
the number of short messages that must be received from a peer before
Open MPI will setup an RDMA connection to that peer. This mechanism
tries to setup RDMA connections only to those peers who will
frequently send around a lot of short messages (e.g., avoid consuming
valuable RDMA resources for peers who only exchange a few "startup"
control messages).
btl_openib_max_eager_rdma (default value: 16): This parameter
controls the maximum number of peers that can receive an RDMA
connection for short messages. It is not advisable to change this
value to a very large number because the polling time increase with
the number of the connections; as a direct result, short message
latency will increase.
btl_openib_eager_rdma_num (default value: 16): This parameter
controls the maximum number of pre-allocated buffers allocated to each
peer for small messages.
btl_openib_eager_limit (default value: 12k): The maximum size
of small messages (in bytes).
Note that long messages use a different protocol than short messages;
messages over a certain size always use RDMA. Long messages are not
affected by the btl_openib_use_eager_rdma MCA parameter.
Also note that, as stated above, prior to v1.2, small message RDMA is
not used when the shared receive queue is used.
302. How do I tune large message behavior in Open MPI the v1.2 series? (openib BTL)
Note that this answer generally pertains to the Open MPI v1.2
series. Later versions slightly changed how large messages are
handled.
Open MPI uses a few different protocols for large messages. Much
detail is provided in this
paper.
The btl_openib_flags MCA parameter is a set of bit flags that
influences which protocol is used; they generally indicate what kind
of transfers are allowed to send the bulk of long messages.
Specifically, these flags do not regulate the behavior of "match"
headers or other intermediate fragments.
The following flags are available:
Use send/receive semantics (1): Allow the use of send/receive
semantics.
Use PUT semantics (2): Allow the sender to use RDMA writes.
Use GET semantics (4): Allow the receiver to use RDMA reads.
Open MPI defaults to setting both the PUT and GET flags (value 6).
Open MPI uses the following long message protocols:
RDMA Direct: If RDMA writes or reads are allowed by
btl_openib_flags and the sender's message is already registered
(either by use of the
mpi_leave_pinned MCA parameter or if the buffer was allocated
via MPI_ALLOC_MEM), a slightly simpler protocol is used:
Send the "match" fragment: the sender sends the MPI message
information (communicator, tag, etc.) to the receiver using copy
in/copy out semantics. No data from the user message is included in
the match header.
Use RDMA to transfer the message:
If RDMA reads are enabled and only one network connection is
available between the pair of MPI processes, once the receiver has
posted a matching MPI receive, it issues an RDMA read to get the
message, and sends an ACK back to the sender when the transfer has
completed.
If the above condition is not met, then RDMA writes must be
enabled (or we would not have chosen this protocol). The receiver
sends an ACK back when a matching MPI receive is posted and the sender
issues an RDMA write across each available network link (i.e., BTL
module) to transfer the message. The RDMA write sizes are weighted
across the available network links. For example, if two MPI processes
are connected by both SDR and DDR IB networks, this protocol will
issue an RDMA write for 1/3 of the entire message across the SDR
network and will issue a second RDMA write for the remaining 2/3 of
the message across the DDR network.
The sender then sends an ACK to the receiver when the transfer has
completed.
NOTE: Per above, if striping across multiple
network interfaces is available, only RDMA writes are used. The
reason that RDMA reads are not used is solely because of an
implementation artifact in Open MPI; we didn't implement it because
using RDMA reads only saves the cost of a short message round trip,
the extra code complexity didn't seem worth it for long messages
(i.e., the performance difference will be negligible).
Note that the user buffer is not unregistered when the RDMA
transfer(s) is (are) completed.
RDMA Pipeline: If RDMA Direct was not used and RDMA writes
are allowed by btl_openib_flagsand the sender's message is not
already registered, a 3-phase pipelined protocol is used:
Send the "match" fragment: the sender sends the MPI message
information (communicator, tag, etc.) and the first fragment of the
user's message using copy in/copy out semantics.
Send "intermediate" fragments: once the receiver has posted a
matching MPI receive, it sends an ACK back to the sender. The sender
and receiver then start registering memory for RDMA. To cover the
cost of registering the memory, several more fragments are sent to the
receiver using copy in/copy out semantics.
Transfer the remaining fragments: once memory registrations start
completing on both the sender and the receiver (see the paper for
details), the sender uses RDMA writes to transfer the remaining
fragments in the large message.
Note that phases 2 and 3 occur in parallel. Each phase 3 fragment is
unregistered when its transfer completes (see the
paper for more details).
Also note that one of the benefits of the pipelined protocol is that
large messages will naturally be striped across all available network
interfaces.
The sizes of the fragments in each of the three phases are tunable by
the MCA parameters shown in the figure below (all sizes are in units
of bytes):
Send/Receive: If RDMA Direct and RDMA Pipeline were not
used, copy in/copy out semantics are used for the whole message (note
that this will happen even if the SEND flag is not set in
btl_openib_flags):
Send the "match" fragment: the sender sends the MPI message
information (communicator, tag, etc.) and the first fragment of the
user's message using copy in/copy out semantics.
Send remaining fragments: once the receiver has posted a
matching MPI receive, it sends an ACK back to the sender. The sender
then uses copy in/copy out semantics to send the remaining fragments
to the receiver.
This protocol behaves the same as the RDMA Pipeline protocol when
the btl_openib_min_rdma_size value is infinite.
303. How do I tune large message behavior in the Open MPI v1.3 (and later) series? (openib BTL)
The Open MPI v1.3 (and later) series generally use the same
protocols for sending long messages as described for the v1.2
series, but the MCA parameters for the RDMA Pipeline protocol
were both moved and renamed (all sizes are in units of bytes):
The change to move the "intermediate" fragments to the end of the
message was made to better support applications that call fork().
Specifically, there is a problem in Linux when a process with
registered memory calls fork(): the registered memory will
physically not be available to the child process (touching memory in
the child that is registered in the parent will cause a segfault or
other error). Because memory is registered in units of pages, the end
of a long message is likely to share the same page as other heap
memory in use by the application. If this last page of the large
message is registered, then all the memory in that page — to include
other buffers that are not part of the long message — will not be
available to the child. By moving the "intermediate" fragments to
the end of the message, the end of the message will be sent with copy
in/copy out semantics and, more importantly, will not have its page
registered. This increases the chance that child processes will be
able to access other memory in the same page as the end of the large
message without problems.
Some notes about these parameters:
btl_openib_rndv_eager_limit defaults to the same value as
btl_openib_eager_limit (the size for "small" messages). It is a
separate parameter in case you want/need different values.
The btl_openib_min_rdma_size parameter was an absolute offset
into the message; it was replaced by
btl_openib_rdma_pipeline_send_length, which is a length.
Note that messages must be larger than
btl_openib_min_rdma_pipeline_size (a new MCA parameter to the v1.3
series) to use the RDMA Direct or RDMA Pipeline protocols.
Messages shorter than this length will use the Send/Receive protocol
(even if the SEND flag is not set on btl_openib_flags).
304. How does the mpi_leave_pinned parameter affect
large message transfers? (openib BTL)
NOTE: The mpi_leave_pinned parameter was
broken in Open MPI v1.3 and v1.3.1 (see
this announcement). mpi_leave_pinned functionality was fixed in v1.3.2.
When mpi_leave_pinned is set to 1, Open MPI aggressively
tries to pre-register user message buffers so that the RDMA Direct
protocol can be used. Additionally, user buffers are left
registered so that the de-registration and re-registration costs are
not incurred if the same buffer is used in a future message passing
operation.
NOTE: Starting with Open MPI v1.3,
mpi_leave_pinned is automatically set to 1 by default when
applicable. It is therefore usually unnecessary to set this value
manually.
NOTE: The mpi_leave_pinned MCA parameter
has some restrictions on how it can be set starting with Open MPI
v1.3.2. See this FAQ
entry for details.
Leaving user memory registered when sends complete can be extremely
beneficial for applications that repeatedly re-use the same send
buffers (such as ping-pong benchmarks). Additionally, the fact that a
single RDMA transfer is used and the entire process runs in hardware
with very little software intervention results in utilizing the
maximum possible bandwidth.
Leaving user memory registered has disadvantages, however. Bad Things
happen if registered memory is free()ed, for example —
it can silently invalidate Open MPI's cache of knowing which memory is
registered and which is not. The MPI layer usually has no visibility
on when the MPI application calls free() (or otherwise frees memory,
such as through munmap() or sbrk()). Open MPI has implemented
complicated schemes that intercept calls to return memory to the OS.
Upon intercept, Open MPI examines whether the memory is registered,
and if so, unregisters it before returning the memory to the OS.
These schemes are best described as "icky" and can actually cause
real problems in applications that provide their own internal memory
allocators. Additionally, only some applications (most notably,
ping-pong benchmark applications) benefit from "leave pinned"
behavior — those who consistently re-use the same buffers for sending
and receiving long messages.
*It is for these reasons that "leave pinned" behavior is not enabled
by default.* Note that other MPI implementations enable "leave
pinned" behavior by default.
Also note that another pipeline-related MCA parameter also exists:
mpi_leave_pinned_pipeline. Setting this parameter to 1 enables the
use of the RDMA Pipeline protocol, but simply leaves the user's
memory registered when RDMA transfers complete (eliminating the cost
of registering / unregistering memory during the pipelined sends /
receives). This can be beneficial to a small class of user MPI
applications.
305. How does the mpi_leave_pinned parameter affect
memory management? (openib BTL)
NOTE: The mpi_leave_pinned parameter was
broken in Open MPI v1.3 and v1.3.1 (see
this announcement). mpi_leave_pinned functionality was fixed in v1.3.2.
When mpi_leave_pinned is set to 1, Open MPI aggressively
leaves user memory registered with the OpenFabrics network stack after
the first time it is used with a send or receive MPI function. This
allows Open MPI to avoid expensive registration / deregistration
function invocations for each send or receive MPI function.
NOTE: The mpi_leave_pinned MCA parameter
has some restrictions on how it can be set starting with Open MPI
v1.3.2. See this FAQ
entry for details.
However, registered memory has two drawbacks:
There is only so much registered memory available.
User applications may free the memory, thereby invalidating Open
MPI's internal table of what memory is already registered.
The second problem can lead to silent data corruption or process
failure. As such, this behavior must be disallowed. Note that the
real issue is not simply freeing memory, but rather returning
registered memory to the OS (where it can potentially be used by a
different process). Open MPI has two methods of solving the issue:
Using an internal memory manager; effectively overriding calls to
malloc(), free(), mmap(), munmap(), etc.
Telling the OS to never return memory from the process to the
OS
Open MPI 1.2 and earlier on Linux used the ptmalloc2 memory allocator
linked into the Open MPI libraries to handle memory deregistration.
On Mac OS X, it uses an interface provided by Apple for hooking into
the virtual memory system, and on other platforms no safe memory
registration was available. The ptmalloc2 code could be disabled at
Open MPI configure time with the option --without-memory-manager,
however it could not be avoided once Open MPI was built.
ptmalloc2 can cause large memory utilization numbers for a small
number of applications and has a variety of link-time issues.
Therefore, by default Open MPI did not use the registration cache,
resulting in lower peak bandwidth. The inability to disable ptmalloc2
after Open MPI was built also resulted in headaches for users.
307. How does the mpi_leave_pinned parameter affect
memory management in Open MPI v1.3? (openib BTL)
NOTE: The mpi_leave_pinned parameter was
broken in Open MPI v1.3 and v1.3.1 (see
this announcement). mpi_leave_pinned functionality was fixed in v1.3.2.
NOTE: The mpi_leave_pinned MCA parameter
has some restrictions on how it can be set starting with Open MPI
v1.3.2. See this FAQ
entry for details.
With Open MPI 1.3, Mac OS X uses the same hooks as the 1.2 series,
and most operating systems do not provide pinning support. However,
the pinning support on Linux has changed. ptmalloc2 is now by default
built as a standalone library (with dependencies on the internal Open
MPI libopen-pal library), so that users by default do not have the
problematic code linked in with their application. Further, if
OpenFabrics networks are being used, Open MPI will use the mallopt()
system call to disable returning memory to the OS if no other hooks
are provided, resulting in higher peak bandwidth by default.
To utilize the independent ptmalloc2 library, users need to add
-lopenmpi-malloc to the link command for their application:
1
shell$ mpicc foo.o -o foo -lopenmpi-malloc
Linking in libopenmpi-malloc will result in the OpenFabrics BTL not
enabling mallopt() but using the hooks provided with the ptmalloc2
library instead.
To revert to the v1.2 (and prior) behavior, with ptmalloc2 folded into
libopen-pal, Open MPI can be built with the
--enable-ptmalloc2-internal configure flag.
When not using ptmalloc2, mallopt() behavior can be disabled by
disabling mpi_leave_pined:
1
shell$ mpirun--mca mpi_leave_pinned 0 ...
Because mpi_leave_pinned behavior is usually only useful for
synthetic MPI benchmarks, the never-return-behavior-to-the-OS behavior
was resisted by the Open MPI developers for a long time. Ultimately,
it was adopted because a) it is less harmful than imposing the
ptmalloc2 memory manager on all applications, and b) it was deemed
important to enable mpi_leave_pinned behavior by default since Open
MPI performance kept getting negatively compared to other MPI
implementations that enable similar behavior by default.
308. How can I set the mpi_leave_pinned MCA parameter? (openib BTL)
NOTE: The mpi_leave_pinned parameter was
broken in Open MPI v1.3 and v1.3.1 (see
this announcement). mpi_leave_pinned functionality was fixed in v1.3.2.
As with all MCA parameters, the mpi_leave_pinned parameter (and
mpi_leave_pinned_pipeline parameter) can be set from the mpirun
command line:
However, starting with v1.3.2, not all of the usual methods to set
MCA parameters apply to mpi_leave_pinned. Due to various
operating system memory subsystem constraints, Open MPI must react to
the setting of the mpi_leave_pinned parameter in each MPI process
before MPI_INIT is invoked. Specifically, some of Open MPI's MCA
parameter propagation mechanisms are not activated until during
MPI_INIT — which is too late for mpi_leave_pinned.
As such, only the following MCA parameter-setting mechanisms can be
used for mpi_leave_pinned and mpi_leave_pinned_pipeline:
Command line: See the example above.
Environment variable: Setting OMPI_MCA_mpi_leave_pinned to 1
before invoking mpirun.
To be clear: you cannot set the mpi_leave_pinned MCA parameter via
Aggregate MCA parameter files or normal MCA parameter files. This is
expected to be an acceptable restriction, however, since the default
value of the mpi_leave_pinned parameter is "-1", meaning
"determine at run-time if it is worthwhile to use leave-pinned
behavior." Specifically, if mpi_leave_pinned is set to -1, if any
of the following are true when each MPI processes starts, then Open
MPI will use leave-pinned bheavior:
Either the environment variable OMPI_MCA_mpi_leave_pinned or
OMPI_MCA_mpi_leave_pinned_pipeline is set to a positive value (note
that the "mpirun --mca mpi_leave_pinned 1 ..." command-line syntax
simply results in setting these environment variables in each MPI
process)
Any of the following files / directories can be found in the
filesystem where the MPI process is running:
/sys/class/infiniband
/dev/open-mx
/dev/myri[0-9]
Note that if either the environment variable
OMPI_MCA_mpi_leave_pinned or OMPI_MCA_mpi_leave_pinned_pipeline is
set to to "-1", then the above indicators are ignored and Open MPI
will not use leave-pinned behavior.
309. I got an error message from Open MPI about not using the
default GID prefix. What does that mean, and how do I fix it? (openib BTL)
Users may see the following error message from Open MPI v1.2:
1
2
3
4
5
6
7
WARNING: There are more than one active ports on host '%s', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical OFA
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate OFA subnet that is
used between connected MPI processes must have different subnet ID
values.
This is a complicated issue.
What it usually means is that you have a host connected to multiple,
physically separate OFA-based networks, at least 2 of which are using
the factory-default subnet ID value (FE:80:00:00:00:00:00:00). Open
MPI can therefore not tell these networks apart during its
reachability computations, and therefore will likely fail. You need
to reconfigure your OFA networks to have different subnet ID values,
and then Open MPI will function properly.
Please note that the same issue can occur when any two physically
separate subnets share the same subnet ID value — not just the
factory-default subnet ID value. However, Open MPI only warns about
the factory default subnet ID value because most users do not bother
to change it unless they know that they have to.
All this being said, note that there are valid network configurations
where multiple ports on the same host can share the same subnet ID
value. For example, two ports from a single host can be connected to
the same network as a bandwidth multiplier or a high-availability
configuration. For this reason, Open MPI only warns about finding
duplicate subnet ID values, and that warning can be disabled. Setting
the btl_openib_warn_default_gid_prefix MCA parameter to 0 will
disable this warning.
Since Open MPI can utilize multiple network links to send MPI traffic,
it needs to be able to compute the "reachability" of all network
endpoints that it can use. Specifically, for each network endpoint,
Open MPI calculates which other network endpoints are reachable.
In OpenFabrics networks, Open MPI uses the subnet ID to differentiate
between subnets — assuming that if two ports share the same subnet
ID, they are reachable from each other. If multiple, physically
separate OFA networks use the same subnet ID (such as the default
subnet ID), it is not possible for Open MPI to tell them apart and
therefore reachability cannot be computed properly.
310. What subnet ID / prefix value should I use for my OpenFabrics networks?
You can use any subnet ID / prefix value that you want.
However, Open MPI v1.1 and v1.2 both require that every physically
separate OFA subnet that is used between connected MPI processes must
have different subnet ID values.
For example, if you have two hosts (A and B) and each of these
hosts has two ports (A1, A2, B1, and B2). If A1 and B1 are connected
to Switch1, and A2 and B2 are connected to Switch2, and Switch1 and
Switch2 are not reachable from each other, then these two switches
must be on subnets with different ID values.
It depends on what Subnet Manager (SM) you are using. Note that changing the subnet ID will likely kill
any jobs currently running on the fabric!
OpenSM: The SM contained in the OpenFabrics Enterprise
Distribution (OFED) is called OpenSM. The instructions below pertain
to OFED v1.2 and beyond; they may or may not work with earlier
versions.
Stop any OpenSM instances on your cluster:
1
shell# /etc/init.d/opensm stop
Run a single OpenSM iteration:
1
shell# opensm -c-o
The -o option causes OpenSM to run for one loop and exit.
The -c option tells OpenSM to create an "options" text file.
The OpenSM options file will be generated under
/var/cache/opensm/opensm.opts. Open the file and find the line with
subnet_prefix. Replace the default value prefix with the new one.
Restart OpenSM:
1
shell# /etc/init.d/opensm start
OpenSM will automatically load the options file from the cache
repository and will use new prefix.
Cisco High Performance Subnet Manager (HSM): The Cisco HSM has a
console application that can dynamically change various
characteristics of the IB fabrics without restarting. The Cisco HSM
works on both the OFED InfiniBand stack and an older,
Cisco-proprietary "Topspin" InfiniBand stack. Please consult the
Cisco HSM (or switch) documentation for specific instructions on how
to change the subnet prefix.
Other SM: Consult that SM's instructions for how to change the
subnet prefix.
312. In a configuration with multiple host ports on the same fabric, what connection pattern does Open MPI use? (openib BTL)
When multiple active ports exist on the same physical fabric
between multiple hosts in an MPI job, Open MPI will attempt to use
them all by default. Open MPI makes several assumptions regarding
active ports when establishing connections between two hosts. Active
ports that have the same subnet ID are assumed to be connected to the
same physical fabric — that is to say that communication is possible
between these ports. Active ports with different subnet IDs
are assumed to be connected to different physical fabric — no
communication is possible between them. It is therefore very important
that if active ports on the same host are on physically separate
fabrics, they must have different subnet IDs. Otherwise Open MPI may
attempt to establish communication between active ports on different
physical fabrics. The subnet manager allows subnet prefixes to be
assigned by the administrator, which should be done when multiple
fabrics are in use.
The following is a brief description of how connections are
established between multiple ports. During initialization, each
process discovers all active ports (and their corresponding subnet IDs)
on the local host and shares this information with every other process
in the job. Each process then examines all active ports (and the
corresponding subnet IDs) of every other process in the job and makes a
one-to-one assignment of active ports within the same subnet. If the
number of active ports within a subnet differ on the local process and
the remote process, then the smaller number of active ports are
assigned, leaving the rest of the active ports out of the assignment
between these two processes. Connections are not established during
MPI_INIT, but the active port assignment is cached and upon the first
attempted use of an active port to send data to the remote process
(e.g., via MPI_SEND), a queue pair (i.e., a connection) is established
between these ports. Active ports are used for communication in a
round robin fashion so that connections are established and used in a
fair manner.
NOTE: This FAQ entry generally applies to v1.2 and beyond. Prior to
v1.2, Open MPI would follow the same scheme outlined above, but would
not correctly handle the case where processes within the same MPI job
had differing numbers of active ports on the same physical fabric.
313. I'm getting lower performance than I expected. Why?
Measuring performance accurately is an extremely difficult
task, especially with fast machines and networks. Be sure to read this FAQ entry for
many suggestions on benchmarking performance.
_Pay particular attention to the discussion of processor affinity and
NUMA systems_ — running benchmarks without processor affinity and/or
on CPU sockets that are not directly connected to the bus where the
HCA is located can lead to confusing or misleading performance
results.
314. I get bizarre linker warnings / errors / run-time faults when
I try to compile my OpenFabrics MPI application statically. How do I
fix this?
Fully static linking is not for the weak, and is not
recommended. But it is possible.
315. Can I use system(), popen(), or fork() in an MPI application that uses the OpenFabrics support? (openib BTL)
The answer is, unfortunately, complicated. Be sure to also
see this FAQ entry as
well.
If you have a Linux kernel before version 2.6.16: no. Some
distros may provide patches for older versions (e.g, RHEL4 may someday
receive a hotfix).
If you have a version of OFED before v1.2: sort of. Specifically,
newer kernels with OFED 1.0 and OFED 1.1 may generally allow the use
of system() and/or the use of fork() as long as the parent does
nothing until the child exits.
If you have a Linux kernel >= v2.6.16 and OFED >= v1.2 and Open MPI >=
v1.2.1: yes. Open MPI
v1.2.1 added two MCA values relevant to arbitrary fork() support
in Open MPI:
btl_openib_have_fork_support: This is a "read-only" MCA
value, meaning that users cannot change it in the normal ways that MCA
parameter values are set. It can be queried via the ompi_info
command; it will have a value of 1 if this installation of Open MPI
supports fork(); 0 otherwise.
btl_openib_want_fork_support: This MCA parameter can be used to
request conditional, absolute, or no fork() support. The following
values are supported:
Negative values: try to enable fork support, but continue even if
it is not available.
Zero: Do not try to enable fork support.
Positive values: Try to enable fork support and fail if it is not
available.
Hence, you can reliably query Open MPI to see if it has support for
fork() and force Open MPI to abort if you request fork support and
it doesn't have it.
This feature is helpful to users who switch around between multiple
clusters and/or versions of Open MPI; they can script to know whether
the Open MPI that they're using (and therefore the underlying IB stack)
has fork support. For example:
1
2
3
4
5
6
7
8
#!/bin/shhave_fork_support=`ompi_info--param btl openib --level9--parsable|grep have_fork_support:value |cut -d: -f7`iftest"$have_fork_support" = "1"; then# Happiness / world peace / birds are singingelse# Despair / time for Häagen-Dazsfi
Alternatively, you can skip querying and simply try to run your job:
Which will abort if Open MPI's openib BTL does not have fork support.
All this being said, even if Open MPI is able to enable the
OpenFabrics fork() support, it does not mean
that your fork()-calling application is safe.
In general, if your application calls system() or popen(), it
will likely be safe.
However, note that arbitrary fork() support is not supported
in the OpenFabrics software stack. If you use fork() in your
application, you must not touch any registered memory before calling
some form of exec() to launch another process. Doing so will cause
an immediate seg fault / program crash.
It is important to note that memory is registered on a per-page basis;
it is therefore possible that your application may have memory
co-located on the same page as a buffer that was passed to an MPI
communications routine (e.g., MPI_Send() or MPI_Recv()) or some
other internally-registered memory inside Open MPI. You may therefore
accidentally "touch" a page that is registered without even
realizing it, thereby crashing your application.
There is unfortunately no way around this issue; it was intentionally
designed into the OpenFabrics software stack. Please complain to the
OpenFabrics Alliance that they should really fix this problem!
316. My MPI application sometimes hangs when using the
openib BTL; how can I fix this? (openib BTL)
Starting with v1.2.6, the MCA pml_ob1_use_early_completion
parameter allows the user (or administrator) to turn off the "early
completion" optimization. Early completion may cause "hang"
problems with some MPI applications running on OpenFabrics networks,
particularly loosely-synchronized applications that do not call MPI
functions often. The default is 1, meaning that early completion
optimization semantics are enabled (because it can reduce
point-to-point latency).
NOTE: This FAQ entry only applies to the v1.2 series. This
functionality is not required for v1.3 and beyond because of changes
in how message passing progress occurs. Specifically, this MCA
parameter will only exist in the v1.2 series.
317. Does InfiniBand support QoS (Quality of Service)?
Yes.
InfiniBand QoS functionality is configured and enforced by the Subnet
Manager/Administrator (e.g., OpenSM).
Open MPI (or any other ULP/application) sends traffic on a specific IB
Service Level (SL). This SL is mapped to an IB Virtual Lane, and all
the traffic arbitration and prioritization is done by the InfiniBand
HCAs and switches in accordance with the priority of each Virtual
Lane.
For details on how to tell Open MPI which IB Service Level to use,
please see this FAQ entry.
318. Does Open MPI support InfiniBand clusters with torus/mesh topologies? (openib BTL)
Yes.
InfiniBand 2D/3D Torus/Mesh topologies are different from the more
common fat-tree topologies in the way that routing works: different IB
Service Levels are used for different routing paths to prevent the
so-called "credit loops" (cyclic dependencies among routing path
input buffers) that can lead to deadlock in the network.
Open MPI complies with these routing rules by querying the OpenSM
for the Service Level that should be used when sending traffic to
each endpoint.
Note that this Service Level will vary for different endpoint pairs.
For details on how to tell Open MPI to dynamically query OpenSM for
IB Service Level, please refer to this FAQ entry.
NOTE: 3D-Torus and other torus/mesh IB
topologies are supported as of version 1.5.4.
319. How do I tell Open MPI which IB Service Level to use? (openib BTL)
There are two ways to tell Open MPI which SL to use:
By providing the SL value as a command line parameter to the
openib BTL
By telling openib BTL to dynamically query OpenSM for
SL that should be used for each endpoint
1. Providing the SL value as a command line parameter for the openib BTL
Use the btl_openib_ib_service_level MCA parameter to tell
openib BTL which IB SL to use:
1
shell$ mpirun--mca btl openib,self,vader --mca btl_openib_ib_service_level N ...
The value of IB SL N should be between 0 and 15, where 0 is the
default value.
NOTE: Open MPI will use the same SL value
for all the endpoints, which means that this option is not valid for
3D torus and other torus/mesh IB topologies.
2. Querying OpenSM for SL that should be used for each endpoint
Use the btl_openib_ib_path_record_service_level MCA
parameter to tell the openib BTL to query OpenSM for the IB SL
that should be used for each endpoint. Open MPI will send a
PathRecord query to OpenSM in the process of establishing connection
between two endpoints, and will use the IB Service Level from the
PathRecord response:
NOTE: The
btl_openib_ib_path_record_service_level MCA parameter is supported
as of version 1.5.4.
320. How do I tell Open MPI which IB Service Level to use? (UCX PML)
In order to tell UCX which SL to use, the
IB SL must be specified using the UCX_IB_SL environment variable.
For example:
1
shell$ mpirun--mca pml ucx -xUCX_IB_SL=N ...
The value of IB SL N should be between 0 and 15, where 0 is the
default value.
321. What is RDMA over Converged Ethernet (RoCE)?
RoCE (which stands for RDMA over Converged Ethernet)
provides InfiniBand native RDMA transport (OFA Verbs) on top of
lossless Ethernet data link.
Since we're talking about Ethernet, there's no Subnet Manager, no
Subnet Administrator, no InfiniBand SL, nor any other InfiniBand Subnet
Administration parameters.
Connection management in RoCE is based on the OFED RDMACM (RDMA
Connection Manager) service:
The OS IP stack is used to resolve remote (IP,hostname) tuples to
a DMAC.
The outgoing Ethernet interface and VLAN are determined according
to this resolution.
The appropriate RoCE device is selected accordingly.
Network parameters (such as MTU, SL, timeout) are set locally by
the RDMACM in accordance with kernel policy.
322. How do I run Open MPI over RoCE? (openib BTL)
Open MPI can use the OFED Verbs-based openib BTL for traffic
and its internal rdmacm CPC (Connection Pseudo-Component) for
establishing connections for MPI traffic.
So if you just want the data to run over RoCE and you're
not interested in VLANs, PCP, or other VLAN tagging parameters, you
can just run Open MPI with the openib BTL and rdmacm CPC:
How do I tell Open MPI to use a specific RoCE VLAN?
When a system administrator configures VLAN in RoCE, every VLAN is
assigned with its own GID. The QP that is created by the
rdmacm CPC uses this GID as a Source GID. When Open MPI
(or any other application for that matter) posts a send to this QP,
the driver checks the source GID to determine which VLAN the traffic
is supposed to use, and marks the packet accordingly.
Note that InfiniBand SL (Service Level) is not involved in this
process — marking is done in accordance with local kernel policy.
To control which VLAN will be selected, use the
btl_openib_ipaddr_include/exclude MCA parameters and
provide it with the required IP/netmask values. For
example, if you want to use a VLAN with IP 13.x.x.x:
XRC (eXtended Reliable Connection) decreases the memory consumption
of Open MPI and improves its scalability by significantly decreasing
number of QPs per machine.
XRC is available on Mellanox ConnectX family HCAs with OFED 1.4 and
later.
XRC was was removed in the middle of multiple release streams (which
were effectively concurrent in time) because there were known problems
with it and no one was going to fix it. Here are the versions where
XRC support was disabled:
In then 2.0.x series, XRC was disabled in v2.0.4.
In then 2.1.x series, XRC was disabled in v2.1.2.
In then 3.0.x series, XRC was disabled prior to the v3.0.0
release.
Specifically: v2.1.1 was the latest release that contained XRC
support. Note that it is not known whether it actually works,
however.
See this FAQ entry for instructions
how to tell Open MPI to use XRC receive queues.
325. How do I specify the type of receive queues that I want Open MPI to use? (openib BTL)
You can use the btl_openib_receive_queues MCA parameter to
specify the exact type of the receive queues for the Open MPI to use.
This can be advantageous, for example, when you know the exact sizes
of messages that your MPI application will use — Open MPI can
internally pre-post receive buffers of exactly the right size. See this paper for more
details.
The btl_openib_receive_queues parameter
takes a colon-delimited string listing one or more receive queues of
specific sizes and characteristics. For now, all processes in the job
must use the same string. You can specify three kinds of receive
queues:
P : Per-Peer Receive Queues
S : Shared Receive Queues (SRQ)
X : eXtended Reliable Connection queues (see this FAQ item to see when XRC support was removed
from Open MPI)
The default value of the btl_openib_receive_queues MCA parameter
is sometimes equivalent to the following command line:
In particular, note that XRC is (currently) not used by default (and
is no longer supported — see this FAQ item
for more information).
NOTE: Open MPI chooses a default value of btl_openib_receive_queues
based on the type of OpenFabrics network device that is found. The
text file $openmpi_packagedata_dir/mca-btl-openib-device-params.ini
(which is typically
$openmpi_installation_prefix_dir/share/openmpi/mca-btl-openib-device-params.ini)
contains a list of default values for different OpenFabrics devices.
See that file for further explanation of how default values are
chosen.
Per-Peer Receive Queues
Per-peer receive queues require between 1 and 5 parameters:
Buffer size in bytes: mandatory
Number of buffers: optional; defaults to 8
Low buffer count watermark: optional; defaults to (num_buffers / 2)
Credit window size: optional; defaults to (low_watermark / 2)
Number of buffers reserved for credit messages: optional; defaults to
((num_buffers × 2 - 1) / credit_window)
Example: P,128,256,128,16
128 byte buffers
256 buffers to receive incoming MPI messages
When the number of available buffers reaches 128, re-post 128 more
buffers to reach a total of 256
If the number of available credits reaches 16, send an explicit
credit message to the sender
Defaulting to ((256 × 2) - 1) / 16 = 31; this many buffers are
reserved for explicit credit messages
Shared Receive Queues
Shared Receive Queues can take between 1 and 4 parameters:
Buffer size in bytes: mandatory
Number of buffers: optional; defaults to 16
Low buffer count watermark: optional; defaults to (num_buffers / 2)
Maximum number of outstanding sends a sender can have: optional;
defaults to (low_watermark / 4)
Example: S,1024,256,128,32
1024 byte buffers
256 buffers to receive incoming MPI messages
When the number of available buffers reaches 128, re-post 128 more
buffers to reach a total of 256
A sender will not send to a peer unless it has less than 32 outstanding
sends to that peer
XRC Queues
Note that XRC is no longer supported in Open MPI. See this FAQ item for more details.
XRC queues take the same parameters as SRQs. Note that if you use
any XRC queues, then all of your queues must be XRC. Therefore,
to use XRC, specify the following:
NOTE: the rdmacm CPC is not supported with
XRC. Also, XRC cannot be used when btls_per_lid > 1.
NOTE: the rdmacm CPC cannot be used unless the first QP is per-peer.
326. Does Open MPI support FCA?
Yes.
FCA (which stands for _Fabric Collective
Accelerator_) is a Mellanox MPI-integrated software package
that utilizes CORE-Direct
technology for implementing the MPI collectives communications.
A list of FCA parameters will be displayed if Open MPI has FCA support.
How do I tell Open MPI to use FCA?
1
shell$ mpirun--mca coll_fca_enable 1 ...
By default, FCA will be enabled only with 64 or more MPI processes.
To turn on FCA for an arbitrary number of ranks ( N ), please use
the following MCA parameters:
MXM support is currently deprecated and replaced by UCX.
328. Does Open MPI support UCX?
Yes.
UCX is an open-source
optimized communication library which supports multiple networks,
including RoCE, InfiniBand, uGNI, TCP, shared memory, and others. UCX
mixes-and-matches transports and protocols which are available on the
system to provide optimal performance. It also has built-in support
for GPU transports (with CUDA and RoCM providers) which lets
RDMA-capable transports access the GPU memory directly.
UCX is enabled and selected by default; typically, no additional
parameters are required. In this case, the network port with the
highest bandwidth on the system will be used for inter-node
communication, and shared memory will be used for intra-node
communication. To select a specific network device to use (for
example, mlx5_0 device port 1):
1
shell$ mpirun-xUCX_NET_DEVICES=mlx5_0:1 ...
It's also possible to force using UCX for MPI point-to-point and
one-sided operations:
1
shell$ mpirun--mca pml ucx --mca osc ucx ...
For OpenSHMEM, in addition to the above, it's possible to force using
UCX for remote memory access and atomic memory operations:
329. I'm getting errors about "initializing an OpenFabrics device" when running v4.0.0 with UCX support enabled. What should I do?
The short answer is that you should probably just disable
verbs support in Open MPI.
The messages below were observed by at least one site where Open MPI
v4.0.0 was built with support for InfiniBand verbs (--with-verbs),
OFA UCX (--with-ucx), and CUDA (--with-cuda) with applications
running on GPU-enabled hosts:
WARNING: There was an error initializing an OpenFabrics device.
Local host: c36a-s39
Local device: mlx4_0
and
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: c36a-s39
Local adapter: mlx4_0
Local port: 1
These messages are coming from the openib BTL. As noted in the
messages above, the openib BTL (enabled when Open
MPI is configured --with-verbs) is deprecated in favor of the UCX
PML, which includes support for OpenFabrics devices. The openib BTL
is therefore not needed.
You can disable the openib BTL (and therefore avoid these messages)
in a few different ways:
Configure Open MPI --without-verbs. This will prevent building
the openib BTL in the first place.
Disable the openib BTL via the btl MCA param (see this FAQ item for
information on how to set MCA params). For example,
1
shell$ mpirun--mca btl '^openib' ...
Note that simply selecting a different PML (e.g., the UCX PML) is
not sufficient to avoid these messages. For example:
1
shell$ mpirun--mca pml ucx ...
You will still see these messages because the openib BTL is not only
used by the PML, it is also used in other contexts internally in Open
MPI. Hence, it is not sufficient to simply choose a non-OB1 PML; you
need to actually disable the openib BTL to make the messages go
away.
330. How can I find out what devices and transports are supported by UCX on my system?
Check out the UCX documentation
for more information, but you can use the ucx_info command. For
example:
shell$ ucx_info -d
331. What is cpu-set?
The --cpu-set parameter allows you to specify the logical CPUs to use in an MPI job.
From mpirun --help:
Comma-separated list of ranges specifying logical cpus allocated to this job.
1
shell$ mpirun-cpu-set0,1,2,3 ...
The hwloc package can be used to get information about the topology on your host.
More information about hwloc is available here.
Here is a usage example with hwloc-ls.
Consider the following command line:
The explanation is as follows.
When hwloc-ls is run, the output will show the mappings of physical cores to logical ones.
As per the example in the command line, the logical PUs 0,1,14,15 match the physical cores — 0 and 7 (as shown in the map above).
It is also possible to use hwloc-calc. The following command line will show all the available logical CPUs on the host:
1
shell$ hwloc-calc all -I pu
The following will show two specific hwthreads specified by physical ids 0 and 1:
1
shell$ hwloc-calc -I pu --physical-input pu:0 pu:1
332. Does Open MPI support connecting hosts from different subnets? (openib BTL)
Yes.
When using InfiniBand, Open MPI supports host communication between
separate subnets using the Mellanox IB-Router.
The support for IB-Router is available starting with Open MPI v1.10.3.
To enable routing over IB, follow these steps:
Configure Open MPI with --enable-openib-rdmacm-ibaddr.
Ensure to use an Open SM with support for IB-Router (available in
MLNX_OFED starting version 3.3).
Select to use rdmacm with the openib BTL from the mpirun
command line.
Set the btl_openib_allow_different_subnets MCA parameter to 1
(it is 0 by default).
Set the btl_openib_gid_index MCA parameter to 1.
For example, to run the IMB benchmark on host1 and host2 which are on
separate subents (i.e., they have have different subnet_prefix
values), use the following command line:
NOTE: The rdmacm CPC cannot be used unless the first QP is per-peer.
If the default value of btl_openib_receive_queues is to use only SRQ
QPs, please set the first QP in the list to a per-peer QP.
Please see this FAQ entry for more
information on this MCA parameter.
333. What versions of Open MPI contain support for uDAPL?
The following versions of Open MPI contain support for uDAPL:
Open MPI series
uDAPL supported
v1.0 series
No
v1.1 series
No
v1.2 series
Yes
v1.3 / v1.4 series
Yes
v1.5 / v1.6 series
Yes
v1.7 and beyond
No
334. What is different between Sun Microsystems ClusterTools 7 and Open
MPI in regards to the uDAPL BTL?
Sun's ClusterTools is based off of Open MPI with one significant
difference: Sun's ClusterTools includes uDAPL RDMA capabilities in the
uDAPL BTL. Open MPI v1.2 uDAPL BTL does not include the RDMA
capabilities. These improvements do exist today in the Open MPI main
and will be included in future Open MPI releases.
335. What values are expected to be used by the btl_udapl_if_include and btl_udapl_if_exclude MCA parameters?
The uDAPL BTL looks for a match from the uDAPL static registry which is contained in the dat.conf file. Each non commented or blank line is considered an interface. The first field of each interface entry is the value which must be supplied to the MCA parameter in question.
337. How come the value reported by ifconfig is not accepted by the btl_udapl_if_include/btl_udapl_if_exclude MCA parameter?
uDAPL queries a static registry defined in the dat.conf file to find available interfaces which can be used. As such, the uDAPL BTL needs to match the names found in the registry and these may differ from what is reported by ifconfig.
338. I get a warning message about not being able to register memory and possibly out of privileged memory while running on Solaris; what can I do?
The error message probably looks something like this:
1
2
3
WARNING: The uDAPL BTL is not able to register memory. Possibly out of
allowed privileged memory (i.e. memory that can be pinned). Increasing
the allowed privileged memory may alleviate this issue.
One thing to do is increase the amount of available privileged
memory. On Solaris your system adminstrator can increase the amount of
available privileged memory by editing the /etc/project file on the
nodes. For more information see the Solaris project man page.
1
shell%man project
As an example of increasing the privileged memory, first determine the
amount available (example of typical value is 978 MB):
1
2
3
4
5
shell% prctl -n project.max-device-locked-memory -i project default
NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT
project.max-device-locked-memory
privileged 978MB - deny -
system 16.0EB max deny -
To increase the amount of privileged memory, edit the /etc/project file:
339. What is special about MPI performance analysis?
The synchronization among the MPI processes can be a key
performance concern. For example, if a serial program spends a lot of
time in function foo(), you should optimize foo(). In contrast,
if an MPI process spends a lot of time in MPI_Recv(), not only is
the optimization target probably not MPI_Recv(), but you should in
fact probably be looking at some other process altogether. You should
ask, "What is happening on other processes when this process has the
long wait?"
Another issue is that a parallel program (in the case of MPI, a
multi-process program) can generate much more performance data than a
serial program due to the greater number of execution threads.
Managing that data volume can be a challenge.
340. What are "profiling" and "tracing"?
These terms are sometimes used to refer to two different kinds
of performance analysis.
In profiling, one aggregates statistics at run time — e.g., total
amount of time spent in MPI, total number of messages or bytes sent,
etc. Data volumes are small.
In tracing, an event history is collected. It is common to display
such event history on a timeline display. Tracing data can provide
much interesting detail, but data volumes are large.
341. How do I sort out busy wait time from idle wait, user time
from system time, and so on?
Don't.
MPI synchronization delays, which are key performance inhibitors you
will probably want to study, can show up as user or system time, all
depending on the MPI implementation, the type of wait, what run-time
settings you have chosen, etc. In many cases, it makes most sense for
you just to distinguish between time spent inside MPI from time spent
outside MPI.
Elapsed wall clock time will probably be your key metric. Exactly how
the MPI implementation spends time waiting is less important.
342. What is PMPI?
PMPI refers to the MPI standard profiling interface.
Each standard MPI function can be called with an MPI_ or PMPI_
prefix. For example, you can call either MPI_Send() or
PMPI_Send(). This feature of the MPI standard allows one to write
functions with the MPI_ prefix that call the equivalent PMPI_
function. Specifically, a function so written has the behavior of the
standard function plus any other behavior one would like to add. This
is important for MPI performance analysis in at least two ways.
First, many performance analysis tools take advantage of PMPI. They
capture the MPI calls made by your program. They perform the
associated message-passing calls by calling PMPI functions, but also
capture important performance data.
Second, you can use such wrapper functions to customize MPI behavior.
E.g., you can add barrier operations to collective calls, write out
diagnostic information for certain MPI calls, etc.
OMPI generally layers the various function interfaces as follows:
Fortran MPI_ interfaces are weak symbols for...
Fortran PMPI_ interfaces, which call...
C MPI_ interfaces, which are weak symbols for...
C PMPI_ interfaces, which provide the specified functionality.
Since OMPI generally implements MPI functionality for all languages in
C, you only need to provide profiling wrappers in C, even if your
program is in another programming language. Alternatively, you may
write the wrappers in your program's language, but if you provide
wrappers in both languages then both sets will be invoked.
There are a handful of exceptions. For example,
MPI_ERRHANDLER_CREATE() in Fortran does not call
MPI_Errhandler_create(). Instead, it calls some other low-level
function. Thus, to intercept this particular Fortran call, you need a
Fortran wrapper.
Be sure you make the library dynamic. A static library can experience
the linker problems described in the Complications section of the
Profiling Interface chapter of the MPI standard.
See the section on Profiling Interface in the MPI standard for more details.
343. Should I use those switches --enable-mpi-profile and
--enable-trace when I configure OMPI?
Probably not.
The --enable-mpi-profile switch enables building of the PMPI
interfaces. While this is important for performance analysis, this
setting is already turned on by default.
The --enable-trace enables internal tracing of OMPI/ORTE/OPAL calls.
It is used only for developer debugging, not MPI application
performance tracing.
344. What support does OMPI have for performance analysis?
The OMPI source base has some instrumentation to capture
performance data, but that data must be analyzed by other non-OMPI
tools.
PERUSE was a proposed MPI standard that gives information about
low-level behavior of MPI internals. Check the PERUSE web site for
any information about analysis tools. When you configure OMPI, be
sure to use --enable-peruse. Information is available describing
its integration with OMPI.
Unfortunately, PERUSE didn't win standardization, so it didn't really
go anywhere. Open MPI may drop PERUSE support at some point in the
future.
MPI-3 standardized the MPI_T tools interface API (see Chapter 14 in
the MPI-3.0 specification). MPI_T is fully supported starting with
v1.7.3.
VampirTrace traces the entry to and exit from the MPI layer,
along with important performance data, writing data using the open OTF
format. VT is available freely and can be used with any MPI.
Information is available
describing its integration with OMPI.
345. How do I view VampirTrace output?
While OMPI includes VampirTrace instrumentation, it does not
provide a tool for viewing OTF trace data. There is simply a
primitive otfdump utility in the same directory where other OMPI
commands (mpicc, mpirun, etc.) are located.
Another simple utility, otfprofile, comes with OTF software and
allows you to produce a short profile in LaTeX format from an OTF
trace.
The main way to view OTF data is with the Vampir tool. Evaluation licenses are available.
346. Are there MPI performance analysis tools for OMPI that I can download for free?
The OMPI distribution includes no such tools, but some general
MPI tools can be used with OMPI.
...we used to maintain a list of links here. But the list changes
over time; projects come, and projects go. Your best bet these days
is simply to use Google to find MPI tracing and performance analysis
tools.
347. Any other kinds of tools I should know about?
Well, there are other tools you should consider. Part of
performance analysis is not just analyzing performance per se, but
generally understanding the behavior of your program.
As such, debugging tools can help you step through or pry into the
execution of your MPI program. Popular tools include TotalView, which can be
downloaded for free trial use, and Arm DDT which
also provides evaluation copies.
The command-line job inspection tool padb has been ported to
ORTE and OMPI.
348. How does Open MPI handle HFS+ / UFS filesystems?
Generally, Open MPI does not care whether it is running from
an HFS+ or UFS filesystem. However, the C++ wrapper compiler historically
has been called mpiCC, which of course is the same file
as mpicc when running on case-insensitive HFS+.
During the configure
process, Open MPI will attempt to determine if the build filesystem is
case sensitive or not, and assume the install file system is the same
way. Generally, this is all that is needed to deal with HFS+.
However, if you are building on UFS and installing to HFS+, you should
specify --without-cs-fs to configure to make sure Open
MPI does not build the mpiCC wrapper. Likewise, if you
build on HFS+ and install to UFS, you may want to specify
--with-cs-fs to ensure that mpiCC is installed.
349. How do I use the Open MPI wrapper compilers in XCode?
XCode has a non-public interface for adding compilers to XCode. A
friendly Open MPI user sent in a configuration file for XCode 2.3
(MPICC.pbcompspec), which will add
support for the Open MPI wrapper compilers. The file should be
placed in /Library/Application Support/Apple/Developer Tools/Specifications/.
Upon starting XCode, this file is loaded and added to the list of
known compilers.
To use the mpicc compiler: open the project, get info on the
target, click the rules tab, and add a new entry. Change the process rule
for "C source files" and select "using MPICC".
Before moving the file, the ExecPath parameter should be set
to the location of the Open MPI install. The BasedOn parameter
should be updated to refer to the compiler version that mpicc
will invoke — generally gcc-4.0 on OS X 10.4 machines.
Thanks to Karl Dockendorf for this information.
350. What versions of Open MPI support XGrid?
XGrid is a batch-scheduling technology that was included in
some older versions of OS X. Support for XGrid appeared in the
following versions of Open MPI:
Open MPI series
XGrid supported
v1.0 series
Yes
v1.1 series
Yes
v1.2 series
Yes
v1.3 series
Yes
v1.4 and beyond
No
351. How do I run jobs under XGrid?
XGrid support will be built if the XGrid tools are installed.
We unfortunately have little documentation on how to run with XGrid at
this point other than a fairly lengthy e-mail that Brian Barrett wrote
on the Open MPI user's mailing list:
Since Open MPI 1.1.2, we also support authentication using Kerberos.
The process is essentially the same, but there is no need to specify
the XGRID_PASSWORD field. Open MPI applications will then run as
the authenticated user, rather than nobody.
352. Where do I get more information about running under XGrid?
Please write to us on the user's mailing list. Hopefully any
replies that we send will contain enough information to create proper
FAQs about how to use Open MPI with XGrid.
353. Is Open MPI included in OS X?
Open MPI v1.2.3 was included in some older versions of OS X,
starting with version 10.5 (Leopard). It was removed in more recent
versions of OS X (we're not sure in which version it disappeared —
*but your best bet is to simply download
a modern version of Open MPI for your modern version of OS X*).
Note, however, that OS X Leopard does not include a Fortran compiler,
so the OS X-shipped version of Open MPI does not include Fortran
support.
If you need/want Fortran support, you will need to build your own copy
of Open MPI (assumedly when you have a Fortran compiler installed).
The Open MPI team strongly recommends not overwriting the OS
X-installed version of Open MPI, but rather installing it somewhere
else (e.g., /opt/openmpi).
354. How do I not use the OS X-bundled Open MPI?
There are a few reasons you might not want to use the OS
X-bundled Open MPI, such as wanting Fortran support, upgrading to a
new version, etc.
If you wish to use a community version of Open MPI, You can download
and build Open MPI on OS X just like any other supported platform. We
strongly recommend not replacing the OS X-installed Open MPI, but
rather installing to an alternate location (such as /opt/openmpi).
shell$ wget https://www.open-mpi.org/.../open-mpi....
shell$ tar xf openmpi-<version>.tar.bz2
shell$ cd openmpi-<version>shell$ ./configure --prefix=/opt/openmpi 2>&1|tee config.out
[...lots of output...]shell$ make-j42>&1|tee make.out
[...lots of output...]shell$ sudomakeinstall2>&1|tee install.out
[...lots of output...]shell$ exportPATH=/opt/openmpi/bin:$PATHshell$ ompi_info[...see output from newly-installed Open MPI...]
Note that there is no need to add Open MPI's libdir to
LD_LIBRARY_PATH; Open MPI's shared library build process
automatically uses the "rpath" mechanism to automatically find the
correct shared libraries (i.e., the ones associated with this build,
vs., for example, the OS X-shipped OMPI shared libraries). Also note
that we specifically do not recommend adding Open MPI's libdir to
DYLD_LIBRARY_PATH.
If you build static libraries for Open MPI, there is an ordering
problem such that /usr/lib/libmpi.dylib will be found before
$libdir/libmpi.a, and therefore user-linked MPI applications that
use mpicc (and friends) will use the "wrong" libmpi. This can be
fixed by editing
OMPI's wrapper compilers to force the use of the Right libraries,
such as with the following flag when configuring Open MPI:
355. I am using Open MPI 2.0.x / v2.1.x and getting an error at application startup. How do I work around this?
On some versions of Mac OS X / macOS Sierra, the default
temporary directory location is sufficiently long that it is easy for
an application to create file names for temporary files which exceed
the maximum allowed file name length. With Open MPI v2.0.x, this can lead to
errors like the following at application startup:
1
2
3
shell$ mpirun ... my_mpi_app
[[53415,0],0] ORTE_ERROR_LOG: Bad parameter infile ../../orte/orted/pmix/pmix_server.c at line 264[[53415,0],0] ORTE_ERROR_LOG: Bad parameter infile ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line
Or you may see something like this (v2.1.x):
1
2
3
4
5
6
7
8
shell$ mpirun ... my_mpi_app
PMIx has detected a temporary directory name that results
in a path that is too long for the Unix domain socket:
Temp dir: /var/folders/mg/q0_5yv791yz65cdnbglcqjvc0000gp/T/openmpi-sessions-502@anlextwls026-173_0/53422
Try setting your TMPDIR environmental variable to point to
something shorter in length.
The workaround for the Open MPI 2.0.x and v2.1.x release series is to set the
TMPDIR environment variable to /tmp or another short directory
name.
356. Is AIX a supported operating system for Open MPI?
No. AIX used to be supported, but none of the current Open
MPI developers has any platforms that require AIX support for Open
MPI.
Since Open MPI is an open source project, its features and
requirements are driven by the union of its developers. Hence, AIX
support has fallen away because none of us currently use AIX. All
this means is that we do not develop or test on AIX; there is no
fundamental technology reason why Open MPI couldn't be supported on
AIX.
AIX support could certainly be re-instated if someone who wanted AIX
support joins the core group of developers and contributes the
development and testing to support AIX.
357. Does Open MPI work on AIX?
There have been reports from random users that a small number
of changes are required to the Open MPI code base to make it work
under AIX. For example, see the following post on the Open MPI user's
list, reported by Ricardo Fonseca:
NOTE: VampirTrace was only included in Open
MPI from v1.3.x through v1.10.x. It was removed in the v2.0.0 release
of Open MPI. This FAQ question pertains to the versions of Open MPI
that contained VampirTrace.
VampirTrace is a program tracing package that can collect a
very fine grained event trace of your sequential or parallel
program. The traces can be visualized by the Vampir tool and a number
of other tools that read the Open Trace Format (OTF).
Tracing is interesting for performance analysis and optimization of
parallel and HPC (High Performance Computing) applications in general
and MPI programs in particular. In fact, that's where the letters
'mpi' in "Vampir" come from. Therefore, it is integrated into Open MPI
for convenience.
VampirTrace is included in Open MPI v1.3 and later.
VampirTrace consists of two main components: First, the
instrumentation part which slightly modifies the target program in
order to be notified about run-time events of interest. Simply replace
the compiler wrappers to activate it: mpicc to mpicc-vt, mpicxx
to mpicxx-vt and so on (note that the *-vt variants of the wrapper
compilers are unavailable before Open MPI v1.3). Second, the
run-time measurement part is responsible for data collection. This
can only be effective when the first part was performed — otherwise
there will be no effect on your program at all.
VampirTrace has been developed at ZIH, TU Dresden in collaboration with
the KOJAK project from JSC/FZ Juelich and is available as open source
software under the BSD license; see ompi/contrib/vt/vt/COPYING.
359. Where can I find the complete documentation of VampirTrace?
NOTE: VampirTrace was only included in Open
MPI from v1.3.x through v1.10.x. It was removed in the v2.0.0 release
of Open MPI. This FAQ question pertains to the versions of Open MPI
that contained VampirTrace.
A complete documentation of VampirTrace comes with the Open
MPI software package as PDF and HTML. You
can find it in the Open MPI source tree at ompi/contrib/vt/vt/doc/ or
after installing Open MPI in
$(install-prefix)/share/vampirtrace/doc/.
360. How to instrument my MPI application with VampirTrace?
NOTE: VampirTrace was only included in Open
MPI from v1.3.x through v1.10.x. It was removed in the v2.0.0 release
of Open MPI. This FAQ question pertains to the versions of Open MPI
that contained VampirTrace.
All the necessary instrumentation of user functions as well as
MPI and OpenMP events is handled by special compiler wrappers (
mpicc-vt, mpicxx-vt, mpif77-vt, mpif90-vt ). Unlike the normal
wrappers ( mpicc and friends) these wrappers call VampirTrace's
compiler wrappers ( vtcc, vtcxx, vtf77, vtf90 ) instead of the
native compilers. The vt* wrappers use underlying platform compilers
to perform the necessary instrumentation of the program and link the
suitable VampirTrace library.
Original:
1
shell$ mpicc-c hello.c -o hello
With instrumentation:
1
shell$ mpicc-vt -c hello.c -o hello
For your application, simply change the compiler definitions in your
Makefile(s):
1
2
3
4
5
6
7
8
9
# original definitions in Makefile## CC=mpicc## CXX=mpicxx## F90=mpif90# replace with
CC=mpicc-vt
CXX=mpicxx-vt
F90=mpif90-vt
361. Does VampirTrace cause overhead to my application?
NOTE: VampirTrace was only included in Open
MPI from v1.3.x through v1.10.x. It was removed in the v2.0.0 release
of Open MPI. This FAQ question pertains to the versions of Open MPI
that contained VampirTrace.
By using the default MPI compiler wrappers ( mpicc, etc.) your
application will be run without any changes at all. The VampirTrace
compiler wrappers ( mpicc-vt etc.) link the VampirTrace library which intercepts
MPI calls and some user level function/subroutine calls. This causes a certain
amount of run-time overhead to applications. Usually, the overhead is reasonably
small (0.x% - 5%) and VampirTrace by default enables precautions to avoid
excessive overhead. However, it can be configured to produce very substantial
overhead using non-default settings.
362. How can I change the underlying compiler of the mpi*-vt wrappers?
NOTE: VampirTrace was only included in Open
MPI from v1.3.x through v1.10.x. It was removed in the v2.0.0 release
of Open MPI. This FAQ question pertains to the versions of Open MPI
that contained VampirTrace.
Unlike the standard MPI compiler wrappers ( mpicc etc.) the
environment variables OMPI_CC, OMPI_CXX, OMPI_F77, OMPI_F90 do not
affect the VampirTrace compiler wrappers. Please, use the environment
variables VT_CC, VT_CXX, VT_F77, VT_F90 instead. In addition, you
can set the compiler with the wrapper's option -vt:[cc|cxx|f77|f90]
The following two are equivalent, setting the underlying compiler to
gcc:
Futhermore, you can modify the default settings in
/share/openmpi/mpi*-wrapper-data.txt.
363. How can I pass VampirTrace related configure options through the
Open MPI configure?
NOTE: VampirTrace was only included in Open
MPI from v1.3.x through v1.10.x. It was removed in the v2.0.0 release
of Open MPI. This FAQ question pertains to the versions of Open MPI
that contained VampirTrace.
To give options to the VampirTrace configure script you can add this
to the configure option: --with-contrib-vt-flags.
The following example passes the options --with-papi-lib-dir and --with-papi-lib
to the VampirTrace configure script to specify the location and name of the PAPI
library:
364. How to disable the integrated VampirTrace, completely?
NOTE: VampirTrace was only included in Open
MPI from v1.3.x through v1.10.x. It was removed in the v2.0.0 release
of Open MPI. This FAQ question pertains to the versions of Open MPI
that contained VampirTrace.
By default, the VampirTrace part of Open MPI will be built and
installed. If you would like to disable building and installing of
VampirTrace, add the value vt to the configure option
--enable-contrib-no-build.