wiki:DevelopingInMPI

Developing the MPI version of Cloudy

This page will give a brief description of some of the issues that are relevant when you are developing the MPI version of Cloudy.

Running the code

First you need to make sure that you have MPI installed on your computer. You will need MPI version 2 or newer to run Cloudy. On Linux machines you will typically have packages for MPICH2, LAM/MPI, and/or Open MPI (the latter is a further development of LAM/MPI, which is now in maintenance-only mode). All of these support MPI 2. On HPC machines and clusters you may need to issue a module load command to make MPI visible. Several versions of MPI may be available. The command module avail should give a full list of all the available modules. The command module list will give a list of all the modules that are already loaded.

Typically an MPI installation will provide an mpiCC script or binary as a wrapper around the compiler. The command may also be called mpicxx or mpic++ instead of mpiCC. Some implementations (e.g. the SGI Message Passing Toolkit) do not provide a wrapper and you need to call g++ or icc directly. In that case you need to explicitly add the MPI libraries during the link phase (mpiCC will conveniently add them for you). Contact your local helpdesk for further information. To find out what compiler mpiCC is based on, type mpiCC --version. Typically you will find that it is either g++ or icc (but on big clusters it may also be another commercial compiler). Go to the sys_mpi_gcc or sys_mpi_icc directory, as appropriate, and type make -j <n>. This will create an MPI capable version of Cloudy. If you e.g. only have mpicxx, but not mpiCC, type make -j <n> CXX=mpicxx. If you have no compiler wrapper at all, something like this may work make -j <n> CXX=g++ LDLIBS='-lmpi++ -lmpi'

To run the code, issue the command:

mpirun -np <n> /path/to/cloudy.exe -r <input_script>

where <n> is the number of cores you want to use. The command may also be called mpiexec, orterun, etc... Contact your local helpdesk for further information. The "-r" flag redirects input and output. Alternatively, you can also use "-p" flag, which does the same thing, but additionally sets the punch file prefix. Note that <input_script> should NOT have the customary ".in" extension. That is added implicitly. See Hazy 3 for further details. Also note that the use of the command line option "-r" or "-p" is essential. At least in LAM/MPI, Cloudy gets horribly confused if you use the normal I/O redirection because mpirun passes the open input stream only to the master rank, while the other ranks get /dev/null. This makes Cloudy confused because it assumes that all ranks have full knowledge of what is in the input script. This leads to obscure problems. Other MPI distros may have a different behavior though... Also note that there may be subtle differences in the way mpirun works for different MPI distros, so check your man pages.

Mpirun will start <n> copies of Cloudy, possibly running on different computers. Each of these copies has a unique rank number ranging from 0 to <n-1>. Below we will call rank 0 the master rank. The method cpu.i().lgMaster() will return a bool saying if a rank is the master rank.

Mpirun will need to know what the names are of the computers that you want to use. On large clusters this information will usually be passed on automatically, especially if they are running a batch system. On small (home-made) clusters you will need to pass that information yourself. In LAM/MPI you need the lamboot and lamwipe commands for that, but other distros will have other methods, so check the man pages.

Windows support

Running MPI versions of Cloudy under Windows is currently not tested. Open MPI does support Windows, but this is still very fresh. It is extremely doubtful that any large cluster is running Windows at the time of this writing, so for now the use of Windows MPI would be limited to "home use" by adventurous individuals.

here is a blog outlining how to build and run an MPI code using Visual Studio 2008 - the example given did work on a 64-bit Vista install.

Mac OS X support

MPI on OS X has been tested. The OS X MPI environment is described on this Apple page, and on this sourceforge page.

Supported commands

Currently the following modes are supported in MPI:

  • Optimize phymir runs
  • Grid runs

Some commonly used MPI commands

We use the C bindings and not the C++ bindings because the latter have been deprecated in MPI-2.2 and removed in MPI-3.

Every MPI program contains the following 4 calls:

  • MPI_Init( &argc, &argv );
  • MPI_Comm_size( MPI_COMM_WORLD, &nCPU );
  • MPI_Comm_rank( MPI_COMM_WORLD, &nRANK );
  • MPI_Finalize();

The calls to MPI_Init() and MPI_Finalize() should be pretty much the first and last thing you do when running an MPI program. This is why they are located in main(). MPI_Comm_size() gives you the total number of ranks that have been started by mpirun, while MPI_Comm_rank() returns the number of this specific rank. These routines are also called in main() and the results are stored in the cpu class. You should normally use the cpu.i().nCPU() and cpu.i().nRANK() methods to retrieve this information, which will save you a lot of typing and is easier to remember.

The other MPI routines that are currently used in Cloudy are:

  • MPI_Barrier( MPI_COMM_WORLD );
  • MPI_Bcast( v, 20, MPI_DOUBLE, n, MPI_COMM_WORLD );
  • MPI_Reduce( s, r, 20, MPI_DOUBLE, MPI_SUM, n, MPI_COMM_WORLD );

The MPI_Barrier() command will synchronize all the ranks (i.e. cause the ranks to wait until they all reached this point in the code). The MPI_Bcast() command will transfer the contents of the specified data item(s) from rank n to all the other ranks. For receiving ranks, the call will block until the data has been received. In the example above, v happens to be an array of 20 doubles. The MPI_Reduce() command will collect data from all ranks and perform an operation on them. In the example above, s is an array of 20 doubles that each rank has. The contents of these arrays will be summed up in the array r on rank n.

Coding a MPI_Bcast, MPI_Reduce, etc., as shown above can be difficult or dangerous. What if v was a realnum array? The size of the variable depends on the setting of FLT_IS_DBL. You could of course write:

#ifdef FLT_IS_DBL
MPI_Bcast( v, 20, MPI_DOUBLE, n, MPI_COMM_WORLD );
#else
MPI_Bcast( v, 20, MPI_FLOAT, n, MPI_COMM_WORLD );
#endif

but that is cumbersome, and only solves part of the problem. What if you later decide that you are better off making v an array of doubles and forget to adjust the call to MPI_Bcast()? Then you will be stuck with a mysterious bug...

A better solution is to write the following:

MPI_Bcast( v, 20, MPI_type(v), n, MPI_COMM_WORLD );

which will automatically pass the correct argument to MPI_Bcast(). Note that despite appearances, this is not an MPI command, but an extension that we defined in mpi_utilities.h. MPI_type() can take both arrays and single variables of any POD type currently in use in Cloudy (except bool). If need arises we could also add support for container classes like vector, valarray, multi_arr, etc.

PS - The fact that type starts with a lower-case 't' guarantees that it can never clash with any current or future MPI symbols.

PS2 - For the call to MPI_Bcast() to be really safe, the value 20 should not be hard-coded either of course, but that is not within the scope of this page.

Note that we should never use MPI_Abort()! This is because we want a grid to run to completion, even when problems occur in certain corners of the grid. On the other hand, inside Cloudy it is OK to use the normal abort mechanisms like cdEXIT(), TotalInsanity(), ASSERT(), etc. These will be caught by cdMain() and the grid can continue despite the error condition.

Doing I/O

One of the biggest headaches when running an MPI code is to get the output right. When MPI rank x has written to a file, you have to be careful that rank y doesn't start writing on the same file since that will wipe out what rank x has already written. This can happen quite easily since all the ranks use the same code! ROMIO should be able to solve all that, but for now we use a simpler approach which seems sufficient for our current needs.

There are two rules that govern I/O:

  • Each rank writes to a different output file.
  • Only the master rank produces the main output in MPI runs, each rank produces its own output file when processing the individual grid scripts.

To achieve this, some changes have been made to the way called.lgTalk is set. All of the control for this is now in the cpu class. The following methods are relevant to this:

  • cpu.i().lgMPI() : are we running in MPI mode?
  • cpu.i().lgMaster() : is this rank the master rank (will return true in non-MPI runs)?
  • cpu.i().nCPU() : return the total number of ranks in COMM_WORLD (the number of cores in non-MPI runs).
  • cpu.i().nRANK() : return the rank number in COMM_WORLD (will return 0 in non-MPI runs).
  • cpu.i().lgMPI_talk() : returns the default setting for called.lgTalk, even in non-MPI runs.
  • cpu.i().lgMPISingleRankMode() : returns true if each rank runs its own model (used in grid runs).

Methods for setting these quantities are also supplied. This is done in main(), and it is best not to fiddle with these settings elsewhere in the code. The most important rules to remember are:

  • When you need to reset called.lgTalk to a default value, use called.lgTalk = cpu.i().lgMPI_talk().
  • When you want to suppress output from a slave rank, if possible use if( called.lgTalk ) ....

When you do an MPI run of a script "some_script", the output of the master rank will appear in "some_script.out", while the output of the slave ranks will appear in "some_script.err01", "some_script.err02", etc. The latter will usually be empty, or contain error messages. If the slave output is empty, the file will be automatically removed at the end of the MPI run. In grid runs, the output for each of the grid points will initially be in "grid000000000_some_script.out", "grid000000001_some_script.out", etc. At the end of the grid run, Cloudy will concatenate all the output in the main output file and delete the output files from the individual grid points. A similar procedure will be followed for the punch output. If your program produces a punch file "some_script.pch", then each grid point produces a file "grid000000000_some_script.pch", etc., and the files will be concatenated at the end of the run. If you set the variable punch.lgPunchToSeparateFiles[punch.npunch] to true in parse_punch.cpp, then the concatenation will not be done and the files from each grid point will be kept separate (this works in non-MPI runs as well).

Note that the concerns about clobbering output apply to any file that Cloudy produces, not just the main output. If you need to write additional files (such as the state file in a Phymir run), make sure that the code for opening, writing, and closing the file is protected with an "if( cpu.i().lgMaster() )" conditional. Another rank may be used if that is more convenient, as long as you make sure that only one rank does the I/O.

Writing new MPI code

In order to prevent cluttering up the code with "#ifdef MPI_ENABLED" statements (which will quickly render it very hard to read and maintain), I have decided to create preprocessor stubs for MPI routines that remove the calls from the code in non-MPI mode and replace them with calls to TotalInsanityAsStub(). The stubs are located in "mpi_utilities.h", so all code that includes MPI statements should include this file. Typically a piece of MPI code will look like this:

#include "mpi_utilities.h" // note there MUST not be an #ifdef MPI_ENABLED surrounding this!

realnum v[20];
if( cpu.i().lgMPI() )
{
    MPI_Bcast( v, 20, MPI_type(v), 0, MPI_COMM_WORLD );
}

This transfers the contents of the array v[20] to the other ranks. In a non-MPI compilation the compiler will see this (after inlining):

realnum v[20];
if( cpu.i().lgMPI() )
{
    TotalInsanityAsStub<int>();
}

The call to TotalInsanityAsStub() will make the code safe in case you forget the cpu.i().lgMPI() conditional. The arguments to Bcast have been removed, which reduces the need for more stubs. The template TotalInsanityAsStub<T> pretends to be returning a value of type T (but in practice it never does), which is useful if you want to #define away a routine that assigns to an lvalue. This setup also prevents complaints by the compiler about unreachable code after the call that is #define'd away. Currently stubs for all MPI routines listed above are supplied. Others should be added when the need arises.


Return to DeveloperPages

Return to main wiki page


Last modified 4 years ago Last modified on 2013-06-04T16:55:42Z