MPI Jobs

MPI allows users to run a parallel program across multiple nodes. The latest update allows users to easily run MPI jobs by simply sending the following four parameters in a job:

--mpi: this is a flag letting Lancium know that a user wishes to run an MPI job.
--tasks <INTEGER>: number of total MPI tasks in the program
--tasks_per_node <INTEGER>: number of MPI tasks to run per node
--mpi-version <TEXT>: which flavor or version of MPI to run the program on

Before running an MPI job, the program must be compiled against the correct MPI version. For example, if using the MPI flavor “mpich2”, then the MPI program must be compiled against MPICH2 verison 3.3a2. All MPI jobs will be run on full nodes, and therefore, an MPI job must run on a minimum of a single node. The number of nodes is calculated by taking the ceiling of the tasks divided by the tasks_per_node. The only currently available MPI flavor is MPICH2.

The Lancium Compute grid is only suitable for low-degree parallel, loosely-coupled MPI applications. The networking between compute nodes is 25 Gbs ethernet. Programs that require very low latency will not run well. Currently we are limiting MPI jobs to 90 nodes with 24 MPI tasks per node. Regardless of the number of MPI tasks per node requested user jobs will be allocated and charged for exclusive use of the node.

Example: Finding Available MPI Versions

$ lcli resources show -t mpi
[
    {
        "version_id": "mpich2",
        "name": "MPICH2 v3.3a2",
        "description": "MPich2 v3.3a2 as distributed by Ubuntu 18.04"
    }
]

After choosing an MPI flavor, please compile your program against the correct MPI version. For example, to compile the ring_c_mpich program:

$ mpicc -o ring_c_mpich ring_c_mpich.c

The ring_c_mpich program will simply go around to each node (in a “ring”) ten times. Then it will exit each node and finish.

Example: Running the ring_c_mpich Program

This is the ring_c_mpich program written in C:

/*
 * Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
 *                         University Research and Technology
 *                         Corporation.  All rights reserved.
 * Copyright (c) 2006      Cisco Systems, Inc.  All rights reserved.
 *
 * Simple ring test program in C.
 */
#include "mpi.h"
#include <stdio.h>
int main(int argc, char *argv[])
{
    int rank, size, next, prev, message, tag = 201;
    /* Start up MPI */
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    /* Calculate the rank of the next process in the ring.  Use the
       modulus operator so that the last process "wraps around" to
       rank zero. */
    next = (rank + 1) % size;
    prev = (rank + size - 1) % size;
    /* If we are the "master" process (i.e., MPI_COMM_WORLD rank 0),
       put the number of times to go around the ring in the
       message. */
    if (0 == rank) {
        message = 10;
        printf("Process 0 sending %d to %d, tag %d (%d processes in ring)\n", message, next, tag,
               size);
        MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
        printf("Process 0 sent to %d\n", next);
    }
    /* Pass the message around the ring.  The exit mechanism works as
       follows: the message (a positive integer) is passed around the
       ring.  Each time it passes rank 0, it is decremented.  When
       each processes receives a message containing a 0 value, it
       passes the message on to the next process and then quits.  By
       passing the 0 message first, every process gets the 0 message
       and can quit normally. */
    while (1) {
        MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        if (0 == rank) {
            --message;
            printf("Process 0 decremented value: %d\n", message);
        }
        MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
        if (0 == message) {
            printf("Process %d exiting\n", rank);
            break;
        }
    }
    /* The last process does one extra send to process 0, which needs
       to be received before the program can exit */
    if (0 == rank) {
        MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    }
    /* All done */
    MPI_Finalize();
    return 0;
}

### After choosing an MPI version and compiling our program against that version, we are ready to send out an MPI Job.
$ lcli job run -n ring --image ubuntu --command './ring_c_mpich' --input-file ~/bin/ring_c_mpich --mpi --tasks 4 --tasks_per_node 2 --mpi_version mpich2


{
    "id": 71616,
    "name": "ring",
    "status": "created",
    "qos": "high",
    "command_line": "./ring_c_mpich",
    "image": "ubuntu",
    "resources": {
        "gpu_count": null,
        "gpu": null,
        "scratch": null,
        "mpi": true,
        "mpi_version": "mpich2",
        "tasks": 4,
        "tasks_per_node": 2
    },
    "max_run_time": 259200,
    "input_files": [
        {
            "id": 34486,
            "name": "ring_c_mpich",
            "source_type": "file",
            "source": "/home/rap/bin/ring_c_mpich",
            "cache": false,
            "upload_complete": true,
            "chunks_received": [
                [
                    1,
                    12664
                ]
            ]
        }
    ],
    "created_at": "2022-07-28T18:05:47.995Z",
    "updated_at": "2022-07-28T18:05:47.995Z"
}

# Let's look at the output
$ lcli job output get --view --file stdout.txt 71616
'Process 0 sending 10 to 1, tag 201 (4 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 2 exiting
Process 3 exiting'

Note that there was no need to call mpirun on the program. To run the program, simply execute the binary.