First 5 minutes of hell
A blocking operation is the kind of operation, which returns only when a certain completion condition is satisfied.
MPI_Sendis a blocking operation (it is completed when the input buffer can be modified (= the sending of all the data is complete)) - realizes “standard mode”
- there are two options (decided by the MPI library, the code must handle both)
- the data are either sent to the destination process (non-local operation - the completion of the MPI_Send depeds on the data reception from the destination process)
- or the data are copied to the temporary system buffer (for later sending) - this is a local operation (the completion is after loading the buffer, it does not depend on the reception of the destination process)
- the uncertainty is good for portability of the MPI application across multiple systems (each system handles buffering in a different way and the run of the program should not depend on it)
- before sending, there is a small handshake in the background
- the sender sends a tiny rendesvous message “hey, I have N bytes with tag T for you, are you ready?”
- receiver firstly has to get to its
MPI_Recvand then replies, yes, I am ready, send it to the address 0x… in my memory- sender then transfers the data
MPI_Bsend- realized “buffered mode”, only local operation (the completion does not depend on the destination process receiving the data)
- the programmer has to prepare a buffer to load the data into (calling
MPI_Buffer_attach), in case of calling multiple MPI_Bsend functions, the buffer has to be able to store the sum of all data- buffering may improve the performance of a correct program, but a sufficient large buffer must always be allocated
MPI_Ssend- realizes the “synchronous mode”, it is blocking until the destination process initializes the data reception, so it is only non-local operation
MPI_Rsend- realizes “ready mode”, the receiving process must already be waiting for the message (e.g. withMPI_Recvand buffer already allocated)
- the handshake is omitted, because the receiver is ready (the library trusts it, so it sends the message right away, if the receiver is not waiting, the system is undefined)
- it is non-local operation (the receiver has to be involved)
There are two communication modes:
- point-to-point: communication between two MPI processes
- collective: communication among all MPI processes associated with the given communicator
- communicator is a group of processes within which is allowed communication (each communication function has a parameter for the communicator to specify to which processes could the message be sent)
- the implicit default is:
MPI_COMM_WORLD= all defined processes within the MPI runMPI_Comm_rank,MPI_Comm_size= functions to get the rank within the comm. group and the number of MPI processes in the comm. groupState object -
MPI_Statusis a data structure informing about the status of the source and of the communication for the receiving process:
- it is populated by the
MPI_Recvfunction and it contains information:
- “who sent this message?”, useful, when
MPI_ANY_SOURCEis used- “what tag is there?”, useful, when
MPI_ANY_TAGis used- how many elements will arrive?, the
MPI_Recvhas specified only the maximum of elements to arrive (there could be less)
- this is not directly in the struct, but needs to be retrieved using
MPI_Get_count(&status, MPI_INT, &received);function
Definition: blocking operation
An MPI operation is blocking if the corresponding MPI function returns only after a defined completion condition has been satisfied. Concretely:
MPI_Send,MPI_Bsend,MPI_Ssend,MPI_Rsendare blocking in the sense that after they return, the input buffer (buf) can be safely modified - the data is “gone” (either delivered or safely captured by the MPI subsystem).MPI_Recvis blocking in the sense that after it returns, the received data is safely in its buffer.
Example illustrating the blocking-send guarantee:
int c = 10;
MPI_Send(&c, 1, MPI_INT, ...);
c = 20; // cannot influence the sent data; the destination receives 10Function signatures
int MPI_Send(const void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm);
int MPI_Recv(void *buf, int count, MPI_Datatype datatype,
int source, int tag, MPI_Comm comm,
MPI_Status *status);Parameters:
buf- pointer to send buffer / receive buffer (contiguous memory).count- on send, the number of elements; on receive, the maximum number of elements that fit.datatype- MPI datatype (MPI_INT,MPI_DOUBLE,MPI_UINT64_T, …, or a user-builtMPI_Type_create_*type).dest/source- rank of the partner process within the communicator.sourcemay beMPI_ANY_SOURCE.tag- integer label distinguishing semantically different messages.tagon receive may beMPI_ANY_TAG.comm- communicator (e.g.MPI_COMM_WORLD); sender and receiver must use the same one.status- output state object (orMPI_STATUS_IGNORE).
The four send communication modes
1. Standard mode - MPI_Send
Returns when the data is either:
- sent to the destination process, or
- copied into a temporary system buffer for later sending.
The choice is up to the MPI library - the programmer cannot assume which option will be taken. Consequences:
- In case (1), completion depends on the receiver becoming ready ⇒
MPI_Sendis a non-local operation. - In case (2),
MPI_Sendcan return even before the receiver has initiated reception ⇒ behaves as a local operation.
Portable code must remain correct under both choices.
2. Buffered mode - MPI_Bsend
- Completion does not depend on the receiver:
MPI_Bsendis a local operation. - The MPI library copies the data into a user-supplied buffer that must be pre-attached:
MPI_Buffer_attach(buf, bs); // bs must include MPI_BSEND_OVERHEAD per pending Bsend
... MPI_Bsend(...) ...
MPI_Buffer_detach(&buf, &bs);- If a process issues several
MPI_Bsends, the buffer must accommodate the sum of their sizes plus per-messageMPI_BSEND_OVERHEAD. OtherwiseMPI_Bsendmay complete with an error. - Trades extra user memory for guaranteed local completion.
3. Synchronous mode - MPI_Ssend
- The function does not return until the destination process initiates the data reception - a rendez-vous synchronization.
- Non-local operation. Whoever (sender or receiver) arrives first blocks until the other catches up.
- Useful when the program must be sure the receiver actually engaged.
4. Ready mode - MPI_Rsend
- If the receiver has not already initiated reception, the function terminates with an error and behaviour is undefined.
- Otherwise it returns once the receiver initiates reception. Non-local.
- Use only when correctness of the program guarantees the receiver is already posted - then the implementation can skip a handshake and run faster.
Why standard mode is deliberately ambiguous (motivation)
- In standard mode MPI gives the user no control over whether the message is buffered or not.
- The reason is portability: any system runs out of buffering memory if message sizes grow, and different implementations buffer to different degrees for efficiency.
- MPI takes the position that a correct (and thus portable) MPI program must not depend on system buffering.
- Buffering may improve performance of a correct program but must not affect its result.
- If buffering is required for correctness, the programmer must use buffered mode and explicitly allocate a sufficiently large user buffer.
Programmers cannot assume which option standard mode will take. Portable code must work under either - and must never rely on system buffering for correctness.
The state object: MPI_Status
Some MPI functions, notably MPI_Recv, return a pointer to an MPI_Status structure carrying metadata about the just-received message:
MPI_Status status;
MPI_Recv(&buf, count, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,
MPI_COMM_WORLD, &status);Public fields:
status.MPI_SOURCE- rank of the source process.status.MPI_TAG- tag of the received message.
The size (number of received elements) is not a struct field; it must be queried separately:
int received;
MPI_Get_count(&status, MPI_INT, &received);If the caller does not need the metadata, the special placeholder MPI_STATUS_IGNORE can be passed:
MPI_Recv(..., MPI_STATUS_IGNORE);Canonical use of MPI_Status: receiving a message of unknown size
Pre-allocate a buffer large enough for the maximum possible message, receive into it, query the actual count, then resize:
static const int count = 1000;
std::vector<int> v(count);
MPI_Status status;
MPI_Recv(&v[0], count, MPI_INT,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
std::cout << "Source: " << status.MPI_SOURCE
<< " Tag: " << status.MPI_TAG << std::endl;
int received;
MPI_Get_count(&status, MPI_INT, &received);
v.resize(received); // free unused tailThis is also the foundation for combining MPI_Probe with MPI_Recv (covered separately under probing).
Wildcards on receive
source = MPI_ANY_SOURCE- accept message from any sender (typical in master-slave: master does not know which slave will finish first, then learns the rank fromstatus.MPI_SOURCE).tag = MPI_ANY_TAG- accept any tag (recover the actual tag fromstatus.MPI_TAG).
Matching rules
A send and a receive match when:
- they share the same communicator,
- the sender’s
destequals the receiver’s process rank, and the receiver’ssourceequals the sender’s rank (orMPI_ANY_SOURCE), - the tag matches exactly (or the receiver uses
MPI_ANY_TAG), - the datatype is the same,
- the receive buffer is at least as large as the sent message.
Local vs non-local summary
MPI_Bsend- local (depends only on user buffer).MPI_Send- non-local in case (1), local in case (2); programmer cannot assume.MPI_Ssend- non-local (rendez-vous).MPI_Rsend- non-local (requires receiver pre-posted).MPI_Recv- non-local (depends on a matching send).
Cyclic shift: a paradigmatic deadlock illustration
The cyclic shift (proc_num sends to (proc_num+1) % num_procs) implemented naively as MPI_Send followed by MPI_Recv may deadlock under standard mode (if the implementation chooses option 1 for all processes simultaneously) and always deadlocks under MPI_Ssend. This is the classical dining-philosophers situation and is the prime motivation for either alternating send/receive order on even/odd ranks, using MPI_Bsend, switching to nonblocking primitives, or using MPI_Sendrecv.
The probability of deadlock with
MPI_Sendis not 100%, but it is also not 0% - the worst possible situation for portable code. WithMPI_Ssendit is always 100%.
Potential exam questions
- Define what it means for an MPI operation to be blocking. State the precise post-condition guaranteed by
MPI_Sendand byMPI_Recv. - List the four blocking send modes in MPI and give the completion condition of each.
- What is the difference between a local and a non-local operation? Classify
MPI_Send,MPI_Bsend,MPI_Ssend,MPI_Rsendaccordingly. - Explain standard mode (
MPI_Send). Why is it deliberately under-specified, and what does this imply for portable programs? - Describe the rendez-vous semantics of
MPI_Ssend. When would you choose it deliberately? - Describe
MPI_Bsend. What is the user’s responsibility regarding buffer management, and what happens if the buffer is too small? - When is
MPI_Rsendsafe to use, and what is the penalty if its precondition is violated? - Give the full signatures of
MPI_SendandMPI_Recv, and explain every parameter. - What is the
MPI_Statusobject? What information does it carry, how is the message size obtained, and when do you passMPI_STATUS_IGNORE? - Show a code pattern that uses
MPI_StatusandMPI_Get_countto receive a message of a priori unknown size into astd::vectorand trim it to the actual size. - Explain the wildcards
MPI_ANY_SOURCEandMPI_ANY_TAG. Give a typical use case for each. - Why does the naive
MPI_Send-then-MPI_Recvimplementation of a cyclic shift sometimes deadlock underMPI_Sendand always deadlock underMPI_Ssend?