Skip to content

Known issues#

This page provides details on a couple of known problems, and the workarounds that are available for them.

If you have any questions related to these issues, please contact the HPC-UGent team.


Operation not permitted error for MPI applications#

When running an MPI application that was installed with a foss toolchain, you may run into crash with an error message like:

Failed to modify UD QP to INIT on mlx5_0: Operation not permitted

This error means that an internal problem has occurred in OpenMPI.

Cause of the problem#

This problem was introduced with the OS updates that were installed on the HPC-UGent and VSC Tier-1 Hortense clusters mid February 2024, most likely due to updating the Mellanox OFED kernel module.

It seems that having OpenMPI consider both UCX and libfabric as "backends" to use the high-speed interconnect (InfiniBand) is causing this problem: the error message is reported by UCX, but the problem only occurs when OpenMPI is configured to also consider libfabric.

Affected software#

We have been notified that this error may occur with various applications, including (but not limited to) CP2K, LAMMPS, netcdf4-python, SKIRT, ...

Workarounds#

Use latest vsc-mympirun#

A workaround as been implemented in mympirun (version 5.4.0).

Make sure you use the latest version of vsc-mympirun by using the following (version-less) module load statement in your job scripts:

module load vsc-mympirun

and launch your MPI application using the mympirun command.

For more information, see the mympirun documentation.

Configure OpenMPI to not use libfabric via environment variables#

If using mympirun is not an option, you can configure OpenMPI to not consider libfabric (and only use UCX) by setting the following environment variables (in your job script or session environment):

export OMPI_MCA_btl='^uct,ofi'
export OMPI_MCA_pml='ucx'
export OMPI_MCA_mtl='^ofi'

Resolution#

We will re-install the affected OpenMPI installations during the scheduled maintenance of 13-17 May 2024 (see also VSC status page).