Skip to content

Frequently Asked Questions (FAQ)#

New users should consult the Introduction to HPC to get started, which is a great resource for learning the basics, troubleshooting, and looking up specifics.

If you want to use software that's not yet installed on the HPC, send us a software installation request.

Overview of HPC-UGent Tier-2 infrastructure

Composing a job#

How many cores/nodes should I request?#

An important factor in this question is how well your task is being parallellized: does it actually run faster with more resources? You can test this yourself: start with 4 cores, then 8, then 16... The execution time should each time be reduced to around half of what it was before. You can also try this with full nodes: 1 node, 2 nodes. A rule of thumb is that you're around the limit when you double the resources but the execution time is still ~60-70% of what it was before. That's a signal to stop increasing the core count.

See also: Running batch jobs.

Which packages are available?#

When connected to the HPC, use the commands module avail [search_text] and module spider [module] to find installed modules and get information on them.

Among others, many packages for both Python and R are readily available on the HPC. These aren't always easy to find, though, as we've bundled them together.

Specifically, the module SciPy-bundle includes numpy, pandas, scipy and a few others. For R, the normal R module has many libraries included. The bundle R-bundle-Bioconductor contains more libraries. Use the command module spider [module] to find the specifics on these bundles.

If the package or library you want is not available, send us a software installation request.

How do I choose the job modules?#

Modules each come with a suffix that describes the toolchain used to install them.

Examples:

  • AlphaFold/2.2.2-foss-2021a

  • tqdm/4.61.2-GCCcore-10.3.0

  • Python/3.9.5-GCCcore-10.3.0

  • matplotlib/3.4.2-foss-2021a

Modules from the same toolchain always work together, and modules from a *different version of the same toolchain* never work together.

The above set of modules works together: an overview of compatible toolchains can be found here: https://docs.easybuild.io/en/latest/Common-toolchains.html#overview-of-common-toolchains.

You can use module avail [search_text] to see which versions on which toolchains are available to use.

If you need something that's not available yet, you can request it through a software installation request.

It is possible to use the modules without specifying a version or toolchain. However, this will probably cause incompatible modules to be loaded. Don't do it if you use multiple modules. Even if it works now, as more modules get installed on the HPC, your job can suddenly break.

Troubleshooting jobs#

My modules don't work together#

When incompatible modules are loaded, you might encounter an error like this:

Lmod has detected the following error: A different version of the 'GCC' module
is already loaded (see output of 'ml').

You should load another foss module for that is compatible with the currently loaded version of GCC. Use ml spider foss to get an overview of the available versions.

Modules from the same toolchain always work together, and modules from a different version of the same toolchain never work together.

An overview of compatible toolchains can be found here: https://docs.easybuild.io/en/latest/Common-toolchains.html#overview-of-common-toolchains.

See also: How do I choose the job modules?

My job takes longer than 72 hours#

The 72 hour walltime limit will not be extended. However, you can work around this barrier:

Job failed: SEGV Segmentation fault#

Any error mentioning SEGV or Segmentation fault/violation has something to do with a memory error. If you weren't messing around with memory-unsafe applications or programming, your job probably hit its memory limit.

When there's no memory amount specified in a job script, your job will get access to a proportional share of the total memory on the node: If you request a full node, all memory will be available. If you request 8 cores on a cluster where nodes have 2x18 cores, you will get 8/36 = 2/9 of the total memory on the node.

Try requesting a bit more memory than your proportional share, and see if that solves the issue.

See also: Specifying memory requirements.

My compilation/command fails on login node#

When logging in, you are using a connection to the login nodes. There are somewhat strict limitations on what you can do in those sessions: check out the output of ulimit -a. Specifically, the memory and the amount of processes you can use may present an issue. This is common with MATLAB compilation and Nextflow. An error caused by the login session limitations can look like this: Aborted (core dumped).

It's easy to get around these limitations: start an interactive session on one of the clusters. Then, you are acting as a node on that cluster instead of a login node. Notably, the debug/interactive cluster will grant such a session immediately, while other clusters might make you wait a bit. Example command: ml swap cluster/donphan && qsub -I -l nodes=1:ppn=8

See also: Running interactive jobs.

My job isn't using any GPUs#

Only two clusters have GPUs. Check out the infrastructure overview, to see which one suits your needs. Make sure that you manually switch to the GPU cluster before you submit the job. Inside the job script, you need to explicitly request the GPUs: #PBS -l nodes=1:ppn=24:gpus=2

Some software modules don't have GPU support, even when running on the GPU cluster. For example, when running module avail alphafold on the joltik cluster, you will find versions on both the foss toolchain and the fossCUDA toolchain. Of these, only the CUDA versions will use GPU power. When in doubt, CUDA means GPU support.

See also: HPC-UGent GPU clusters.

My job runs slower than I expected#

There are a few possible causes why a job can perform worse than expected.

Is your job using all the available cores you've requested? You can test this by increasing and decreasing the core amount: If the execution time stays the same, the job was not using all cores. Some workloads just don't scale well with more cores. If you expect the job to be very parallelizable and you encounter this problem, maybe you missed some settings that enable multicore execution. See also: How many cores/nodes should i request?

Does your job have access to the GPUs you requested? See also: My job isn't using any GPUs

Not all file locations perform the same. In particular, the $VSC_HOME and $VSC_DATA directories are, relatively, very slow to access. Your jobs should rather use the $VSC_SCRATCH directory, or other fast locations (depending on your needs), described in Where to store your data on the HPC. As an example how do this: The job can copy the input to the scratch directory, then execute the computations, and lastly copy the output back to the data directory. Using the home and data directories is especially a problem when UGent isn't your home institution: your files may be stored, for example, in Leuven while you're running a job in Ghent.

My MPI job fails#

Use mympirun in your job script instead of mpirun. It is a tool that makes sure everything gets set up correctly for the HPC infrastructure. You need to load it as a module in your job script: module load vsc-mympirun.

To submit the job, use the qsub command rather than sbatch. Although both will submit a job, qsub will correctly interpret the #PBS parameters inside the job script. sbatch might not set the job environment up correctly for mympirun/OpenMPI.

See also: Multi core jobs/Parallel Computing and Mympirun.

mympirun seems to ignore its arguments#

For example, we have a simple script (./hello.sh):

#!/bin/bash 
echo "hello world"

And we run it like mympirun ./hello.sh --output output.txt.

To our surprise, this doesn't output to the file output.txt, but to standard out! This is because mympirun expects the program name and the arguments of the program to be its last arguments. Here, the --output output.txt arguments are passed to ./hello.sh instead of to mympirun. The correct way to run it is:

mympirun --output output.txt ./hello.sh

When will my job start?#

See the explanation about how jobs get prioritized in When will my job start.

Other#

Can I share my account with someone else?#

NO. You are not allowed to share your VSC account with anyone else, it is strictly personal.

See https://helpdesk.ugent.be/account/en/regels.php.

If you want to share data, there are alternatives (like a shared directories in VO space, see Virtual organisations).

Can I share my data with other HPC users?#

Yes, you can use the chmod or setfacl commands to change permissions of files so other users can access the data. For example, the following command will enable a user named "otheruser" to read the file named dataset.txt. See

$ setfacl -m u:otheruser:r dataset.txt
$ ls -l dataset.txt
-rwxr-x---+ 2 vsc40000 mygroup      40 Apr 12 15:00 dataset.txt

For more information about chmod or setfacl, see Linux tutorial.

Can I use multiple different SSH key pairs to connect to my VSC account?#

Yes, and this is recommended when working from different computers. Please see Adding multiple SSH public keys on how to do this.

I want to use software that is not available on the clusters yet#

Please fill out the details about the software and why you need it in this form: https://www.ugent.be/hpc/en/support/software-installation-request. When submitting the form, a mail will be sent to hpc@ugent.be containing all the provided information. The HPC team will look into your request as soon as possible you and contact you when the installation is done or if further information is required.

Is my connection compromised? Remote host identification has changed#

On Monday 25 April 2022, the login nodes received an update to RHEL8. This means that the host keys of those servers also changed. As a result, you could encounter the following warnings.

MacOS & Linux (on Windows, only the second part is shown):

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
xx:xx:xx.
Please contact your system administrator.
Add correct host key in /home/hostname/.ssh/known_hosts to get rid of this message.
Offending RSA key in /var/lib/sss/pubconf/known_hosts:1
RSA host key for user has changed and you have requested strict checking.
Host key verification failed.

Please follow the instructions at migration to RHEL8 to ensure it really is not a hacking attempt - you will find the correct host key to compare. You will also find how to hide the warning.

VO: how does it work?#

A Virtual Organisation consists of a number of members and moderators. A moderator can:

  • Manage the VO members (but can't access/remove their data on the system).

  • See how much storage each member has used, and set limits per member.

  • Request additional storage for the VO.

One person can only be part of one VO, be it as a member or moderator. It's possible to leave a VO and join another one. However, it's not recommended to keep switching between VO's (to supervise groups, for example).

See also: Virtual Organisations.

My UGent shared drives don't show up#

After mounting the UGent shared drives with kinit your_email@ugent.be, you might not see an entry with your username when listing ls /UGent. This is normal: try ls /UGent/your_username or cd /UGent/your_username, and you should be able to access the drives. Be sure to use your UGent username and not your VSC username here.

See also: Your UGent home drive and shares.

I have another question/problem#

Who can I contact?