Getting Started#

Welcome to the "Getting Started" guide. This chapter will lead you through the initial steps of logging into the HPC-UGent infrastructure and submitting your very first job. We'll also walk you through the process step by step using a practical example.

In addition to this chapter, you might find the recording of the Introduction to HPC-UGent training session to be a useful resource.

Before proceeding, read the introduction to HPC to gain an understanding of the HPC-UGent infrastructure and related terminology.

Getting Access#

To get access to the HPC-UGent infrastructure, visit Getting an HPC Account.

If you have not used Linux before, now would be a good time to follow our Linux Tutorial.

A typical workflow looks like this:#

Connect to the login nodes
Transfer your files to the HPC-UGent infrastructure
Optional: compile your code and test it
Create a job script and submit your job
Wait for job to be executed
Study the results generated by your jobs, either on the cluster or after downloading them locally.

We will walk through an illustrative workload to get you started. In this example, our objective is to train a deep learning model for recognizing hand-written digits (MNIST dataset) using TensorFlow; see the example scripts.

Getting Connected#

There are two options to connect

Using a terminal to connect via SSH (for power users) (see First Time connection to the HPC-UGent infrastructure)
Using the web portal

Considering your operating system is macOS, it should be easy to make use of the ssh command in a terminal, but the web portal will work too.

The web portal offers a convenient way to upload files and gain shell access to the HPC-UGent infrastructure from a standard web browser (no software installation or configuration required).

See shell access when using the web portal, or connection to the HPC-UGent infrastructure when using a terminal.

Make sure you can get to a shell access to the HPC-UGent infrastructure before proceeding with the next steps.

Info

When having problems see the connection issues section on the troubleshooting page.

Transfer your files#

Now that you can login, it is time to transfer files from your local computer to your home directory on the HPC-UGent infrastructure.

Download following the example scripts to your computer:

You can also find the example scripts in our git repo: https://github.com/hpcugent/vsc_user_docs/.

On your local machine you can run:

curl -OL https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist/tensorflow_mnist.py
curl -OL https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist/run.sh

Using the scp command, the files can be copied from your local host to your home directory (~) on the remote host (HPC).

scp tensorflow_mnist.py run.sh vsc40000login.hpc.ugent.be:~

ssh  vsc40000@login.hpc.ugent.be

User your own VSC account id

Replace vsc40000 with your VSC account id (see https://account.vscentrum.be)

Info

For more information about transfering files or scp, see tranfer files from/to hpc.

When running ls in your session on the HPC-UGent infrastructure, you should see the two files listed in your home directory (~):

$ ls ~
run.sh tensorflow_mnist.py

When you do not see these files, make sure you uploaded the files to your home directory.

Submitting a job#

Jobs are submitted and executed using job scripts. In our case run.sh can be used as a (very minimal) job script.

A job script is a shell script, a text file that specifies the resources, the software that is used (via module load statements), and the steps that should be executed to run the calculation.

Our job script looks like this:

run.sh

#!/bin/bash

module load TensorFlow/2.15.1-foss-2023a

python tensorflow_mnist.py

As you can see this job script will run the Python script named tensorflow_mnist.py.

The jobs you submit are per default executed on cluser/doduo, you can swap to another cluster by issuing the following command.

module swap cluster/donphan

Tip

When submitting jobs with limited amount of resources, it is recommended to use the debug/interactive cluster: donphan.

To get a list of all clusters and their hardware, see https://www.ugent.be/hpc/en/infrastructure.

This job script can now be submitted to the cluster's job system for execution, using the qsub (queue submit) command:

$ qsub run.sh
123456

This command returns a job identifier (123456) on the HPC cluster. This is a unique identifier for the job which can be used to monitor and manage your job.

Make sure you understand what the module command does

Note that the module commands only modify environment variables. For instance, running module swap cluster/donphan will update your shell environment so that qsub submits a job to the donphan cluster, but our active shell session is still running on the login node.

It is important to understand that while module commands affect your session environment, they do not change where the commands your are running are being executed: they will still be run on the login node you are on.

When you submit a job script however, the commands in the job script will be run on a workernode of the cluster the job was submitted to (like donphan).

For detailed information about module commands, read the running batch jobs chapter.

Wait for job to be executed#

Your job is put into a queue before being executed, so it may take a while before it actually starts. (see when will my job start? for scheduling policy).

You can get an overview of the active jobs using the qstat command:

$ qstat
Job ID     Name             User            Time Use S Queue
---------- ---------------- --------------- -------- - -------
123456     run.sh           vsc40000        0:00:00  Q donphan

Eventually, after entering qstat again you should see that your job has started running:

$ qstat
Job ID     Name             User            Time Use S Queue
---------- ---------------- --------------- -------- - -------
123456     run.sh           vsc40000        0:00:01  R donphan

If you don't see your job in the output of the qstat command anymore, your job has likely completed.

Read this section on how to interpret the output.

Inspect your results#

When your job finishes it generates 2 output files:

One for normal output messages (stdout output channel).
One for warning and error messages (stderr output channel).

By default located in the directory where you issued qsub.

Info

For more information about the stdout and stderr output channels, see this section.

In our example when running ls in the current directory you should see 2 new files:

run.sh.o123456, containing normal output messages produced by job 123456;
run.sh.e123456, containing errors and warnings produced by job 123456.

Info

run.sh.e123456 should be empty (no errors or warnings).

Use your own job ID

Replace 123456 with the jobid you got from the qstat command (see above) or simply look for added files in your current directory by running ls.

When examining the contents of run.sh.o123456 you will see something like this:

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 1s 0us/step
Epoch 1/5
1875/1875 [==============================] - 2s 823us/step - loss: 0.2960 - accuracy: 0.9133
Epoch 2/5
1875/1875 [==============================] - 1s 771us/step - loss: 0.1427 - accuracy: 0.9571
Epoch 3/5
1875/1875 [==============================] - 1s 767us/step - loss: 0.1070 - accuracy: 0.9675
Epoch 4/5
1875/1875 [==============================] - 1s 764us/step - loss: 0.0881 - accuracy: 0.9727
Epoch 5/5
1875/1875 [==============================] - 1s 764us/step - loss: 0.0741 - accuracy: 0.9768
313/313 - 0s - loss: 0.0782 - accuracy: 0.9764

Hurray 🎉, we trained a deep learning model and achieved 97,64 percent accuracy.

Warning

When using TensorFlow specifically, you should actually submit jobs to a GPU cluster for better performance, see GPU clusters.

For the purpose of this example, we are running a very small TensorFlow workload on a CPU-only cluster.

Next steps#

For more examples see Program examples and Job script examples