Manipulating files and directories#

Being able to manage your data is an important part of using the HPC infrastructure. The bread and butter commands for doing this are mentioned here. It might seem annoyingly terse at first, but with practice you will realise that it's very practical to have such common commands short to type.

File contents: "cat", "head", "tail", "less", "more"#

To print the contents of an entire file, you can use cat; to only see the first or last N lines, you can use head or tail:

$ cat one.txt
1
2
3
4
5

$ head -2 one.txt
1
2

$ tail -2 one.txt
4
5

To check the contents of long text files, you can use the less or more commands which support scrolling with "<up>", "<down>", "<space>", etc.

Copying files: "cp"#

$ cp source target

This is the cp command, which copies a file from source to target. To copy a directory, we use the -r option:

$ cp -r sourceDirectory target

A last more complicated example:

$ cp -a sourceDirectory target

Here we used the same cp command, but instead we gave it the -a option which tells cp to copy all the files and keep timestamps and permissions.

Creating directories: "mkdir"#

$ mkdir directory

which will create a directory with the given name inside the current directory.

Renaming/moving files: "mv"#

$ mv source target

mv will move the source path to the destination path. Works for both directories as files.

Removing files: "rm"#

Note: there are NO backups, there is no 'trash bin'. If you remove files/directories, they are gone.

$ rm filename

rm will remove a file or directory. (rm -rf directory will remove every file inside a given directory). WARNING: files removed will be lost forever, there are no backups, so beware when using this command!

Removing a directory: "rmdir"#

You can remove directories using rm -r directory, however, this is error-prone and can ruin your day if you make a mistake in typing. To prevent this type of error, you can remove the contents of a directory using rm and then finally removing the directory with:

$ rmdir directory

Changing permissions: "chmod"#

Every file, directory, and link has a set of permissions. These permissions consist of permission groups and permission types. The permission groups are:

User - a particular user (account)
Group - a particular group of users (may be user-specific group with only one member)
Other - other users in the system

The permission types are:

Read - For files, this gives permission to read the contents of a file
Write - For files, this gives permission to write data to the file. For directories, it allows users to add or remove files to a directory.
Execute - For files this gives permission to execute a file as through it were a script. For directories, it allows users to open the directory and look at the contents.

Any time you run ls -l you'll see a familiar line of -rwx------ or similar combination of the letters r, w, x and - (dashes). These are the permissions for the file or directory. (See also the previous section on permissions)

$ ls -l
total 1
-rw-r--r--. 1 vsc40000 mygroup 4283648 Apr 12 15:13 articleTable.csv
drwxr-x---. 2 vsc40000 mygroup 40 Apr 12 15:00 Project_GoldenDragon

Here, we see that articleTable.csv is a file (beginning the line with -) has read and write permission for the user vsc40000 (rw-), and read permission for the group mygroup as well as all other users (r-- and r--).

The next entry is Project_GoldenDragon. We see it is a directory because the line begins with a d. It also has read, write, and execute permission for the vsc40000 user (rwx). So that user can look into the directory and add or remove files. Users in the mygroup can also look into the directory and read the files. But they can't add or remove files (r-x). Finally, other users can read files in the directory, but other users have no permissions to look in the directory at all (---).

Maybe we have a colleague who wants to be able to add files to the directory. We use chmod to change the modifiers to the directory to let people in the group write to the directory:

$ chmod g+w Project_GoldenDragon
$ ls -l
total 1
-rw-r--r--. 1 vsc40000 mygroup 4283648 Apr 12 15:13 articleTable.csv
drwxrwx---. 2 vsc40000 mygroup 40 Apr 12 15:00 Project_GoldenDragon

The syntax used here is g+x which means group was given write permission. To revoke it again, we use g-w. The other roles are u for user and o for other.

You can put multiple changes on the same line: chmod o-rwx,g-rxw,u+rx,u-w somefile will take everyone's permission away except the user's ability to read or execute the file.

You can also use the -R flag to affect all the files within a directory, but this is dangerous. It's best to refine your search using find and then pass the resulting list to chmod since it's not usual for all files in a directory structure to have the same permissions.

Access control lists (ACLs)#

However, this means that all users in mygroup can add or remove files. This could be problematic if you only wanted one person to be allowed to help you administer the files in the project. We need a new group. To do this in the HPC environment, we need to use access control lists (ACLs):

$ setfacl -m u:otheruser:w Project_GoldenDragon
$ ls -l Project_GoldenDragon
drwxr-x---+ 2 vsc40000 mygroup 40 Apr 12 15:00 Project_GoldenDragon

This will give the user otheruser permissions to write to Project_GoldenDragon

Now there is a + at the end of the line. This means there is an ACL attached to the directory. getfacl Project_GoldenDragon will print the ACLs for the directory.

Note: most people don't use ACLs, but it's sometimes the right thing and you should be aware it exists.

See https://linux.die.net/man/1/setfacl for more information.

Zipping: "gzip"/"gunzip", "zip"/"unzip"#

Files should usually be stored in a compressed file if they're not being used frequently. This means they will use less space and thus you get more out of your quota. Some types of files (e.g., CSV files with a lot of numbers) compress as much as 9:1. The most commonly used compression format on Linux is gzip. To compress a file using gzip, we use:

$ ls -lh myfile
-rw-r--r--. 1 vsc40000 vsc40000 4.1M Dec 2 11:14 myfile
$ gzip myfile
$ ls -lh myfile.gz
-rw-r--r--. 1 vsc40000 vsc40000 1.1M Dec 2 11:14 myfile.gz

Note: if you zip a file, the original file will be removed. If you unzip a file, the compressed file will be removed. To keep both, we send the data to stdout and redirect it to the target file:

$ gzip -c myfile > myfile.gz
$ gunzip -c myfile.gz > myfile

"zip" and "unzip"#

Windows and macOS seem to favour the zip file format, so it's also important to know how to unpack those. We do this using unzip:

$ unzip myfile.zip

If we would like to make our own zip archive, we use zip:

$ zip myfiles.zip myfile1 myfile2 myfile3

Working with tarballs: "tar"#

Tar stands for "tape archive" and is a way to bundle files together in a bigger file.

You will normally want to unpack these files more often than you make them. To unpack a .tar file you use:

$ tar -xf tarfile.tar

Often, you will find gzip compressed .tar files on the web. These are called tarballs. You can recognize them by the filename ending in .tar.gz. You can uncompress these using gunzip and then unpacking them using tar. But tar knows how to open them using the -z option:

$ tar -zxf tarfile.tar.gz
$ tar -zxf tarfile.tgz

Order of arguments#

Note: Archive programs like zip, tar, and jar use arguments in the "opposite direction" of copy commands.

# cp, ln: &lt;source(s)&gt; &lt;target&gt;
$ cp source1 source2 source3 target
$ ln -s source target

# zip, tar: &lt;target&gt; &lt;source(s)&gt;
$ zip zipfile.zip source1 source2 source3
$ tar -cf tarfile.tar source1 source2 source3

If you use tar with the source files first then the first file will be overwritten. You can control the order of arguments of tar if it helps you remember:

$ tar -c source1 source2 source3 -f tarfile.tar

Exercises#

Create a subdirectory in your home directory named test containing a single, empty file named one.txt.
Copy /etc/hostname into the test directory and then check what's in it. Rename the file to hostname.txt.
Make a new directory named another and copy the entire test directory to it. another/test/one.txt should then be an empty file.
Remove the another/test directory with a single command.
Rename test to test2. Move test2/hostname.txt to your home directory.
Change the permission of test2 so only you can access it.
Create an empty job script named job.sh, and make it executable.
gzip hostname.txt, see how much smaller it becomes, then unzip it again.

The next chapter is on uploading files, especially important when using HPC-infrastructure.