Intro - Taltech Ai-Lab

AI-Lab resources (including GPUs)

AI-Lab is an environment to learn how to use modern computational resources to solve various problems. Many such problems fall under the category of AI. The environment is primarily aimed at learning how to use the resources. Larger, long running jobs should move to the HPC Center resources.

The AI-Lab with the head node accessible by SSH using your Uni-ID user name and password at ai-lab.taltech.ee currently hosts the following nodes all running Ubuntu 20.04 Linux:

ai-lab-01: AMD Threadripper 3960X 24-Core/48-thread Processor, 128 GB of memory, NVidia 3090 GPU with 24 GB of graphics memory
ai-lab-02: Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz 10-Core/20-thread Processor, 128 GB of memory, NVidia 2080Ti GPU with 11 GB of graphics memory
ai-lab-04: AMD Threadripper 3970X 32-Core/64-thread Processor, 128 GB of memory, NVidia 3090 GPU with 24 GB of graphics memory
ai-lab-05: Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz 4-Core/8-thread Processor, 64 GB of memory, NVidia 1080Ti GPU with 11 GB of graphics memory
ai-lab-06: AMD Threadripper 3960X 24-Core/48-thread Processor, 128 GB of memory, NVidia 2080TI GPU with 11 GB of graphics memory
ai-lab-07: AMD Threadripper 3960X 24-Core/48-thread Processor, 128 GB of memory, NVidia 2080TI GPU with 11 GB of graphics memory

NB! No jobs should be run on the head node ai-lab.taltech.ee. This node is for building your code and preparing your tasks that you submit to the cluster.

Software support

The system runs Ubuntu 20.04. Installed packages:

NVidia CUDA 11.0
NVidia CUDA 11.1
NVidia CUDA 11.2 (default)
LibCuDNN 8
Python 3.8 (and virtual environments)
.Net 5.0 SDK
…

There is an LMod based environment modules system available. To check what modules are available, run

module avail

To load a particular module run

module load Miniconda3

The above command loads the default version of Miniconda3, if you need a particular (older) version just add the full version Miniconda3/py38_4.9.2 to the module name.

If there is a package missing that is in the official Ubuntu repositories, post the request to the AI-Lab teams chat under the software requests channel.

If the software does not come as an official Ubuntu package, please build it in your home directory and run from there.

Preparing the software to run

The intended use of AI-Lab involves setting the software up in your home directory in the head node, i.e. the computer you log in from remote: ai-lab.taltech.ee. Note that your home directory is shared between the nodes and the software you set up will be available for each and every node. If you discover that there is some dependency missing, please drop a line to the AI-Lab Teams team!.

While it makes sense to run build tasks of large software repositories as batch jobs, it is best to sort out dependencies in the head node. When running e.g. pip3, the packages will be installed into ~/.local subdirectory of your home directory. So the environment will be available in all nodes. It is also true for Python Virtual Environments.

Batch processing

The system runs SLURM queue manager to pass workloads to nodes without breaking them.

The best practice is to sort out your development environment in your home directory in the head node and then write scripts that will invoke the GPU involving code.

Feel free to check out the ai-benchmark project!

Once the software is prepared you can queue the script with the following command (assuming you need to use the GPU):

sbatch --gres mps:40 --cpus-per-task=8 --mem=20G ./yourscript.sh

This will provide your job access to about 40% of a GPU (there are no strict restrictions enforced, but 40% means about 4 GB of GPU memory). This way one node will be able to host multiple users simultaneously. This is very useful for e.g. Jupyter notebook sessions. The maximum that is available for a single job and a single GPU is 200%. The prerequisite is that your code is compiled to use CUDA-11.2, other SDKs may not work. The approach utilizes NVidia MPS.

If you know that your job requires the whole resources of a GPU or a particular kind of GPU (currently available list includes 3090, 2080TI, 1080TI), you can request the whole GPU with the following command:

sbatch --gres gpu:2080TI:1 --cpus-per-task=8 --mem=20G ./yourscript.sh

NB! It is important to estimate how many CPU cores and how much memory your job will require. Otherwise tasks from other users may interfere with your job and none of them might complete successfully. To get a better understanding of resource usage, check out the output of /usr/bin/time -v (which is different from just running time in bash).

If your computational workload does not need GPU resources, just submit it by specifying the required memory and cpu resources:

sbatch --cpus-per-task=10 --mem=24G  ./yourscript.sh

It is worth while to include /usr/bin/time -v in the script in front of the command that runs the computation to get some feedback on the resources required. It is useful to start with a shorter task to see how much resources your task requires and adjust the resource parameters of the sbatch script accordingly.

There is also the possibility to use the RTX 3090 GPUs 24 GB of GPU memory. For using that use the following tag:

sbatch --gres gpu:3090:1 --cpus-per-task=8 --mem=20G ./yourscript.sh

If you have lots of long jobs and you do not have any alternatives available at the moment, please resort to using a single node for your jobs. E.g. submit the jobs by specifying the appropriate node name. This way the other users will be able to still use the system while your jobs are computing.

sbatch --gres gpu:1080TI:1  -w ai-lab-05 --cpus-per-task=8 --mem=20G ./yourscript.sh

Duration of jobs

The default maximum duration is 4 days. Please invest some time into storing intermediate results. It is possible to request longer jobs, but you should do it sparingly because you will block the resources for a very long time. The command:

sbatch --gres gpu:2080TI:1  -w ai-lab-05 --cpus-per-task=2 --mem=10G -t 5-0  ./yourscript.sh

will submit a job request to the queue that can run up to 5 days and 0 hours.

How to observe the progress of the job?

Slurm creates a slurm-<jobid>.out file into the directory where you submit the job from for each job that is submitted to the queue. The standard output and standard error of the job are logged into that file. It is possible to observe the progress of a particular job with, for example, less. To make less follow the current updates, press SHIFT+F. To stop monitoring, just press CTRL+C. To exit less, press q (or Q). Note that less is a better more.

Upon completion of a job a file named slurm-<jobid>-dcgm-gpu-stats-<nodename>.out will be created next to the slurm-<jobid>.out file where a summary of GPU utilization is stored. If you notice very low GPU utilization, please check your code. If your code has parts where GPU-s are not used, break the job into parts. Note that the nodes are shared between different users, i.e. it is possible for multiple users to run jobs on CPU as long as the total number of CPU cores requested fits into a particular node.

Monitoring the state of the queue

To see what is the status of the queue, just run

squeue

If it appears that you have submitted a job to the queue that has a mistake in it, then you can cancel the jobs by using

scancel <jobid>

where is replaced with the appropriade ID of the job to be cancelled.

To see which nodes are available, run

sinfo

If the nodes are down, mention it in the AI-Lab teams team.

To get an overview of your job history, run

sacct --starttime 2021-01-01 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist

It is possible to get the summary for a particular job by adding the job id with the -j <jobid> switch.

To see full set of SLURM parameters of a job, cancelled

scontrol show job <jobid>

Interactive sessions

Sometimes it is necessary to develop software in an interactive session with access to GPU resources. For that it is possible to create a session to the head node of the AI-LAB edu cluster using dynamic port forwarding:

ssh -A uniid@193.40.244.114 -D 1337

Configure your browser to use the SOCKS proxy in your browser, i.e. your SOCKS proxy should be localhost and the port should be 1337 (if you use -D 1337 when opening the SSH session).

For interactive session (please use sparingly):

 srun --gres gpu:2080TI:1 --cpus-per-task=8 --mem=20G --pty bash

To run jupyter-notebook first find out the local ip of the node:

host ai-lab-02

Then run jupyter-notebook

jupyter-notebook  --ip the-ip-found-with-host-command   --no-browser

You should be able to paste the link into your browser which has the socks proxy configured.

Switching CUDA versions

The default CUDA is Cuda 11.2

Tensorflow (version 1) typically wants CUDA 10.1, but recent nightly Tensorflow builds also support CUDA 11. Note that the 3090 (aka Ampera) GPUs do not have CUDA 10 support available from NVidia.

Support

For support, please refer to the AI-Lab Teams team

Please note that you will need your Uni-ID to log in to the system. For people who have joined the university in recent years, the Uni-ID is a 6 character string. Instructions how to achieve access is available in the above mentioned Teams team.

Credit

The system is set up in collaboration between the IT College HPC Center and Department of Software Science. The list of contributors (in alphabetical order):

Lauri Anton
Juhan Ernits
Rainer Liis
Bahdan Yanovich