Running Jobs
How do I run my jobs?
All jobs must be submitted to run on the compute nodes using Slurm. There are many different ways to interact with Slurm:
- Use sbatch to submit a job script (e.g.
sbatch jobscript.sh
) - Start an interactive shell session on a compute node with the
sinteractive
command - Obtain a job allocation using salloc and then submit jobs to these resources using
srun
- Construct a Slurm job array
The most important point step in all of these cases is your resource request for compute, memory, and other hardware resources (i.e. GPUs).
Why isn’t my job running?
There are many reasons why your job does not start running immediately. The most common reasons are:
- The cluster is busy, and the resources you requested are not available
- You requested a very specific or limited resource (i.e. a V100 GPU)
- Your job priority is lower than other users with pending jobs
You can use the scontrol show job JOBID
command, where JOBID is your pending job ID number, to get additional information, including:
- What node your job is scheduled to run on
- When your job is expected to start
- What resources you requested, in the event that you made a mistake in your job submission
The amount of resources varies across the different compute nodes, including the type of CPU, number of CPU cores, amount of memory, and availability of GPUs.
For example, if your job submission requests:
- 40 cores or more, only 18 compute nodes out of 56 that can accommodate your job
- 512 GB of RAM in a single node, only 2 compute nodes out of 56 that can accommodate your job
- V100 GPUs, only 1 compute node out of 56 can accommodate your job
If you want to decrease the amount of wait time for your job submission, make sure you are requesting the minimum amount of resources (CPU/RAM/GPU) required.
Hardware specifications for all of the compute nodes is printed when you login, or detailed specifications can be printed using the scluster
command.
You can view the calculated job priority numbers, resource requests, and job start times using the show_queue
command.
User X has 20 jobs running, while I only submitted one job and it is pending. Why does user X get to run 20 jobs while my job waits?
After reviewing the answers to the previous question, the following are some additional reasons why your job is pending while other jobs are running:
- User X is running 20 single CPU jobs, which are easy to schedule in between the “gaps” of other users running jobs
- Your job requested a large amount of CPUs or RAM, making it difficult to schedule immediately if the cluster is busy (it usually is)
- Your job priority is lower than other users, as calculated by the multifactor priority algorithm
You can view the calculated job priority numbers, resource requests, and job start times using the show_queue
command.
There are many different reasons for why your job does not start immediately, and these factors are constantly changing. Use of the short partition, requesting the minimum required resources for your job, frequent checkpointing, and multiple job submissions, is the best strategy to use for moving your jobs through the queue as quickly as possible.
Why does my GPU code say there are no CUDA devices available?
You did not request a GPU in your resource request. Make sure to include --gres=gpu:X
in your SLURM script, where X is the number of GPUs you need per node. For example, add the line:
#SBATCH --gres=gpu:2
to indicate that your job will use 2 GPUs per node.
Additionally, to make use of CUDA drivers, your SLURM script must have the line
module load cuda
written after your #SBATCH
lines in order to load the proper drivers.
My job runs more slowly on the cluster than it does on my laptop/workstation/server. Why is the cluster so slow?
From a purely hardware perspective, this is highly unlikely. If your job is running more slowly than you expect, the most common reasons are:
- Incorrect/inadequate resource request (you only requested 1 CPU core, or the default amount of RAM)
- Your code is doing something that you don’t know about
A combination of these two factors are almost always the reason for jobs running more slowly on the cluster than on a different resource.
One example is Matlab code running on the cluster versus a modern laptop. Matlab implicitly performs some operations in parallel (e.g. matrix operations), spawning multiple thrads to accomplish this parallelization. If your resource request on the cluster only includes 1 or 2 CPU cores, no matter how many threads Matlab spawns, they will be pinned to the cores you were assigned.
This can lead to a situation where a user’s Matlab code is running using all 4 CPU cores on their laptop, but only 1 CPU core on the cluster. This can give the impression that the cluster is slower than a laptop. By requesting 4 or 8 CPU cores in your resource request, Matlab can now run these operations in parallel.
It is important to understand what the code or application you are using is actually doing when you run your job
Last Updated on July 27, 2023