Frequently Asked Questions

This page contains the most common questions regarding usage of the HPC resources at WPI.

Running Jobs

How do I run my jobs?

All jobs must be submitted to run on the compute nodes using Slurm. There are many different ways to interact with Slurm:

  • Use sbatch to submit a job script (e.g. sbatch jobscript.sh)
  • Start an interactive shell session on a compute node with the sinteractive command
  • Obtain a job allocation using salloc and then submit jobs to these resources using srun
  • Construct a Slurm job array

The most important point step in all of these cases is your resource request for compute, memory, and other hardware resources (i.e. GPUs).

Why isn't my job running?

There are many reasons why your job does not start running immediately. The most common reasons are:

  • The cluster is busy, and the resources you requested are not available
  • You requested a very specific or limited resource (i.e. a V100 GPU)
  • Your job priority is lower than other users with pending jobs

You can use the scontrol show job JOBID command, where JOBID is your pending job ID number, to get additional information, including:

  • What node your job is scheduled to run on
  • When your job is expected to start
  • What resources you requested, in the event that you made a mistake in your job submission

The amount of resources varies across the different compute nodes, including the type of CPU, number of CPU cores, amount of memory, and availability of GPUs.

For example, if your job submission requests:

  • 40 cores or more, only 18 compute nodes out of 56 that can accommodate your job
  • 512 GB of RAM in a single node, only 2 compute nodes out of 56 that can accommodate your job
  • V100 GPUs, only 1 compute node out of 56 can accommodate your job

If you want to decrease the amount of wait time for your job submission, make sure you are requesting the minimum amount of resources (CPU/RAM/GPU) required.

Hardware specifications for all of the compute nodes is printed when you login, or detailed specs can be printed using the scluster command.

You can view the calculated job priority numbers, resource requests, and job start times using the show_queue command.

User X has 20 jobs running, while I only submitted one job and it is pending. Why does user X get to run 20 jobs while my job waits?

After reviewing the answers to the previous question, the following are some additional reasons why your job is pending while other jobs are running:

  • User X is running 20 single CPU jobs, which are easy to schedule in between the "gaps" of other users running jobs
  • Your job requested a large amount of CPUs or RAM, making it difficult to schedule immediately if the cluster is busy (it usually is)
  • Your job priority is lower than other users, as calculated by the multifactor priority algorithm

You can view the calculated job priority numbers, resource requests, and job start times using the show_queue command.

There are many different reasons for why your job does not start immediately, and these factors are constantly changing. Use of the short partition, requesting the minimum required resources for your job, frequent checkpointing, and multiple job submissions, is the best strategy to use for moving your jobs through the queue as quickly as possible.

Why does my GPU code say there are no CUDA devices available?

You did not request a GPU in your resource request. Make sure to include --gres=gpu:X where X is the number of GPUs you need per node.

To request specific types of GPUs, include the constraint flag in your job request, followed by the GPU type you would like, for example -C K80

My job runs more slowly on the cluster than it does on my laptop/workstation/server. Why is the cluster so slow?

From a purely hardware perspective, this is highly unlikely. If your job is running more slowly than you expect, the most common reasons are:

  • Incorrect/inadequate resource request (you only requested 1 CPU core, or the default amount of RAM)
  • Your code is doing something that you don't know about

A combination of these two factors are almost always the reason for jobs running more slowly on the cluster than on a different resource.

One example is Matlab code running on the cluster versus a modern laptop. Matlab implicitly performs some operations in parallel (e.g. matrix operations), spawning multiple thrads to accomplish this parallelization. If your resource request on the cluster only includes 1 or 2 CPU cores, no matter how many threads Matlab spawns, they will be pinned to the cores you were assigned.

This can lead to a situation where a user's Matlab code is running using all 4 CPU cores on their laptop, but only 1 CPU core on the cluster. This can give the impression that the cluster is slower than a laptop. By requesting 4 or 8 CPU cores in your resource request, Matlab can now run these operations in parallel.

It is important to understand what the code or application you are using is actually doing when you run your job

Storage

Where should I keep my files on the cluster?

The safest place to keep files on the cluster is in your /home directory. This storage array has both hourly snapshots (last 5 days) and daily backups (last 30 days).

For the best performance, temporary output for running jobs should be written directly to the /tmp directory on the node ($TMPDIR environmental variable).

The /work directory is legacy storage space that is not backed up. In the future, the /work directory will be renamed /scratch to more accurately represent what it should be used for.

I deleted a file and need to recover it! How do I get my files back?

As noted above, file recovery can only be accomplished in your /home directory.

Each directory has a hidden .snapshot directory that can be accessed by directly moving into it (this directory does not show up with the ls command). Use the command cd .snapshot to change to the directory, and then find the most recent snapshot using the ls -alh command. You can copy any files you need from the snapshot of your choice back to the relevant directory.

How do I move files to and from the cluster storage?

Data can be transferred to and from your local computers and the cluster using Globus

If you do not already have a Globus account, you can create one for free. To move data between your local machines and WPI storage, you can create a local endpoint using Globus Connect Personal.

Software

I need package X for my simulations. Can you install X for me?

First, check to make sure that the software you need isn't already available by running the module avail command.

Most system level packages (i.e. CUDA, cuDNN, MPI, math libraries) are already installed as environment modules that can be loaded using module load PACKAGE_NAME where PACKAGE_NAME is obtained using the module avail command.

If you do not see what you are looking for, there are a couple of options depending on the software.

Software installation depends greatly on the type of software (open source, commercial) and complexity.

All commercial software that WPI has a license agreement allowing for use on the cluster is likely already installed. If there is a Linux version of the commercial software you need available, and it is not already installed, please contact archelp@wpi.edu.

Open source software can either be installed by ARC staff, or installed directly in your /home directory. Most packages found on GitHub can be cloned directly into your /home directory, and include build instructions for user level installations.

Python packages are a different situation, and are covered elsewhere in this FAQ.

Python

I need package X for Python. Can you install it for me?

Python manages packages by name, not by version number. This creates problems when trying to support multiple users from a single system Python installation.

There are different versions of Python available as modules that can be used as a base for additional environment modifications.

In general, users are responsible for managing their Python environments. There are a number of options for installing Python packages (in order of ease of user control):

  1. Install Anaconda Python in your /home directory
  2. Use conda (after intalling Anaconda) to manage your Python packages into environments (for multiple projects)
  3. Use virtualenv to create an environment to install packages
  4. Use the system pip/pip3 with the --user flag to install packages in your /home directory

Each of these options has benefits and problems. For options 3 and 4, to make sure there is a consistent Python environment on both the head node and compute nodes, each user should load a module based Python installation as their default Python (e.g. module load python/gcc-4.8.5/3.6.0).

Why is my Tensorflow code failing with an error about libcudnn.so.7/libcuda.so.9 not being found?

While you can manage your Python environment from your home directory, libraries like cuDNN and CUDA are installed as environment modules. Tensorflow in particular is very picky about specific versions of cuDNN and CUDA. A code that fails because libcudnn.so.7 is not found requires cuDNN version 7.0.

Double check to be sure you have the correct versions loaded for the version of Tensorflow you are running.