[one_fourth padding=”0 20px 0 0″]
Faculty
Data Management
Rapid Prototyping
[/one_fourth]
[three_fourth_last padding=”0 0 0 0″]
Example of How to Run TensorFlow Batch Jobs
To use TensorFlow, first you need to load the module:
module load tensorflow/0.11
This automatically loads the correct software for TensorFlow (CUDA, cudNN, etc).
The CUDA module exports the relevant environmental variables (CUDA_ROOT and LD_LIBRARY_PATH) for you when loaded.
The second way to submit a job is through the use of a batch script that is submitted to the queue to run on a compute node.
Batch jobs are for when you have production ready code and simulations to run.
You can submit a job to the queue using a shell script (e.g. tensorflow.sh) to call the relevant program, such as:
sbatch tensorflow.sh
The flags for the requested resources are placed at the top of the script like this:
#!/bin/bash
#SBATCH -N1
#SBATCH -pexclusive
#SBATCH --gres=gpu:2
The example shell script below calls a program named tf_bash.py:
#!/bin/bash
#SBATCH -N 1
#SBATCH -p shared
#SBATCH -t 1:00:00
#SBATCH --gres=gpu:1
python tf_bash.py >> tf_test.log
You can also submit a python program (e.g. tf_direct.py) to the batch manager directly, without the extra shell script. Below is a simple TensorFlow program that can be submitted using the command sbatch tf_direct.py
.
#!/bin/python
#SBATCH -N 1
#SBATCH -p exclusive
#SBATCH -o tf_test.out
#SBATCH -t 1:00:00
#SBATCH --gres=gpu:2
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
a = tf.constant(10)
b = tf.constant(32)
print(sess.run(a + b))
At the top of the program, you will see the resource requests as #SBATCH comments. These can be put either at the top of the job script, or passed to the sbatch command when submitting, e.g.:
sbatch -N 1 -p exclusive -o tf_test.out -t 1:00:00 --gres=gpu:2 tf_direct.py
If you pass them to sbatch on the command line, there is no need to include them at the top of the program.
Example of How to Run Abaqus Batch Jobs
To use Abaqus, first you need to load the module:
module load abaqus/2016
Abaqus can run on either a single compute node, or across multiple nodes using MPI. For some large problems, the only way to run a simulation is by decomposing the system into domains for execution on separate compute nodes.
ARC has created an Abaqus job submission program for the cluster that automates resource requests, variable settings, and Abaqus flags.
After loading the Abaqus module, the command sabaqus
is available. Users can now execute the command sabaqus --help
to see options for the cluster and Abaqus jobs:
gompei@ace:~/work$ sabaqus --help
Usage: sabaqus [options]
Options:
-h, --help show this help message and exit
-i INPF, --input=INPF Abaqus input file (INPUT.inp)
-l LOGF, --logfile=LOGF Log file for Abaqus output
-s SCR, --scratch=SCR Abaqus scratch directory
-p PART, --partition=PART Partition (queue) for resource request
-E ERR, --error=ERR Error file for Abaqus
-t TIME, --walltime=TIME Job run time (HOURS:MINUTES)
-m MEM, --memory=MEM Abaqus memory (in GB)
-N NODES, --nodes=NODES Number of nodes
-C CONST, --constraint=CONST Hardware constraint (Slurm Feature)
-n NTASKS, --ntasks-per-node=NTASKS Number of tasks per node
-g NGPUS, --gpus-per-node=NGPUS Number of GPUs per node
The submission program simplifies the job submission process, allowing for quick submission of new Abaqus simulations for the users.
Suggestions or modifications to the sabaqus
program are welcome.
If you choose to write your own submission script, below are some important details to keep in mind.
Abaqus requires a couple of special environmental variable settings and flags in order to run properly on the cluster.
The first is a Slurm specific environmental variable (a known issue with Abaqus on HPC clusters):
unset SLURM_GTIDS
The second setting is to pass the interactive
keyword to the Abaqus command:
abq2016 cpus=40 gpus=2 mp_mode=mpi input=INPUT_FILENAME job=INPUT_FILENAME scratch=/PATH/TO/ABAQUS/SCRATCH memory="80 gb" interactive
Without the interactive
keyword, the job will exit the queue immediately after starting.
[/three_fourth_last]
Last Updated on May 3, 2017