Using the Super Computer when we Need GPU¶
When running AI related tasks for images and video, the super computer is often needed. This doc helps understand workflows and tasks for using GRACE or FASTER.
Finding and Loading Modules¶
Due to the restrictions of the connection node and its file system, you will normally want to find and load modules so that it doesn’t count against your quota.
To search for a module, you can do something like:
$ module -r spider '.*torch.*'
This will return your matches and the modules that must be installed prior to loading this module.
$ ml GCCcore/13.2.0 Python/3.11.5
Building an Activating a Virtual Environment¶
The Costs of Using GPUs¶
GRACE has 3 different GPUs each with there own costs and details:
Effective GPU SU charge per one hour (wall_time) A100 72 RTX 6000 48 T4 24
Writing SLURM Job¶
This is my template for writing a SLURM job:
#!/bin/bash
##NECESSARY JOB SPECIFICATIONS
#SBATCH --job-name=marks template mob
#SBATCH --time=00:15:00 # modify to needs
#SBATCH --ntasks=1 # modify to needs
#SBATCH --ntasks-per-node=2 # modify to needs
#SBATCH --mem=4G # modify to needs
#SBATCH --output=is_torch_details_log.%j
#SBATCH --partition=gpu # Add this to Get GPUs
#SBATCH --gres=gpu:a100:1 # Specify the GPUs you want to use (see above)
#SBATCH --mail-type=ALL # add if you want an email
#SBATCH --mail-user=mark.baggett@tamu.edu # add if you want an email
# load required module(s)
module purge # Purge everything just in case then add modules in order
module load GCCcore/13.2.0
module load Python/3.11.5
source activate_venv mark_test_venv # Load Virtual Environment to Bring in Other Things You've added
python execute.py # Run your job
# Job Environment variables
echo $SLURM_JOBID
echo $SLURM_SUBMIT_DIR
echo $TMPDIR
echo $SCRATCH
Calculating Costs of a SLURM JOB¶
Before you run a job, there is piece of mind in understanding what that job might cost. To do this, you can use
maxconfig
to get an estimate of what your job will cost and why:
$ maxconfig -f my-job.slurm
Showing SU calculation for file is_torch_gpu.slurm
(CPU-billing + (GPU-billing * GPU-count)) * hours * nodes = SUs
( 2 + ( 72 * 1)) * 0.25 * 1 = 18.5
#!/bin/bash
#SBATCH --job-name=torch_details
#SBATCH --time=00:15:00
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=2
#SBATCH --nodes=1
#SBATCH --mem=4G
#SBATCH --output=is_torch_details_log.%j
#SBATCH --gres=gpu:a100:1
#SBATCH --partition=gpu
#SBATCH --mail-type=ALL
#SBATCH --mail-user=mark.baggett@tamu.edu
Running a Job¶
Calculate costs then exectute like:
$ sbatch demo.slurm
Submitted batch job 12685166
(from job_submit) your job is charged as below
Project Account: 132667767747
Account Balance: 19999.858889
Requested SUs: 0.5
Pro tips¶
When Nothing Works¶
When things go wrong, always:
$ deactivate && ml purge
If you can’t get something working, it’s probably a module inheritance problem.
Check Venv First on Node¶
Before running a job, enter a repl and make sure the things you expect are there. If you don’t you may waste time and
service units (su
).