=========================================
Using the Super Computer when we Need GPU
=========================================

When running AI related tasks for images and video, the super computer is often needed. This doc helps understand workflows
and tasks for using GRACE or FASTER.

---------------------------
Finding and Loading Modules
---------------------------

Due to the restrictions of the connection node and its file system, you will normally want to find and load modules so
that it doesn't count against your quota.

To search for a module, you can do something like:

.. code-block:: console

    $ module -r spider '.*torch.*'

This will return your matches and the modules that must be installed prior to loading this module.

.. code-block:: console

    $ ml GCCcore/13.2.0 Python/3.11.5


--------------------------------------------
Building an Activating a Virtual Environment
--------------------------------------------


-----------------------
The Costs of Using GPUs
-----------------------

GRACE has 3 different GPUs each with there own costs and details:

Effective GPU SU charge per one hour
(wall_time)
A100 72
RTX 6000 48
T4 24

-----------------
Writing SLURM Job
-----------------

This is my template for writing a SLURM job:

.. code-block:: slurm

    #!/bin/bash

    ##NECESSARY JOB SPECIFICATIONS
    #SBATCH --job-name=marks template mob
    #SBATCH --time=00:15:00 # modify to needs
    #SBATCH --ntasks=1 # modify to needs
    #SBATCH --ntasks-per-node=2 # modify to needs
    #SBATCH --mem=4G # modify to needs
    #SBATCH --output=is_torch_details_log.%j
    #SBATCH --partition=gpu # Add this to Get GPUs
    #SBATCH --gres=gpu:a100:1 # Specify the GPUs you want to use (see above)
    #SBATCH --mail-type=ALL # add if you want an email
    #SBATCH --mail-user=mark.baggett@tamu.edu # add if you want an email

    # load required module(s)
    module purge # Purge everything just in case then add modules in order
    module load GCCcore/13.2.0
    module load Python/3.11.5
    source activate_venv mark_test_venv # Load Virtual Environment to Bring in Other Things You've added
    python execute.py # Run your job

    # Job Environment variables
    echo $SLURM_JOBID
    echo $SLURM_SUBMIT_DIR
    echo $TMPDIR
    echo $SCRATCH

--------------------------------
Calculating Costs of a SLURM JOB
--------------------------------

Before you run a job, there is piece of mind in understanding what that job might cost.  To do this, you can use
:code:`maxconfig` to get an estimate of what your job will cost and why:

.. code-block:: console

    $ maxconfig -f my-job.slurm
      Showing SU calculation for file is_torch_gpu.slurm

      (CPU-billing + (GPU-billing * GPU-count)) * hours * nodes =   SUs
      (          2 + (         72 *         1)) *  0.25 *     1 =  18.5

    #!/bin/bash
    #SBATCH --job-name=torch_details
    #SBATCH --time=00:15:00
    #SBATCH --ntasks=1
    #SBATCH --ntasks-per-node=2
    #SBATCH --nodes=1
    #SBATCH --mem=4G
    #SBATCH --output=is_torch_details_log.%j
    #SBATCH --gres=gpu:a100:1
    #SBATCH --partition=gpu
    #SBATCH --mail-type=ALL
    #SBATCH --mail-user=mark.baggett@tamu.edu

-------------
Running a Job
-------------

Calculate costs then exectute like:

.. code-block:: console

    $ sbatch demo.slurm
    Submitted batch job 12685166
    (from job_submit) your job is charged as below
              Project Account: 132667767747
              Account Balance: 19999.858889
              Requested SUs:   0.5

--------
Pro tips
--------

When Nothing Works
==================

When things go wrong, always:

.. code-block:: console

    $ deactivate && ml purge

If you can't get something working, it's probably a module inheritance problem.

Check Venv First on Node
========================

Before running a job, enter a repl and make sure the things you expect are there.  If you don't you may waste time and
service units (:code:`su`).