efg's Research Notes
Embarrassingly Parallel Computations Using the Sun Grid Engine


Cluster Array Job

Purpose.  Many jobs can be submitted to the Sun Grid Engine with a single qsub command using the "array job" feature.  The purpose of this simple example is to explain the mechanics of a SGE array job.

Background.  When a cluster job is submitted to the Sun Grid Engine, it dispatches the same job to various cluster nodes with each job identified by a unique "task id," which is a simple integer number. The number is defined as the SGE_TASK_ID envirnoment variable.  This single number must be used by a cluster job to identify which case to run, which file to process, ....  what work to do.  The combination of JOB_ID and SGE_TASK_ID can be used to uniquely identify output from an array job.

Step-by-Step Instructions and Comments

  1. Some familiarity with the Linux environment for a cluster node is assumed here.  The get info cluster example provides some introductory information.
  2. Study/edit these scripts from a development Linux box:

    cat -n arrayjob.bash
    1 #! /bin/bash
    2 # Name to appear in qstat
    3 #$ -N ArrayJob
    4
    5 # Merge stderr with stdout
    6 #$ -j yes
    7
    8 # Simple script to show how SGE array job works
    9
    10 echo "Begin"
    11 echo "HOSTNAME=$HOSTNAME, JOB_ID=$JOB_ID, SGE_TASK_ID=$SGE_TASK_ID"
    12 echo "End"

    cat -n submit.bash
    1 #! /bin/bash
    2 qsub -t 1-26:5 -cwd arrayjob.bash

The submit.bash script issues a single qsub command (line 2) with the array job option, -t 1-26:5.  The -t option identifies the range of task id's and the step between values.  In this case, the SGE will use task ids from 1 to 26 by 5, namely, 1, 6, 11, 16, 21, 26.  If no step size is give, the assumed value is 1.  So, -t 1-26 would result in submitting 26 cluster jobs with task ids from 1 to 26.

The SGE will schedule arrayjob.bash script to run with the environment variable $SGE_TASK_ID set to the various values.

Special "#$" comments (lines 3 and 6) in arrayjob.bash provide additional SGE parameters that could have been specified on the qsub command line. The "-j yes"  parameter on line 6 specifies that standard error output will also be written to standard output. Since no "-o" parameter is present to specify standard output, SGE will assign standard output by job using the job id and task id to make a unique name for each file.

  1. Login to the cluster head node (cluster02 in this case), change to the working directory, and run the submit.bash script:

    [65 efg cluster02 23Oct07 16:39:58 /home/efg]
    cd cluster/arrayjob

    [66 efg cluster02 23Oct07 16:40:06 /home/efg/cluster/arrayjob]
    ./submit.bash
    Your job-array 84265.1-26:5 ("ArrayJob") has been submitted

  2. Logout of the cluster head node, and login to a development box (genekc03 in this case).  Let's look at the files created by the array job above:

    [302 efg genekc03 23Oct07 16:40:38 /home/efg/cluster/arrayjob]
    ls -AlF
    total 32
    -rw-r--r-- 1 efg efg 57 Oct 23 16:40 ArrayJob.o84265.1
    -rw-r--r-- 1 efg efg 58 Oct 23 16:40 ArrayJob.o84265.11
    -rw-r--r-- 1 efg efg 58 Oct 23 16:40 ArrayJob.o84265.16
    -rw-r--r-- 1 efg efg 58 Oct 23 16:40 ArrayJob.o84265.21
    -rw-r--r-- 1 efg efg 58 Oct 23 16:40 ArrayJob.o84265.26
    -rw-r--r-- 1 efg efg 57 Oct 23 16:40 ArrayJob.o84265.6
    -rwxr-xr-x 1 efg efg 235 Oct 23 16:25 arrayjob.bash*
    -rwxr-xr-x 1 efg efg 49 Oct 23 16:24 submit.bash*

    Note the default names used by SGE to name the standard output files since no "-o" option was specified.  Here's what's in those files:

    [303 efg genekc03 23Oct07 16:46:09 /home/efg/cluster/arrayjob]
    cat ArrayJob.o*
    Begin
    HOSTNAME=node0024, JOB_ID=84265, SGE_TASK_ID=1
    End
    Begin
    HOSTNAME=node0020, JOB_ID=84265, SGE_TASK_ID=11
    End
    Begin
    HOSTNAME=node0023, JOB_ID=84265, SGE_TASK_ID=16
    End
    Begin
    HOSTNAME=node0016, JOB_ID=84265, SGE_TASK_ID=21
    End
    Begin
    HOSTNAME=node0000, JOB_ID=84265, SGE_TASK_ID=26
    End
    Begin
    HOSTNAME=node0002, JOB_ID=84265, SGE_TASK_ID=6
    End

Summary.  This toy example shows the mechanics of how a SGE array job can be submitted to a Linux cluster for execution. [More complicated examples are planned.]

 

 

 


E a r l   F.   G l y n n
e f g @ s t o w e r s - i n s t i t u t e . o r g

Updated
 23 Oct 2007