|
Cluster Array Job
Purpose. Many jobs can be submitted to the Sun Grid Engine with a single qsub command using the "array job" feature. The purpose of this simple example is to explain the mechanics of a SGE array job.
Background. When a cluster job is submitted to the Sun Grid Engine, it dispatches the same job to various cluster nodes with each job identified by a unique "task id," which is a simple integer number. The number is defined as the SGE_TASK_ID envirnoment variable. This single number must be used by a cluster job to identify which case to run, which file to process, .... what work to do. The combination of JOB_ID and SGE_TASK_ID can be used to uniquely identify output from an array job.
Step-by-Step Instructions and Comments
-
Some familiarity with the Linux environment for a cluster node is assumed here. The get info cluster example provides some introductory information.
-
Study/edit these scripts from a development Linux box:
cat -n arrayjob.bash
1 #! /bin/bash
2 # Name to appear in qstat
3 #$ -N ArrayJob
4
5 # Merge stderr with stdout
6 #$ -j yes
7
8 # Simple script to show how SGE array job works
9
10 echo "Begin"
11 echo "HOSTNAME=$HOSTNAME, JOB_ID=$JOB_ID, SGE_TASK_ID=$SGE_TASK_ID"
12 echo "End"
cat -n submit.bash
1 #! /bin/bash
2 qsub -t 1-26:5 -cwd arrayjob.bash
The submit.bash script issues a single qsub command (line 2) with the array job option, -t 1-26:5. The -t option identifies the range of task id's and the step between values. In this case, the SGE will use task ids from 1 to 26 by 5, namely, 1, 6, 11, 16, 21, 26. If no step size is give, the assumed value is 1. So, -t 1-26 would result in submitting 26 cluster jobs with task ids from 1 to 26.
The SGE will schedule arrayjob.bash script to run with the environment variable $SGE_TASK_ID set to the various values.
Special "#$" comments (lines 3 and 6) in arrayjob.bash provide additional SGE parameters that could have been specified on the qsub command line. The "-j yes" parameter on line 6 specifies that standard error output will also be written to standard output. Since no "-o" parameter is present to specify standard output, SGE will assign standard output by job using the job id and task id to make a unique name for each file.
-
Login to the cluster head node (cluster02 in this case), change to the working directory, and run the submit.bash script:
[65 efg cluster02 23Oct07 16:39:58 /home/efg]
cd cluster/arrayjob
[66 efg cluster02 23Oct07 16:40:06 /home/efg/cluster/arrayjob]
./submit.bash
Your job-array 84265.1-26:5 ("ArrayJob") has been submitted
-
Logout of the cluster head node, and login to a development box (genekc03 in this case). Let's look at the files created by the array job above:
[302 efg genekc03 23Oct07 16:40:38 /home/efg/cluster/arrayjob]
ls -AlF
total 32
-rw-r--r-- 1 efg efg 57 Oct 23 16:40 ArrayJob.o84265.1
-rw-r--r-- 1 efg efg 58 Oct 23 16:40 ArrayJob.o84265.11
-rw-r--r-- 1 efg efg 58 Oct 23 16:40 ArrayJob.o84265.16
-rw-r--r-- 1 efg efg 58 Oct 23 16:40 ArrayJob.o84265.21
-rw-r--r-- 1 efg efg 58 Oct 23 16:40 ArrayJob.o84265.26
-rw-r--r-- 1 efg efg 57 Oct 23 16:40 ArrayJob.o84265.6
-rwxr-xr-x 1 efg efg 235 Oct 23 16:25 arrayjob.bash*
-rwxr-xr-x 1 efg efg 49 Oct 23 16:24 submit.bash*
Note the default names used by SGE to name the standard output files since no "-o" option was specified. Here's what's in those files:
[303 efg genekc03 23Oct07 16:46:09 /home/efg/cluster/arrayjob]
cat ArrayJob.o*
Begin
HOSTNAME=node0024, JOB_ID=84265, SGE_TASK_ID=1
End
Begin
HOSTNAME=node0020, JOB_ID=84265, SGE_TASK_ID=11
End
Begin
HOSTNAME=node0023, JOB_ID=84265, SGE_TASK_ID=16
End
Begin
HOSTNAME=node0016, JOB_ID=84265, SGE_TASK_ID=21
End
Begin
HOSTNAME=node0000, JOB_ID=84265, SGE_TASK_ID=26
End
Begin
HOSTNAME=node0002, JOB_ID=84265, SGE_TASK_ID=6
End
Summary. This toy example shows the mechanics of how a SGE array job can be submitted to a Linux cluster for execution. [More complicated examples are planned.] |