WorkBook/NAF/Batch System -

Contents

Submitting Jobs
Monitoring Jobs
Good Practice
Ganga
Documentation

At the NAF SGE is used as batch system. It is similar to LSF or PBS but has a different approache to defining resources. CPU time and memory are resources.

For more details see NAF:Batch System.

Submitting Jobs

Jobs are submitted using the qsub command. Resources, which are important for Athena jobs, are set using the -l option:

time: -l h_cpu=12:00:00 (hard limit for CPU time: 12 hours)
virtual memory: -l h_vmem=1G (hard limit for virtual memory: 1 GB)
stack for threads (needed for athena): -l h_stack=10M (hard limit for stack size: 10 MB)
scratch disk space on workernode: -l h_fsize=10G (hard limit for space: 10 GB)
hostname: -l hostname=tcx040 (run job only on host with hostname tcx040)
site (hh|zn): `-l site=hh' (run jobs preferential at hh, useful if you process data from /scratch)

Either put the option on the command line or into the job script using the special comment #$. For more information about resources see the general documentation or e. g. man complex, for a list of available resource per queue use qstat -F. For additional options see man qsub.

Memory consumption for standard ATLAS jobs:

generation: 500M
simulation: 500M
reconstruction 2G+
ntuple analysis: 500M

Example script (for athena):

#! /bin/zsh

echo -e "\nJob started at "`date`" on "`hostname --fqdn`"\n"

# (only accept jobs with correct resources)
#$ -w e
#
# (stderr and stdout are merged together to stdout)
#$ -j y
#
# (send mail on job's end and abort)
#$ -m ae -M max.mustermann@desy.de
#
# (put log files into current working directory)
#$ -cwd
#
# (use ATLAS project)
#$ -P atlas
#
# (choose memory and disc usage)
#$ -l h_vmem=1G
#$ -l h_stack=10M
#$ -l h_fsize=10G
#
# (choose time)
#$ -l h_cpu=12:00:00

# change to scratch directory
cd $TMPDIR

# if you need input files, you might consider copying them to the local working directory
# cp /scratch/input/bigfile .

# setup athena
echo seting up athena
asetup 15.6.12 --testarea=$HOME/atlas/testarea/15.6.12
ini atlas
atlas_test_config.sh

# run athena
athena.py AthExHelloWorld/AthExHelloWorld_jobOptions.py

# save your output in your afs directory
# cp output /scratch/output/bigfile

echo -e "\nJob stoped at "`date`" on "`hostname --fqdn`"\n"

exit 0

Please not, that the #$ syntax is used to specify options to the batch system. The lines starting with #$ are read at job submision time and are interepreted as options to the qsub command.

Monitoring Jobs

Job status can be monitored with the qstat or the qmon command. The job meta data (requested resources, used resources, job status) of all finished jobs are recorded in a data base, which can be queried using a web interface.

qmon also offers not only monitoring possibilities but also things like suspending, holding and resubmitting jobs in a graphical user interface.

Good Practice

If you write out big files, write them first to the local disk and at the end of the job to some long term storage.

Ganga

Ganga can submit jobs directly to the SGE batch system. Support for Athena and AthenaMC is becomming better and better over ime. Still, there are a few limitations. In practise you need to add a SGE() object as backend:

j.backend = SGE()
j.backend.extraopts ='-l h_cpu=0:10:00 -l h_vmem=0.5G -l site=zn'

This example also shows, how to change the requested resources for this job.

Documentation

The SGE man pages are installed at the NAF, e. g. man qsub. For further information see NAF: Batch System.

ATLAS: WorkBook/NAF/Batch System (last edited 2010-09-06 20:53:46 by WolfgangEhrenfeld)