A batch job is simply a shell script and can be written in any shell scripting
language the user desires. As a result, just about any command can be placed
into a script and sumbitted to NQS for execution. Batch jobs are submitted
to NQS from porsche.cis.udel.edu.
Each of the eight Alpha client machines allows a maximum of one batch job to be running at any time, thus giving the user a dedicated machine for his/her batch requests submitted via NQS. These machines do not permit direct user logins, so users desiring to run jobs on them must use NQS.
QUEUE TYPE CPU TIME LIMIT
---------------------------------------
short 0.25 hour (15 minutes)
medium 0.50 hour (30 minutes)
long 0.75 hour (45 minutes)
extra_long 1.00 hour (60 minutes)
short1 or long4.
All queues are prioritized according to these criteria. Jobs requiring fewer machines have higher priority than jobs requiring more machines, and jobs requiring less CPU time have higher priority than jobs requiring more CPU time. The NQS queues are first divided into categories based on number of machines required, then each of these categories is further subdivided according to CPU time required.
short1 (0.25 hour on 1 machine)
.
.
extra_long1 (1.00 hour on 1 machine)
short2 (0.25 hour on 2 machines)
.
.
extra_long2 (1.00 hour on 2 machines)
short4 (0.25 hour on 4 machines)
.
.
extra_long4 (1.00 hour on 4 machines)
short8 (0.25 hour on 8 machines)
.
.
extra_long8 (1.00 hour on 8 machines)
Actual priority assignments are based on a very rough approximation to
an exponential curve.
To choose an appropriate queue, first decide how many machines you need your job to run on and then select an appropriate maximum CPU time limit. Remember: the fewer machines and less CPU time your job needs, the higher the priority that your job receives.
qstat command. To find out what queues are present on
the local machine, use the following command:
qstat -x
If you add the -b switch, you will get a brief version
of the information, and if you add the -l switch, you
will get a longer version.
Here is some sample output from qstat -x:
short1@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=63 lim=40
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {short@ferrari, short@jaguar, short@corvette, short@lotus,
short@viper, short@lamborghini, short@maserati, short@cobra};
This is queue short1. It is a pipe queue, which means that it
forwards all of it jobs to machines in the set "Destset". It has priority
63, and a total capacity ("lim") of 40 jobs.
You can also use qstat to find out the status of jobs; click here to find out how.
~/.rhosts file or NQS will deny your jobs
access at the remote clients. For a complete list, please see the
DEC Alpha Cluster FAQ.
Jobs are submitted using the qsub command. Qsub
accepts a script which contains the shell commands to be executed
when your job runs. In addition, qsub accepts several command line
options to modify characteristics of your job. The format of the
qsub command is:
qsub [options]
Use the -q <queue> option to
specify which queue to submit your job to. For example, to submit a
script called myscript.sh to the queue medium2,
you would enter the command:
qsub -q medium2 myscript.sh
short1 is the default queue if you do not specify a queue
with the -q option.
When your job is successfully submitted, you will see a message like:
Request 42.porsche.cis.udel.edu submitted to queue: short1.
The number proceeding the word "Request" is your request's identification
number. If you wish to delete your job at any time, you will
need to remember this number!
NQS has the ability to notify you via email about stages your job has completed; these options are passed to NQS via qsub. A complete listing can be found here, as well as in the man page.
qstat. The switches
to do this are:
-a Show all requests
-u <username> Show requests belonging to a specific user
-o Select jobs which originated on the local machine
-d Show jobs on all machines withing the local NQS domain.
In addition, you can specify the -s switch for short
output, and the -l switch for long output.
Since no jobs are actually run on porsche and you will not know which remote
queues your job has been forwarded to, you will need to use the
-d option initially. Once you know which machine(s) and
queue(s) your job is in, you can use the command:
qstat [-s] [-l] queue@machine
to find out about your job. If you want to find out about all jobs on a
remote machine, use:
qstat [-s] [-l] @machine
For more information, see the man page.
qdel command. Its only
parameter is the id of the job to be deleted; this number was given to
you when you submitted your job with qsub. To delete a job
in its entirety, the command is:
qdel <id number>
But if you only want to delete a job on one remote machine, the appropriate
command is:
qdel <id number@machine>
"Occasionally, your job may be in the 'Waiting' or 'Queued' state, and it may not be clear why it is not running. Determination of the reason can be complicated. NQS allows system managers to set limits on the number of jobs that can be run in a queue at a time. There are queue run limits on the total number of jobs that can run at a time, and queue user run limits, which limit the number if jobs a particular user can run at a time. In the same manner there are global run and user run limits which determine the number of total jobs that can run on the system and the number of jobs a person can have running at any one time, respectively. An investigation of the interactions of these limits and the mix of jobs on the system should indicate the reason a particular request is not running."
<job name>.o<job id> for stdout
<job name>.e<job id> for stderr
For example, you submitted a job called myscript.sh,
and qsub informed you that your job id is 42. When
your job completes, you will have the following files in your directory:
myscript.sh.o42
myscript.sh.e42
You may notice that the stderr file contains the line
"stty: tcgetattr: Not a typewriter". This isn't anything
more than a nuisance and will have no effect on your job.
There is a problem regarding these files which stems from our multiple job distribution setup, which NQS was not originally built to handle. If you want output from more than one run of your job, read this.
However, NQS has the ability to notify you of a job's status via email.
When you submit your job with qsub, include the
-me switch to have NQS mail you when your job ends
execution. The complete list of mailing options for qsub is:
-mb send mail when the request begins execution
-me send mail when the request ends execution
-mr send mail when the request is restarted
-mt send mail when the request is transferred to the
execution machine
-mu <username> send mail for the request to the stated user
You can specify as many of these as you like. Mail will be sent to you from
user nqs (Joe NQS) to your account on porsche. To have your
mail sent to your eecis research account, either use the "-mu"
switch in addition to other mailing switches, or set up a .forward
file in your home directory.
There is no way in NQS itself to prevent this from happening. It is up
to you to control your program's output. It is important to remember that
the same copy of the job gets distributed to multiple machines, so if you
simply open a file (such as "outputfile"), you either risk
potential NFS file locking problems or overwriting of the file by another
copy of the same program running on another machine.
The workaround is fairly simple. What you need is something that is found
on every machine but unique to it, such as its hostname. A simple solution
is to redirect output to a file with the hostname appended to it. Below is
a sample script that would be submitted to qsub:
#!/bin/sh
/usa/username/myprogram >& outputfile.$HOST
If this ran on ferrari, jaguar, and corvette, you would have 3 files named:
outputfile.ferrari.cis.udel.edu
outputfile.jaguar.cis.udel.edu
outputfile.corvette.cis.udel.edu
which contain the output from the runs of myprogram on those
machines respectively.
qstat -x.
short1@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=63 lim=40
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {short@ferrari, short@jaguar, short@corvette, short@lotus,
short@viper, short@lamborghini, short@maserati, short@cobra};
short2@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=27 lim=40
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {short@ferrari, short@jaguar, short@corvette, short@lotus,
short@viper, short@lamborghini, short@maserati, short@cobra};
short4@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=13 lim=40
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {short@ferrari, short@jaguar, short@corvette, short@lotus,
short@viper, short@lamborghini, short@maserati, short@cobra};
short8@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=5 lim=40
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {short@ferrari, short@jaguar, short@corvette, short@lotus,
short@viper, short@lamborghini, short@maserati, short@cobra};
medium1@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=51 lim=30
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {medium@ferrari, medium@jaguar, medium@corvette, medium@lotus,
medium@viper, medium@lamborghini, medium@maserati, medium@cobra};
medium2@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=23 lim=30
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {medium@ferrari, medium@jaguar, medium@corvette, medium@lotus,
medium@viper, medium@lamborghini, medium@maserati, medium@cobra};
medium4@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=11 lim=30
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {medium@ferrari, medium@jaguar, medium@corvette, medium@lotus,
medium@viper, medium@lamborghini, medium@maserati, medium@cobra};
medium8@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=3 lim=30
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {medium@ferrari, medium@jaguar, medium@corvette, medium@lotus,
medium@viper, medium@lamborghini, medium@maserati, medium@cobra};
long1@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=41 lim=20
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {long@ferrari, long@jaguar, long@corvette, long@lotus, long@viper,
long@lamborghini, long@maserati, long@cobra};
long2@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=18 lim=20
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {long@ferrari, long@jaguar, long@corvette, long@lotus, long@viper,
long@lamborghini, long@maserati, long@cobra};
long4@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=9 lim=20
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {long@ferrari, long@jaguar, long@corvette, long@lotus, long@viper,
long@lamborghini, long@maserati, long@cobra};
long8@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=1 lim=20
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {long@ferrari, long@jaguar, long@corvette, long@lotus, long@viper,
long@lamborghini, long@maserati, long@cobra};
extra_long1@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=31 lim=10
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {extra_long@ferrari, extra_long@jaguar, extra_long@corvette,
extra_long@lotus, extra_long@viper, extra_long@lamborghini,
extra_long@maserati, extra_long@cobra};
extra_long2@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=15 lim=10
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {extra_long@ferrari, extra_long@jaguar, extra_long@corvette,
extra_long@lotus, extra_long@viper, extra_long@lamborghini,
extra_long@maserati, extra_long@cobra};
extra_long4@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=7 lim=10
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {extra_long@ferrari, extra_long@jaguar, extra_long@corvette,
extra_long@lotus, extra_long@viper, extra_long@lamborghini,
extra_long@maserati, extra_long@cobra};
extra_long8@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=0 lim=10
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {extra_long@ferrari, extra_long@jaguar, extra_long@corvette,
extra_long@lotus, extra_long@viper, extra_long@lamborghini,
extra_long@maserati, extra_long@cobra};