Help Topics
User Accounts Additional Help
|
NQSFAQUsing the Network Queueing System (NQS)This page describes how to use the Network Queueing System (NQS) which runs on our cluster of DEC Alphas. If you have any specific questions or problems that aren't answered below, please use the Help Request System. On this page... (hide)
General QuestionsWhat is NQS?NQS allows users to submit batch jobs to queues on local or remote machines for execution while returning log/error files to the machine that originated the job. The version we are running has been partially rewritten by David Carver to allow batch jobs to run on multiple machines simultaneously. A batch job is simply a shell script and can be written in any shell scripting language the user desires. As a result, just about any command can be placed into a script and submitted to NQS for execution. Batch jobs are submitted to NQS from Each of the eight Alpha client machines allows a maximum of one batch job to be running at any time, thus giving the user a dedicated machine for his/her batch requests submitted via NQS. These machines do not permit direct user logins, so users desiring to run jobs on them must use NQS. How do I decide which queue to use?Our queues are set up and prioritized according to 2 criteria:
QUEUE TYPE CPU TIME LIMIT --------------------------------------- short 0.25 hour (15 minutes) medium 0.50 hour (30 minutes) long 0.75 hour (45 minutes) extra_long 1.00 hour (60 minutes) Queue names are a concatenation of queue type and number of machines required, such as All queues are prioritized according to these criteria. Jobs requiring fewer machines have higher priority than jobs requiring more machines, and jobs requiring less CPU time have higher priority than jobs requiring more CPU time. The NQS queues are first divided into categories based on number of machines required, then each of these categories is further subdivided according to CPU time required. short1 (0.25 hour on 1 machine) . . extra_long1 (1.00 hour on 1 machine) short2 (0.25 hour on 2 machines) . . extra_long2 (1.00 hour on 2 machines) short4 (0.25 hour on 4 machines) . . extra_long4 (1.00 hour on 4 machines) short8 (0.25 hour on 8 machines) . . extra_long8 (1.00 hour on 8 machines) Actual priority assignments are based on a very rough approximation to an exponential curve. To choose an appropriate queue, first decide how many machines you need your job to run on and then select an appropriate maximum CPU time limit. Remember: the fewer machines and less CPU time your job needs, the higher the priority that your job receives. How do I find out what queues are available?You can find out the status of queues and jobs using the qstat -x If you add the Here is some sample output from short1@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=63 lim=40 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {short@ferrari, short@jaguar, short@corvette, short@lotus, short@viper, short@lamborghini, short@maserati, short@cobra}; This is queue You can also use qstat to find out the status of jobs. How do I submit a job to NQS?Before submitting a job, ensure that all of the Alpha machines are in your Jobs are submitted using the qsub [options] Use the qsub -q medium2 myscript.sh
When your job is successfully submitted, you will see a message like: Request 42.porsche.cis.udel.edu submitted to queue: short1. The number proceeding the word "Request" is your request's identification number. If you wish to delete your job at any time, you will need to remember this number! NQS has the ability to notify you via email about stages your job has completed; these options are passed to NQS via qsub. A complete listing can be found in the man page. How do I check on the status of a job?You can check a job's status using -a Show all requests -u <username> Show requests belonging to a specific user -o Select jobs which originated on the local machine -d Show jobs on all machines withing the local NQS domain. In addition, you can specify the Since no jobs are actually run on porsche and you will not know which remote queues your job has been forwarded to, you will need to use the qstat [-s] [-l] queue@machine to find out about your job. If you want to find out about all jobs on a remote machine, use: qstat [-s] [-l] @machine For more information, see the man page. How do I delete a job?You may delete a job with the qdel <id number> But if you only want to delete a job on one remote machine, the appropriate command is: qdel <id number@machine> Why isn't my job running?From the NQS setup documentation: "Occasionally, your job may be in the 'Waiting' or 'Queued' state, and it may not be clear why it is not running. Determination of the reason can be complicated. NQS allows system managers to set limits on the number of jobs that can be run in a queue at a time. There are queue run limits on the total number of jobs that can run at a time, and queue user run limits, which limit the number if jobs a particular user can run at a time. In the same manner there are global run and user run limits which determine the number of total jobs that can run on the system and the number of jobs a person can have running at any one time, respectively. An investigation of the interactions of these limits and the mix of jobs on the system should indicate the reason a particular request is not running." How do I know when my job is finished?By default, NQS will generate 2 files in the directory where you submitted your job from: one is a listing of all output that went to stdout while your job was running, the other is a listing of all output that went to stderr while your job was running. These files take the form: <job name>.o<job id> for stdout <job name>.e<job id> for stderr For example, you submitted a job called myscript.sh.o42 myscript.sh.e42 You may notice that the stderr file contains the line " There is a problem regarding these files which stems from our multiple job distribution setup, which NQS was not originally built to handle. If you want output from more than one run of your job, read this. However, NQS has the ability to notify you of a job's status via email. When you submit your job with -mb send mail when the request begins execution -me send mail when the request ends execution -mr send mail when the request is restarted -mt send mail when the request is transferred to the execution machine -mu <username> send mail for the request to the stated user You can specify as many of these as you like. Mail will be sent to you from user I just ran a job on multiple machines, but there's only one output file in my directory. Why?By default, NQS is set up to provide you with two files that contain output from stdout and stderr. The problem resides in the fact that each job has one unique id number, regardless of how many machines it gets distributed to. Hence when each instance of the job is completed, the same two files are written. So the output files in your directory will be the output from the most recently finished job. While this is not a problem for one-machine jobs, it certainly is if you want output for two-or-more-machine jobs. There is no way in NQS itself to prevent this from happening. It is up to you to control your program's output. It is important to remember that the same copy of the job gets distributed to multiple machines, so if you simply open a file (such as " The workaround is fairly simple. What you need is something that is found on every machine but unique to it, such as its hostname. A simple solution is to redirect output to a file with the hostname appended to it. Below is a sample script that would be submitted to #!/bin/sh /usa/username/myprogram >& outputfile.$HOST If this ran on ferrari, jaguar, and corvette, you would have 3 files named: outputfile.ferrari.cis.udel.edu outputfile.jaguar.cis.udel.edu outputfile.corvette.cis.udel.edu which contain the output from the runs of Listing of all available queuesThis listing is directly from short1@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=63 lim=40 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {short@ferrari, short@jaguar, short@corvette, short@lotus, short@viper, short@lamborghini, short@maserati, short@cobra}; short2@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=27 lim=40 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {short@ferrari, short@jaguar, short@corvette, short@lotus, short@viper, short@lamborghini, short@maserati, short@cobra}; short4@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=13 lim=40 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {short@ferrari, short@jaguar, short@corvette, short@lotus, short@viper, short@lamborghini, short@maserati, short@cobra}; short8@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=5 lim=40 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {short@ferrari, short@jaguar, short@corvette, short@lotus, short@viper, short@lamborghini, short@maserati, short@cobra}; medium1@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=51 lim=30 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {medium@ferrari, medium@jaguar, medium@corvette, medium@lotus, medium@viper, medium@lamborghini, medium@maserati, medium@cobra}; medium2@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=23 lim=30 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {medium@ferrari, medium@jaguar, medium@corvette, medium@lotus, medium@viper, medium@lamborghini, medium@maserati, medium@cobra}; medium4@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=11 lim=30 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {medium@ferrari, medium@jaguar, medium@corvette, medium@lotus, medium@viper, medium@lamborghini, medium@maserati, medium@cobra}; medium8@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=3 lim=30 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {medium@ferrari, medium@jaguar, medium@corvette, medium@lotus, medium@viper, medium@lamborghini, medium@maserati, medium@cobra}; long1@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=41 lim=20 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {long@ferrari, long@jaguar, long@corvette, long@lotus, long@viper, long@lamborghini, long@maserati, long@cobra}; long2@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=18 lim=20 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {long@ferrari, long@jaguar, long@corvette, long@lotus, long@viper, long@lamborghini, long@maserati, long@cobra}; long4@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=9 lim=20 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {long@ferrari, long@jaguar, long@corvette, long@lotus, long@viper, long@lamborghini, long@maserati, long@cobra}; long8@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=1 lim=20 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {long@ferrari, long@jaguar, long@corvette, long@lotus, long@viper, long@lamborghini, long@maserati, long@cobra}; extra_long1@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=31 lim=10 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {extra_long@ferrari, extra_long@jaguar, extra_long@corvette, extra_long@lotus, extra_long@viper, extra_long@lamborghini, extra_long@maserati, extra_long@cobra}; extra_long2@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=15 lim=10 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {extra_long@ferrari, extra_long@jaguar, extra_long@corvette, extra_long@lotus, extra_long@viper, extra_long@lamborghini, extra_long@maserati, extra_long@cobra}; extra_long4@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=7 lim=10 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {extra_long@ferrari, extra_long@jaguar, extra_long@corvette, extra_long@lotus, extra_long@viper, extra_long@lamborghini, extra_long@maserati, extra_long@cobra}; extra_long8@porsche.cis.udel.edu; type=PIPE; [ENABLED, INACTIVE]; pri=0 lim=10 0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive; Destset = {extra_long@ferrari, extra_long@jaguar, extra_long@corvette, extra_long@lotus, extra_long@viper, extra_long@lamborghini, extra_long@maserati, extra_long@cobra}; David M. Carver, carver AT ee.udel.edu Comments (Add Your Own) |