Recent Changes - Search:

Help Topics

User Accounts

Additional Help

Staff Docs

  • (Private)

edit SideBar

NQSFAQ

Using the Network Queueing System (NQS)

This page describes how to use the Network Queueing System (NQS) which runs on our cluster of DEC Alphas. If you have any specific questions or problems that aren't answered below, please use the Help Request System.


General Questions

What is NQS?

NQS allows users to submit batch jobs to queues on local or remote machines for execution while returning log/error files to the machine that originated the job. The version we are running has been partially rewritten by David Carver to allow batch jobs to run on multiple machines simultaneously.

A batch job is simply a shell script and can be written in any shell scripting language the user desires. As a result, just about any command can be placed into a script and submitted to NQS for execution. Batch jobs are submitted to NQS from porsche.cis.udel.edu.

Each of the eight Alpha client machines allows a maximum of one batch job to be running at any time, thus giving the user a dedicated machine for his/her batch requests submitted via NQS. These machines do not permit direct user logins, so users desiring to run jobs on them must use NQS.


How do I decide which queue to use?

Our queues are set up and prioritized according to 2 criteria:

  • Number of machines. Jobs can be run on 1, 2, 4, or 8 machines simultaneously.
  • CPU time limit. Each queue has an associated cpu time limit which jobs may not exceed. These types and limits are:
             QUEUE TYPE        CPU TIME LIMIT
           ---------------------------------------
             short         0.25 hour (15 minutes)
             medium        0.50 hour (30 minutes)
             long          0.75 hour (45 minutes)
             extra_long    1.00 hour (60 minutes)

Queue names are a concatenation of queue type and number of machines required, such as short1 or long4.

All queues are prioritized according to these criteria. Jobs requiring fewer machines have higher priority than jobs requiring more machines, and jobs requiring less CPU time have higher priority than jobs requiring more CPU time. The NQS queues are first divided into categories based on number of machines required, then each of these categories is further subdivided according to CPU time required.

         short1       (0.25 hour on 1 machine)
           .
           .
         extra_long1  (1.00 hour on 1 machine)
         short2	     (0.25 hour on 2 machines)
           .
           .
         extra_long2  (1.00 hour on 2 machines)
         short4       (0.25 hour on 4 machines)
           .
           .
         extra_long4  (1.00 hour on 4 machines)
         short8       (0.25 hour on 8 machines)
           .
           .
         extra_long8  (1.00 hour on 8 machines)

Actual priority assignments are based on a very rough approximation to an exponential curve.

To choose an appropriate queue, first decide how many machines you need your job to run on and then select an appropriate maximum CPU time limit. Remember: the fewer machines and less CPU time your job needs, the higher the priority that your job receives.


How do I find out what queues are available?

You can find out the status of queues and jobs using the qstat command. To find out what queues are present on the local machine, use the following command:

         qstat -x

If you add the -b switch, you will get a brief version of the information, and if you add the -l switch, you will get a longer version.

Here is some sample output from qstat -x:

 short1@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=63  lim=40
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {short@ferrari, short@jaguar, short@corvette, short@lotus,
              short@viper, short@lamborghini, short@maserati, short@cobra};

This is queue short1. It is a pipe queue, which means that it forwards all of it jobs to machines in the set "Destset". It has priority 63, and a total capacity ("lim") of 40 jobs.

You can also use qstat to find out the status of jobs.


How do I submit a job to NQS?

Before submitting a job, ensure that all of the Alpha machines are in your ~/.rhosts file or NQS will deny your jobs access at the remote clients. For a complete list, please see the DEC Alpha Cluster FAQ.

Jobs are submitted using the qsub command. Qsub accepts a script which contains the shell commands to be executed when your job runs. In addition, qsub accepts several command line options to modify characteristics of your job. The format of the qsub command is:

         qsub [options]

Use the -q &ltqueue> option to specify which queue to submit your job to. For example, to submit a script called myscript.sh to the queue medium2, you would enter the command:

         qsub -q medium2 myscript.sh

short1 is the default queue if you do not specify a queue with the -q option.

When your job is successfully submitted, you will see a message like:

         Request 42.porsche.cis.udel.edu submitted to queue: short1.

The number proceeding the word "Request" is your request's identification number. If you wish to delete your job at any time, you will need to remember this number!

NQS has the ability to notify you via email about stages your job has completed; these options are passed to NQS via qsub. A complete listing can be found in the man page.


How do I check on the status of a job?

You can check a job's status using qstat. The switches to do this are:

         -a               Show all requests
         -u &ltusername>    Show requests belonging to a specific user
         -o               Select jobs which originated on the local machine
         -d               Show jobs on all machines withing the local NQS domain.

In addition, you can specify the -s switch for short output, and the -l switch for long output.

Since no jobs are actually run on porsche and you will not know which remote queues your job has been forwarded to, you will need to use the -d option initially. Once you know which machine(s) and queue(s) your job is in, you can use the command:

         qstat [-s] [-l] queue@machine

to find out about your job. If you want to find out about all jobs on a remote machine, use:

         qstat [-s] [-l] @machine

For more information, see the man page.


How do I delete a job?

You may delete a job with the qdel command. Its only parameter is the id of the job to be deleted; this number was given to you when you submitted your job with qsub. To delete a job in its entirety, the command is:

        qdel <id number>

But if you only want to delete a job on one remote machine, the appropriate command is:

         qdel <id number@machine>

Why isn't my job running?

From the NQS setup documentation:

"Occasionally, your job may be in the 'Waiting' or 'Queued' state, and it may not be clear why it is not running. Determination of the reason can be complicated. NQS allows system managers to set limits on the number of jobs that can be run in a queue at a time. There are queue run limits on the total number of jobs that can run at a time, and queue user run limits, which limit the number if jobs a particular user can run at a time. In the same manner there are global run and user run limits which determine the number of total jobs that can run on the system and the number of jobs a person can have running at any one time, respectively. An investigation of the interactions of these limits and the mix of jobs on the system should indicate the reason a particular request is not running."


How do I know when my job is finished?

By default, NQS will generate 2 files in the directory where you submitted your job from: one is a listing of all output that went to stdout while your job was running, the other is a listing of all output that went to stderr while your job was running. These files take the form:

         <job name>.o&ltjob id>    for stdout
         <job name>.e&ltjob id>    for stderr

For example, you submitted a job called myscript.sh, and qsub informed you that your job id is 42. When your job completes, you will have the following files in your directory:

         myscript.sh.o42
         myscript.sh.e42

You may notice that the stderr file contains the line "stty: tcgetattr: Not a typewriter". This isn't anything more than a nuisance and will have no effect on your job.

There is a problem regarding these files which stems from our multiple job distribution setup, which NQS was not originally built to handle. If you want output from more than one run of your job, read this.

However, NQS has the ability to notify you of a job's status via email. When you submit your job with qsub, include the -me switch to have NQS mail you when your job ends execution. The complete list of mailing options for qsub is:

         -mb               send mail when the request begins execution
         -me               send mail when the request ends execution
         -mr               send mail when the request is restarted
         -mt               send mail when the request is transferred to the
                           execution machine
         -mu <username>    send mail for the request to the stated user

You can specify as many of these as you like. Mail will be sent to you from user nqs (Joe NQS) to your account on porsche. To have your mail sent to your eecis research account, either use the "-mu" switch in addition to other mailing switches, or set up a .forward file in your home directory.


I just ran a job on multiple machines, but there's only one output file in my directory. Why?

By default, NQS is set up to provide you with two files that contain output from stdout and stderr. The problem resides in the fact that each job has one unique id number, regardless of how many machines it gets distributed to. Hence when each instance of the job is completed, the same two files are written. So the output files in your directory will be the output from the most recently finished job. While this is not a problem for one-machine jobs, it certainly is if you want output for two-or-more-machine jobs.

There is no way in NQS itself to prevent this from happening. It is up to you to control your program's output. It is important to remember that the same copy of the job gets distributed to multiple machines, so if you simply open a file (such as "outputfile"), you either risk potential NFS file locking problems or overwriting of the file by another copy of the same program running on another machine.

The workaround is fairly simple. What you need is something that is found on every machine but unique to it, such as its hostname. A simple solution is to redirect output to a file with the hostname appended to it. Below is a sample script that would be submitted to qsub:

         #!/bin/sh

         /usa/username/myprogram >& outputfile.$HOST

If this ran on ferrari, jaguar, and corvette, you would have 3 files named:

         outputfile.ferrari.cis.udel.edu
         outputfile.jaguar.cis.udel.edu
         outputfile.corvette.cis.udel.edu

which contain the output from the runs of myprogram on those machines respectively.


Listing of all available queues

This listing is directly from qstat -x.

 short1@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=63  lim=40
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {short@ferrari, short@jaguar, short@corvette, short@lotus,
              short@viper, short@lamborghini, short@maserati, short@cobra};

 short2@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=27  lim=40
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {short@ferrari, short@jaguar, short@corvette, short@lotus,
              short@viper, short@lamborghini, short@maserati, short@cobra};

 short4@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=13  lim=40
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {short@ferrari, short@jaguar, short@corvette, short@lotus,
              short@viper, short@lamborghini, short@maserati, short@cobra};

 short8@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=5  lim=40
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {short@ferrari, short@jaguar, short@corvette, short@lotus,
              short@viper, short@lamborghini, short@maserati, short@cobra};

 medium1@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=51  lim=30
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {medium@ferrari, medium@jaguar, medium@corvette, medium@lotus,
              medium@viper, medium@lamborghini, medium@maserati, medium@cobra};

 medium2@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=23  lim=30
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {medium@ferrari, medium@jaguar, medium@corvette, medium@lotus,
              medium@viper, medium@lamborghini, medium@maserati, medium@cobra};

 medium4@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=11  lim=30
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {medium@ferrari, medium@jaguar, medium@corvette, medium@lotus,
              medium@viper, medium@lamborghini, medium@maserati, medium@cobra};

 medium8@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=3  lim=30
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {medium@ferrari, medium@jaguar, medium@corvette, medium@lotus,
              medium@viper, medium@lamborghini, medium@maserati, medium@cobra};

 long1@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=41  lim=20
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {long@ferrari, long@jaguar, long@corvette, long@lotus, long@viper,
              long@lamborghini, long@maserati, long@cobra};

 long2@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=18  lim=20
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {long@ferrari, long@jaguar, long@corvette, long@lotus, long@viper,
              long@lamborghini, long@maserati, long@cobra};

 long4@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=9  lim=20
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {long@ferrari, long@jaguar, long@corvette, long@lotus, long@viper,
              long@lamborghini, long@maserati, long@cobra};

 long8@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=1  lim=20
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {long@ferrari, long@jaguar, long@corvette, long@lotus, long@viper,
              long@lamborghini, long@maserati, long@cobra};

 extra_long1@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=31  lim=10
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {extra_long@ferrari, extra_long@jaguar, extra_long@corvette,
              extra_long@lotus, extra_long@viper, extra_long@lamborghini,
              extra_long@maserati, extra_long@cobra};

 extra_long2@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=15  lim=10
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {extra_long@ferrari, extra_long@jaguar, extra_long@corvette,
              extra_long@lotus, extra_long@viper, extra_long@lamborghini,
              extra_long@maserati, extra_long@cobra};

 extra_long4@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=7  lim=10
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {extra_long@ferrari, extra_long@jaguar, extra_long@corvette,
              extra_long@lotus, extra_long@viper, extra_long@lamborghini,
              extra_long@maserati, extra_long@cobra};

 extra_long8@porsche.cis.udel.edu;  type=PIPE;  [ENABLED, INACTIVE];  pri=0  lim=10
   0 depart;   0 route;   0 queued;   0 wait;   0 hold;   0 arrive;
   Destset = {extra_long@ferrari, extra_long@jaguar, extra_long@corvette,
              extra_long@lotus, extra_long@viper, extra_long@lamborghini,
              extra_long@maserati, extra_long@cobra};

David M. Carver, carver AT ee.udel.edu


Comments (Add Your Own)


Edit - History - Print - Recent Changes - Search
Page last modified on December 19, 2006, at 03:09 PM