Sciblade - Queuing System
For the efficient use of the cluster, two Monitoring/Job Management software (PBS/Torque and Maui) have been installed.
After logging into to the cluster, the user is on the master node. When a program is run, it is also immediately run on the master. This is the "interactive mode", which is convenient for running simple commands like ls, vi, etc. or for editing/compiling a program. But, long computing jobs should be submitted through the queuing system. The submitted job will be in a queue waiting for its turn, then will be sent to one or more compute node(s), which the job will have dedicated access to until it finishes. Therefore, the job will run faster and the cluster will be more efficiently utilized.
Basic Commands
Some basic commands that every cluster user should know before they start running jobs on these system:
Command | Description |
qsub | To submit a job to the queuing system |
qdel | To delete a job that has been submitted to the queuing system |
qstat / showq | List all information about queues and jobs |
Sample PBS job scripts
PBS job script for Parallel MVAPICH1
PBS job script for Parallel MVAPICH2
PBS job script for Parallel OPENMPI
PBS job script for Parallel NAMD2
PBS job script for Serial job
Submit Your Jobs
Submit your batch job from the frontend with the command
$ qsub [job_script]
You get the job_name and job_id assigned, which can be used with various command.
Monitor Your Jobs
To see the progress information of running jobs, the command showq(Maui) and qstat(Torque) can be used.
Both commands give you a summary of the status of submitted jobs and queues They give slightly different types of information.
qstat shows a list of all running and waiting jobs in the queue, sorted by job identifier.
[user_y@sciblade myjob]$ qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
256.sciblade J01_16 09432411 22:58:50 R default
258.sciblade gau-16_4 user_x 00:23:11 R default
272.sciblade cpi_test user_x 0 Q default
281.sciblade q22p128 user_y 00:00:00 R default
|
Here you can see that the submitted job 281 is in the state of running (R), while job 272 of user_x is waiting (Q).
To get more detailed information, use qstat -a or qstat -f [job id]
showq sorts the jobs in three categories: running, idle and blocked. Idle jobs will start when processors become available.
Blocked jobs will become idle when the queue system rule allow it(e.g. when a user no longer has the maximum allowed number of
processors used).
[user_y@sciblade myjob]$ showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
281 user_y Running 128 1:58:47:47 Thu Oct 8 15:01:11
258 user_x Running 64 3:36:44:45 Wed Oct 7 15:50:16
256 09432411 Running 16 11:01:01:26 Thu Sep 24 00:30:57
5 Active Jobs 208 of 2048 Processors Active (10.15%)
13 of 256 Nodes Active (5.08%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
272 user_x Idle 480 7:15:00:00 Wed Sep 30 17:22:21
1 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
Total Jobs: 4 Active Jobs: 3 Idle Jobs: 1 Blocked Jobs: 0
|
Please note that sometimes it takes a minute for submitted job to showq up under showq.
Another difference is that qstat shows time used for running jobs, while showq displays time left until the job will be killed by the queue system. When a job has finished it will no longer appear in the qstat or showq output.
Besides, the web based cluster monitor Ganglia (available from http://clustername/ganglia) is a very helpful tools to monitor the compute-node loading/status.
To delete a running job, use
$ qdel [jobid]
Frequently Used PBS Command
PBS supplies a command line interface. This is used to submit, monitor, modify, and delete jobs. The following are some frequent used PBS user commands and their functions:
Command | Description |
qsub | Submit a job |
qstat | List all information of queues and jobs |
qdel | Delete a job |
qhold | Hold a batch job to keep it from being scheduled for running |
qmove | Move a job to a different queue or server |
qmsg |
Append a message to the output of an executing job |
qrerun | Terminate an executing job and return it to a queue |
qrls | Release a held job |
qsig | Send a signal to an executing job |
Frequently Used qsub option
Option | Action |
qsub -l list | Set job resource list |
qsub -N jobname | Set job name to jobname |
qsub -q dest | Submit to queue dest |
The resource requested on command line has a high preference than the directive line in the script file.
For an example, submit job by command qsub -l nodes=2:ppn=4 [jobscript]
this job will run on 2 compute nodes with 4 processors each instead of what stated in the script file.
Frequently Used qstat option
Option | Action |
qstat -a | List all jobs |
qstat -q | List all queues on the system |
qstat -n | List |
qstat -u userid | List all jobs owned by user userid |
qstat -r | List all running jobs |
qstat -f jobid | List all information known about specified job(jobid) |
|