Home
Info
	Grant
	Objective
	Strategy
Timeline
People
	Project Team
	Technician
Facility
	Hardware
	Software
	Account
Teaching Material
Tutorial
	Course / Lab
Research
Resource
FAQ
Vacancy

HPCCC Website

Total Visitors since 28 Aug 2003: 438933

FAQ

1. How to login the cluster?
2. When I run my MPI program, i get "semget failed for setnum = 0", and I cannot run any MPI program thereafter.
3. Could I specify which process runs on which machine?
4. Allgatherv & reduce_scatter performance?
5. Is there a way I can test to see if lamboot has been run before proceeding with my program?

1. How to login the cluster?

There are 2 ways to login the cluster:

SSH Login (terminal login)
- Using your favourite ssh client software, namely putty, SSHsecureshell on windows and openssh on Linux/UNIX
  e.g. on all SCI workstations (sc11 – sc30), type
  
  ssh tdgrocks.sci.hkbu.edu.hk
  
  The password is your Novell account password.
- Settings for Putty 0.54:
  
  Type "tdgrocks.sci.hkbu.edu.hk" in the Host Name textbox
  
  Select the "SSH" radio buttion in Protocol
  
  In Category -> Connection -> SSH -> Auth, check the checkboxes of both "Attempt TIS or CryptoCard auth(SSH1)" and "Attempt "keyboard-interactive" auth(SSH2)"
  
  Then press the "Open" button to start the login session.
VNC Login (graphical login)
- Using vncviewer download from http://www.uk.research.att.com/vnc/
  vncviewer vnc.sci.hkbu.edu.hk:51
  e.g. in windows, run vncviewer and upon asking the server address, type
  
  vnc.sci.hkbu.edu.hk:51

Top of page

2. When I run my MPI program, i get "semget failed for setnum = 0", and I cannot run any MPI program anymore.

In MPICH website, this problem is described as Unsolved Problems. You may refer to http://www-unix.mcs.anl.gov/mpi/mpich/buglist-tbl.html to get more information about it.

If you face such kind of problem, you may use a command call "cleanipcs" to frees the shared memory segments and semaphores. This command is normally not needed, but if MPI programs exit abnormally, the program may be unable to free the System V IPCs that it held (this is a feature of System V IPCs). In that case, cleanipcs can be used to recover the IPCs that are no longer needed. Note that this command releases all IPCs held by the user. This is the correct behavior for most users, but may cause problems for users of other programs or systems that rely on the persistence of System V IPCs.

Usage:

cluster-fork /u1/local/mpich-1.2.5/sbin/cleanipcs

Reference:

http://www-unix.mcs.anl.gov/mpi/www/www1/cleanipcs.html

Top of page

3. Could I specify which process runs on which machine?

If I want to set up 20 processes for an application in a cluster with 10 machines, could I specify which process runs on which machine? For example, 0-4 processes running on machine-0, 5-11 runing on machine-1, then the left 8 processes running on the left 8 machines respectively.

The following will run the processes on certain machines, but those machines are determined by your hosts file. n0 = the first machine in the hosts file, n1= the second machine, etc.

{ process 0-4 } { process 5-11 } { the rest }

$ mpirun n0 n0 n0 n0 n0 n1 n1 n1 n1 n1 n1 n1 n2 n3 n4 n5 n6 n7 n8 n9 -v [-np 20] prog

If you want to run 20 processes only in the first 5 nodes, you can try:

$ mpirun n0-4 -v -np 20 prog

Top of page

4. Allgatherv & reduce_scatter performance?

I wonder that which of allgatherv and reduce_scatter is faster? Basically, they deal with same amout of data but in opposite way. In my experience, reduce_scatter is much slower that allgatherv. I don't know how they work differently in implementation level. Can anybody help me understand the speed difference of allgatherv and reduce_scatter?

In LAM, Allgatherv is implemented using N calls to Gatherv while Reduce_scatter is implemented using a Reduce followed by a Scatterv. You might want to see the source code (/share/ssi/coll/lam_basic/src) for above collective operations.

When you say that reduce_scatter is slower, it depends upon how you measure the completion time, the message size and the number of processes. I think that for large number of processes and sufficiently large message sizes, allgatherv becomes slower compared to the reduce_scatter because of the implementation. Further, in allgatherv, rank0 finishes early compared to rank1 and so on. In reduce_scatter all the processes finish almost at the same time.

Top of page

5. Is there a way I can test to see if lamboot has been run before proceeding with my program?

You can make use of "lamnodes" command for this purpose. It will list the nodes on which LAM has been booted (if at all). Else it will say that "lam daemons" are not running, which is effectively saying that "lamboot" has not been run.

Way to use it: On the command prompt, simply run "lamnodes".

Top of page