Well, let me explain the situation first.
Situation : I have a Beowulf cluster with two compute node say node-1[192.168.2.10] and node-2[192.168.2.11]. Among them node-1 is acting as nfs server – providing common home directory for both the compute node. The NIS server is also on node-1 allowing uniform login and authentication on all compute node.
Both the compute node are connected to each other with Gigabit Ethernet switch. This is how my Beowulf cluster is organized. I have installed OpenMPI-1.4.3 on on both node at /usr/local location. Environment variables are properly set in .bashrc file of the common login ID in cluster for runtime linking and loading of libraries.
The compilation of HPL is done with following libraries and software:
5) Cent OS 5.2
To run mpi program I design hostfile[ hostfile.txt ] as follows.
where both the system[ node-1, node-2] is dual socket quad core system.
For simple testing purpose while I run a simple hello world program using following command
mpiexec -np 16 -hostfile hostfile.txt ./helloWorldMPI
it runs perfectly. But while I run HPL with following command
mpiexec -np 16 -hostfile hostfile.txt ./xhpl
It simply stalls. Which means I can see 34% to 40% use of each of the core on each of the system [ using top command ] – but it continues that way and no output comes ever on screen – that is too even for small problem size say N=200 . It run forever till I feel like killing the job.
Solution : As usual I have gone through lot of materials and forums related to OpenMPI and HPL. Some of them are listed bellow. One of them talks about HPL_NO_MPI_DATATYPE [ see the thread ]
I have tried this , with no help. After a lot of trial-and-error, I zeroed in at conclusion that as each of the compute node of my cluster is having multiple Ethernet connectivity port [ eth0,eth1,eth2 ] , the OpenMPI is confused at the time of HPL-MPI communication[ send/receive ] about – which port to use for packet transfer. This understanding takes me to Mailing List Archives which talks about MCA flag “btl_tcp_if_include eth0”. So I decided to give it a try – and surprisngly this solved my problem. The final command is as follows
mpiexec -mca btl_tcp_if_include eth0 -np 16 -hostfile hostfile.txt ./xhpl
AHH …..Ah….what a relief … thank God .. after such long irritating trial-and-error – I am relaxed now :D. As I found almost no citation of this issue on web I decided to write it down for us. I hope this will help Others too…