VASP run error under mpich2

Problems running VASP: crashes, internal errors, "wrong" results.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
Gu Chenjie
Newbie
Newbie
Posts: 18
Joined: Thu Nov 25, 2010 5:41 am

VASP run error under mpich2

#1 Post by Gu Chenjie » Thu Jan 06, 2011 8:04 am

Hi, all. Here I try to explain my problem clearly.
I get two nodes, each nodes have 12 cores, and the two nodes have been connected by a Giga Switch.
Now, I test the vasp examples on nodes.
First, I boot the mpd on the single node, and all the examples run well on each of the single nodes.
then I try to run the examples on the both nodes using 24 cores, after booting the mpd on the two nodes, I got the following error.

Code: Select all

Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(261)..................: MPI_Waitall(count=46, req_array=0x7fffeeca46a0, status_array=0x7fffeeca4760) failed
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1709).......: Communication error
rank 23 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 23: killed by signal 9 
As I am thinking, whether this error is from the low speed of the switch or the limitation of memory, althrough I have set the stack and memeory to unlimited .
Thanks for your attention.
Last edited by Gu Chenjie on Thu Jan 06, 2011 8:04 am, edited 1 time in total.

admin
Administrator
Administrator
Posts: 2921
Joined: Tue Aug 03, 2004 8:18 am
License Nr.: 458

VASP run error under mpich2

#2 Post by admin » Mon Jan 31, 2011 12:58 pm

sorry but this error is not related to vasp itself, but to an error in your MPI
Last edited by admin on Mon Jan 31, 2011 12:58 pm, edited 1 time in total.

Gu Chenjie
Newbie
Newbie
Posts: 18
Joined: Thu Nov 25, 2010 5:41 am

VASP run error under mpich2

#3 Post by Gu Chenjie » Sun Feb 06, 2011 6:54 am

Hi, Sir, yes, today I find where the problem is.
As I mentioned, I have two nodes, and if the job run on one nodes, there should be no problem. However, if the job was assigned to run on both of the two nodes, no matter how many cores I use, so long as there is data exchange between the two nodes, the problem will come, and most important is that it depends on the size of the super cell.
let's take the handson1 as an example(1_1_O_atom), the original POSCAR is as follows:

Code: Select all

O atom in a box
 1.0          ! universal scaling parameters
 8.0 0.0 0.0  ! lattice vector  a(1)
 0.0 8.0 0.0  ! lattice vector  a(2)
 0.0 0.0 8.0  ! lattice vector  a(3)
1             ! number of atoms
cart          ! positions in cartesian coordinates
 0 0 0

and this job can not run if two nodes are used at the same tome, however, if I change the size of the super cell to 4X4X4, means the new POSCAR will as follows:

Code: Select all

O atom in a box
 [color=red]0.5          ! universal scaling parameters[/color] 8.0 0.0 0.0  ! lattice vector  a(1)
 0.0 8.0 0.0  ! lattice vector  a(2)
 0.0 0.0 8.0  ! lattice vector  a(3)
1             ! number of atoms
cart          ! positions in cartesian coordinates
 0 0 0

now, it run very well.
so now I am thinking this should be my compiling problem, maybe from the FFT lib or the hard ware limitation, such as the limit of the CPU stack or memory.
hope you can give me soem suggestions.

Code: Select all

HP BL460c:
CPU 2 X5670 for each node
Memory 24G for each node
NIC 2x10G
using Bewulf structure

Thanks a lot..
Last edited by Gu Chenjie on Sun Feb 06, 2011 6:54 am, edited 1 time in total.

Post Reply