local QP operation err
Posted: Mon Dec 09, 2013 4:36 pm
Hello everyone,
I have got a problem while running DFPT calculations using VASP (IBRION=8). It turns out that some of my jobs crashed due to the following error:
mlx4: local QP operation err (QPN 003fb8, WQE index dce90000, vendor syndrome 77, opcode = 5e)
And also I got this message from the program:
--------------------------------------------------------------------------
mpirun has exited due to process rank 20 with PID 27788 on
node tiger-r2c2n9 exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
This crash happens at random points, so it is very hard to predict. It is also related how many nodes and processors I use, and seems more likely to happen if I use lots of nodes.
The VASP I use was compiled with openmpi-1.6.3, the VASP version number is VASP.5.2.11, and the code is run with the following command line options:
$MPIRUN -np `wc -l \$PBS_NODEFILE | awk '{print \$1}'` --mca btl ^tcp --bind-to-socket $VASP > logfile
Did anyone else see this problem? And thank you very much for your help!
Sincerely yours,
Kuang
I have got a problem while running DFPT calculations using VASP (IBRION=8). It turns out that some of my jobs crashed due to the following error:
mlx4: local QP operation err (QPN 003fb8, WQE index dce90000, vendor syndrome 77, opcode = 5e)
And also I got this message from the program:
--------------------------------------------------------------------------
mpirun has exited due to process rank 20 with PID 27788 on
node tiger-r2c2n9 exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
This crash happens at random points, so it is very hard to predict. It is also related how many nodes and processors I use, and seems more likely to happen if I use lots of nodes.
The VASP I use was compiled with openmpi-1.6.3, the VASP version number is VASP.5.2.11, and the code is run with the following command line options:
$MPIRUN -np `wc -l \$PBS_NODEFILE | awk '{print \$1}'` --mca btl ^tcp --bind-to-socket $VASP > logfile
Did anyone else see this problem? And thank you very much for your help!
Sincerely yours,
Kuang