We were able to run calculations for a few days and afterwards we are encountering an internal error in mpi.F file.
The detailed error is given below for your reference.
Code: Select all
Local host: scanmatdgx1
--------------------------------------------------------------------------
running 1 mpi-ranks, on 1 nodes
distrk: each k-point on 1 cores, 1 groups
distr: one band on 1 cores, 1 groups
OpenACC runtime initialized ... 1 GPUs detected
-----------------------------------------------------------------------------
| _ ____ _ _ _____ _ |
| | | | _ \ | | | | / ____| | | |
| | | | |_) | | | | | | | | | |
| |_| | _ < | | | | | | |_ | |_| |
| _ | |_) | | || | | |__| | _ |
| (_) |____/ \____/ \_____| (_) |
| |
| internal error in: mpi.F at line: 898 |
| |
| M_init_nccl: Error in ncclCommInitRank |
| |
| If you are not a developer, you should not encounter this problem. |
| Please submit a bug report. |
| |
-----------------------------------------------------------------------------
Warning: ieee_inexact is signaling
1
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
Please help us to resolve the issue.
Thanks in advance.
SCANMAT.