same input but different result/convergence on different nodes
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 12
- Joined: Thu May 25, 2023 6:24 pm
same input but different result/convergence on different nodes
I repeated one calculation several times, on node with 48 cores and 64 cores. Their inputs are exactly same, the only difference is the node job was submitted to. It could only converge in two cases, with usual NPAR=4 on 48 cores and default NPAR on 64-core node but with more steps. I use NPAR=4 in my usual calculations, which gives NBANDS=56 in OUTCAR on both two nodes. I've also tried to comment NPAR to use default value, and they both don't converge.
I'm not sure why this happens. I've checked this forum but looks like it's not a problem related to NBANDS, and didn't find a solution to this. I've attached the 4 outputs here.
I'm not sure why this happens. I've checked this forum but looks like it's not a problem related to NBANDS, and didn't find a solution to this. I've attached the 4 outputs here.
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 74
- Joined: Fri Aug 04, 2023 11:07 am
Re: same input but different result/convergence on different nodes
Dear yujia_teng,
It does indeed look like a few of the calculation take more than a few steps to converge (using ALGO=All wiki/index.php/ALGO might help here). It is difficult to say anything specific here without the input and output files (see here: wiki/index.php/Minimal_reproducible_exa ... 20possible.)
Sudarshan
It does indeed look like a few of the calculation take more than a few steps to converge (using ALGO=All wiki/index.php/ALGO might help here). It is difficult to say anything specific here without the input and output files (see here: wiki/index.php/Minimal_reproducible_exa ... 20possible.)
Sudarshan
-
- Newbie
- Posts: 12
- Joined: Thu May 25, 2023 6:24 pm
Re: same input but different result/convergence on different nodes
Dear admin,
I've attached the input file below. Only NPAR and #of nodes is different for those 4 calculations.
I've attached the input file below. Only NPAR and #of nodes is different for those 4 calculations.
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 74
- Joined: Fri Aug 04, 2023 11:07 am
Re: same input but different result/convergence on different nodes
Could you please send all inputs, i.e. all INCAR with the corresponding NPAR and number of nodes and all OUTCAR files in accordance with the forum guidelines (forum/viewtopic.php?t=17928)? Thanks!
Sudarshan
Sudarshan
-
- Newbie
- Posts: 12
- Joined: Thu May 25, 2023 6:24 pm
Re: same input but different result/convergence on different nodes
Dear admin,
I've attached the file below as required. All of them are on one node.
I've attached the file below as required. All of them are on one node.
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 74
- Joined: Fri Aug 04, 2023 11:07 am
Re: same input but different result/convergence on different nodes
Dear yujia_teng,
I am not sure I understand the issue, with 48 and 64 cores and fixing NBANDS to 56 it looks like you converge to the same energy. All calculations that you sent me seem to reach the required EDIFF. Is there a specific error you are facing beyond this?
Sudarshan
I am not sure I understand the issue, with 48 and 64 cores and fixing NBANDS to 56 it looks like you converge to the same energy. All calculations that you sent me seem to reach the required EDIFF. Is there a specific error you are facing beyond this?
Sudarshan
-
- Newbie
- Posts: 12
- Joined: Thu May 25, 2023 6:24 pm
Re: same input but different result/convergence on different nodes
Dear admin,
Only 2 of 4 get convergence. With 48 and 64 cores same NBANDS (I didn't fix it), 48 cores one converges while 64 cores one doesn't. The EDIFF is 1E-08. 64 core one only reaches 1E-06 at maximum electronic step. So the issue is different number of cores gives different convergence result, one converged and one does not.
Same for default NPAR, 64 cores converges while 48 does not.
Only 2 of 4 get convergence. With 48 and 64 cores same NBANDS (I didn't fix it), 48 cores one converges while 64 cores one doesn't. The EDIFF is 1E-08. 64 core one only reaches 1E-06 at maximum electronic step. So the issue is different number of cores gives different convergence result, one converged and one does not.
Same for default NPAR, 64 cores converges while 48 does not.
-
- Global Moderator
- Posts: 74
- Joined: Fri Aug 04, 2023 11:07 am
Re: same input but different result/convergence on different nodes
I am still not sure I understand. For the files that you attached it looks like all calculations reached electronic convergence and EDIFF = 0.1E-07 (grep for EDIFF on all OUTCARs). Are there some other files you are referring to?
Sudarshan
Sudarshan
-
- Newbie
- Posts: 12
- Joined: Thu May 25, 2023 6:24 pm
Re: same input but different result/convergence on different nodes
Dear admin,
Only 2 of the 4 calculations get convergence... Look at the output.log file, not other files since this is the place where we can see the problem. Only two of them reached 1E-08 at the end. The other two only reaches 1E-06. They couldn't get convergence within default electronic steps.
Only 2 of the 4 calculations get convergence... Look at the output.log file, not other files since this is the place where we can see the problem. Only two of them reached 1E-08 at the end. The other two only reaches 1E-06. They couldn't get convergence within default electronic steps.
-
- Global Moderator
- Posts: 74
- Joined: Fri Aug 04, 2023 11:07 am
Re: same input but different result/convergence on different nodes
Yes indeed I see it now. There are a couple of things to try here: (1) does the calculation end up converging with an increased NELM? (2) If not, since it is very near convergence anyway, have you tried another ALGO (say, ALGO=All)
Sudarshan
Sudarshan
-
- Newbie
- Posts: 12
- Joined: Thu May 25, 2023 6:24 pm
Re: same input but different result/convergence on different nodes
Dear admin,
I didn't try that. I believe those not converged result can get convergence with increased NELM, tuning mixing parameter or using ALGO = All.
But the main point here is that, with exact same 4 input files and only # of cores different, why the convergence situation is different? Is that because how parallelization is implemented in VASP would just affect this?
I didn't try that. I believe those not converged result can get convergence with increased NELM, tuning mixing parameter or using ALGO = All.
But the main point here is that, with exact same 4 input files and only # of cores different, why the convergence situation is different? Is that because how parallelization is implemented in VASP would just affect this?
-
- Global Moderator
- Posts: 74
- Joined: Fri Aug 04, 2023 11:07 am
Re: same input but different result/convergence on different nodes
Yes, in this context it only makes sense to me to compare converged calculations. Recall that NBANDS is altered due to the parallelization settings (wiki/index.php/NBANDS) and so comparison for a converged (both in terms of parameters and electronically) calculation is required.
-
- Newbie
- Posts: 12
- Joined: Thu May 25, 2023 6:24 pm
Re: same input but different result/convergence on different nodes
Dear admin,
I'm still confused here. Why it only makes senses to compare converged calculations? Just look at the 2 calculations with same NBANDS, which is 56 (I didn't set that, it's the default value generated by code). They have exact same input and same # of NBANDS. The only difference is # of core used, one is 48 and one is 64. But their output is different, one get convergence and one does not. Why this could happen? From vaspwiki, it looks like same NBANDS should give same result, but it's not in this case.
I'm still confused here. Why it only makes senses to compare converged calculations? Just look at the 2 calculations with same NBANDS, which is 56 (I didn't set that, it's the default value generated by code). They have exact same input and same # of NBANDS. The only difference is # of core used, one is 48 and one is 64. But their output is different, one get convergence and one does not. Why this could happen? From vaspwiki, it looks like same NBANDS should give same result, but it's not in this case.