Hi,
The issue I face is that when I restart a ML training, but with a new species.
ie. I use ML_ISTART = 1, my ML_AB has Zr and Cu. However the new POSCAR that I'm starting to train on has Zr,Cu and Al. In that case, in the openMP version,
setting
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
the initialisation of the machine learning seems to get stuck. This does not happend when
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1.
That is it gets stuck only if I declare more than one process per node.
I'm trying liquid structures, so I have a high-ish ML_MB and ML_MCONF. I also have ML_LBAND_DISCARD = True. But the issue is happening irrespective of the setting. My old ML_AB has large number of structures though.
I have attached the ML_AB, INCAR and job submission script example here. I would appreciate any suggestion on the INCAR. Especially if ML_MB = 6000 is too diabolical even for a liquid.
Thank you
ML Restart Initialisation stuck when more than one process per node
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 6
- Joined: Sun Oct 16, 2022 9:49 pm
ML Restart Initialisation stuck when more than one process per node
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 501
- Joined: Mon Nov 04, 2019 12:41 pm
- Contact:
Re: ML Restart Initialisation stuck when more than one process per node
Ok, we had a look at your input files but we need a bit more information:
Could you please share the OUTCAR and ML_LOGFILE?
Did you compile VASP with support for shared memory?
A few suggestions that you might try:
1. There is no support for OpenMP in the machine learning part of the code so using might not lead to a great speedup.
2. You might reduce ML_CONF to for example 1500 and significantly reduce the memory usage of your calculation
Could you please share the OUTCAR and ML_LOGFILE?
Did you compile VASP with support for shared memory?
A few suggestions that you might try:
1. There is no support for OpenMP in the machine learning part of the code so using might not lead to a great speedup.
2. You might reduce ML_CONF to for example 1500 and significantly reduce the memory usage of your calculation
-
- Newbie
- Posts: 6
- Joined: Sun Oct 16, 2022 9:49 pm
Re: ML Restart Initialisation stuck when more than one process per node
Hi,
Sorry for the delay. I needed some preliminary data quick and hence kept running on a single task per node basis. But I think it's time to fix the issue. I have a feeling that the problem is not about the ML but the openMP itself since same issues persist without ML. I'm attaching the OUTCAR and stdout (vasp.out) for the single task per node case as well as the 2task per node test.
The attachment contains the following main files:
OUTCAR.2task - failed OUTCAR, it always gets stuck at that last line
OUTCAR.1task - I soft stopped after 3 ionic steps
The corresponding job script and stdout are named as *.2task and *.single
Then there is the makefile.include. The INCAR and POSCAR is consistent across all. It is the same as before, but I removed the ML tags and reduced EDIFF and increase KPAR all for for quicker runs (I also got more nodes to match KPAR).
Best
Sayan
PS. I don't think shared memory is turned on. I will make sure to ask the sys admin specifically for this on the next re-compile.
Side Note. Could you quickly explain the difference between ML_MB and ML_MCONF. I don't exactly understand what is the difference in the items that each tag sets the limit for.
Sorry for the delay. I needed some preliminary data quick and hence kept running on a single task per node basis. But I think it's time to fix the issue. I have a feeling that the problem is not about the ML but the openMP itself since same issues persist without ML. I'm attaching the OUTCAR and stdout (vasp.out) for the single task per node case as well as the 2task per node test.
The attachment contains the following main files:
OUTCAR.2task - failed OUTCAR, it always gets stuck at that last line
OUTCAR.1task - I soft stopped after 3 ionic steps
The corresponding job script and stdout are named as *.2task and *.single
Then there is the makefile.include. The INCAR and POSCAR is consistent across all. It is the same as before, but I removed the ML tags and reduced EDIFF and increase KPAR all for for quicker runs (I also got more nodes to match KPAR).
Best
Sayan
PS. I don't think shared memory is turned on. I will make sure to ask the sys admin specifically for this on the next re-compile.
Side Note. Could you quickly explain the difference between ML_MB and ML_MCONF. I don't exactly understand what is the difference in the items that each tag sets the limit for.
You do not have the required permissions to view the files attached to this post.
-
- Newbie
- Posts: 6
- Joined: Sun Oct 16, 2022 9:49 pm
Re: ML Restart Initialisation stuck when more than one process per node
After lots of trials, I finally managed to make it work.
-
- Global Moderator
- Posts: 460
- Joined: Mon Nov 04, 2019 12:44 pm
Re: ML Restart Initialisation stuck when more than one process per node
Could you please elaborate what was the problem and solution?