ML Restart Initialisation stuck when more than one process per node
Posted: Mon Nov 14, 2022 9:20 pm
Hi,
The issue I face is that when I restart a ML training, but with a new species.
ie. I use ML_ISTART = 1, my ML_AB has Zr and Cu. However the new POSCAR that I'm starting to train on has Zr,Cu and Al. In that case, in the openMP version,
setting
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
the initialisation of the machine learning seems to get stuck. This does not happend when
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1.
That is it gets stuck only if I declare more than one process per node.
I'm trying liquid structures, so I have a high-ish ML_MB and ML_MCONF. I also have ML_LBAND_DISCARD = True. But the issue is happening irrespective of the setting. My old ML_AB has large number of structures though.
I have attached the ML_AB, INCAR and job submission script example here. I would appreciate any suggestion on the INCAR. Especially if ML_MB = 6000 is too diabolical even for a liquid.
Thank you
The issue I face is that when I restart a ML training, but with a new species.
ie. I use ML_ISTART = 1, my ML_AB has Zr and Cu. However the new POSCAR that I'm starting to train on has Zr,Cu and Al. In that case, in the openMP version,
setting
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
the initialisation of the machine learning seems to get stuck. This does not happend when
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1.
That is it gets stuck only if I declare more than one process per node.
I'm trying liquid structures, so I have a high-ish ML_MB and ML_MCONF. I also have ML_LBAND_DISCARD = True. But the issue is happening irrespective of the setting. My old ML_AB has large number of structures though.
I have attached the ML_AB, INCAR and job submission script example here. I would appreciate any suggestion on the INCAR. Especially if ML_MB = 6000 is too diabolical even for a liquid.
Thank you