Page 1 of 2

memory issue for MLFF calculation for large ML_MB

Posted: Tue Jul 19, 2022 7:57 pm
by xiaoming_wang
Hello,

I'm doing ML_FF calculations on hybrid perovskites. Since the system contains hydrogen atoms, the basis sets for ML are very big. With the default ML_MB parameter, I was quickly confronted with the error and hint that ML_MB was too small. I gradually increased ML_MB from 2000 to 4000 to 7000 every time when the code stopped and suggested me to increase ML_MB. Now, I increased ML_MB to 9000. The code stopped without doing any SCF loops. The error shows that:

Code: Select all

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f0b43aded6f in ???
#1  0x7f0b49b4974d in buff2block
        at /tmp/tmp.2R3jevzSvm/gnu8.1_x86_64_build/mp/scalapack/REDIST/SRC/pdgemr.c:679
#2  0x7f0b49b4974d in Cpdgemr2d
        at /tmp/tmp.2R3jevzSvm/gnu8.1_x86_64_build/mp/scalapack/REDIST/SRC/pdgemr.c:547
#3  0x4a2108 in ???
#4  0x56df5b in ???
#5  0x578163 in ???
#6  0x5806aa in ???
#7  0xa2786f in ???
#8  0x10afe3a in ???
#9  0x10e2d33 in ???
#10  0x7f0b43ac92bc in ???
#11  0x40a719 in ???
        at ../sysdeps/x86_64/start.S:120
#12  0xffffffffffffffff in ???
srun: error: nid005553: task 0: Segmentation fault
srun: launch/slurm: _step_signal: Terminating StepId=2753888.0

It seems to me that there is something related to the memory. I checked the ML_LOGFILE for the memory usage:

Code: Select all

Estimated memory consumption for ML force field generation (MB):

Persistent allocations for force field        :  38414.4
|
|-- CMAT for basis                            :  16453.8
|-- FMAT for basis                            :   1974.4
|-- DESC for basis                            :   1646.5
|-- DESC product matrix                       :     53.1

Persistent allocations for ab initio data     :      9.4
|
|-- Ab initio data                            :      8.9
|-- Ab initio data (new)                      :      0.4

Temporary allocations for sparsification      :    406.5
|
|-- SVD matrices                              :    405.5

Other temporary allocations                   :    609.7
|
|-- Descriptors                               :     42.3
|-- Regression                                :    519.7
|-- Prediction                                :     47.7

Total memory consumption                      :  39439.9

So, the total mem for each task is about 39GB. I'm running my job on 8 nodes with 8 tasks each. The mem for each node is 512GB. So, I have 64GB mem available for each task. I'm wondering why there is still memory issues with this setup. Btw, the problem can be solved by increasing ML_EPS_LOW, but the accuracy as tested is not acceptable. Do you have any comments or suggestions on the parameter setups for ML_FF calculations about hybrid organic inorganic systems, I mean, to make the calculations not that challenging?


Best,
Xiaoming

Re: memory issue for MLFF calculation for large ML_MB

Posted: Wed Jul 20, 2022 12:24 pm
by henrique_miranda
Dear Xiaoming,

Could you provide the input (INCAR, POSCAR, POTCAR and KPOINTS) and output (CONTCAR, OUTCAR and ML_AB) files you used in your calculations?
It is very strange that you need to increase ML_MB so much.
There can be many reasons for this but it is hard to say exactly without looking at our input and output files.

Re: memory issue for MLFF calculation for large ML_MB

Posted: Wed Jul 20, 2022 3:17 pm
by xiaoming_wang
Hi,

Please find my files
inoutfils.zip
.

Best,
Xiaoming

Re: memory issue for MLFF calculation for large ML_MB

Posted: Wed Jul 20, 2022 9:34 pm
by henrique_miranda
Ok, it seems that you are getting a very large number of local reference configurations for the hydrogen atoms, 8405 based on your ML_AB file.
This might be indicative of something wrong with your MD calculation. In principle, such a large number is not required.
For example, if the atoms are moving in an erratic way (because the forces are not accurate or the time step is too small) you will get a lot of local configurations.

Here are a few things you can try/check:
1. Reduce the time step POTIM to for example 1fs. This might be needed because the hydrogen atoms oscillate quickly (even if you changed their mass to 8 ).
2. Run the MD calculation without the Machine learning part turned off and check that the trajectories (XDATCAR) and temperatures (OSZICAR and OUTCAR) make sense.

And a few more questions in case none of the suggestions above works:
1. The OUTCAR file you sent is truncated almost in the beginning so I cannot say much from it. Could you share it with a few more MD steps?
2. Without the machine learning turned on how many MD steps can you run with these settings?

Re: memory issue for MLFF calculation for large ML_MB

Posted: Thu Jul 21, 2022 4:34 pm
by xiaoming_wang
Thanks for your comments. I attached two files here. First, I reduced POTIM to 1fs and EDIFF to 1.E-6 and same errors appeared as before.
reduced_potim.zip
Second, I performed normal MD without ML_FF. It is still running, and no error happened. I checked the outputs, everything is OK. Since the OUTCAR is too big to upload, I just included two ionic steps.
without_ml.zip
Any more ideas?

Re: memory issue for MLFF calculation for large ML_MB

Posted: Fri Jul 22, 2022 6:16 am
by henrique_miranda
Note that with POTIM=1 the number of basis sets for the hydrongen atom reported in your ML_ABN file much lower than before.
This means that you should be able to lower the ML_MB parameter thus reducing the memory usage.
This might avoid the segmentation fault altogether.

The MD trajectory seems fine so far (at least the atoms are not jumping arround erratically).
One thing I noticed is that the temperature drops a lot in the begining of your MD simulation.
From the previous input files it seems that you are starting from a POSCAR file with velocities in it.
Do these velocities result from a run where equilibrated the temperature to 300K?
Otherwise you might want to remove the velocities from the POSCAR before starting you MD simulation.

Re: memory issue for MLFF calculation for large ML_MB

Posted: Fri Jul 22, 2022 9:11 am
by xiaoming_wang
Thanks. I forgot to mention that the first uploaded results with POTIM=8 were obtained by gradually increasing ML_MB until the mem problem. The later uploaded results with POTIM=1 were obtained by the default ML_MB. So, you can see that there are different number of configurations in the ML_AB files. The much smaller number of basis sets for hydrogen may be due to the smaller number of configurations in the ML_ABN file. Anyway, I'll try to continue the calculations with POTIM=1 and see how far I can go.

For the normal MD calculations, I started with the POSCAR without velocities. The temperature drops a lot and goes up back to 300 K after about 600 steps. Btw, the normal MD calculations run fine without any error till now.

Re: memory issue for MLFF calculation for large ML_MB

Posted: Fri Jul 22, 2022 12:19 pm
by henrique_miranda
Anyway, I'll try to continue the calculations with POTIM=1 and see how far I can go.
Yes, hopefully, this fixes your problem.
You might want to visualize the MD trajectories in the MD run with POTIM=8 to check if the atoms are following regular trajectories.
I used https://www.ovito.org/manual/ to directly visualize the XDATCAR you sent me.
You might also plot, for example, the temperature as a function of MD step for POTIM=1 and POTIM=8 (with the problem time scaling) to get an idea if your time step is small enough.
For the normal MD calculations, I started with the POSCAR without velocities. The temperature drops a lot and goes up back to 300 K after about 600 steps. Btw, the normal MD calculations run fine without any error till now.
Ok, I just asked this because the POSCAR you posted had initial velocities.
The reason I suggested running an MD calculation without the ML part was so that you could verify if the trajectories are reasonable.

Re: memory issue for MLFF calculation for large ML_MB

Posted: Mon Jul 25, 2022 2:27 pm
by henrique_miranda
I mentioned POTIM=8 because you wrote that in the previous post.
However, in your input files, I only found POTIM=3 and POMASS=8 for hydrogen so I think you were referring to that.
My suggestion is to decrease to POTIM=1 while still using POMASS=8 for hydrogen.
Then visualize both MD trajectories and compare.

Re: memory issue for MLFF calculation for large ML_MB

Posted: Mon Jul 25, 2022 3:37 pm
by xiaoming_wang
Hi Henrique,

Thanks for your replies and suggestions. Yes, I meant POTIM=3.

I did several tests. It seems that increasing the accuracy of the force and reducing POTIM would reduce the basis set size. However, unfortunately, with the settings of (EDIFF=1.e-6, LREAL=F, POTIM=0.5), I still got the similar errors as before but with more configurations included in the ML_ABN file. I compared the calculations with and without MLFF. No obvious difference can be observed for the trajectories. But for the temperature and energy fluctuations plot, there are obvious deviations.
Picture1.png
I checked the plot for silicon in the tutorial, the fluctuations with and without MLFF almost overlap with each other. I think there must be something wrong for my MLFF settings.

Best,
Xiaoming

Re: memory issue for MLFF calculation for large ML_MB

Posted: Tue Jul 26, 2022 7:55 am
by henrique_miranda
Hi Xiaoming,

I am having some difficulty understanding exactly what you have tested which makes it difficult for me to help.
Please try to post as many details as possible.

From what you wrote I understood that you tried to increase the precision of your calculation by changing
EDIFF=1.e-5, LREAL=T, POTIM=3
to
EDIFF=1.e-6, LREAL=F, POTIM=0.5
I will assume all other variables are unchanged.
I think this is a good idea, increase the precision of your MD run and see if the ML part still captures so many different local reference configurations.
When you say that you still have similar errors as before, do you mean that you are asked to increase ML_MB successively until you run out of memory?
How many local configurations do you have in your ML_ABN for the different species?

Did you visualize the MD trajectories you get in all these cases?
Do you see anything strange?

As for the comparison of the temperature and energies as a function of MD step with and without ML I should point out that the complete reproducibility of MD calculations is very hard to achieve.
Any small error in the forces at some point will lead to different trajectories.
You can verify this yourself by doing two MD runs without ML with lower and higher accuracy for example (EDIFF=1.e-5, LREAL=T) vs (EDIFF=1.e-6, LREAL=F).
My guess is that you would also get some deviations (I might be wrong though).

Machine learning tries to reproduce the ab initio forces but there is always some error involved which can lead to different trajectories.
The goal is always to keep this error under control.

Re: memory issue for MLFF calculation for large ML_MB

Posted: Tue Jul 26, 2022 7:04 pm
by xiaoming_wang
Hi Henrique,

Thanks for your reply.
From what you wrote I understood that you tried to increase the precision of your calculation by changing
EDIFF=1.e-5, LREAL=T, POTIM=3
to
EDIFF=1.e-6, LREAL=F, POTIM=0.5
I will assume all other variables are unchanged.
I think this is a good idea, increase the precision of your MD run and see if the ML part still captures so many different local reference configurations.
When you say that you still have similar errors as before, do you mean that you are asked to increase ML_MB successively until you run out of memory?
Yes.
How many local configurations do you have in your ML_ABN for the different species?
Here is the header of ML_ABN

Code: Select all

 1.0 Version
**************************************************
     The number of configurations
--------------------------------------------------
        263
**************************************************
     The maximum number of atom type
--------------------------------------------------
       5
**************************************************
     The atom types in the data file
--------------------------------------------------
     Cu Cl C
     H  N
**************************************************
     The maximum number of atoms per system
--------------------------------------------------
            188
**************************************************
     The maximum number of atoms per atom type
--------------------------------------------------
             96
**************************************************
     Reference atomic energy (eV)
--------------------------------------------------
   0.0000000000000000        0.0000000000000000        0.0000000000000000
   0.0000000000000000        0.0000000000000000
**************************************************
     Atomic mass
--------------------------------------------------
   63.545999999999999        35.453000000000003        12.010999999999999
   1.0000000000000000        14.000999999999999
**************************************************
     The numbers of basis sets per atom type
--------------------------------------------------
       300  1806  2715
      8498   609
**************************************************
Did you visualize the MD trajectories you get in all these cases?
Do you see anything strange?
I visualized the trajectories with ASE and I did not see anything strange.
As for the comparison of the temperature and energies as a function of MD step with and without ML I should point out that the complete reproducibility of MD calculations is very hard to achieve.
Any small error in the forces at some point will lead to different trajectories.
You can verify this yourself by doing two MD runs without ML with lower and higher accuracy for example (EDIFF=1.e-5, LREAL=T) vs (EDIFF=1.e-6, LREAL=F).
My guess is that you would also get some deviations (I might be wrong though).
I totally agree with you. I think what I mean here is that the average T or E should be comparable for the MD calculations with and without MLFF. For your reference, I attach the plot for silicon here.
Picture3.png
As you can see, the specific spot may not be over the other, but the average line should be comparable. With this in mind, if you go back and look at the previous plots, there seems to be more obvious deviations. I finally managed to get similar plots for my Hydrogen containing system with (EDIFF=1.e-6, LREAL=F, POTIM=0.5).
Picture2.png
I think the learned forced field at this point may be acceptable. The problem is that the basis set size is still very large.

Re: memory issue for MLFF calculation for large ML_MB

Posted: Wed Jul 27, 2022 11:21 am
by henrique_miranda
Thank you for posting detailed plots of your calculation.
This is certainly useful for other users too.

It seems that you are still getting a lot of different local reference configurations.
This might be due to the nature of the system you are trying to simulate but it would imply that it is very difficult to get accurate ML force fields for it or you would need a very large basis set.
Let us assume that this is not the case and instead there is another problem in the accuracy of the forces during the MD run.

A possible source of noise in the force computations for metallic systems is if the ISMEAR tag is not set appropriately.
I see that you set ISMEAR=0, this type of smearing is not ideal for metallic systems, are you sure you system has a gap at every MD step?

Re: memory issue for MLFF calculation for large ML_MB

Posted: Wed Jul 27, 2022 3:33 pm
by henrique_miranda
Once you are sure to have chosen the adequate smearing method for your system and that the forces are not too noisy I would suggest that you look how the number of local references increases as a function of the MD step.
You can do so by grepping for 'SPRSC' in the ML_LOGFILE.
You can compare how the number of local reference configurations increases with the different parameters on DFT run.
If you have too much noise in the forces from your MD calculation I would expect that the machine learning will need to pick out a larger number of local reference configurations (but please check this for yourself).

For the machine learning model to become more precise new local reference configurations have to be added.
However, you don't want this number to grow too much otherwise you need a lot of memory for the design matrix and each ML force field evaluation becomes slower.

At this point, I would suggest, instead of thinking about how to allocate more memory for the design matrix, that you check whether you need so many local reference configurations to begin with.
There are a few variables that might help you to decrease the number of local reference configurations that the machine learning captures: ML_EPS_LOW, ML_AFILT2, ML_RCUT1 and ML_RCUT2 just to name a few.

Re: memory issue for MLFF calculation for large ML_MB

Posted: Thu Jul 28, 2022 2:16 pm
by xiaoming_wang
Hi Henrique,

Thank you for taking care of my case.

For the smearing, my system is a large bandgap semiconductor. I think ISMEAR=0 is OK for my system. One question about the electronic degree of freedom is that the system is ferromagnetic with magnetization of 4. However, while doing MD, for some steps, the magnetization can be changed to 2 or 0, and after some steps the magnetization turns back to 4. I checked ERR and BEEF from the ML_LOGFILE, there was a sudden change when the magnetization changed. To avoid the sudden change, I constrain the magnetization by setting NUPDOWN=4. Does this make sense?

The SPRSC plot is very helpful. As seen from the following figure
Picture4.png
the increasing of the basis set size of H atoms is the bottleneck for the ML procedure. I'm now trying to tune the ML_EPS_LOW and ML_RCUT. I'm testing ML_EPS_LOW=1.e-8 and (MR_RCUT2=4, ML_MRB1=6) separately to see which could deliver an acceptable MLFF. Both settings can reduce the basis set size. It seems with ML_EPS_LOW=1.e-8 the basis set size decreases more significantly. Maybe I should try 5.e-9.

As your mentioned ML_AFILT2, I checked the wiki and found the default value is 0.002. However, from the wiki of ML_IAFILT2 wiki/index.php/ML_IAFILT2, it seems that the default value is 0.02 for which we can safely use the default ML_LMAX2=4. I checked my calculations, the parameters from the ML_LOGFILE are that ML_AFILT2=0.002 and ML_LMAX2=4. So, seems there are some inconsistences. Could you please check that?

Best,
Xiaoming