ML force field training does not progress

Message

akretschmer · #1 Post by **akretschmer** » Tue Aug 13, 2024 8:37 am

I am trying to train a ML force field for AIMD with graphite. But when the main loop starts, nothing happens and after 3 days the job aborts due to the time limit of the cluster.

EDIT: I am using vasp.6.4.2

INCAR:

Code: Select all

SYSTEM = graphene
ENCUT = 550
IBRION = 0
ISIF = 3
NSW = 100
EDIFF = 1e-6
EDIFFG = 1e-5
ISMEAR = 1
SIGMA = 0.2
PREC = Accurate
ALGO = FAST
LREAL  = Auto
LWAVE  = .FALSE.        !write WAVECAR (def T)
LCHARG = .TRUE.        !write CHGCAR (def T)
NCORE = 2

IVDW = 12

ML_LMLFF = .TRUE.
POTIM = 0.7
MDALGO = 3
ISYM = 0
ML_MODE = train
TEBEG = 50
TEEND = 500

I tried the same with other xc functionals but they all fail the same way. I relaxed the cell before with static AI which runs perfectly fine. I then just add the last code block in the INCAR file.

POSCAR:

Code: Select all

Graphite
   1.0000000000000000
     7.4030347517527781   -0.0111052851320072    0.0159637274031600
    -3.7026587113806686    6.4131930112247382    0.0006787546983971
    -0.0000000000000000    0.0000000000000002   13.4433740916905080
   C
    72
Direct
  0.0049278469335001  0.0024455047286062  0.1252358593113222
 -0.0007109408973012 -0.0008817066211251  0.6240798432840473
  0.0057835149373610  0.3352854444422640  0.1248845911781240
 -0.0024288457598892  0.3334761305940993  0.6227157596021295
  0.0056777522820779  0.6705103407953577  0.1247834411600615
 -0.0037804529108982  0.6657771079223536  0.6244463858538861
  0.3396929031149822  0.0032408933870816  0.1252927693876168
  0.3312449051078522 -0.0032995018962406  0.6234079387789914
  0.3391128084622168  0.3339467305997547  0.1265875199713704
  0.3308643961844996  0.3322887469818847  0.6236279494970725
  0.3374444372869301  0.6675729690985164  0.1271281112655019
  0.3312824223503720  0.6659573084352595  0.6234714993320106
  0.6711900550506326  0.0026275769658387  0.1249047189506541
  0.6652487561523698 -0.0011190686765061  0.6226569736164909
  0.6741870084050987  0.3372091342224170  0.1261005097108367
  0.6634041769965733  0.3314381201595334  0.6229545484584281
  0.6713049103122155  0.6687163506957010  0.1245989025324702
  0.6634224418383139  0.6655925929113597  0.6233121542442099
  0.0085656175418687 -0.0002636903059656  0.3687874258165127
  0.0049255024652602  0.0024063734478703  0.8764352481874031
  0.0067971933649898  0.3322612763213928  0.3704382375466629
  0.0056206404654441  0.3371443372841954  0.8773704978551391
  0.0079664359150852  0.6658137344357871  0.3683180423549069
  0.0051548649432684  0.6696964301109145  0.8763691547647932
  0.3407996485864724  0.0008049412391801  0.3675210170564341
  0.3394673665100049  0.0027199387934017  0.8763994323005698
  0.3403238589293950  0.3326906770325044  0.3681057314301978
  0.3391904444225973  0.3359591581637849  0.8747233893362275
  0.3417223323281676  0.6670986666757702  0.3682899691247028
  0.3386844864779097  0.6699644799180259  0.8761319696073786
  0.6730679249343270 -0.0016036249220819  0.3693117932714712
  0.6732955261519441  0.0036141804644993  0.8769228820002187
  0.6742516243213496  0.3328447706408325  0.3681570751986207
  0.6735315505826842  0.3372005113332451  0.8758646131204946
  0.6739319882003703  0.6663518268579599  0.3680627293048768
  0.6709395841566800  0.6700250301374353  0.8761194495350685
  0.2274932196082184  0.1128086522023745  0.1266672800376033
  0.2206087137491038  0.1097578293189827  0.6248329222927036
  0.2278922090841584  0.4445912150011435  0.1267045734700731
  0.2199254165458235  0.4430531523243142  0.6240898179035792
  0.2275382483493668  0.7795187129914795  0.1268633289051972
  0.2190711442435241  0.7757190930410361  0.6245821350676810
  0.5607630741469803  0.1131138601821316  0.1258736027732899
  0.5536509258719490  0.1095526817655896  0.6234479313227643
  0.5627304975370685  0.4469026561266214  0.1270103871239664
  0.5522735812247701  0.4442735697334139  0.6234282328623040
  0.5594230164553252  0.7786408786840243  0.1261711359877175
  0.5539978387859763  0.7778729838502607  0.6236854052256970
  0.8934179450135686  0.1130363555711553  0.1257033626365640
  0.8894571620127282  0.1121126257972904  0.6232254847528574
  0.8960843208236453  0.4477623470817549  0.1246607226163121
  0.8847737148460817  0.4433698583875044  0.6237321301406060
  0.8942406257776696  0.7806786120743753  0.1240870420758763
  0.8871480091699188  0.7787414333359274  0.6250895482944877
  0.1187539064375385  0.2211299307261630  0.3690276325830065
  0.1155861219429878  0.2244436472881211  0.8760486011449945
  0.1179867036361861  0.5547247563743948  0.3693579762748144
  0.1150339138033622  0.5599802825929336  0.8769218147655224
  0.1194028179518858  0.8892487973452687  0.3685166815854137
  0.1160625222523202  0.8924670331506304  0.8754098046236934
  0.4519514524437534  0.2207169334854960  0.3685663516469894
  0.4490730911104343  0.2253180250450681  0.8750536056720808
  0.4523514101548700  0.5554240648100396  0.3685471906471366
  0.4470630976123472  0.5573373674207622  0.8749391262064576
  0.4523750051610730  0.8874116411786455  0.3685958136462503
  0.4501586553917471  0.8924734228393962  0.8761060833236406
  0.7857233558462281  0.2215002121497795  0.3700024566542869
  0.7840160470304917  0.2268659351347516  0.8764746558520714
  0.7865816953110767  0.5553536880688398  0.3681545304547837
  0.7808866041887622  0.5583601258270514  0.8759012408518100
  0.7838530910026997  0.8879276403193670  0.3690368494450083
  0.7816062435294621  0.8912044822654043  0.8774607633464009

The main loop part in the ML_LOGFILE contains just this single line, which tells me that it runs an ab initio step as it should:

Code: Select all

--------------------------------------------------------------------------------
STATUS                  0 threshold  2      T      F         0         0

The OSZICAR is likewise empty.

The end of the OUTCAR file stops at the first iteration of the first ionic step:

Code: Select all

 ML FREE ENERGIE OF THE ION-ELECTRON SYSTEM (eV)
  ---------------------------------------------------
  free  energy ML TOTEN  =         0.00000000 eV

  ML energy  without entropy=        0.00000000  ML energy(sigma->0) =        0.00000000

      MLFF:  cpu time      0.3316: real time     21.2640


--------------------------------------- Iteration      1(   1)  ---------------------------------------


    POTLOK:  cpu time      1.2962: real time     65.8013
    SETDIJ:  cpu time      0.1713: real time     11.0067

What am I doing wrong?

#2 Post by **ferenc_karsai** » Tue Aug 13, 2024 9:48 am

Please upload all neccessary files (POSCAR, POTCAR, KPOINTS, ML_AB, INCAR, OUTCAR, OSZICAR, ML_LOGFILE and stdout) according to the forum guidelines.

akretschmer · #3 Post by **akretschmer** » Tue Aug 13, 2024 10:49 am

Here are the files. I cannot provide the ML_AB file as it was not generated.

PBE-D3.zip

#4 Post by **ferenc_karsai** » Fri Aug 16, 2024 2:27 pm

I just tried your example. I think you have too many k points (163 in the irreducible Brillouin zone) and the calculation needs forever.
Please try with a reduced number of k points (maybe make a convergence test on a single structure from bottom up).
Also try to parallelize over the k-points. For that start the calculation until NKPTS appears in the OUTCAR file (should take a few ten seconds).
Then stop the calculation. Set KPAR equal to the number you looked up in the INCAR file and then restart your calculation.
This should also noticeably speed up your calculation.

akretschmer · #5 Post by **akretschmer** » Wed Sep 18, 2024 9:12 am

I have reduced the number of k-points, but the problem still persists. When I set the KPAR tag to the value of NKPTS, I get an error and the calculation starts, using KPAR = 2 was the only option that has worked for me so far with the VSC5 (128 cores per node).

I have now tried a single cell with only 4 atoms and still not a single step is completed, so I guess there must be something else wrong. I have uploaded my latest try with the small cell.

Test.zip

#6 Post by **ferenc_karsai** » Thu Sep 26, 2024 11:09 am

I've ran the calculation with 122 cores and see no problems. I've used both VASP 6.4.3 and 6.4.2 (the version from your OUTCAR). Please from now on run with VASP 6.4.3 if possible.
I've seen in the OUTCAR that you sent you used 128 cores. I guess that is an OUTCAR from a previous run.
If your run KPAR=122 on 128 you should get a termination of the code with the following error:
"M_divide: can not subdivide 128 nodes by 122".
But I guess you ran with 122 cores so stdout and OUTCAR doesn't fit. Always run with integer multiples of KPAR.

So it seems your installation has a problem. Please try to first compile without "-DscaLAPACK" and then without "-Duse_shmem" in the makefile.include. If the compiled code with either of the choices work you now that either your scaLAPACK library has problems or your shared memory.

I've used this toolchain:
gfortran 11.2, openmpi 4.1.2 and mkl 2022.0.1

If you have access to these compilers and libraries then please try these.

My Community

ML force field training does not progress

ML force field training does not progress

Re: ML force field training does not progress

Re: ML force field training does not progress

Re: ML force field training does not progress

Re: ML force field training does not progress

Re: ML force field training does not progress