Initialization of design matrix failed

Queries about input and output files, running specific calculations, etc.


Moderators: Global Moderator, Moderator

Locked
Message
Author
jie_yao2
Newbie
Newbie
Posts: 9
Joined: Sun Jun 26, 2022 6:32 am

Initialization of design matrix failed

#1 Post by jie_yao2 » Sun Feb 04, 2024 12:07 am

Hi VASP team,

Hope the message finds you well. I am currently work on the MLFF train of ab initio data.

When I use previous ML_ABN file (renamed to ML_AB) to restart, VASP say: ERROR, First Initialization of design matrix (FFM%FMAT) failed.

For the same input files, the job works (starts well) on the same GPU in supercomputer center.

It maybe due to the different installation. Why my design matrix initialization fail on the local GPU ? Where is the possible problem and solution ?
(The files are attached please.)

Thank you a lot for the time,
Jie
You do not have the required permissions to view the files attached to this post.

manuel_engel1
Global Moderator
Global Moderator
Posts: 126
Joined: Mon May 08, 2023 4:08 pm

Re: Initialization of design matrix failed

#2 Post by manuel_engel1 » Mon Feb 05, 2024 9:32 am

Hi Jie,

I'll try to assist you with your problem, but first, I would require a bit more information. Are you using the same version of VASP in both cases? What kind of GPUs are you running on? Could you please also attach the OUTCAR files and standard output of the two runs?
Manuel
VASP developer

jie_yao2
Newbie
Newbie
Posts: 9
Joined: Sun Jun 26, 2022 6:32 am

Re: Initialization of design matrix failed

#3 Post by jie_yao2 » Mon Feb 05, 2024 9:53 am

Hi Manuel,

Thank you for your reply.

Both VASP are version 6.4.2. Both GPU are A100.

The OUTCAR1 from local GPU is attached, the OUTCAR from supercomputer center is too large to upload, therefore
I extracted the first 100,000 rows and named it OUTCAR2.

Jie
You do not have the required permissions to view the files attached to this post.

manuel_engel1
Global Moderator
Global Moderator
Posts: 126
Joined: Mon May 08, 2023 4:08 pm

Re: Initialization of design matrix failed

#4 Post by manuel_engel1 » Mon Feb 05, 2024 10:07 am

Perfect, thanks. I will look into it.
Manuel
VASP developer

manuel_engel1
Global Moderator
Global Moderator
Posts: 126
Joined: Mon May 08, 2023 4:08 pm

Re: Initialization of design matrix failed

#5 Post by manuel_engel1 » Mon Feb 05, 2024 11:55 am

I consulted with our machine-learning experts. It is likely that you run out of memory on your local machine. The error message you encounter is generated from a failed allocation statement in the code.

The ML_LOGFILE contains information regarding the memory requirements of the calculation. Could you please also attach this file? How much memory do you have available on your local and on the remote machine?
Manuel
VASP developer

jie_yao2
Newbie
Newbie
Posts: 9
Joined: Sun Jun 26, 2022 6:32 am

Re: Initialization of design matrix failed

#6 Post by jie_yao2 » Mon Feb 05, 2024 12:04 pm

Hi Manuel,

It should not due to the memory. I tried another of local machine with smaller memory and it runs well.
Local machine has 80 GB, same with remote machine.

File is attached please.

Jie
You do not have the required permissions to view the files attached to this post.

jie_yao2
Newbie
Newbie
Posts: 9
Joined: Sun Jun 26, 2022 6:32 am

Re: Initialization of design matrix failed

#7 Post by jie_yao2 » Mon Feb 05, 2024 10:16 pm

Hi Manuel,

I wondering whether this is due to the different version of HPC SDK and cuda.

The local GPU A100 run with NVHPC 23.1 and cuda 12.0, while the others (works) with cuda 11.x.

Is it possible for you help to check whether this job run well on NVHPC 23.1 and cuda 12.0 on VASP 6.4.2 ? Therefore may
have a clue to direction of search.

Thanks,
Jie

manuel_engel1
Global Moderator
Global Moderator
Posts: 126
Joined: Mon May 08, 2023 4:08 pm

Re: Initialization of design matrix failed

#8 Post by manuel_engel1 » Tue Feb 06, 2024 9:54 am

Unfortunately, the best advice I can give you is to not use the GPU version of VASP to run the machine-learning code. The ML code does not benefit from GPU parallelization and is, in fact, untested when running VASP on GPU. The error you encounter might be directly related to this. Could you please try to run the code on CPU only and see if the error persists?
Last edited by manuel_engel1 on Tue Feb 06, 2024 9:55 am, edited 1 time in total.
Reason: make it clear that only the GPU + ML combination is untested
Manuel
VASP developer

jie_yao2
Newbie
Newbie
Posts: 9
Joined: Sun Jun 26, 2022 6:32 am

Re: Initialization of design matrix failed

#9 Post by jie_yao2 » Tue Feb 06, 2024 12:09 pm

Hi Manuel,

You mean I can run the CPU version of VASP on multi node CPU cores for the machine learning code ?
(Is using multi core CPU more efficient than GPU when running the ML_MODE = select, refit and production run,
any recommendations for the efficiency in each ML_MODE stage ?)

Tried with the CPU version of VASP on CPU only, the previous error disappeared. However, the GPU is much faster for pure ab initio calculations.

Sorry for one more question about merging different ML_AB files, on vasp wiki: https://www.vasp.at/wiki/index.php/ML_AB
It recommends: strongly advise to group structures with the same number of elements and atoms per element in the training data
About group structures, does it mean: for the combined ML_AB, always use one modified Header specification, then simply put, for example, atom number 48 structures for configuration numbers 1 to 10; then atom number 50 structures for configuration numbers 11 to 20, so it is total 20 structures, Configuration num. 1 to Configuration num. 20. Not necessary to do other things.

Thanks a lot for the help,
Jie

manuel_engel1
Global Moderator
Global Moderator
Posts: 126
Joined: Mon May 08, 2023 4:08 pm

Re: Initialization of design matrix failed

#10 Post by manuel_engel1 » Tue Feb 06, 2024 1:07 pm

No worries, I hope I can clear things up.
You mean I can run the CPU version of VASP on multi node CPU cores for the machine learning code ?
Yes.
Is using multi core CPU more efficient than GPU when running the ML_MODE = select, refit and production run,
any recommendations for the efficiency in each ML_MODE stage ?
The ML code does not use GPU parallelization. It is currently a CPU-only code.
However, the GPU is much faster for pure ab initio calculations.
That is true. Unfortunately, you are currently restricted to CPU with ML in VASP. And it seems that trying to run ML calculations with a GPU involved will produce errors so I advice against it.

For the additional question regarding the merging of ML_AB files, please open a new topic in the forum.
Manuel
VASP developer

Locked