I am curious as to what the memory requirements for ACFDT-RPA calculations are, and what they will be for the low-scaling methods that have not yet been released. My understanding is that the difficult part of ACFDT-RPA calculations is evaluating Tr(Log(I - χ0(iω)ν)), which involves diagonalizing an (NBANDS * NKPTS)x(NBANDS * NKPTS) array for NOMEGA different frequency points. If this is correct, then it would mean the memory requirements scale as O(NBANDS^2 * NKPTS^2) (assuming that only one frequency point is stored at a time) and the computational complexity is O(NBANDS^3 * NKPTS^3 * NOMEGA). However, it is commonly said that ACFDT-RPA scales as O(N^4), where N is usually left undefined but I assume refers to the number of electrons. Have I misidentified the step which determines the overall scaling, or have I made a mistake in how I am calculating the computational complexity? What are the actual memory requirements of doing a traditional ACFDT-RPA calculation?
Furthermore, I am concerned about the memory requirements for the as-yet unreleased low-scaling ACFDT-RPA algorithm as described in J. Chem. Theory Comput. 2014, 10, 2498-2507. I understand that the scaling of this algorithm is dictated by the calculation of the real-space imaginary time single particle Green's function (eq. 32), but it appears to me as though the memory requirements are dictated by the cosine transform to obtain the reciprocal-space imaginary frequency RPA response function (eq. 12). In order to evaluate this cosine transform, the response function must be simultaneously stored at all frequency points and all time points, resulting in a memory requirement of O(NBANDS^2 * NKPTS^2 * (NTAU + NOMEGA)). The paper suggests that "more than 20 frequency and time points will be hardly ever required" even for large systems.
Large metallic systems are likely the worst case scenario, requiring the largest value of NOMEGA for good convergence. For example, I am interested in using RPA to study the adsorption of organic molecules to the Pd(111) surface, which I model as a 3x3 slab with 4 layers and a vacuum gap of 15 Angstrom. If I simulate this system with no adsorbate with a 450 eV plane wave cutoff (due to the presence of C and O atoms in adsorbates) and a 3x3x1 k-point mesh (5 irreducible k-points), I find a maximum number of around 28800 plane waves. For 20 omega points and 20 tau points, this would suggest that around 12 TB of memory would be required to evaluate this cosine transform (with a 28800x28800x5x5 complex-valued array at 40 time- and frequency-points). Obviously these memory requirements can be reduced by lowering the plane wave cutoff, by shrinking the vacuum gap, by using a smaller unit cell, or by using fewer omega- and tau-points, but it does suggest that despite the lower formal scaling of the unreleased ACFDT-RPA algorithm, it may still not be possible to treat systems that are substantially larger than those that can be treated by the traditional ACFDT-RPA algorithm due to the increased memory requirements of the algorithm. Could you shed some light on what the memory requirements for the current and unreleased ACFDT-RPA algorithms are? Will it be possible for me to study my large metal-adsorbate systems with the low-scaling ACFDT-RPA algorithm?
Scaling behavior of ACFDT-RPA and low-scaling version
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 11
- Joined: Tue May 14, 2013 5:45 pm
- License Nr.: 5-1579
-
- Newbie
- Posts: 11
- Joined: Tue May 14, 2013 5:45 pm
- License Nr.: 5-1579
Re: Scaling behavior of ACFDT-RPA and low-scaling version
I now understand that the scaling of the existing ACFDT-RPA algorithm is determined by construction of the non-interacting response function using the Adler and Wisler formula, which involves a rank-2 tensor update (O(N^2)) of the non-interacting imaginary frequency response function χ0(iω) for each pair of bands (O(N^2) many) resulting in a total O(N^4) scaling. However, my question about the memory requirements of the current and future ACFDT-RPA implementations still stands: is it correct that the low-scaling ACFDT-RPA algorithm will require substantially more memory than the current O(N^4) algorithm?
-
- Administrator
- Posts: 2921
- Joined: Tue Aug 03, 2004 8:18 am
- License Nr.: 458
Re: Scaling behavior of ACFDT-RPA and low-scaling version
Yes, the low-scaling ACFDT-RPA algorithm requires substantially more memory than the current O(N^4) algorithm
-
- Newbie
- Posts: 42
- Joined: Wed Jul 27, 2011 12:22 pm
- Location: Vienna
Re: Scaling behavior of ACFDT-RPA and low-scaling version
The new algorithm needs to store the Greens functions partly in real space, which can require quite some memory, depending on the system size and k-points.
Specifically, for ALGO=ACFDTR you need 16*3*N_k*N_g^2*fac of RAM. 16 for double complex, 3 for two Greens functions and one temporary, N_k is the total number of k-points,
N_g is the number of g vectors and fac is ~4 for PRECFOCK=Fast and ~8 for PRECFOCK=Normal. In your case then you'd need 1350 of RAM, assuming PRECFOCK=FAST
or twice that for NORMAL. Plus about 50% for other quantities, depending on the number of frequency points, etc.
For a system I'm currently looking with a volume of ~1500 A^3, 6 k-points, and ENCUT=400, ENCUTGW=200, 8 frequency points, I get the result in ~75 minutes
on 512 cores with 8 GB RAM each, about twice as long as the diagonalisation takes. If you'd need more time points, that would take longer, but would not increase the
memory requirements substantially. There is a loop over time points, so only one is stored at a time and the memory requirements for the response, which depend
on the number of time points, aren't dominating.
You can lower the memory requirements with ACFDTRK, that will consume about N_k less memory in this step but increase the compute time by about a factor of N_k.
Then obviously the number of atoms, for which the new algorithm becomes faster than the old one increases.
Specifically, for ALGO=ACFDTR you need 16*3*N_k*N_g^2*fac of RAM. 16 for double complex, 3 for two Greens functions and one temporary, N_k is the total number of k-points,
N_g is the number of g vectors and fac is ~4 for PRECFOCK=Fast and ~8 for PRECFOCK=Normal. In your case then you'd need 1350 of RAM, assuming PRECFOCK=FAST
or twice that for NORMAL. Plus about 50% for other quantities, depending on the number of frequency points, etc.
For a system I'm currently looking with a volume of ~1500 A^3, 6 k-points, and ENCUT=400, ENCUTGW=200, 8 frequency points, I get the result in ~75 minutes
on 512 cores with 8 GB RAM each, about twice as long as the diagonalisation takes. If you'd need more time points, that would take longer, but would not increase the
memory requirements substantially. There is a loop over time points, so only one is stored at a time and the memory requirements for the response, which depend
on the number of time points, aren't dominating.
You can lower the memory requirements with ACFDTRK, that will consume about N_k less memory in this step but increase the compute time by about a factor of N_k.
Then obviously the number of atoms, for which the new algorithm becomes faster than the old one increases.