Revisions to The relationship between average eigenvalue and convergence performance in VASP?

added 168 characters in body

edited Mar 17, 2022 at 2:38

38.6k
4
109
267

What determines what $\alpha$ should be? Let's Taylor-expand the function around our trial inputs, and write $$ f(\mathbf{r}) \approx f(\mathbf{r}_0) + (\mathbf{r}-\mathbf{r}_0)^\dagger\nabla f(\mathbf{r}_0)+\frac{1}{2}(\mathbf{r}-\mathbf{r}_0)^\dagger\mathrm{B}.(\mathbf{r}-\mathbf{r}_0), $$$$ f(\mathbf{r}) \approx f(\mathbf{r}_0) + (\mathbf{r}-\mathbf{r}_0)^\dagger\nabla f(\mathbf{r}_0)+\frac{1}{2}(\mathbf{r}-\mathbf{r}_0)^\dagger\mathrm{B}.(\mathbf{r}-\mathbf{r}_0),\tag{2} $$ where $\mathrm{B}$ is the Hessian matrix, which is the matrix of second derivatives, $$ \mathrm{B}_{ij}=\left.\frac{\partial^2 f}{\partial r_i \partial r_j}\right\vert_{\mathbf{r}=\mathbf{r}_0}. $$$$ \mathrm{B}_{ij}=\left.\frac{\partial^2 f}{\partial r_i \partial r_j}\right\vert_{\mathbf{r}=\mathbf{r}_0}.\tag{3} $$ At the minimum, $\mathbf{r}=\mathbf{r}_\mathrm{opt}$, we know that the gradient should be zero - so let's differentiate the Taylor expansion to get the gradient expression: $$ \nabla f(\mathbf{r}) \approx \nabla f(\mathbf{r}_0)+\mathrm{B}.(\mathbf{r}-\mathbf{r}_0), $$$$ \nabla f(\mathbf{r}) \approx \nabla f(\mathbf{r}_0)+\mathrm{B}.(\mathbf{r}-\mathbf{r}_0),\tag{4} $$ Now we set this to zero, and rearrange

$$ \begin{array}{lcl} & \nabla f(\mathbf{r}_0)+\mathrm{B}.(\mathbf{r}_\mathrm{opt}-\mathbf{r}_0) &= 0\\ \Rightarrow & \mathrm{B}.(\mathbf{r}_\mathrm{opt}-\mathbf{r}_0) &= -\nabla f(\mathbf{r}_0)\\ \Rightarrow & (\mathbf{r}_\mathrm{opt}-\mathbf{r}_0) &= -\mathrm{B}^{-1}\nabla f(\mathbf{r}_0)\\ \Rightarrow & \mathbf{r}_\mathrm{opt} &= \mathbf{r}_0 -\mathrm{B}^{-1}\nabla f(\mathbf{r}_0) \end{array} $$$$ \begin{eqnarray}{} & \nabla f(\mathbf{r}_0)+\mathrm{B}.(\mathbf{r}_\mathrm{opt}-\mathbf{r}_0) &= 0\tag{5}\\ \Rightarrow & \mathrm{B}.(\mathbf{r}_\mathrm{opt}-\mathbf{r}_0) &= -\nabla f(\mathbf{r}_0)\tag{6}\\ \Rightarrow & (\mathbf{r}_\mathrm{opt}-\mathbf{r}_0) &= -\mathrm{B}^{-1}\nabla f(\mathbf{r}_0)\tag{7}\\ \Rightarrow & \mathbf{r}_\mathrm{opt} &= \mathbf{r}_0 -\mathrm{B}^{-1}\nabla f(\mathbf{r}_0)\tag{8} \end{eqnarray} $$ Comparing this with equation \eqref{eq:step}, we see that the basic form is the same, except that the scalar $\alpha$ has been replaced by the matrix $\mathrm{B}^{-1}$.

If $\mathrm{B}^{-1}$ is a diagonal matrix, and all the eigenvalues are the same, then it has the form, $$ \mathrm{B}^{-1}=\left(\begin{array}{ccc} \lambda & 0 & 0 & \ldots\\ 0 & \lambda & 0 & \ldots\\ 0 & 0 &\lambda & \\ \vdots & \vdots & & \ddots \end{array}\right) =\lambda\mathrm{I}, $$$$\tag{9} \mathrm{B}^{-1}=\left(\begin{array}{ccc} \lambda & 0 & 0 & \ldots\\ 0 & \lambda & 0 & \ldots\\ 0 & 0 &\lambda & \\ \vdots & \vdots & & \ddots \end{array}\right) =\lambda\mathrm{I}, $$ where $\mathrm{I}$ is the identity matrix. In this case, setting $\alpha=\lambda$ will give the ideal step length and, in fact, will jump straight to the minimum of the function in a single step.

If the eigenvalues of $\mathrm{B}^{-1}$ are different, then there is no single value of $\alpha$ which can mimic $\mathrm{B}^{-1}$ perfectly. The best approximation is for $\alpha$ to match $\mathrm{B}^{-1}$ as closely as possible. Since we used $\mathrm{B}^{-1}$ because of the relation, $$ \mathrm{B}^{-1}\mathrm{B}=I, $$$$ \mathrm{B}^{-1}\mathrm{B}=I,\tag{10} $$ this leads naturally to the conclusion that we want $$ \alpha\mathrm{B} \approx I. $$$$ \alpha\mathrm{B} \approx I.\tag{11} $$ The identity matrix $\mathrm{I}$ has eigenvalues of 1, and the best approximation for $\alpha$ is when the average eigenvalue of $\alpha\mathrm{B}$ is one.

"Electrons and positrons in metal vacancies", M. Manninen, R. Nieminen, P. Hautojärvi, and J. Arponen, Phys. Rev. B 12, 4012 (1975); https://doi.org/10.1103/PhysRevB.12.4012

"Efficient iteration scheme for self-consistent pseudopotential calculations", G. P. Kerker, Phys. Rev. B 23, 3082 (1981); https://doi.org/10.1103/PhysRevB.23.3082

"A class of methods for solving nonlinear simultaneous equations", C. G. Broyden, Math. Comp. 19, 577-593 (1965)

"Convergence acceleration of iterative sequences. the case of scf iteration", P. Pulay, Chem. Phys. Lett. 73, 393 (1980); https://doi.org/10.1016/0009-2614(80)80396-4

"Improved SCF convergence acceleration", P. Pulay, J. Comput. Chem. 3, 556(1982); https://doi.org/10.1002/jcc.540030413

"Iterative Procedures for Nonlinear Integral Equations", D. G. Anderson, J. ACM 12 4 (1965); https://doi.org/10.1145/321296.321305

"Electrons and positrons in metal vacancies", M. Manninen, R. Nieminen, P. Hautojärvi, and J. Arponen, Phys. Rev. B 12, 4012 (1975).

"Efficient iteration scheme for self-consistent pseudopotential calculations", G. P. Kerker, Phys. Rev. B 23, 3082 (1981).

"A class of methods for solving nonlinear simultaneous equations", C. G. Broyden, Math. Comp. 19, 577-593 (1965).

"Convergence acceleration of iterative sequences. the case of scf iteration", P. Pulay, Chem. Phys. Lett. 73, 393 (1980).

"Improved SCF convergence acceleration", P. Pulay, J. Comput. Chem. 3, 556(1982).

"Iterative Procedures for Nonlinear Integral Equations", D. G. Anderson, J. ACM 12 4 (1965).

What determines what $\alpha$ should be? Let's Taylor-expand the function around our trial inputs, and write $$ f(\mathbf{r}) \approx f(\mathbf{r}_0) + (\mathbf{r}-\mathbf{r}_0)^\dagger\nabla f(\mathbf{r}_0)+\frac{1}{2}(\mathbf{r}-\mathbf{r}_0)^\dagger\mathrm{B}.(\mathbf{r}-\mathbf{r}_0), $$ where $\mathrm{B}$ is the Hessian matrix, which is the matrix of second derivatives, $$ \mathrm{B}_{ij}=\left.\frac{\partial^2 f}{\partial r_i \partial r_j}\right\vert_{\mathbf{r}=\mathbf{r}_0}. $$ At the minimum, $\mathbf{r}=\mathbf{r}_\mathrm{opt}$, we know that the gradient should be zero - so let's differentiate the Taylor expansion to get the gradient expression: $$ \nabla f(\mathbf{r}) \approx \nabla f(\mathbf{r}_0)+\mathrm{B}.(\mathbf{r}-\mathbf{r}_0), $$ Now we set this to zero, and rearrange

$$ \begin{array}{lcl} & \nabla f(\mathbf{r}_0)+\mathrm{B}.(\mathbf{r}_\mathrm{opt}-\mathbf{r}_0) &= 0\\ \Rightarrow & \mathrm{B}.(\mathbf{r}_\mathrm{opt}-\mathbf{r}_0) &= -\nabla f(\mathbf{r}_0)\\ \Rightarrow & (\mathbf{r}_\mathrm{opt}-\mathbf{r}_0) &= -\mathrm{B}^{-1}\nabla f(\mathbf{r}_0)\\ \Rightarrow & \mathbf{r}_\mathrm{opt} &= \mathbf{r}_0 -\mathrm{B}^{-1}\nabla f(\mathbf{r}_0) \end{array} $$ Comparing this with equation \eqref{eq:step}, we see that the basic form is the same, except that the scalar $\alpha$ has been replaced by the matrix $\mathrm{B}^{-1}$.

If $\mathrm{B}^{-1}$ is a diagonal matrix, and all the eigenvalues are the same, then it has the form, $$ \mathrm{B}^{-1}=\left(\begin{array}{ccc} \lambda & 0 & 0 & \ldots\\ 0 & \lambda & 0 & \ldots\\ 0 & 0 &\lambda & \\ \vdots & \vdots & & \ddots \end{array}\right) =\lambda\mathrm{I}, $$ where $\mathrm{I}$ is the identity matrix. In this case, setting $\alpha=\lambda$ will give the ideal step length and, in fact, will jump straight to the minimum of the function in a single step.

If the eigenvalues of $\mathrm{B}^{-1}$ are different, then there is no single value of $\alpha$ which can mimic $\mathrm{B}^{-1}$ perfectly. The best approximation is for $\alpha$ to match $\mathrm{B}^{-1}$ as closely as possible. Since we used $\mathrm{B}^{-1}$ because of the relation, $$ \mathrm{B}^{-1}\mathrm{B}=I, $$ this leads naturally to the conclusion that we want $$ \alpha\mathrm{B} \approx I. $$ The identity matrix $\mathrm{I}$ has eigenvalues of 1, and the best approximation for $\alpha$ is when the average eigenvalue of $\alpha\mathrm{B}$ is one.

"Electrons and positrons in metal vacancies", M. Manninen, R. Nieminen, P. Hautojärvi, and J. Arponen, Phys. Rev. B 12, 4012 (1975); https://doi.org/10.1103/PhysRevB.12.4012

"Efficient iteration scheme for self-consistent pseudopotential calculations", G. P. Kerker, Phys. Rev. B 23, 3082 (1981); https://doi.org/10.1103/PhysRevB.23.3082

"A class of methods for solving nonlinear simultaneous equations", C. G. Broyden, Math. Comp. 19, 577-593 (1965)

"Convergence acceleration of iterative sequences. the case of scf iteration", P. Pulay, Chem. Phys. Lett. 73, 393 (1980); https://doi.org/10.1016/0009-2614(80)80396-4

"Improved SCF convergence acceleration", P. Pulay, J. Comput. Chem. 3, 556(1982); https://doi.org/10.1002/jcc.540030413

"Iterative Procedures for Nonlinear Integral Equations", D. G. Anderson, J. ACM 12 4 (1965); https://doi.org/10.1145/321296.321305

What determines what $\alpha$ should be? Let's Taylor-expand the function around our trial inputs, and write $$ f(\mathbf{r}) \approx f(\mathbf{r}_0) + (\mathbf{r}-\mathbf{r}_0)^\dagger\nabla f(\mathbf{r}_0)+\frac{1}{2}(\mathbf{r}-\mathbf{r}_0)^\dagger\mathrm{B}.(\mathbf{r}-\mathbf{r}_0),\tag{2} $$ where $\mathrm{B}$ is the Hessian matrix, which is the matrix of second derivatives, $$ \mathrm{B}_{ij}=\left.\frac{\partial^2 f}{\partial r_i \partial r_j}\right\vert_{\mathbf{r}=\mathbf{r}_0}.\tag{3} $$ At the minimum, $\mathbf{r}=\mathbf{r}_\mathrm{opt}$, we know that the gradient should be zero - so let's differentiate the Taylor expansion to get the gradient expression: $$ \nabla f(\mathbf{r}) \approx \nabla f(\mathbf{r}_0)+\mathrm{B}.(\mathbf{r}-\mathbf{r}_0),\tag{4} $$ Now we set this to zero, and rearrange

$$ \begin{eqnarray}{} & \nabla f(\mathbf{r}_0)+\mathrm{B}.(\mathbf{r}_\mathrm{opt}-\mathbf{r}_0) &= 0\tag{5}\\ \Rightarrow & \mathrm{B}.(\mathbf{r}_\mathrm{opt}-\mathbf{r}_0) &= -\nabla f(\mathbf{r}_0)\tag{6}\\ \Rightarrow & (\mathbf{r}_\mathrm{opt}-\mathbf{r}_0) &= -\mathrm{B}^{-1}\nabla f(\mathbf{r}_0)\tag{7}\\ \Rightarrow & \mathbf{r}_\mathrm{opt} &= \mathbf{r}_0 -\mathrm{B}^{-1}\nabla f(\mathbf{r}_0)\tag{8} \end{eqnarray} $$ Comparing this with equation \eqref{eq:step}, we see that the basic form is the same, except that the scalar $\alpha$ has been replaced by the matrix $\mathrm{B}^{-1}$.

If $\mathrm{B}^{-1}$ is a diagonal matrix, and all the eigenvalues are the same, then it has the form, $$\tag{9} \mathrm{B}^{-1}=\left(\begin{array}{ccc} \lambda & 0 & 0 & \ldots\\ 0 & \lambda & 0 & \ldots\\ 0 & 0 &\lambda & \\ \vdots & \vdots & & \ddots \end{array}\right) =\lambda\mathrm{I}, $$ where $\mathrm{I}$ is the identity matrix. In this case, setting $\alpha=\lambda$ will give the ideal step length and, in fact, will jump straight to the minimum of the function in a single step.

If the eigenvalues of $\mathrm{B}^{-1}$ are different, then there is no single value of $\alpha$ which can mimic $\mathrm{B}^{-1}$ perfectly. The best approximation is for $\alpha$ to match $\mathrm{B}^{-1}$ as closely as possible. Since we used $\mathrm{B}^{-1}$ because of the relation, $$ \mathrm{B}^{-1}\mathrm{B}=I,\tag{10} $$ this leads naturally to the conclusion that we want $$ \alpha\mathrm{B} \approx I.\tag{11} $$ The identity matrix $\mathrm{I}$ has eigenvalues of 1, and the best approximation for $\alpha$ is when the average eigenvalue of $\alpha\mathrm{B}$ is one.

"Electrons and positrons in metal vacancies", M. Manninen, R. Nieminen, P. Hautojärvi, and J. Arponen, Phys. Rev. B 12, 4012 (1975).

"Efficient iteration scheme for self-consistent pseudopotential calculations", G. P. Kerker, Phys. Rev. B 23, 3082 (1981).

"A class of methods for solving nonlinear simultaneous equations", C. G. Broyden, Math. Comp. 19, 577-593 (1965).

"Convergence acceleration of iterative sequences. the case of scf iteration", P. Pulay, Chem. Phys. Lett. 73, 393 (1980).

"Improved SCF convergence acceleration", P. Pulay, J. Comput. Chem. 3, 556(1982).

"Iterative Procedures for Nonlinear Integral Equations", D. G. Anderson, J. ACM 12 4 (1965).

Fixed minor error in the Hessian definition (it was expanded about the wrong point); also added doi for the first Pulay reference.

Source Link

edited Mar 17, 2022 at 2:17

Phil Hasnip

9.1k
24
51

A common approach is to compute the gradient of the function at $\mathbf{r}_0$, i.e. $\nabla f(\mathbf{r}_0)$, and use it to determine an improved guess $\mathbf{r}_1$. Since the gradient is the direction in which the function increases quickest,and and we wish to minimise the function, we use $-\nabla f(\mathbf{r}_0)$ as the direction in which to move. We then write $$ \mathbf{r}_1 = \mathbf{r}_0 - \alpha \nabla f(\mathbf{r}_0),\tag{1}\label{eq:step} $$ where $\alpha$ is the step we take in the search direction $-\nabla f(\mathbf{r}_0)$. This step could be a fixed guess, or we could try various different values and try to find the optimum (often called "line minimisation"). (NB In the context of machine learning, $\alpha$ is sometimes called the "learning rate".)

Finding the optimal step length

What determines what $\alpha$ should be? Let's Taylor-expand the function around our trial inputs, and write $$ f(\mathbf{r}) \approx f(\mathbf{r}_0) + (\mathbf{r}-\mathbf{r}_0)^\dagger\nabla f(\mathbf{r}_0)+\frac{1}{2}(\mathbf{r}-\mathbf{r}_0)^\dagger\mathrm{B}.(\mathbf{r}-\mathbf{r}_0), $$ where $\mathrm{B}$ is the Hessian matrix, which is the matrix of second derivatives, $$ \mathrm{B}_{ij}=\left.\frac{\partial^2 f}{\partial r_i \partial r_j}\right\vert_{\mathbf{r}=\mathbf{r}_\mathrm{opt}}. $$$$ \mathrm{B}_{ij}=\left.\frac{\partial^2 f}{\partial r_i \partial r_j}\right\vert_{\mathbf{r}=\mathbf{r}_0}. $$ At the minimum, $\mathbf{r}=\mathbf{r}_\mathrm{opt}$, we know that the gradient should be zero - so let's differentiate the Taylor expansion to get the gradient expression: $$ \nabla f(\mathbf{r}) \approx \nabla f(\mathbf{r}_0)+\mathrm{B}.(\mathbf{r}-\mathbf{r}_0), $$ Now we set this to zero, and rearrange

Quasi-Newton methods start with an initial approximation for the dielectric matrix (the Hessian), which is often the Kerker form, and successively improve it using the actual change in the density during the self-consistent field cycles. There are different ways to do this, and the most common are those of Broyden and Pulay (Anderson). In the case of VASP, it is actually this approximate dielectric matrix which is used for the eigenvalue decomposition, since this is much faster to compute.

References

"Electrons and positrons in metal vacancies", M. Manninen, R. Nieminen, P. Hautojärvi, and J. Arponen, Phys. Rev. B 12, 4012 (1975); https://doi.org/10.1103/PhysRevB.12.4012

"Efficient iteration scheme for self-consistent pseudopotential calculations", G. P. Kerker, Phys. Rev. B 23, 3082 (1981); https://doi.org/10.1103/PhysRevB.23.3082

"A class of methods for solving nonlinear simultaneous equations", C. G. Broyden, Math. Comp. 19, 577-593 (1965)

"Convergence acceleration of iterative sequences. the case of scf iteration", P. Pulay, Chem. Phys. Lett. 73, 393 (1980); https://doi.org/10.1016/0009-2614(80)80396-4

"Improved SCF convergence acceleration", P. Pulay, J. Comput. Chem. 3, 556(1982); https://doi.org/10.1002/jcc.540030413

"Iterative Procedures for Nonlinear Integral Equations", D. G. Anderson, J. ACM 12 4 (1965); https://doi.org/10.1145/321296.321305

A common approach is to compute the gradient of the function at $\mathbf{r}_0$, i.e. $\nabla f(\mathbf{r}_0)$, and use it to determine an improved guess $\mathbf{r}_1$. Since the gradient is the direction in which the function increases quickest,and we wish to minimise the function, we use $-\nabla f(\mathbf{r}_0)$ as the direction in which to move. We then write $$ \mathbf{r}_1 = \mathbf{r}_0 - \alpha \nabla f(\mathbf{r}_0),\tag{1}\label{eq:step} $$ where $\alpha$ is the step we take in the search direction $-\nabla f(\mathbf{r}_0)$. This step could be a fixed guess, or we could try various different values and try to find the optimum (often called "line minimisation"). (NB In the context of machine learning, $\alpha$ is sometimes called the "learning rate".)

What determines what $\alpha$ should be? Let's Taylor-expand the function around our trial inputs, and write $$ f(\mathbf{r}) \approx f(\mathbf{r}_0) + (\mathbf{r}-\mathbf{r}_0)^\dagger\nabla f(\mathbf{r}_0)+\frac{1}{2}(\mathbf{r}-\mathbf{r}_0)^\dagger\mathrm{B}.(\mathbf{r}-\mathbf{r}_0), $$ where $\mathrm{B}$ is the Hessian matrix, which is the matrix of second derivatives, $$ \mathrm{B}_{ij}=\left.\frac{\partial^2 f}{\partial r_i \partial r_j}\right\vert_{\mathbf{r}=\mathbf{r}_\mathrm{opt}}. $$ At the minimum, $\mathbf{r}=\mathbf{r}_\mathrm{opt}$, we know that the gradient should be zero - so let's differentiate the Taylor expansion to get the gradient expression: $$ \nabla f(\mathbf{r}) \approx \nabla f(\mathbf{r}_0)+\mathrm{B}.(\mathbf{r}-\mathbf{r}_0), $$ Now we set this to zero, and rearrange

Quasi-Newton methods start with an initial approximation for the dielectric matrix (the Hessian), which is often the Kerker form, and successively improve it using the actual change in the density during the self-consistent field cycles. There are different ways to do this, and the most common are those of Broyden and Pulay (Anderson). In the case of VASP, it is actually this approximate dielectric matrix which is used for the eigenvalue decomposition, since this is much faster to compute.

A common approach is to compute the gradient of the function at $\mathbf{r}_0$, i.e. $\nabla f(\mathbf{r}_0)$, and use it to determine an improved guess $\mathbf{r}_1$. Since the gradient is the direction in which the function increases quickest, and we wish to minimise the function, we use $-\nabla f(\mathbf{r}_0)$ as the direction in which to move. We then write $$ \mathbf{r}_1 = \mathbf{r}_0 - \alpha \nabla f(\mathbf{r}_0),\tag{1}\label{eq:step} $$ where $\alpha$ is the step we take in the search direction $-\nabla f(\mathbf{r}_0)$. This step could be a fixed guess, or we could try various different values and try to find the optimum (often called "line minimisation"). (NB In the context of machine learning, $\alpha$ is sometimes called the "learning rate".)

Finding the optimal step length

What determines what $\alpha$ should be? Let's Taylor-expand the function around our trial inputs, and write $$ f(\mathbf{r}) \approx f(\mathbf{r}_0) + (\mathbf{r}-\mathbf{r}_0)^\dagger\nabla f(\mathbf{r}_0)+\frac{1}{2}(\mathbf{r}-\mathbf{r}_0)^\dagger\mathrm{B}.(\mathbf{r}-\mathbf{r}_0), $$ where $\mathrm{B}$ is the Hessian matrix, which is the matrix of second derivatives, $$ \mathrm{B}_{ij}=\left.\frac{\partial^2 f}{\partial r_i \partial r_j}\right\vert_{\mathbf{r}=\mathbf{r}_0}. $$ At the minimum, $\mathbf{r}=\mathbf{r}_\mathrm{opt}$, we know that the gradient should be zero - so let's differentiate the Taylor expansion to get the gradient expression: $$ \nabla f(\mathbf{r}) \approx \nabla f(\mathbf{r}_0)+\mathrm{B}.(\mathbf{r}-\mathbf{r}_0), $$ Now we set this to zero, and rearrange

Quasi-Newton methods start with an initial approximation for the dielectric matrix (the Hessian), which is often the Kerker form, and successively improve it using the actual change in the density during the self-consistent field cycles. There are different ways to do this, and the most common are those of Broyden and Pulay (Anderson). In the case of VASP, it is actually this approximate dielectric matrix which is used for the eigenvalue decomposition, since this is much faster to compute.

References

"Electrons and positrons in metal vacancies", M. Manninen, R. Nieminen, P. Hautojärvi, and J. Arponen, Phys. Rev. B 12, 4012 (1975); https://doi.org/10.1103/PhysRevB.12.4012

"Efficient iteration scheme for self-consistent pseudopotential calculations", G. P. Kerker, Phys. Rev. B 23, 3082 (1981); https://doi.org/10.1103/PhysRevB.23.3082

"A class of methods for solving nonlinear simultaneous equations", C. G. Broyden, Math. Comp. 19, 577-593 (1965)

"Convergence acceleration of iterative sequences. the case of scf iteration", P. Pulay, Chem. Phys. Lett. 73, 393 (1980); https://doi.org/10.1016/0009-2614(80)80396-4

"Improved SCF convergence acceleration", P. Pulay, J. Comput. Chem. 3, 556(1982); https://doi.org/10.1002/jcc.540030413

"Iterative Procedures for Nonlinear Integral Equations", D. G. Anderson, J. ACM 12 4 (1965); https://doi.org/10.1145/321296.321305

Source Link

answered Mar 17, 2022 at 1:58

Phil Hasnip

9.1k
24
51

In general, the difficulty of an optimisation problem depends on how widely spread the eigenvalues of the Hessian are. To see why this is, consider a minimisation problem where we wish to find the minimum value of the function $f(\mathbf{r})$, given an initial "guess" set of inputs, $\mathbf{r}_0$.

Gradient optimisation (steepest descent)

A common approach is to compute the gradient of the function at $\mathbf{r}_0$, i.e. $\nabla f(\mathbf{r}_0)$, and use it to determine an improved guess $\mathbf{r}_1$. Since the gradient is the direction in which the function increases quickest,and we wish to minimise the function, we use $-\nabla f(\mathbf{r}_0)$ as the direction in which to move. We then write $$ \mathbf{r}_1 = \mathbf{r}_0 - \alpha \nabla f(\mathbf{r}_0),\tag{1}\label{eq:step} $$ where $\alpha$ is the step we take in the search direction $-\nabla f(\mathbf{r}_0)$. This step could be a fixed guess, or we could try various different values and try to find the optimum (often called "line minimisation"). (NB In the context of machine learning, $\alpha$ is sometimes called the "learning rate".)

What determines what $\alpha$ should be? Let's Taylor-expand the function around our trial inputs, and write $$ f(\mathbf{r}) \approx f(\mathbf{r}_0) + (\mathbf{r}-\mathbf{r}_0)^\dagger\nabla f(\mathbf{r}_0)+\frac{1}{2}(\mathbf{r}-\mathbf{r}_0)^\dagger\mathrm{B}.(\mathbf{r}-\mathbf{r}_0), $$ where $\mathrm{B}$ is the Hessian matrix, which is the matrix of second derivatives, $$ \mathrm{B}_{ij}=\left.\frac{\partial^2 f}{\partial r_i \partial r_j}\right\vert_{\mathbf{r}=\mathbf{r}_\mathrm{opt}}. $$ At the minimum, $\mathbf{r}=\mathbf{r}_\mathrm{opt}$, we know that the gradient should be zero - so let's differentiate the Taylor expansion to get the gradient expression: $$ \nabla f(\mathbf{r}) \approx \nabla f(\mathbf{r}_0)+\mathrm{B}.(\mathbf{r}-\mathbf{r}_0), $$ Now we set this to zero, and rearrange

$$ \begin{array}{lcl} & \nabla f(\mathbf{r}_0)+\mathrm{B}.(\mathbf{r}_\mathrm{opt}-\mathbf{r}_0) &= 0\\ \Rightarrow & \mathrm{B}.(\mathbf{r}_\mathrm{opt}-\mathbf{r}_0) &= -\nabla f(\mathbf{r}_0)\\ \Rightarrow & (\mathbf{r}_\mathrm{opt}-\mathbf{r}_0) &= -\mathrm{B}^{-1}\nabla f(\mathbf{r}_0)\\ \Rightarrow & \mathbf{r}_\mathrm{opt} &= \mathbf{r}_0 -\mathrm{B}^{-1}\nabla f(\mathbf{r}_0) \end{array} $$ Comparing this with equation \eqref{eq:step}, we see that the basic form is the same, except that the scalar $\alpha$ has been replaced by the matrix $\mathrm{B}^{-1}$.

If $\mathrm{B}^{-1}$ is a diagonal matrix, and all the eigenvalues are the same, then it has the form, $$ \mathrm{B}^{-1}=\left(\begin{array}{ccc} \lambda & 0 & 0 & \ldots\\ 0 & \lambda & 0 & \ldots\\ 0 & 0 &\lambda & \\ \vdots & \vdots & & \ddots \end{array}\right) =\lambda\mathrm{I}, $$ where $\mathrm{I}$ is the identity matrix. In this case, setting $\alpha=\lambda$ will give the ideal step length and, in fact, will jump straight to the minimum of the function in a single step.

If the eigenvalues of $\mathrm{B}^{-1}$ are different, then there is no single value of $\alpha$ which can mimic $\mathrm{B}^{-1}$ perfectly. The best approximation is for $\alpha$ to match $\mathrm{B}^{-1}$ as closely as possible. Since we used $\mathrm{B}^{-1}$ because of the relation, $$ \mathrm{B}^{-1}\mathrm{B}=I, $$ this leads naturally to the conclusion that we want $$ \alpha\mathrm{B} \approx I. $$ The identity matrix $\mathrm{I}$ has eigenvalues of 1, and the best approximation for $\alpha$ is when the average eigenvalue of $\alpha\mathrm{B}$ is one.

Density mixing

In the case of density mixing, the function we wish to minimise is the difference between the input and output densities in the self-consistent field cycle. The relevant Hessian is the static dielectric matrix for the system, which is why VASP is using that for its reporting -- although note that VASP is not actually computing the dielectric matrix, because that is computationally quite expensive. Instead, it builds up an approximate form for the dielectric matrix over the course of the calculation (see below).

NB it is actually possible to use the dielectric matrix, either by computing it directly (perhaps only for the lowest wavevectors; ABINIT can do this), or using it implicitly by solving the perturbation theory equations iteratively for the dielectric response.

Beyond gradient optimisation

There are two additional methods which are commonly used: preconditioning; and quasi-Newton methods. I will not go into details here, but briefly preconditioning is a technique for transforming your problem into one where the eigenvalues are closer together ("compressing the eigenspectrum"), and quasi-Newton methods are a way to build up a model of $\mathrm{B}^{-1}$ in order to capture the change of search direction as well.

In the case of density-mixing methods, preconditioning is commonly done using an approximate inverse dielectric matrix proposed by Manninen et al, although it is usually named after Kerker, who noted its wider applicability. It is based on the asymptotic limit of the dielectric matrix at long wavelengths, where the response of the material is dominated by the electron-electron interaction (the Hartree potential, in density functional theory).

Quasi-Newton methods start with an initial approximation for the dielectric matrix (the Hessian), which is often the Kerker form, and successively improve it using the actual change in the density during the self-consistent field cycles. There are different ways to do this, and the most common are those of Broyden and Pulay (Anderson). In the case of VASP, it is actually this approximate dielectric matrix which is used for the eigenvalue decomposition, since this is much faster to compute.

Stack Exchange Network

Return to Answer

Finding the optimal step length

References

Finding the optimal step length

References

Gradient optimisation (steepest descent)

Density mixing

Beyond gradient optimisation