Unconstrained Optimization
Mathwrist White Paper Series

(January 20, 2023)

Abstract

This document presents a technical overview of the unconstrained optimization feature for n-dimensional smooth functions in Mathwrist’s C++ Numerical Programming Library (NPL). NPL unconstrained optimization includes a set of line search based methods and a set of trust region based methods.

1 Introduction

Let $\psi(\mathbf{x}):\mathbb{R}^{n}\rightarrow\mathbb{R}$ be a twice continuously differentiable function with gradient vector $\mathbf{g}(\mathbf{x})$ and Hessian matrix $\mathbf{H}(\mathbf{x})$ . Unconstrained optimization solves the following minimization problem,

\min_{\mathbf{x}\in\mathbb{R}^{n}}\psi(\mathbf{x})

For brevity, we could be using $\mathbf{g}$ and $\mathbf{H}$ as the notation of gradient and Hessian when the context has no ambiguity. Let $\mathbf{p}$ be a step move at the current point $\mathbf{x}$ . If $\mathbf{p}^{T}\mathbf{g}(\mathbf{x})<0$ , we call $\mathbf{p}$ a descent direction. By Taylor expansion, we can write

\psi(\mathbf{x}+\mathbf{p})=\psi(\mathbf{x})+\mathbf{p}^{T}\mathbf{g}(\mathbf{% x})+\frac{1}{2}\mathbf{p}^{T}\mathbf{H}(\mathbf{x}+\theta\mathbf{p})\mathbf{p}

(1)

, for some $0\leq\theta\leq 1$ . It is then clear that an optimal point $\mathbf{x}^{*}$ needs satisfies the following conditions:

1.

Stationary condition: $\|\mathbf{g}(\mathbf{x}^{*})\|=0$ .
2.

Curvature condition: $\mathbf{H}(\mathbf{x}^{*})$ is positive semi-definite (necessary) or positive definite (sufficient).

NPL provides a set of line search based methods and a set of trust region based methods, derived from LineSearch and TrustRegion base classes respectively. All algorithms iteratively generate a sequence of improved data points $\mathbf{x}^{k},k=0,\cdots$ converging to $\mathbf{x}^{*}$ . The algorithm terminates at iteration $k$ when $\|\mathbf{g}(\mathbf{x}^{k})\|_{\infty}\leq\epsilon$ for some tolerance $\epsilon$ and $\mathbf{H}(\mathbf{x}^{k})$ is at least positive semi-definite.

2 Line Search Method

At each iteration, a line search method first computes a descent direction $\mathbf{p}$ and then computes a step length $\alpha$ along $\mathbf{p}$ such that $\psi(\mathbf{x}^{k}+\alpha\mathbf{p})$ produces sufficient improvement.

2.1 Step Length Requirement

A good step length not only generates sufficient decrease of the objective but also produces “flatter” gradient, i.e. closer to stationary. These requirements can be stated as

	$\displaystyle\psi(\mathbf{x}^{k}+\alpha\mathbf{p})\leq\psi(\mathbf{x}^{k})+c_{% 1}\alpha\mathbf{g}(\mathbf{x}^{k})^{T}\mathbf{p},$
	$\displaystyle\mathbf{g}(\mathbf{x}^{k}+\alpha\mathbf{p})^{T}\mathbf{p}\geq c_{% 2}\mathbf{g}(\mathbf{x}^{k})^{T}\mathbf{p},$

, $0<c_{1}<c_{2}<1$ , jointly known as Wolfe conditions.

An overshooting step length could be still satisfying the Wolfe conditions. To restrict $\alpha$ being within a small neighborhood of the optimal step length, the strong Wolfe conditions enfore that

	$\displaystyle\psi(\mathbf{x}^{k}+\alpha\mathbf{p})\leq\psi(\mathbf{x}^{k})+c_{% 1}\alpha\mathbf{g}(\mathbf{x}^{k})^{T}\mathbf{p},$
	$\displaystyle\\|\mathbf{g}(\mathbf{x}^{k}+\alpha\mathbf{p})^{T}\mathbf{p}\\|\leq% -c_{2}\mathbf{g}(\mathbf{x}^{k})^{T}\mathbf{p}$

, where $0<c_{1}<c_{2}<1$ . Wolfe and strong Wolfe conditions are theoretically sound. It can be shown that if $\psi(\mathbf{x})$ is continuously differentiable and bounded along the descent direction $\mathbf{p}$ , then there exists an interval $(c_{1},c_{2})$ , in which both Wolfe and strong Wolfe conditions are satisfied.

By default, NPL tests a trial step length using the strong Wolfe condition with $c_{1}=1.e^{-4}$ and $c_{2}=0.9$ . Typically $c_{1}$ is recommeded to be a small number to avoid the quadratic term in (1) dominates. If $c_{2}$ is set to a very small number, the algorithm behaves as an “exact” line search method in the sense that $\alpha$ is very close to the optimal step length, at the cost of more trial step iterations. The default parameter value $c_{2}=0.9$ requires very less amout of work, i.e. 1 or 2 iterations, spent on finding an acceptable step length.

Because the setting of Wolfe conditions is critical to the performance of a line search method, we don’t expose the setting function as a public user control function. Instead, concrete line search derived classes can call a protected member function to set the Wolfe conditions as appropriate to relevant context internally in the library implementation.

2.2 Step Length Algorithm

Fixing the search direction $\mathbf{p}$ , we can write the objective function as a function of step length $\psi(\alpha)$ and search for the optimal step length $\alpha^{*}$ with the Wolfe conditions as early termination conditions. An outline of line search algorithm can be found in [1], section 3.4. Our implementation overall follows this search strategy. When computing a trial step length, we use bi-section or safeguarded quadratic and cubic polynomial interpolation to handle different situations.

2.3 Search Directions

We provide Steepest Descent, modified Newton, Quasi-Newton and Conjugate Gradient four types of line search based methods. They all share the same step length algorithm and differ in calculating the descent search directions.

2.3.1 Steepest Descent

Taking only the first order Taylor expansion $\psi(\mathbf{x}^{k}+\mathbf{p})\approx\psi(\mathbf{x}^{k})+\mathbf{g}(\mathbf{% x}^{k})^{T}\mathbf{p}$ , the Steepest Descent simply chooses the search direction $\mathbf{p}=-\mathbf{g}(\mathbf{x}^{k})$ at each step.

2.3.2 Modified Newton

Assume $\mathbf{H}(\mathbf{x})$ is positive definite and take the second order Taylor expansion $\psi(\mathbf{x}^{k}+\mathbf{p})\approx\psi(\mathbf{x}^{k})+\mathbf{g}(\mathbf{% x}^{k})^{T}\mathbf{p}+\frac{1}{2}\mathbf{p}^{T}\mathbf{H}(\mathbf{x}^{k})% \mathbf{p}=f(\mathbf{p})$ as a model function $f(\mathbf{p})$ . Apply the stationary condition to $f(\mathbf{p})$ , we get

\nabla f(\mathbf{p})=\mathbf{g}(\mathbf{x}^{k})+\mathbf{H}(\mathbf{x}^{k})% \mathbf{p}=0\Rightarrow\mathbf{p}=-\mathbf{H}^{-1}(\mathbf{x}^{k})\mathbf{g}(% \mathbf{x}^{k})

Also if the objective function indeed is quadratic, the Newton direction $\mathbf{p}$ reaches the minimum $\mathbf{x}^{*}$ in one unit step length. It can be shown that if $\mathbf{x}^{k}$ is landed in a nearby region of $\mathbf{x}^{*}$ , Newton method has second order convergence rate.

When the Hessian matrix $\mathbf{H}(\mathbf{x}^{k})$ is indefinite or close to singular, we use a modified Cholesky decomposition method briefly in [2] pp 109 - 111 and detailed in [3] to construct a positive definite matrix

\mathbf{B}^{k}=\mathbf{H}(\mathbf{x}^{k})+\mathbf{E}=\mathbf{L}\mathbf{D}% \mathbf{L}^{T}

, where $\mathbf{E}$ is a diagonal modification matrix such that $\|\mathbf{E}\|_{\infty}$ is minimized. The Newton direction is then computed as $\mathbf{p}=-{\mathbf{B}^{k}}^{-1}\mathbf{g}(\mathbf{x}^{k})$ .

The modified Cholesky decomposition has another advantage when $\mathbf{g}(\mathbf{x}^{k})$ vanishes at a saddle point. In which case, the Newton direction is 0, but the modified Cholesky procedure is able to identify an index $s$ . The search direction computed from $\mathbf{p}={\mathbf{L}^{T}}^{-1}\mathbf{e}_{s}$ is a negative curvature direction, where $\mathbf{e}_{s}$ is a vector of 0 elements except the $s$ -th element is 1.

2.3.3 Quasi Newton

Let $\mathbf{s}$ be the step move from current $\mathbf{x}^{k}$ and express the gradient $\mathbf{g}(\mathbf{x}^{k}+\mathbf{s})$ by its first order Taylor expansion, $\mathbf{g}(\mathbf{x}^{k}+\mathbf{s})\approx\mathbf{g}(\mathbf{x}^{k})+\mathbf% {H}(\mathbf{x}^{k})\mathbf{s}$ . We have the secant equation $\mathbf{H}(\mathbf{x}^{k})\mathbf{s}=\mathbf{y}$ , where $\mathbf{y}=\mathbf{g}(\mathbf{x}^{k}+\mathbf{s})-\mathbf{g}(\mathbf{x}^{k})$ .

Because the sequence of $\mathbf{g}(\mathbf{x}^{k}),k=0,1,\cdots$ carries curvature information, the idea of Quasi Newton method is to construct a Hessian approximation matrix $\mathbf{B}^{k}$ from the sequence of $\mathbf{g}(\mathbf{x}^{k})$ such that

1.

$\mathbf{B}^{k}\mathbf{s}=\mathbf{y}$ holds, known as Quasi-Newton condition.
2.

Assume a symmetric positive definite matrix $\mathbf{B}^{k-1}$ is given, once $\mathbf{x}^{k}$ is computed, we update $\mathbf{B}^{k}=\mathbf{B}^{k-1}+\mathbf{U}^{k}$ with an update matrix $\mathbf{U}^{k}$ . We want $\mathbf{B}^{k}$ to be positive definite as well.

The Quasi-Newton direction is then computed by solving $\mathbf{B}^{k}\mathbf{p}=-\mathbf{g}(x^{k})$ . The well known “Broyden-Fletcher-Goldfarb-Shanno” (BFGS) update scheme constructs $\mathbf{B}^{k}$ as the following,

\mathbf{B}^{k}=\mathbf{B}^{k-1}+\frac{1}{\mathbf{y}^{T}\mathbf{s}}\mathbf{y}% \mathbf{y}^{T}-\frac{1}{\mathbf{s}^{T}\mathbf{B}^{k-1}\mathbf{s}}\mathbf{B}^{k% -1}\mathbf{s}\mathbf{s}^{T}\mathbf{B}^{k-1}

Because $\mathbf{s}=\alpha\mathbf{p}$ and $\mathbf{p}$ is computed from the Quasi-Newton direction, this update can be also written as

\mathbf{B}^{k}=\mathbf{B}^{k-1}+\frac{1}{\mathbf{g}(\mathbf{x}^{k})^{T}\mathbf% {p}}\mathbf{g}(\mathbf{x}^{k})\mathbf{g}(\mathbf{x}^{k})^{T}+\frac{1}{\alpha% \mathbf{y}^{T}\mathbf{p}}\mathbf{y}\mathbf{y}^{T}

The BFGS update is a rank-two update scheme. Given a Cholesky decomposition $\mathbf{B}^{k-1}=\mathbf{L}^{k-1}{\mathbf{L}^{k-1}}^{T}$ , we implemented an efficient rank-two update algorithm on this Cholesky factorization to economically obtain the new Cholesky factorization of $\mathbf{B}^{k}$ . Relevant details can be founded in [4] and [5].

2.3.4 Conjugate Gradient

Conjugate Gradient (CG) method is an iterative method to solve large linear systems $\mathbf{H}\mathbf{x}+\mathbf{c}=0$ , where $\mathbf{H}$ is symmetric positive definite. The solution $\mathbf{x}^{*}$ of the linear system is same as the solution of minimizing a quadratic objective function $\psi(\mathbf{x})=\mathbf{c}^{T}x+\frac{1}{2}\mathbf{x}^{T}\mathbf{H}\mathbf{x}$ .

If $\psi(\mathbf{x})$ is a general smooth convex function, we can approximate $\psi(\mathbf{x}^{k}+\alpha\mathbf{p}^{k})$ in a small neighbourhood of $\mathbf{x}^{k}$ by a quadratic model function. The search direction $\mathbf{p}^{k}$ can be computed from a CG method and the step length $\alpha$ moving along $\mathbf{p}^{k}$ is determined by a line search algorithm.

In the case of solving linear systems, the CG step direction at the $k$ -th iteration $\mathbf{p}^{k}$ is computed from the previous direction $\mathbf{p}^{k-1}$ , previous gradient $\mathbf{g}(\mathbf{x}^{k-1})$ and the current gradient $\mathbf{g}(\mathbf{x}^{k})$ as,

$\displaystyle\mathbf{p}^{0}$	$\displaystyle=$	$\displaystyle-\mathbf{g}(\mathbf{x}^{0})$	(2)
$\displaystyle\mathbf{p}^{k}$	$\displaystyle=$	$\displaystyle-\mathbf{g}(\mathbf{x}^{k})+\beta\mathbf{p}^{k-1}$	(3)
$\displaystyle\beta$	$\displaystyle=$	$\displaystyle\frac{\\|\mathbf{g}(\mathbf{x}^{k})\\|^{2}}{\\|\mathbf{g}(\mathbf{x}% ^{k-1})\\|^{2}}$	(4)

When minimizing a nonlinear convex function $\psi(\mathbf{x})$ , the variation is on how to compute coefficeint $\beta$ used in equation (3). This is because $\beta$ computed from equation (4) doesn’t necessarily produce a descent direction in the nonlinear case. To see this, we multiply $\mathbf{g}(\mathbf{x}^{k})^{T}$ at both sides of equation (3) and get

\mathbf{g}(\mathbf{x}^{k})^{T}\mathbf{p}^{k}=-\|\mathbf{g}(\mathbf{x}^{k})\|^{% 2}+\beta\mathbf{g}(\mathbf{x}^{k})^{T}\mathbf{p}^{k-1}

(5)

If the step length at previous iteration was choosen from an exact line search, i.e. $\alpha$ reached the local minimum along $\mathbf{p}^{k-1}$ , we have $\mathbf{g}(\mathbf{x}^{k})^{T}\mathbf{p}^{k-1}=0$ , then $\mathbf{g}(\mathbf{x}^{k})^{T}\mathbf{p}^{k}=-\|\mathbf{g}(\mathbf{x}^{k})\|^{% 2}<0$ , so $\mathbf{p}^{k}$ calculated from (3) surely is a descent direction. In general, $\alpha$ is not an exact line search step. Hence the second term $\mathbf{g}(\mathbf{x}^{k})^{T}\mathbf{p}^{k-1}$ is not zero. When the second term happens to be positive and dominates, $\mathbf{p}^{k}$ from equation (3) is no longer a descent direction.

The Fletcher-Reeves (FR) method computes $\beta$ exactly using equation (4). In addition, strong Wolfe condition with $0<c_{1}<c_{2}<\frac{1}{2}$ is imposed to step length search. It can be shown that such a choice of Wolfe condition ensures (5) to be negative.

The Polak-Ribiere+ (PR+) method uses the same Wolfe condition as the FR method but choose $\beta$ differently,

\beta=\max\left(\frac{\mathbf{g}(\mathbf{x}^{k})^{T}(\mathbf{g}(\mathbf{x}^{k}% )-\mathbf{g}(\mathbf{x}^{k-1}))}{\|\mathbf{g}(\mathbf{x}^{k-1})\|^{2}},0\right)

We use the PR+ method in our line search conjugate gradient method becuase it is numberically more stable than the FR method.

3 Trust Region Method

Within a small neighborhood of iteration point $\mathbf{x}^{k}$ , the trust region method approximates the original objective function by its second order Taylor expansion model function

\hat{\psi}(\mathbf{x}^{k}+p)=\psi(\mathbf{x}^{k})+\mathbf{g}(\mathbf{x}^{k})^{% T}\mathbf{p}+\frac{1}{2}\mathbf{p}^{T}\mathbf{H}(\mathbf{x}^{k})\mathbf{p}

Writing the model function in term of $\mathbf{p}$ , iteratively we approach to the solution of the original minimization problem by solving a subproblem below and making a series of such small step changes.

\arg\min_{\mathbf{p}}\hat{\psi}(\mathbf{p})=\mathbf{g}(\mathbf{x}^{k})^{T}% \mathbf{p}+\frac{1}{2}\mathbf{p}^{T}\mathbf{H}(\mathbf{x}^{k})\mathbf{p}\mbox{% , s.t. }\|\mathbf{p}\|\leq\Delta_{k}

(6)

The trust region radius $\Delta_{k}$ is choosen to ensure $\hat{\psi}(\mathbf{x})$ is an accurate approximation to $\psi(\mathbf{x})$ . To measure the agreement between actual reduction and model reduction, define

\rho_{k}=\frac{\psi(\mathbf{x}^{k})-\psi(\mathbf{x}^{k}+\mathbf{p})}{\hat{\psi% }(0)-\hat{\psi}(\mathbf{p})}

We can adjust $\Delta_{k}$ based on $\rho_{k}$ .

1.

$\rho_{k}$ is negative, we cannot take this step $\mathbf{p}$ and also need reduce $\Delta_{k}$ ;
2.

$\rho_{k}$ is a small positive, $\mathbf{p}$ is still a descent direction to use, but the approximation is not accurate, we need use a smaller $\Delta_{k+1}$ ;
3.

$\rho_{k}$ is close to 1, it means the approximation is accurate. We could try a larger $\Delta_{k+1}$ at the next iteration.

3.1 Nearly Exact Search Direction

When the problem size is reasonable, we can truly solve problem formulation (6), which has a global optimal solution $\mathbf{p}^{*}$ iff $\exists\lambda\geq 0$ such that the following conditions hold.

1.

$\left(\mathbf{H}(\mathbf{x}^{k})+\lambda\mathbf{I}\right)\mathbf{p}^{*}=-% \mathbf{g}(\mathbf{x}^{k})$
2.

$\lambda\left(\Delta_{k}-\|\mathbf{p}^{*}\|\right)=0$
3.

$\left(\mathbf{H}(\mathbf{x}^{k})+\lambda\mathbf{I}\right)$ is positive semidefinite

We write $\mathbf{p}^{*}(\lambda)$ as a function of $\lambda$ . Once $\lambda$ is determined, we can solve $\mathbf{p}^{*}(\lambda)$ from the first condition. The details of how to find such a $\lambda$ can be found in [1], section 4.2. Mathwrist NPL made some subtle numerical choices in its implementation. First, when $\lambda$ can be found by root-finding, we used a numerically stable formulation of the objective function. Secondly, when $\mathbf{H}(\mathbf{x}^{k})$ is positive semi-definite or indefinite, we used different matrix factorization methods to efficiently handle different cases.

3.2 Conjugate Gradient-Steihaug Direction

When the objective function $\psi(\mathbf{x})$ involves a very large number of unknown variables, it could be costly to compute the “exact” search direction. Note that the trust region method only requires a descent direction $\mathbf{p}$ to produce a significant reduction $\rho_{k}$ . So for large problems, it could be more efficient to compute a sub-optimal $\mathbf{p}$ instead of $\mathbf{p}^{*}(\lambda)$ .

The Conjugate Gradient (CG)-Steihaug method computes search direction $\mathbf{p}$ based on linear CG method with certain modifications. Let $(\alpha_{i},\mathbf{d}_{i})$ , $i=0,\cdots$ be the sequence of step length and step direction generated by a CG method, where $\mathbf{d}_{0}$ is always the negative gradient direction. So if $\|\mathbf{d}_{0}\|>\Delta_{k}$ , we simply scale $\mathbf{d}_{0}$ down to hit the trust region boundary and get a Cauchy point solution, which is a descent step move.

Otherwise, we let the CG method continue generating the sequence. At the $i$ -th CG iteration, if $\mathbf{p}=\sum_{j=0}^{i-1}\alpha_{j}\mathbf{d}_{j}+\alpha_{i}\mathbf{d}_{i}$ exceeds $\Delta_{k}$ , we again scale down $\alpha_{i}$ and terminate the CG. Since $\mathbf{d}_{i}$ are orthogonal descent directions, each CG iteration gains more reduction and increases the norm of $\mathbf{p}$ untill either the trust region boundary is hit or the residual is small enough to stop the CG iterations. So the final direction is somewhere in between the Cauchy point and the exact solution.

Problem though, the theoretical conjugate properties and descent direction property in CG methods rely on the positive definiteness of matrix $\mathbf{H}(\mathbf{x}^{k})$ . This is not guaranteed for a general nonlinear function $\psi(\mathbf{x})$ . So another modification made in CG-Steihaug method is to test the curvature of $\mathbf{H}(\mathbf{x}^{k})$ at each CG iteration. If $\mathbf{d}_{j}^{T}\mathbf{H}(\mathbf{x}^{k})\mathbf{d}_{j}>0,\forall j<i$ and $\mathbf{d}_{i}^{T}\mathbf{H}(\mathbf{x}^{k})\mathbf{d}_{i}>0$ , then the same argument of linear independency for CG method holds here as well. It follows that the sequence of $\{\mathbf{d}_{i}\}$ upto the $i$ -th CG iteration has all the nice properties.

If $\mathbf{d}_{i}^{T}\mathbf{H}(\mathbf{x}^{k})\mathbf{d}_{i}\leq 0$ , CG iteration is immediately stopped. Let $\mathbf{q}=\sum_{j=0}^{i-1}\alpha_{j}\mathbf{d}_{j}$ , which surely is a descent direction. It is possible to gain further reduction by finding a scalar $\tau$ and set the trust region direction $\mathbf{p}=\mathbf{q}+\tau\mathbf{d}_{i}$ . Using the conjugate property, it can be shown that the total model reduction is

	$\displaystyle\Delta\hat{\psi}$	$\displaystyle=$	$\displaystyle\hat{\psi}(\mathbf{q}+\tau\mathbf{d}_{i})-\hat{\psi}(0)$
		$\displaystyle=$	$\displaystyle\underbrace{\mathbf{g}(\mathbf{x}^{k})^{T}\mathbf{q}+\frac{1}{2}% \mathbf{q}^{T}\mathbf{H}(\mathbf{x}^{k})\mathbf{q}}_{\text{$\mathbf{q}$ % reduction component}}+\underbrace{\tau\mathbf{g}^{T}(\mathbf{x}^{k})\mathbf{d}% _{i}+\tau^{2}\frac{1}{2}\mathbf{d}_{i}^{T}\mathbf{H}(\mathbf{x}^{k})\mathbf{d}% _{i}}_{\text{$\mathbf{d}_{i}$ reduction component}}$

The second order term in the $\mathbf{d}_{i}$ reduction component is at least non-positive (zero curvature case). Once $\tau$ has the correct sign, the larger $\tau$ the bigger reduction. We scale $\tau$ such that $\|\mathbf{p}\|=\Delta_{k}$ .

We use two parameters “exactness” $0<\xi<1$ and “tolerance” $\delta>0$ to control the termination of solving a CG subproblem. At the $i$ -th CG iteration, if the residual $\mathbf{r}=\mathbf{H}(\mathbf{x}^{k})\mathbf{q}+\mathbf{g}(\mathbf{x}^{k})$ satisfies $\frac{\|\mathbf{r}\|}{\|\mathbf{g}(\mathbf{x}^{k})\|}\leq 1-\xi$ of $\|\mathbf{r}\|_{\infty}\leq\delta$ we stop the CG iterations and set the trust region step $\mathbf{p}=\mathbf{q}$ .

References

[1] Jorge Nocedal and Stephen J. Wright: Numerical Optimization, Springer, 1999
[2] Philip E. Gill, Walter Murray and Margaret H. Wright: Practical Optimization, Academic Press, 1981
[3] Philip E. Gill and Walter Murray: Newton-Type Methods for Unconstrained and Linearly Constrained Optimization. Mathematical Programming 7 (1974), pp. 311-350
[4] Philip E. Gill, G. H. Golub, Walter Murray and Michael. A. Saunders: Methods for Modifying Matrix Factorizations, Mathematics of Computation, Volumn 28, Number 126, April 1974, pages 505-535
[5] Philip E. Gill, Walter Murray and Michael A. Saunders: Methods for computing and modifying the LDV factors of a matrix, Mathematics of Computation, Volumn 29, Number 132, October 1975, pages 1051-1077
[6] Anders. Forsgren, Philip E. Gill and Walter Murray: Computing Modified Newton Directions Using a Partial Cholesky Factorization, SIAM J. SCI. COMPUT. Vol. 16, No. 1, pp. 139-150
[7] J.E. Dennis and Robert B. Schnabel: A New Derivation of Symmetric Positive Definite Secant Updates; CU-CS-185-80 August 1980 Computer Science Technical Reports, Summer 8-1-1980, University of Colorado at Boulder

Unconstrained Optimization Mathwrist White Paper Series