Unconstrained Optimization
Mathwrist White Paper Series

Copyright Β©Mathwrist LLC 2023
(January 20, 2023)
Abstract

This document presents a technical overview of the unconstrained optimization feature for n-dimensional smooth functions in Mathwrist’s C++ Numerical Programming Library (NPL). NPL unconstrained optimization includes a set of line search based methods and a set of trust region based methods.

1 Introduction

Let ψ⁒(𝐱):ℝn→ℝ be a twice continuously differentiable function with gradient vector 𝐠⁒(𝐱) and Hessian matrix 𝐇⁒(𝐱). Unconstrained optimization solves the following minimization problem,

minπ±βˆˆβ„n⁑ψ⁒(𝐱)

For brevity, we could be using 𝐠 and 𝐇 as the notation of gradient and Hessian when the context has no ambiguity. Let 𝐩 be a step move at the current point 𝐱. If 𝐩T⁒𝐠⁒(𝐱)<0, we call 𝐩 a descent direction. By Taylor expansion, we can write

ψ⁒(𝐱+𝐩)=ψ⁒(𝐱)+𝐩T⁒𝐠⁒(𝐱)+12⁒𝐩T⁒𝐇⁒(𝐱+θ⁒𝐩)⁒𝐩 (1)

, for some 0≀θ≀1. It is then clear that an optimal point 𝐱* needs satisfies the following conditions:

  1. 1.

    Stationary condition: βˆ₯𝐠⁒(𝐱*)βˆ₯=0.

  2. 2.

    Curvature condition: 𝐇⁒(𝐱*) is positive semi-definite (necessary) or positive definite (sufficient).

NPL provides a set of line search based methods and a set of trust region based methods, derived from LineSearch and TrustRegion base classes respectively. All algorithms iteratively generate a sequence of improved data points 𝐱k,k=0,β‹― converging to 𝐱*. The algorithm terminates at iteration k when βˆ₯𝐠⁒(𝐱k)βˆ₯βˆžβ‰€Ο΅ for some tolerance Ο΅ and 𝐇⁒(𝐱k) is at least positive semi-definite.

2 Line Search Method

At each iteration, a line search method first computes a descent direction 𝐩 and then computes a step length α along 𝐩 such that ψ⁒(𝐱k+α⁒𝐩) produces sufficient improvement.

2.1 Step Length Requirement

A good step length not only generates sufficient decrease of the objective but also produces β€œflatter” gradient, i.e. closer to stationary. These requirements can be stated as

ψ⁒(𝐱k+α⁒𝐩)β‰€Οˆβ’(𝐱k)+c1⁒α⁒𝐠⁒(𝐱k)T⁒𝐩,
𝐠⁒(𝐱k+α⁒𝐩)T⁒𝐩β‰₯c2⁒𝐠⁒(𝐱k)T⁒𝐩,

, 0<c1<c2<1, jointly known as Wolfe conditions.

An overshooting step length could be still satisfying the Wolfe conditions. To restrict Ξ± being within a small neighborhood of the optimal step length, the strong Wolfe conditions enfore that

ψ⁒(𝐱k+α⁒𝐩)β‰€Οˆβ’(𝐱k)+c1⁒α⁒𝐠⁒(𝐱k)T⁒𝐩,
βˆ₯𝐠⁒(𝐱k+α⁒𝐩)T⁒𝐩βˆ₯≀-c2⁒𝐠⁒(𝐱k)T⁒𝐩

, where 0<c1<c2<1. Wolfe and strong Wolfe conditions are theoretically sound. It can be shown that if ψ⁒(𝐱) is continuously differentiable and bounded along the descent direction 𝐩, then there exists an interval (c1,c2), in which both Wolfe and strong Wolfe conditions are satisfied.

By default, NPL tests a trial step length using the strong Wolfe condition with c1=1.e-4 and c2=0.9. Typically c1 is recommeded to be a small number to avoid the quadratic term in (1) dominates. If c2 is set to a very small number, the algorithm behaves as an β€œexact” line search method in the sense that Ξ± is very close to the optimal step length, at the cost of more trial step iterations. The default parameter value c2=0.9 requires very less amout of work, i.e. 1 or 2 iterations, spent on finding an acceptable step length.

Because the setting of Wolfe conditions is critical to the performance of a line search method, we don’t expose the setting function as a public user control function. Instead, concrete line search derived classes can call a protected member function to set the Wolfe conditions as appropriate to relevant context internally in the library implementation.

2.2 Step Length Algorithm

Fixing the search direction 𝐩, we can write the objective function as a function of step length ψ⁒(α) and search for the optimal step length α* with the Wolfe conditions as early termination conditions. An outline of line search algorithm can be found in [1], section 3.4. Our implementation overall follows this search strategy. When computing a trial step length, we use bi-section or safeguarded quadratic and cubic polynomial interpolation to handle different situations.

2.3 Search Directions

We provide Steepest Descent, modified Newton, Quasi-Newton and Conjugate Gradient four types of line search based methods. They all share the same step length algorithm and differ in calculating the descent search directions.

2.3.1 Steepest Descent

Taking only the first order Taylor expansion ψ⁒(𝐱k+𝐩)β‰ˆΟˆβ’(𝐱k)+𝐠⁒(𝐱k)T⁒𝐩, the Steepest Descent simply chooses the search direction 𝐩=-𝐠⁒(𝐱k) at each step.

2.3.2 Modified Newton

Assume 𝐇⁒(𝐱) is positive definite and take the second order Taylor expansion ψ⁒(𝐱k+𝐩)β‰ˆΟˆβ’(𝐱k)+𝐠⁒(𝐱k)T⁒𝐩+12⁒𝐩T⁒𝐇⁒(𝐱k)⁒𝐩=f⁒(𝐩) as a model function f⁒(𝐩). Apply the stationary condition to f⁒(𝐩), we get

βˆ‡β‘f⁒(𝐩)=𝐠⁒(𝐱k)+𝐇⁒(𝐱k)⁒𝐩=0⇒𝐩=-𝐇-1⁒(𝐱k)⁒𝐠⁒(𝐱k)

Also if the objective function indeed is quadratic, the Newton direction 𝐩 reaches the minimum 𝐱* in one unit step length. It can be shown that if 𝐱k is landed in a nearby region of 𝐱*, Newton method has second order convergence rate.

When the Hessian matrix 𝐇⁒(𝐱k) is indefinite or close to singular, we use a modified Cholesky decomposition method briefly in [2] pp 109 - 111 and detailed in [3] to construct a positive definite matrix

𝐁k=𝐇⁒(𝐱k)+𝐄=𝐋𝐃𝐋T

, where 𝐄 is a diagonal modification matrix such that βˆ₯𝐄βˆ₯∞ is minimized. The Newton direction is then computed as 𝐩=-𝐁k-1⁒𝐠⁒(𝐱k).

The modified Cholesky decomposition has another advantage when 𝐠⁒(𝐱k) vanishes at a saddle point. In which case, the Newton direction is 0, but the modified Cholesky procedure is able to identify an index s. The search direction computed from 𝐩=𝐋T-1⁒𝐞s is a negative curvature direction, where 𝐞s is a vector of 0 elements except the s-th element is 1.

2.3.3 Quasi Newton

Let 𝐬 be the step move from current 𝐱k and express the gradient 𝐠⁒(𝐱k+𝐬) by its first order Taylor expansion, 𝐠⁒(𝐱k+𝐬)β‰ˆπ β’(𝐱k)+𝐇⁒(𝐱k)⁒𝐬. We have the secant equation 𝐇⁒(𝐱k)⁒𝐬=𝐲, where 𝐲=𝐠⁒(𝐱k+𝐬)-𝐠⁒(𝐱k).

Because the sequence of 𝐠⁒(𝐱k),k=0,1,β‹― carries curvature information, the idea of Quasi Newton method is to construct a Hessian approximation matrix 𝐁k from the sequence of 𝐠⁒(𝐱k) such that

  1. 1.

    𝐁k⁒𝐬=𝐲 holds, known as Quasi-Newton condition.

  2. 2.

    Assume a symmetric positive definite matrix 𝐁k-1 is given, once 𝐱k is computed, we update 𝐁k=𝐁k-1+𝐔k with an update matrix 𝐔k. We want 𝐁k to be positive definite as well.

The Quasi-Newton direction is then computed by solving 𝐁k⁒𝐩=-𝐠⁒(xk). The well known β€œBroyden-Fletcher-Goldfarb-Shanno” (BFGS) update scheme constructs 𝐁k as the following,

𝐁k=𝐁k-1+1𝐲T⁒𝐬⁒𝐲𝐲T-1𝐬T⁒𝐁k-1⁒𝐬⁒𝐁k-1⁒𝐬𝐬T⁒𝐁k-1

Because 𝐬=α⁒𝐩 and 𝐩 is computed from the Quasi-Newton direction, this update can be also written as

𝐁k=𝐁k-1+1𝐠⁒(𝐱k)T⁒𝐩⁒𝐠⁒(𝐱k)⁒𝐠⁒(𝐱k)T+1α⁒𝐲T⁒𝐩⁒𝐲𝐲T

The BFGS update is a rank-two update scheme. Given a Cholesky decomposition 𝐁k-1=𝐋k-1⁒𝐋k-1T, we implemented an efficient rank-two update algorithm on this Cholesky factorization to economically obtain the new Cholesky factorization of 𝐁k. Relevant details can be founded in [4] and [5].

2.3.4 Conjugate Gradient

Conjugate Gradient (CG) method is an iterative method to solve large linear systems 𝐇𝐱+𝐜=0, where 𝐇 is symmetric positive definite. The solution 𝐱* of the linear system is same as the solution of minimizing a quadratic objective function ψ⁒(𝐱)=𝐜T⁒x+12⁒𝐱T⁒𝐇𝐱.

If ψ⁒(𝐱) is a general smooth convex function, we can approximate ψ⁒(𝐱k+α⁒𝐩k) in a small neighbourhood of 𝐱k by a quadratic model function. The search direction 𝐩k can be computed from a CG method and the step length α moving along 𝐩k is determined by a line search algorithm.

In the case of solving linear systems, the CG step direction at the k-th iteration 𝐩k is computed from the previous direction 𝐩k-1, previous gradient 𝐠⁒(𝐱k-1) and the current gradient 𝐠⁒(𝐱k) as,

𝐩0 = -𝐠⁒(𝐱0) (2)
𝐩k = -𝐠⁒(𝐱k)+β⁒𝐩k-1 (3)
Ξ² = βˆ₯𝐠⁒(𝐱k)βˆ₯2βˆ₯𝐠⁒(𝐱k-1)βˆ₯2 (4)

When minimizing a nonlinear convex function ψ⁒(𝐱), the variation is on how to compute coefficeint Ξ² used in equation (3). This is because Ξ² computed from equation (4) doesn’t necessarily produce a descent direction in the nonlinear case. To see this, we multiply 𝐠⁒(𝐱k)T at both sides of equation (3) and get

𝐠⁒(𝐱k)T⁒𝐩k=-βˆ₯𝐠⁒(𝐱k)βˆ₯2+β⁒𝐠⁒(𝐱k)T⁒𝐩k-1 (5)

If the step length at previous iteration was choosen from an exact line search, i.e. Ξ± reached the local minimum along 𝐩k-1, we have 𝐠⁒(𝐱k)T⁒𝐩k-1=0, then 𝐠⁒(𝐱k)T⁒𝐩k=-βˆ₯𝐠⁒(𝐱k)βˆ₯2<0, so 𝐩k calculated from (3) surely is a descent direction. In general, Ξ± is not an exact line search step. Hence the second term 𝐠⁒(𝐱k)T⁒𝐩k-1 is not zero. When the second term happens to be positive and dominates, 𝐩k from equation (3) is no longer a descent direction.

The Fletcher-Reeves (FR) method computes Ξ² exactly using equation (4). In addition, strong Wolfe condition with 0<c1<c2<12 is imposed to step length search. It can be shown that such a choice of Wolfe condition ensures (5) to be negative.

The Polak-Ribiere+ (PR+) method uses the same Wolfe condition as the FR method but choose Ξ² differently,

Ξ²=max⁑(𝐠⁒(𝐱k)T⁒(𝐠⁒(𝐱k)-𝐠⁒(𝐱k-1))βˆ₯𝐠⁒(𝐱k-1)βˆ₯2,0)

We use the PR+ method in our line search conjugate gradient method becuase it is numberically more stable than the FR method.

3 Trust Region Method

Within a small neighborhood of iteration point 𝐱k, the trust region method approximates the original objective function by its second order Taylor expansion model function

ψ^⁒(𝐱k+p)=ψ⁒(𝐱k)+𝐠⁒(𝐱k)T⁒𝐩+12⁒𝐩T⁒𝐇⁒(𝐱k)⁒𝐩

Writing the model function in term of 𝐩, iteratively we approach to the solution of the original minimization problem by solving a subproblem below and making a series of such small step changes.

arg⁑min𝐩⁑ψ^⁒(𝐩)=𝐠⁒(𝐱k)T⁒𝐩+12⁒𝐩T⁒𝐇⁒(𝐱k)⁒𝐩⁒, s.t. ⁒βˆ₯𝐩βˆ₯≀Δk (6)

The trust region radius Ξ”k is choosen to ensure ψ^⁒(𝐱) is an accurate approximation to ψ⁒(𝐱). To measure the agreement between actual reduction and model reduction, define

ρk=ψ⁒(𝐱k)-ψ⁒(𝐱k+𝐩)ψ^⁒(0)-ψ^⁒(𝐩)

We can adjust Ξ”k based on ρk.

  1. 1.

    ρk is negative, we cannot take this step 𝐩 and also need reduce Ξ”k;

  2. 2.

    ρk is a small positive, 𝐩 is still a descent direction to use, but the approximation is not accurate, we need use a smaller Ξ”k+1;

  3. 3.

    ρk is close to 1, it means the approximation is accurate. We could try a larger Ξ”k+1 at the next iteration.

3.1 Nearly Exact Search Direction

When the problem size is reasonable, we can truly solve problem formulation (6), which has a global optimal solution 𝐩* iff βˆƒΞ»β‰₯0 such that the following conditions hold.

  1. 1.

    (𝐇⁒(𝐱k)+λ⁒𝐈)⁒𝐩*=-𝐠⁒(𝐱k)

  2. 2.

    λ⁒(Ξ”k-βˆ₯𝐩*βˆ₯)=0

  3. 3.

    (𝐇⁒(𝐱k)+λ⁒𝐈) is positive semidefinite

We write 𝐩*⁒(Ξ») as a function of Ξ». Once Ξ» is determined, we can solve 𝐩*⁒(Ξ») from the first condition. The details of how to find such a Ξ» can be found in [1], section 4.2. Mathwrist NPL made some subtle numerical choices in its implementation. First, when Ξ» can be found by root-finding, we used a numerically stable formulation of the objective function. Secondly, when 𝐇⁒(𝐱k) is positive semi-definite or indefinite, we used different matrix factorization methods to efficiently handle different cases.

3.2 Conjugate Gradient-Steihaug Direction

When the objective function ψ⁒(𝐱) involves a very large number of unknown variables, it could be costly to compute the β€œexact” search direction. Note that the trust region method only requires a descent direction 𝐩 to produce a significant reduction ρk. So for large problems, it could be more efficient to compute a sub-optimal 𝐩 instead of 𝐩*⁒(Ξ»).

The Conjugate Gradient (CG)-Steihaug method computes search direction 𝐩 based on linear CG method with certain modifications. Let (Ξ±i,𝐝i), i=0,β‹― be the sequence of step length and step direction generated by a CG method, where 𝐝0 is always the negative gradient direction. So if βˆ₯𝐝0βˆ₯>Ξ”k, we simply scale 𝐝0 down to hit the trust region boundary and get a Cauchy point solution, which is a descent step move.

Otherwise, we let the CG method continue generating the sequence. At the i-th CG iteration, if 𝐩=βˆ‘j=0i-1Ξ±j⁒𝐝j+Ξ±i⁒𝐝i exceeds Ξ”k, we again scale down Ξ±i and terminate the CG. Since 𝐝i are orthogonal descent directions, each CG iteration gains more reduction and increases the norm of 𝐩 untill either the trust region boundary is hit or the residual is small enough to stop the CG iterations. So the final direction is somewhere in between the Cauchy point and the exact solution.

Problem though, the theoretical conjugate properties and descent direction property in CG methods rely on the positive definiteness of matrix 𝐇⁒(𝐱k). This is not guaranteed for a general nonlinear function ψ⁒(𝐱). So another modification made in CG-Steihaug method is to test the curvature of 𝐇⁒(𝐱k) at each CG iteration. If 𝐝jT⁒𝐇⁒(𝐱k)⁒𝐝j>0,βˆ€j<i and 𝐝iT⁒𝐇⁒(𝐱k)⁒𝐝i>0, then the same argument of linear independency for CG method holds here as well. It follows that the sequence of {𝐝i} upto the i-th CG iteration has all the nice properties.

If 𝐝iT⁒𝐇⁒(𝐱k)⁒𝐝i≀0, CG iteration is immediately stopped. Let πͺ=βˆ‘j=0i-1Ξ±j⁒𝐝j, which surely is a descent direction. It is possible to gain further reduction by finding a scalar Ο„ and set the trust region direction 𝐩=πͺ+τ⁒𝐝i. Using the conjugate property, it can be shown that the total model reduction is

Ξ”β’Οˆ^ = ψ^⁒(πͺ+τ⁒𝐝i)-ψ^⁒(0)
= 𝐠⁒(𝐱k)T⁒πͺ+12⁒πͺT⁒𝐇⁒(𝐱k)⁒πͺ⏟πͺΒ reduction component+τ⁒𝐠T⁒(𝐱k)⁒𝐝i+Ο„2⁒12⁒𝐝iT⁒𝐇⁒(𝐱k)⁒𝐝i⏟𝐝iΒ reduction component

The second order term in the 𝐝i reduction component is at least non-positive (zero curvature case). Once Ο„ has the correct sign, the larger Ο„ the bigger reduction. We scale Ο„ such that βˆ₯𝐩βˆ₯=Ξ”k.

We use two parameters β€œexactness” 0<ΞΎ<1 and β€œtolerance” Ξ΄>0 to control the termination of solving a CG subproblem. At the i-th CG iteration, if the residual 𝐫=𝐇⁒(𝐱k)⁒πͺ+𝐠⁒(𝐱k) satisfies βˆ₯𝐫βˆ₯βˆ₯𝐠⁒(𝐱k)βˆ₯≀1-ΞΎ of βˆ₯𝐫βˆ₯βˆžβ‰€Ξ΄ we stop the CG iterations and set the trust region step 𝐩=πͺ.

References

  • [1] Jorge Nocedal and Stephen J. Wright: Numerical Optimization, Springer, 1999
  • [2] Philip E. Gill, Walter Murray and Margaret H. Wright: Practical Optimization, Academic Press, 1981
  • [3] Philip E. Gill and Walter Murray: Newton-Type Methods for Unconstrained and Linearly Constrained Optimization. Mathematical Programming 7 (1974), pp. 311-350
  • [4] Philip E. Gill, G. H. Golub, Walter Murray and Michael. A. Saunders: Methods for Modifying Matrix Factorizations, Mathematics of Computation, Volumn 28, Number 126, April 1974, pages 505-535
  • [5] Philip E. Gill, Walter Murray and Michael A. Saunders: Methods for computing and modifying the LDV factors of a matrix, Mathematics of Computation, Volumn 29, Number 132, October 1975, pages 1051-1077
  • [6] Anders. Forsgren, Philip E. Gill and Walter Murray: Computing Modified Newton Directions Using a Partial Cholesky Factorization, SIAM J. SCI. COMPUT. Vol. 16, No. 1, pp. 139-150
  • [7] J.E. Dennis and Robert B. Schnabel: A New Derivation of Symmetric Positive Definite Secant Updates; CU-CS-185-80 August 1980 Computer Science Technical Reports, Summer 8-1-1980, University of Colorado at Boulder