Unconstrained Optimization
Mathwrist White Paper Series
Abstract
This document presents a technical overview of the unconstrained optimization feature for n-dimensional smooth functions in Mathwristβs C++ Numerical Programming Library (NPL). NPL unconstrained optimization includes a set of line search based methods and a set of trust region based methods.
1 Introduction
Let be a twice continuously differentiable function with gradient vector and Hessian matrix . Unconstrained optimization solves the following minimization problem,
For brevity, we could be using and as the notation of gradient and Hessian when the context has no ambiguity. Let be a step move at the current point . If , we call a descent direction. By Taylor expansion, we can write
(1) |
, for some . It is then clear that an optimal point needs satisfies the following conditions:
-
1.
Stationary condition: .
-
2.
Curvature condition: is positive semi-definite (necessary) or positive definite (sufficient).
NPL provides a set of line search based methods and a set of trust region based methods, derived from LineSearch and TrustRegion base classes respectively. All algorithms iteratively generate a sequence of improved data points converging to . The algorithm terminates at iteration when for some tolerance and is at least positive semi-definite.
2 Line Search Method
At each iteration, a line search method first computes a descent direction and then computes a step length along such that produces sufficient improvement.
2.1 Step Length Requirement
A good step length not only generates sufficient decrease of the objective but also produces βflatterβ gradient, i.e. closer to stationary. These requirements can be stated as
, , jointly known as Wolfe conditions.
An overshooting step length could be still satisfying the Wolfe conditions. To restrict being within a small neighborhood of the optimal step length, the strong Wolfe conditions enfore that
, where . Wolfe and strong Wolfe conditions are theoretically sound. It can be shown that if is continuously differentiable and bounded along the descent direction , then there exists an interval , in which both Wolfe and strong Wolfe conditions are satisfied.
By default, NPL tests a trial step length using the strong Wolfe condition with and . Typically is recommeded to be a small number to avoid the quadratic term in (1) dominates. If is set to a very small number, the algorithm behaves as an βexactβ line search method in the sense that is very close to the optimal step length, at the cost of more trial step iterations. The default parameter value requires very less amout of work, i.e. 1 or 2 iterations, spent on finding an acceptable step length.
Because the setting of Wolfe conditions is critical to the performance of a line search method, we donβt expose the setting function as a public user control function. Instead, concrete line search derived classes can call a protected member function to set the Wolfe conditions as appropriate to relevant context internally in the library implementation.
2.2 Step Length Algorithm
Fixing the search direction , we can write the objective function as a function of step length and search for the optimal step length with the Wolfe conditions as early termination conditions. An outline of line search algorithm can be found in [1], section 3.4. Our implementation overall follows this search strategy. When computing a trial step length, we use bi-section or safeguarded quadratic and cubic polynomial interpolation to handle different situations.
2.3 Search Directions
We provide Steepest Descent, modified Newton, Quasi-Newton and Conjugate Gradient four types of line search based methods. They all share the same step length algorithm and differ in calculating the descent search directions.
2.3.1 Steepest Descent
Taking only the first order Taylor expansion , the Steepest Descent simply chooses the search direction at each step.
2.3.2 Modified Newton
Assume is positive definite and take the second order Taylor expansion as a model function . Apply the stationary condition to , we get
Also if the objective function indeed is quadratic, the Newton direction reaches the minimum in one unit step length. It can be shown that if is landed in a nearby region of , Newton method has second order convergence rate.
When the Hessian matrix is indefinite or close to singular, we use a modified Cholesky decomposition method briefly in [2] pp 109 - 111 and detailed in [3] to construct a positive definite matrix
, where is a diagonal modification matrix such that is minimized. The Newton direction is then computed as .
The modified Cholesky decomposition has another advantage when vanishes at a saddle point. In which case, the Newton direction is 0, but the modified Cholesky procedure is able to identify an index . The search direction computed from is a negative curvature direction, where is a vector of 0 elements except the -th element is 1.
2.3.3 Quasi Newton
Let be the step move from current and express the gradient by its first order Taylor expansion, . We have the secant equation , where .
Because the sequence of carries curvature information, the idea of Quasi Newton method is to construct a Hessian approximation matrix from the sequence of such that
-
1.
holds, known as Quasi-Newton condition.
-
2.
Assume a symmetric positive definite matrix is given, once is computed, we update with an update matrix . We want to be positive definite as well.
The Quasi-Newton direction is then computed by solving . The well known βBroyden-Fletcher-Goldfarb-Shannoβ (BFGS) update scheme constructs as the following,
Because and is computed from the Quasi-Newton direction, this update can be also written as
2.3.4 Conjugate Gradient
Conjugate Gradient (CG) method is an iterative method to solve large linear systems , where is symmetric positive definite. The solution of the linear system is same as the solution of minimizing a quadratic objective function .
If is a general smooth convex function, we can approximate in a small neighbourhood of by a quadratic model function. The search direction can be computed from a CG method and the step length moving along is determined by a line search algorithm.
In the case of solving linear systems, the CG step direction at the -th iteration is computed from the previous direction , previous gradient and the current gradient as,
(2) | |||||
(3) | |||||
(4) |
When minimizing a nonlinear convex function , the variation is on how to compute coefficeint used in equation (3). This is because computed from equation (4) doesnβt necessarily produce a descent direction in the nonlinear case. To see this, we multiply at both sides of equation (3) and get
(5) |
If the step length at previous iteration was choosen from an exact line search, i.e. reached the local minimum along , we have , then , so calculated from (3) surely is a descent direction. In general, is not an exact line search step. Hence the second term is not zero. When the second term happens to be positive and dominates, from equation (3) is no longer a descent direction.
The Fletcher-Reeves (FR) method computes exactly using equation (4). In addition, strong Wolfe condition with is imposed to step length search. It can be shown that such a choice of Wolfe condition ensures (5) to be negative.
The Polak-Ribiere+ (PR+) method uses the same Wolfe condition as the FR method but choose differently,
We use the PR+ method in our line search conjugate gradient method becuase it is numberically more stable than the FR method.
3 Trust Region Method
Within a small neighborhood of iteration point , the trust region method approximates the original objective function by its second order Taylor expansion model function
Writing the model function in term of , iteratively we approach to the solution of the original minimization problem by solving a subproblem below and making a series of such small step changes.
(6) |
The trust region radius is choosen to ensure is an accurate approximation to . To measure the agreement between actual reduction and model reduction, define
We can adjust based on .
-
1.
is negative, we cannot take this step and also need reduce ;
-
2.
is a small positive, is still a descent direction to use, but the approximation is not accurate, we need use a smaller ;
-
3.
is close to 1, it means the approximation is accurate. We could try a larger at the next iteration.
3.1 Nearly Exact Search Direction
When the problem size is reasonable, we can truly solve problem formulation (6), which has a global optimal solution iff such that the following conditions hold.
-
1.
-
2.
-
3.
is positive semidefinite
We write as a function of . Once is determined, we can solve from the first condition. The details of how to find such a can be found in [1], section 4.2. Mathwrist NPL made some subtle numerical choices in its implementation. First, when can be found by root-finding, we used a numerically stable formulation of the objective function. Secondly, when is positive semi-definite or indefinite, we used different matrix factorization methods to efficiently handle different cases.
3.2 Conjugate Gradient-Steihaug Direction
When the objective function involves a very large number of unknown variables, it could be costly to compute the βexactβ search direction. Note that the trust region method only requires a descent direction to produce a significant reduction . So for large problems, it could be more efficient to compute a sub-optimal instead of .
The Conjugate Gradient (CG)-Steihaug method computes search direction based on linear CG method with certain modifications. Let , be the sequence of step length and step direction generated by a CG method, where is always the negative gradient direction. So if , we simply scale down to hit the trust region boundary and get a Cauchy point solution, which is a descent step move.
Otherwise, we let the CG method continue generating the sequence. At the -th CG iteration, if exceeds , we again scale down and terminate the CG. Since are orthogonal descent directions, each CG iteration gains more reduction and increases the norm of untill either the trust region boundary is hit or the residual is small enough to stop the CG iterations. So the final direction is somewhere in between the Cauchy point and the exact solution.
Problem though, the theoretical conjugate properties and descent direction property in CG methods rely on the positive definiteness of matrix . This is not guaranteed for a general nonlinear function . So another modification made in CG-Steihaug method is to test the curvature of at each CG iteration. If and , then the same argument of linear independency for CG method holds here as well. It follows that the sequence of upto the -th CG iteration has all the nice properties.
If , CG iteration is immediately stopped. Let , which surely is a descent direction. It is possible to gain further reduction by finding a scalar and set the trust region direction . Using the conjugate property, it can be shown that the total model reduction is
The second order term in the reduction component is at least non-positive (zero curvature case). Once has the correct sign, the larger the bigger reduction. We scale such that .
We use two parameters βexactnessβ and βtoleranceβ to control the termination of solving a CG subproblem. At the -th CG iteration, if the residual satisfies of we stop the CG iterations and set the trust region step .
References
- [1] Jorge Nocedal and Stephen J. Wright: Numerical Optimization, Springer, 1999
- [2] Philip E. Gill, Walter Murray and Margaret H. Wright: Practical Optimization, Academic Press, 1981
- [3] Philip E. Gill and Walter Murray: Newton-Type Methods for Unconstrained and Linearly Constrained Optimization. Mathematical Programming 7 (1974), pp. 311-350
- [4] Philip E. Gill, G. H. Golub, Walter Murray and Michael. A. Saunders: Methods for Modifying Matrix Factorizations, Mathematics of Computation, Volumn 28, Number 126, April 1974, pages 505-535
- [5] Philip E. Gill, Walter Murray and Michael A. Saunders: Methods for computing and modifying the LDV factors of a matrix, Mathematics of Computation, Volumn 29, Number 132, October 1975, pages 1051-1077
- [6] Anders. Forsgren, Philip E. Gill and Walter Murray: Computing Modified Newton Directions Using a Partial Cholesky Factorization, SIAM J. SCI. COMPUT. Vol. 16, No. 1, pp. 139-150
- [7] J.E. Dennis and Robert B. Schnabel: A New Derivation of Symmetric Positive Definite Secant Updates; CU-CS-185-80 August 1980 Computer Science Technical Reports, Summer 8-1-1980, University of Colorado at Boulder