Data and Model Fitting
Mathwrist White Paper Series

Copyright ©Mathwrist LLC 2023
(January 20, 2023)
Abstract

This document presents a technical overview of data and model fitting features implemented in Mathwrist’s C++ Numerical Programming Library (NPL).

1 Linear Least Square

Linear regression assumes a n dimensional linear model y=𝐱Tβ parameterized by coefficient vector β. Given m number of observations (yi,𝐱i),i=0,,m-1, we can fit the linear regression model by least squre fit,

argminβ(𝐲-𝐗β)T(𝐲-𝐗β) (1)

The i-th element of vector 𝐲 and the i-th row of matrix 𝐗 are the observation yi and 𝐱iT.

The solution of linear least square problem (1) can be obtained by solving the linear system 𝐗T𝐗β=𝐗T𝐲. Matrix 𝐗T𝐗 is at least positive semi-definite. If it has rank deficiency or is ill-conditioned, a better way of calibrating model parameter β is to add a regularization term and solve,

argminβ(𝐲-𝐗β)T(𝐲-𝐗β)+λβT𝛀β (2)

, where λ>0 is a penalty factor and the regularization matrix 𝛀 is positive definite. If 𝛀 is the identity matrix, (2) becomes to the standard ridge regression. Therefore problem formulation (2) sometimes is also called generalized ridge regression, which has the closed form solution:

(𝐗T𝐗+λ𝛀)β=𝐗T𝐲 (3)

Mathwrist NPL provides a function linear_fit::grr() to solve this generalized ridge regression problem (2).

1.1 Generalized Cross Validation

Based on the linear regression model assumption, we can write 𝐲=𝐗β+ϵ, where ϵ is the residual (noise) vector. In generalized ridge regression (2), the penalty factor λ could be an experimental choice. However, it is often desired to automatically choose the “optimal” penalty λ* based on the noisy level of ϵ.

Generalized Cross Validation (GCV) is a very successful technique to compute λ* based on the assumption that ϵ followings a Gaussian distribution. In GCV, the optimal penalty factor λ* is the solution of the following 1-dimensional minimization problem.

argminλV(λ)=(𝐈-𝐇(λ))𝐲2Tr(𝐈-𝐇(λ))2 (4)

, where 𝐇(λ) is the unique symmetric influence matrix,

𝐇(λ)=𝐗(𝐗T𝐗+λ𝛀)-1𝐗T

If a parameter λ0 is passed to function linear_fit::grr(), we will use GCV to compute λ* to solve the generalized ridge regression (2).

1.2 Linear Constraints

The linear model parameter β maybe imposed to general linear constraints and simple bounds. Mathwrist NPL provides a function linear_fit::lsq() to solve the following linearly constrained linear least square problem,

argminβ(𝐲-𝐗β)T(𝐲-𝐗β)s.t. (5)
𝐛l𝐀β𝐛u and  (6)
𝐥β𝐮 (7)

, which effectively is a convex quadratic programming (QP) problem.

2 Nonlinear Least Square

Given a nonlinear mapping function y=h(𝐱;β) with model parameter β and m number of observations (yi,𝐱i), i=0,,m-1, the nonlinear least squre fit recovers model parameters β by minimizing the l2 norm of residual vector,

argminβψ(β)=12𝐫(β)22 (8)

, where the i-th element ri(β)=h(𝐱i;β)-yi. Let 𝐉(β) be the Jacobian matrix of the residual vector 𝐫(β). The gradient and Hessian of ψ(β) are

ψ(β) = i=0m-1ri(β)ri(β)=𝐉T(β)𝐫(β) (9)
2ψ(β) = i=0m-1ri(β)riT(β)+i=0m-1ri(β)2ri(β) (10)
= 𝐉T(β)𝐉(β)+i=0m-1ri(β)2ri(β) (11)

In other words, the gradient of the nonlinear least square objective function ψ(β) in (8) is the residual weighted sum of the gradient of each residual. The Hessian of ψ(β) has a first order term 𝐉(β)T𝐉(β), which is at least positive semi-definite and a second order term 𝐐(β)=i=0m-1ri(β)2ri(β), which is a residual weighted sum of the Hessian of each individual residual function.

NPL provides a modified Gauss-Newton method and a Levenberg-Marquardt method to solve the unconstrained nonlinear least square problem (8). The modified Gauss-Newton method uses a line search based algorithm with additional care to the second order Hessian term 𝐐(β). Our implementation of the Levenberg-Marquardt method is a special case of the general trust region based algorithm.

2.1 Modified Gauss-Newton

The classic Gauss-Newton method approximates the true Hessian matrix in equation (11) by 𝐉(β)T𝐉(β). It uses a line search algorithm and iteratively computes a Newton search direction 𝐩 at each step,

𝐉(β)T𝐉(β)𝐩=-ψ(β)=-𝐉T(β)𝐫(β)

This Hessian approximation has quite a few limitations. First, if the Jacobian matrix 𝐉(β) doesn’t have full rank, it produces unstable model calibration. Second, if the residual 𝐫(β) is naturally large or non-negligible at certain point of the model calibration process, ignoring the second order term 𝐐(β) in equation (11) produces incorrect search direction 𝐩.

To address this, we implemented a modified Gauss-Newton method that follows a similar idea as [2], chapter 4. Effectively, at the k-th iteration in the line search, we recompute the SVD decomposition 𝐉(β)=𝐔𝐒𝐕T. It can be shown that the Newton direction 𝐩 wrt the true Hessian solves

(𝐒2𝐕T+𝐕T𝐐(β))𝐩=-𝐒𝐫¯(β) (12)

, where 𝐫¯(β)=𝐔T𝐫(β). Let 𝐒d be the leading submatrix of d number of dominant singulars in 𝐒. Accordingly, let 𝐕d be the first d columns of 𝐕, in other words the principle components. We test whether |𝐫(β)| is small enough relative to the smallest singulars in 𝐒d. If so, we can ignore the second order term 𝐐(β) from the true Hessian 2ψ(β) and just compute a direction 𝐩=𝐕d𝐩d spanned by the principle components. Let 𝐫¯d(β) be the first d elements of 𝐫¯(β). Equation (12) reduces to,

𝐒𝐩d=-𝐫¯d(β)

If 𝐐(β) cannot be ignored, we compute it using finite difference and solve the direction 𝐩 in the full space of 𝐕, 𝐩=𝐕𝐩¯. Equation (12) becomes to,

(𝐒2+𝐕T𝐐(β)𝐕)𝐩¯=-𝐒𝐫¯(β) (13)

The second order term 𝐐(β) could be indefinite. We use modified Cholesky to solve equation (13). This is similar to the modified Newton method documented in our unconstrained optimization white paper.

Our modified Gauss-Newton method terminates whenever any of the following conditions are satisfied in a line search iteration,

  1. 1.

    the residual satisfies |𝐫(β)|<ϵr, where ϵr is the residual tolerance.

  2. 2.

    the max number of iterations M is reached.

  3. 3.

    the gradient in equation (9) satisfies |𝐉T(β)𝐫(β)|<ϵc, where ϵc is the convergence tolerance.

2.2 Levenberg-Marquardt

Levenberg-Marquardt (LM) method is a special case of the trust region algorithm that uses 𝐉T(β)𝐉(β) to approximate the true Hessian (11). At the k-th trust region iteration, we solve a sub problem

argminp𝐩T𝐉T(β)𝐫(β)+12𝐩T𝐉T(β)𝐉(β)𝐩, s.t. 𝐩Δk (14)

𝐩* is a solution of the trust region subproblem (14) if and only if λ0 such that

  1. 1.

    (𝐉T(β)𝐉(β)+λ𝐈) is positive semidefinite and

  2. 2.

    (𝐉T(β)𝐉(β)+λ𝐈)𝐩*=-𝐉T(β)𝐫(β) and

  3. 3.

    λ(Δ-𝐩*)=0

The first condition is automatically satisfied in the case of LM method. Write 𝐩(λ) as a function of λ computed from the second condition,

𝐩(λ)=-(𝐉T(β)𝐉(β)+λ𝐈)-1𝐉T(β)𝐫(β) (15)

If 𝐩(λ=0)<Δk, 𝐩(λ=0) is an exact solution of trust region sub problem (14). Otherwise we can always find a λ(0,) such that 𝐩(λ)=Δk. The main cost of root finding λ is to repeatedly solve equation (15). Let 𝐉(β) have a QR decomposition 𝐉(β)=𝐐(𝐑0). Based on the idea in [1], section 10.2, we can economically obtain an upper triangular 𝐑λ from 𝐑 such that

𝐑λT𝐑λ=(𝐉T(β)𝐉(β)+λ𝐈)

Also, as suggested in [1] section 4.2, we perform root finding on a more numerically stable formulation 1Δk-1𝐩(λ)=0. The LM method terminates whenever any of the following conditions are satisfied after solving a trust region sub problem (14),

  1. 1.

    the residual satisfies |𝐫(β)|<ϵr, where ϵr is the residual tolerance.

  2. 2.

    the max number of iterations M is reached.

  3. 3.

    the gradient equation (9) satisfies |𝐉T(β)𝐫(β)|<ϵc, where ϵc is the convergence tolerance.

2.3 Regularization and Constraints

In practice, it is often desired to regularize the model parameters β and perhaps impose additional constraints. Mathwrist NPL offers a general nonlinear least square fit method that recovers model parameters from the following optimization problem,

argminβψ(β)=12𝐫(β)22+λβT𝛀β s.t.  (16)
𝐛l𝐀β𝐛u and  (17)
𝐥β𝐮 (18)

, where again ri(β)=h(𝐱i;β)-yi and the second term is a regularization term. 𝛀 is at least positive semidefinete and typically choosen to make the model “smooth”, or close to certain desired level. The penalty factor λ>0 could be a user input or automatically computed by the Generalized Cross Validation (GCV) method.

If the constraint specification in (18) is missing, our general nonlinear least square uses the line search modified Newton method to solve an unconstrained optimiation problem. Otherwise, it solves a linearly constrained optimization problem using a combination of line search modified Newton method and active-set method. The algorithm terminates whenever any of the following conditions are satisfied at a major iteration,

  1. 1.

    the residual satisfies |𝐫(β)|<ϵr, where ϵr is the residual tolerance.

  2. 2.

    the max number of iterations M is reached.

  3. 3.

    both the stationary and convergence conditions are satisfied with ϵc, where ϵc is the convergence tolerance.

3 Curve Fitting

Curve fitting is a 1-dimentional function approximation problem. Given a set of data points (xi,yi) observed from a unknown function y=f~(x), xi[a,b] for i=0,,m-1, we want to approximates f~(x) by a smooth curve f(x;θ) that is parameterized by θ.

Let ϕT(x)=(ϕ0(x),,ϕn-1(x)) be a vector of n basis functions of certain form. We can then write the curve approximation function f(x;θ)=ϕ(x)Tθ as a linear combination of the basis vector and coefficient vector θT=(θ0,,θn-1).

In order to recover model parameter θ from observed data points (xi,yi), we offer two curve fitting methods. The first approach is a regularized least squre fit. The second approach minimizes some curve roughness measure subject to fitting error constraints.

3.1 Formulation

In the first approach, we construct a basis matrix 𝚽(x) where the (i,j)-th element of the matrix is Φi,j(x)=ϕj(xi). The sum of square of residuals (SSR) is

i=0n-1(yi-f(xi;θ))2=(𝐲-𝚽(x)θ)T(𝐲-𝚽(x)θ)

We recover the curve model parameter θ from data (𝐱,𝐲) with desired “smoothness” property by solving a regularized least square problem,

argminθ(𝐲-𝚽(𝐱)θ)T(𝐲-𝚽(𝐱)θ)+λθT𝛀θ (19)

The second term in formulation (19) is a regularization term that penalizes the roughness of curve f(x). The roughness matrix 𝛀 needs be at least positive semi-definite. The penalty factor λ>0 could be a user input or automatically computed by GCV method.

Alternatively, we can formulate the smooth curve fitting problem as minimizing the roughness of the curve and subject to the constraints that fitting errors are under certain tolerance level.

argminθθT𝛀θ s.t. (20)
-ϵ<𝚽(𝐱)θ-𝐲<ϵ (21)

3.2 Choices of Basis

We provide two choices of basis functions ϕ(x), namely B-spline and Chebyshev polynomial of the first kind. B-spline basis are piecewise polynomials of the same degree but different “locality”. Chebyshev polynomials are orthogonal polynomials of different degree but on the same global canonical domain [-1,1]. Internally, we map physical domain [a,b] to canonical domain [-1,1]. These two different basis choices are represent by C++ class SmoothSplineCurve and SmoothChebyshevCurve respectively.

Because B-spline basis are overlapped piecewise polynomials, the placement of knot points has significant impact to the result of fitting. In general, it is desired to have data points uniformally scattered across all pieces. If the data implies dramatic curvature changes, adding an extra connecting piece could help smooth transition. Based on our experience, B-spline could be vulnerable to data noises. Careful attention to regularization and perhaps with additional shape constraints are often needed.

Chebyshev polynomial has many nice theoretical properties. Most relevant to data fitting, it is very close to the ideal “minimax” approximation polynomial. In other words, Chebyshev polynomials are somewhat naturally “smooth” and numerically more stable. Each Chebyshev basis function is a “global” polynomial over [-1,1]. We just need determine the number n of basis to be used in f(x), which is the highest degree of Chebyshev polynomial series. Increasing n of course increases the freedom of fitting. Practically, very large degree n, i.e. greater than 7 or 8, is often unnecessary.

3.3 Roughness Measure

The regularization choice in formulation (19) and objective function in formulation (21) are to make the curve function f(x;θ) “smoother”. If one chooses 𝛀 being the identity matrix, the roughness measure is to reduce the l2 norm of basis coefficients θ22, which tends to produce a flat curve closer and closer to f(x;θ)=0 as we increase the penalty factor λ.

3.3.1 Derivative Based

The classic definition, i.e. chapter XIV in [4], of the roughness measure of curve function f(x;θ) over [a,b] is

𝐑(θ)=abf′′(x;θ)2𝑑x

Let f(d)(x;θ) denote the d-th order derivative of f(x;θ). NPL offers 4 levels of derivative-based roughness matrix construction, d=0,,3.

𝐑(θ) = ab(f(d)(x;θ))2𝑑x

Once users make the choice on d, we internally carry out calculations to write the roughness measure in the form of 𝐑(θ)=θT𝛀θ.

3.3.2 Divided Difference Based

The classic roughness measure f′′(x;θ)2𝑑x favorites small magnitude of curvature, but does not have preference over the sign change of curvature. For this reason, we offer another regularization choice that penalizes the divided difference of derivatives,

𝐑(θ)=k(f(d)(xk;θ)-f(d)(xk+1;θ)xk+1-xk)2,d=1,2

, where the index k traverses through the knot points for B-spline basis and predefined points for Chebyshev basis. Again, users only need choose the level of d. We internally compute 𝐑(θ)=θT𝛀θ.

3.4 Roughness Leverage

It is possible to partition [a,b] into sub intervals (x0,x1,,xk+1) and apply different roughness weights in different sub interval, i.e.

𝐑(θ) = i=0kwixixi+1(f(d)(x;θ))2𝑑x

The total roughness measure 𝐑(θ) then is a weighted sum of local roughness. The sub interval weights 𝐰={wi} play the role as micro leverage factors. Users can set the leverage factors 𝐰 through a piecewise constant function w(x).

3.5 Shape Constraints

In both formulations (19) and (21), we may impose additional shape constraints at selected points xk,k=0,,K,

lkf(d)(xk;θ)uk,d=0,1,2 (22)

The shape constraints (22) effectively restrict the function value f(xk;θ), slope f(xk;θ) or curvature f′′(xk;θ) to be bounded within a certain range. For example, we could impose f′′(a;θ)=0 and f′′(b;θ)=0 constraints to fit a curve that satisfies natural boundary conditions.

3.6 Nonlinear Model Fit

Here, we design a mathematical model g(x;θ)=ϕ(x)Tθ that is to be approximated by a smooth curve. The dependent variable y is connected to independent variable x by a known nonlinear mapping function through the model, i.e. y=f(g(x;θ),x).

Let 𝐫(θ) be the vector of residual functions where the i-th element ri(θ)=f(g(xi;θ),xi)-yi is the residual error for observation (xi,yi). The formulation (19) changes to the following regularized nonlinear least square problem,

argminθ|𝐫(θ)|22+λθT𝛀θ (23)

And formulation (21) now becomes to an nonlinear programming problem,

argminθθT𝛀θ s.t.  (24)
-ϵi<ri(θ)<ϵii=0,,m-1 (25)

Again, additional curve shape constraint (22) may be imposed to both formulation (23) and (25). These two model calibration methods internally use our linearly constraint (LC) optimization solver and nonlinear programming (NLP) solver respectively. Please refer to our LC and NLP white paper for further details.

4 Surface Fitting

Given N number of data points (xk,yk,zk), k=0,,N-1, which are observed from an unknown 2-dimensional function z=f~(x,y), we want to approximate f~(x,y) by a smooth surface function f(x,y;𝚯) written as the tensor product of two sets of basis functions ϕ(x) and ψ(y),

f(x,y;𝚯)=ϕ(x)T𝚯ψ(y) (26)

, where 𝚯 is the coefficient matrix of the tensor product. The objective of surface fitting is to recover 𝚯 from observed data points.

In analogue to curve fitting, we offer two surface fitting methods in Mathwrist NPL. The first approach once again is a regularized least square fit,

argmin𝚯k=0N-1(zk-f(𝚯;xk,yk))2+λ𝐑(𝚯)) (27)

, where 𝐑(𝚯) is some choice of roughness regularization term. λ>0 is the roughness penalty factor from user input or computed by GCV method. In the alternative approach, we directly minimize the roughness measure subject to bounded fitting error constraints.

argmin𝚯𝐑(𝚯)s.t. (28)
-ϵ<f(𝚯;xk,yk)-zk<ϵ,k (29)

4.1 Choice of Basis

The basis functions ϕ(x) and ψ(y) are of the same type, either B-spline or Chebyshev polynomials of the first kind. These two basis choices are represented by two different C++ surface class types, SmoothSplineSurface and SmoothChebyshevSurface respectively. As a special case of B-spline basis, when we choose ϕ(x) and ψ(y) to be piecewise linear functions, function f(x,y) is a bi-linear surface. This special case deserves separate treatment. It is represented by a separate surface C++ class SmoothBilinearSurface for more efficient calculation.

4.2 Roughness Measure

We offer 3 different surface roughness measures, Frobenius norm, Dirichlet energy and Thin-plate energy. Frobenius norm is equivalent to the l2 norm in curve fitting situation.

Let [a,b]×[c,d] be the domain of the surface. Dirichlet energy is defined as the square integral of the gradient norm,

𝐑(𝚯;a,b,c,d)=abcdf(x,y;𝚯)2𝑑y𝑑x

The standard “thin-plate energy” is a rotation-invariant measure defined as the following,

𝐑(𝚯;a,b,c,d)=abcd(xxf(x,y;𝚯)2+2xyf(x,y;𝚯)2+yyf(x,y;𝚯)2)𝑑y𝑑x

Thin-plate energy roughness is not applicable to bilinear surface.

4.3 Roughness Leverage

Users can partition the whole surface domain [a,b]×[c,d] into sub areas [xi,xi+1]×[yj,yj+1], i=0,,k, j=0,,l and supply a 2-d piecewise constant function w(x,y) to surface fitting.

Let wi,j be the weights computed from w(x,y) for area [xi,xi+1]×[yj,yj+1]. We then compute the total roughness measure as the weighted sum of roughness over those sub surface areas.

𝐑(𝚯;a,b,c,d)=i=0kj=0lwi,j𝐑(𝚯;xi,xi+1,yj,yj+1)

4.4 Shape Constraints

In both formulation (27) and (29), it is possible to impose constraints directly to certain points (xk,yk) on the surface. The supported constraint types are:

  1. 1.

    Function value f(xk,yk,𝚯) is bounded;

  2. 2.

    Partial delta xf(xk,yk,𝚯) or yf(xk,yk,𝚯) is bounded;

  3. 3.

    Gamma 22xf(xk,yk,𝚯) or 22yf(xk,yk,𝚯) is bounded;

  4. 4.

    Cross gamma 2xyf(xk,yk,𝚯) is bounded;

Let δ(r,s)(f(x,y;𝚯)) be the differential operator relevant to the supported constraint types, i.e. δ(0,0) for function value, δ(1,0) for partial delta x. The optimization problem (27) and (29) may be subject to surface shape constraints,

lkδ(rk,sk)(f(xk,yk;𝚯))uk,k (30)

4.5 Nonlinear Model Fit

Similar to nonlinear curve model fitting, here we have a known 2-d nonlinear mapping function z=f(g(x,y;𝚯),x,y) that connects independent variables (x,y) to function value z through a model g(x,y;𝚯). It is the model function g(x,y;𝚯) that is approximated by a smooth surface, i.e. g(x,y;𝚯)=ϕ(x)T𝚯ψ(y).

Let 𝐫(𝚯) be the vector of residual functions where the i-th element ri(𝚯)=f(g(xi,yi;𝚯),xi,yi)-zi is the residual error for observation (xi,yi,zi). Users can choose to calibrate surface model parameter 𝚯 either by a regularized nonlinear least square fit,

argmin𝚯|𝐫(𝚯)|22+λ𝐑(𝚯) (31)

, or by solving an nonlinear programming problem,

argmin𝚯𝐑(𝚯) s.t.  (32)
-ϵi<ri(𝚯)<ϵii=0,,m-1 (33)

Again, in both formulation (31) and (33), additional shape constraints (30) can be imposed.

References

  • [1] Jorge Nocedal and Stephen J. Wright: Numerical Optimization, Springer, 1999
  • [2] Philip E. Gill, Walter Murray and Margaret H. Wright: Practical Optimization, Academic Press, 1981
  • [3] Trevor Hastie, Robert Tibshirani and Jerome Friedman: The Elements of Statistical Learning, Springer, 2001
  • [4] Carl de Boor: A Practical Guide to Splines, Revised Edition, Applied Mathematical Sciences Volume 27, Springer, 2001
  • [5] John P. Boyd: Chebyshev and Fourier Spectral Methods, second edition (revised), Dover Publications, 2001