Notes | Mechatronic Systems and Laboratory

2.9 Numerical methods for nonlinear programming problems

Nowadays, strong and efficient mathematical programming techniques are available for solving a great variety of nonlinear problems, which are based on solid theoretical results and extensive numerical studies. Approximated functions, derivatives and optimal solutions can be employed together with optimization algorithms to reduce the computational time. The aim of this section is not to describe state-of-the-art algorithms in nonlinear programming, but to explain, in a simple way, a number of modern algorithms for solving nonlinear problems. These techniques are typically iterative in the sense that, given an initial point $x 0$ , a sequence of points, ${x k}$ , is obtained by repeated application of an algorithmic rule. The objective is to make this sequence converge to a point $\bar{x}$ in a pre-specified solution set. Typically, the solution set is specified in terms of optimality conditions, such as those presented in 2.5 through 2.8.

We start by recalling a number of concepts in 2.9.1. Then, we discuss the principles of Newton-like algorithms for nonlinear systems in 2.9.2, and use these concepts for the solution of unconstrained optimization problems in 2.9.3. Finally, algorithms for solving general nonlinear problems are presented in 2.9.7, with emphasizes on sequential unconstrained minimization (SUM) and sequential quadratic programming (SQP) techniques.

2.9.1 Preliminaries

Two essential questions must be addressed concerning iterative algorithms. The first question, which is qualitative in nature, is whether a given algorithm in some sense yields, at least in the limit, a solution to the original problem; the second question, the more quantitative one, is concerned with how fast the algorithm converges to a solution. We elaborate on these concepts in this subsection.

The convergence of an algorithm is said to be asymptotic when the solution is not achieved after a finite number of iterations; except for particular problems such as linear and quadratic programming, this is generally the case in nonlinear programming. That is, a very desirable property of an optimization algorithm is global convergence:

Definition 2.23: Global Convergence, Local Convergence

An algorithm is said to be globally convergent if, for any initial point $x 0$ , it generates a sequence of points that converges to a point $\bar{x}$ in the solution set. It is said to be locally convergent if there exists a $ρ > 0$ such that for any initial point $x 0$ such that $‖ \bar{x} - x 0 ‖ < ρ$ it generates a sequence of points converging to $\bar{x}$ in the solution set.

Most modern mathematical programming algorithms are globally convergent. Locally convergent algorithms are not useful in practice because the neighborhood of convergence is not known in advance and can be arbitrarily small.

Next, what distinguishes optimization algorithms with the global convergence property is the order of convergence:

Definition 2.24: Order of Convergence

The order of convergence of a sequence ${x k} \to \bar{x}$ is the largest nonnegative integer p such that

\lim_{k \to \infty} \frac{∥ x k + 1 - \bar{x} ∥}{∥ x k - \bar{x}} = β < \infty

When $p = 1$ and the convergence ratio $β < 1$ , the convergence is said to be linear; if $β = 0$ , the convergence is said to be superlinear. When $p = 2$ , the convergence is said to be quadratic.

Since they involve the limit when $k \to \infty$ , $p$ and $β$ are a measure of the asymptotic rate of convergence, i.e., as the iterates gets close to the solution; yet, a sequence with a good order of convergence may be very ”slow” far from the solution. Clearly, the convergence is faster when $p$ is larger and $β$ is smaller. Near the solution, if the convergence rate is linear, then the error is multiplied by $β$ at each iteration. The error reduction is squared for quadratic convergence, i.e., each iteration roughly doubles the number of significant digits. The methods that will be studied hereafter have convergence rates varying between linear and quadratic.

Example 2.19. Consider the problem to minimize $f (x) = x^{2}$ , subject to $x \geq 1$ .

Let the (point-to-point) algorithmic map $ℳ_{1}$ be defined defined as $ℳ_{1} (x) = \frac{1}{2} (x + 1)$ . It is easily verified that the sequence obtained by applying the map $ℳ_{1}$ , with any starting point, converges to the optimal solution $x^{⋆} = 1$ , i.e., $ℳ_{1}$ is globally convergent. For example, with $x^{0} = 4$ , the algorithm generates the sequence ${4, 2.5, 1.75, 1.375, 1.1875, . . .}$ . We have $(x^{k + 1} - 1) = \frac{1}{2} (x^{k} - 1)$ , so that the limit in Definition 2.24is $β = \frac{1}{2}$ with $p = 1$ ; moreover, for $p > 1$ , this limit is infinity. Consequently, $x^{k} \to 1$ linearly with convergence ratio $\frac{1}{2}$ .

On the other hand, consider the (point-to-point) algorithmic map $ℳ_{2}$ be defined defined as $ℳ_{2} (x; k) = 1 + \frac{1}{2^{k + 1}} (x - 1)$ . Again, the sequence obtained by applying $ℳ_{2}$ converges to $x^{⋆} = 1$ , from any starting point. However, we now have $\frac{| x^{k + 1} - 1 |}{| x^{k} - 1 |} = \frac{1}{2^{k}}$ , which approaches $0$ as $k \to \infty$ . Hence, $x^{k} \to 1$ superlinearly in this case. With $x^{0} = 4$ , the algorithm generates the sequence ${4, 2.5, 1.375, 1.046875 \dots}$ . The algorithmic maps $ℳ_{1}$ and $ℳ_{2}$ are illustrated on the left and right plots in Fig 1.14., respectively.

It should also be noted that for most algorithms, the user must set initial values for certain parameters, such as the starting point and the initial step size, as well as parameters for terminating the algorithm. Optimization procedures are often quite sensitive to these parameters, and may produce different results, or even stop prematurely, depending on their values. Therefore, it is crucial for the user to understand the principles of the algorithms used, so that he or she can select adequate values for the parameters and diagnose the reasons of a premature termination (failure).

2.9.2 Newton-like Algorithms for nonlinear Systems

The fundamental approach to most iterative schemes was suggested over 300 years ago by Newton. In fact, Newton’s method is the basis for nearly all the algorithms that are described herein.

Suppose one wants to find the value of the variable $x \in ℝ^{n_{x}}$ such that

ϕ (x) = 0

where $ϕ : ℝ^{n_{x}} \to ℝ^{n_{x}}$ is continuously differentiable. Let us assume that one such solution exists, and denote it by $x ⋆$ . Let also $x$ be a guess for the solution. The basic idea of Newton’s method is to approximate the nonlinear function $ϕ$ by the first two terms in its Taylor series expansion about the current point $x$ . This yields a linear approximation for the vector function $ϕ$ at the new point $\bar{x}$ ,

ϕ (\bar{x}) = ϕ (x) + \frac{\partial ϕ}{\partial x} [\bar{x} - x] = ϕ (x) + \nabla ϕ {(x)}^{⊤} [\bar{x} - x]

Using this linear approximation, and provided that the Jacobian matrix $\frac{\partial ϕ}{\partial x}$ is nonsingular, a new estimate for the solution $x ⋆$ can be computed by solving the previous equation for $ϕ (x) = 0$

\bar{x} = x - (\frac{\partial ϕ}{\partial x} | x)^{- 1} ϕ (x)

Letting $d = - (\frac{\partial ϕ}{\partial x} | x)^{- 1} ϕ (x)$ , we get the update $\bar{x} = x + d$ .

Of course, $ϕ$ being a nonlinear function, one cannot expect that $ϕ (\bar{x}) = 0$ , but there is much hope that $\bar{x}$ be a better estimate for the root $x ⋆$ than the original guess $x$ . In other words, we might expect that

‖ \bar{x} - x ⋆ ‖ \leq ‖ x - x ⋆ ‖ and ‖ ϕ (\bar{x}) ‖ \leq ‖ ϕ (x) ‖

If the new point is an improvement, then it makes sense to repeat the process, thereby defining a sequence of points ${x 0, x 1, \dots}$ . An algorithm implementing Newton’s method is as follows:

Algorithm 2.1

Initialization Step:

Let $𝜖 > 0$ be a termination scalar, and choose an initial point $x 0$ .

Let $k = 0$ and go to the main step.

Main Step:

Solve the linear system $(\frac{\partial ϕ}{\partial x} | x k) d k = - ϕ (x k)$ for $d k$ .
Compute the new estimate $x k + 1 = x k + d k$
If $‖ ϕ (x k + 1) ‖ < 𝜖$ , stop; otherwise, replace $k \leftarrow k + 1$ , and go to step 1.

It can be shown that the rate of convergence for Newton’s method is quadratic (see Definition 2.24). Loosely speaking, it implies that each successive estimate of the solution doubles the number significant digits, which is a very desirable property for an algorithm to possess.

Theorem 2.21: Quadratic convergence for Newton’s algorithm

Let $ϕ : ℝ^{n_{x}} \to ℝ^{n_{x}}$ be continuously differentiable, and consider Newton’s algorithm defined by the map $ℳ (x) : = x - \nabla ϕ {(x)}^{- ⊤} ϕ (x)$ . Let $x ⋆$ be such that $ϕ (x ⋆) = 0$ , and suppose that $\nabla ϕ {(x}^{⋆}) ⊤$ is nonsingular. Let the starting point $x 0$ be sufficiently close to $x ⋆$ , so that there exist $c_{1}, c_{2} > 0$ with $c_{1} c_{2} ‖ x 0 - x ⋆ ‖ < 1$ , and

\begin{aligned} ‖ \nabla ϕ {(x)}^{- T} ‖ \leq c_{1} \end{aligned}

^a^aFor a rectangular matrix A∈ℝn×m we define the operator norm ‖A‖=sup{‖Ax‖,x∈ℝmsuch that‖x‖=1}‖ϕ(x⋆)−ϕ(x)−∇⁡ϕ(x)⊤⁡[x⋆−x]‖≤c2‖x⋆−x‖2

for each $x$ satisfying $‖ x ⋆ - x ‖ \leq ‖ x ⋆ - x 0 ‖$ . Then, Newton’s algorithm converges with a quadratic rate of convergence.

Proof. See [6, Theorem 8.6.5] for a proof. □

But can anything go wrong with Newton’s method?

Lack of Global Convergence First and foremost, if the initial guess is not sufficiently close to the solution, i.e., within the region of convergence, Newton’s method may diverge. Said differently, Newton’s method as presented above does not have the global convergence property (see Definition 2.23 and Example 2.20 hereafter). This is because $d k : = \nabla ϕ {(x}^{k}) - ⊤ ϕ (x k)$ may not be a valid descent direction far from the solution, and even if , a unit step size might not give a descent in $ϕ$ . Globalization strategies, which aim at correcting this latter deficiency, will be presented in 2.9.4 in the context of unconstrained optimization.

Singular Jacobian Matrix A second difficulty occurs when the Jacobian matrix $\nabla ϕ (x k)$ becomes singular during the iteration process, since the correction defined by the Newton’s algorithm is not defined in this case. Note that the assumption in Theorem 2.21 guarantees that $\nabla ϕ {(x)}^{⊤}$ cannot be singular. But when the Jacobian matrix is singular at the solution point $x ⋆$ , then Newton’s method looses its quadratic convergence property.

Computational Efficiency Finally, at each iteration, Newton’s method requires (i) that the Jacobian matrix $\nabla ϕ (x k)$ be computed, which may be difficult and/or costly especially when the expression of $ϕ (x)$ is complicated, and (ii) that a linear system be solved. The analytic Jacobian can be replaced by a finite-difference approximation, yet this is costly as $n_{x}$ additional evaluations of $ϕ$ are required at each iteration. With the objective of reducing the computational effort, quasi-Newton methods generate an approximation of the Jacobian matrix, based on the information gathered from the iteration progress. To avoid solving a linear system for the search direction, variants of quasi-Newton methods also exist that generate an approximation of the inverse of the Jacobian matrix. Such methods will be described in 2.9.5 in the context of unconstrained optimization.

Example 2.20. Consider the problem to find a solution to the nonlinear equation

f (x) = \arctan (x) = 0

The Newton iteration sequence obtained by starting from $x^{0} = 1$ is as follows:

\begin{matrix} ̲ & ̲ & ̲ \\ k & x^{k} & | f (x^{k}) | \\ ̲ & ̲ & ̲ \\ 0 & 1 & 0.785398 \\ 1 & - 0.570796 & 0.518669 \\ 2 & 0.116860 & 0.116332 \\ 3 & - 1.061022 \times 1 0^{- 3} & 1.061022 \times 1 0^{- 3} \\ 4 & 7.963096 \times 1 0^{- 10} & 7.963096 \times 1 0^{- 10} \\ ̲ & ̲ & ̲ \end{matrix}

Notice the very fast convergence to the solution $x^{⋆} = 0$ , as could be expected from Theorem 2.21. The first three iterations are represented in Fig. 1.15., on the left plot.

However, convergence is not guaranteed for any initial guess. More precisely, it can be shown that Newton’s method actually diverges when the initial guess is chosen such that $| x^{0} | > α$ , with $α = 1.3917452002707$ being a solution of $a r c t a n (z) = \frac{2 z}{1 + z^{2}}$ ; further, the method cycles indefinitely for $x^{0} = α$ . Both these situations are illustrated in the right plot and the bottom plot of Fig. 1.15., respectively.

2.9.3 Unconstrained Optimization

We now turn to a description of basic techniques used for iteratively solving unconstrained problems of the form:

min f (x) x \in ℝ^{n_{x}}

Many unconstrained optimization algorithms work along the same lines. Starting from an initial point, a direction of movement is determined according to a fixed rule, and then a move is made in that direction so that the objective function value is reduced; at the new point, a new direction is determined and the process is repeated. The main difference between these algorithms rest with the rule by which successive directions of movement are selected. A distinction is usually made between those algorithms which determine the search direction without using gradient information (gradient-free methods), and those using gradient (and higher-order derivatives) information (gradient-based methods). Here, we shall focus our attention on the latter class of methods, and more specifically on Newton-like algorithms.

Throughout this subsection, we shall assume that the objective function $f$ is twice continuously differentiable. By theorem 2.3, a necessary condition for $x ⋆$ to be a local minimum of $f$ is $\nabla f (x ⋆) = 0$ . Hence, the idea is to devise an iterative scheme that finds a point satisfying the foregoing condition. Following the techniques discussed earlier in 2.9.2, this can be done by using a Newton-like algorithm, with $ϕ$ corresponding to the gradient $\nabla f$ of $f$ , and $\nabla ϕ ⊤$ to its Hessian matrix $\nabla^{2} f$ .

At each iteration, a new iterate $x k + 1$ is obtained such that the linear approximation to the gradient at that point is zero,

\nabla f (x k + 1) \approx \nabla f (x k) + \nabla^{2} f (x k) [x k + 1 - x k] = 0

The linear approximation yields the Newton search direction:

d k : = x k + 1 - x k = - \nabla^{2} f {(x}^{k}) - 1 \nabla f (x k)

As discussed in 2.9.2, if it converges, Newton’s method exhibits a quadratic rate of convergence when the Hessian matrix $\nabla^{2} f (x ⋆)$ is nonsingular at the solution point. However, since the Newton iteration is based on finding a zero of the gradient vector, there is no guarantee that the step will move towards a local minimum, rather than a saddle point or even a maximum. To preclude this, the Newton steps should be taken downhill, i.e., the following descent condition should be satisfied at each iteration,

\nabla f (x k) ⊤ d k < 0

Interestingly enough, with the Newton direction iteration, the descent condition becomes

- \nabla f {(x}^{k}) ⊤ \nabla^{2} f {(x}^{k}) - 1 \nabla f (x k) < 0

That is, a sufficient condition to obtain a descent direction at $x k$ is that the Hessian matrix $\nabla^{2} f (x k)$ be positive definite. Moreover, if $\nabla^{2} f (x ⋆)$ is positive definite at a local minimizer $x ⋆$ of $f$ , then the Newton iteration converges to $x ⋆$ when started sufficiently close to $x ⋆$ . (Recall that, by Theorem 2.8, positive definiteness of $\nabla^{2} f (x ⋆)$ is a sufficient condition for a local minimum of $f$ to be a strict local minimum.)

We now discuss two important improvements to Newton’s method, which are directly related to the issues discussed in Subsection 2.9.2, namely (i) the lack of global convergence, and (ii) computational efficiency.

2.9.4 Globalization Strategies

Up to this point, the development has focused on the application of Newton’s method. However, even in the simplest one-dimensional applications, Newton’s method has deficiencies (see, e.g., Example 2.20). Methods for correcting global convergence deficiencies are referred to as globalization strategies. It should be stressed than an efficient globalization strategy should only alter the iterates when a problem is encountered, but it should not impede the ultimate behavior of the method, i.e., the quadratic convergence of a Newton’s method should be retained.

In unconstrained optimization, one can detect problems in a very simple fashion, by monitoring whether the next iterate $x k + 1$ satisfies a descent condition with respect to the actual iterate $x k$ , e.g., $f (x k + 1) < f (x k)$ . Then, either one of two globalization strategies can be used to correct the Newton step. The first strategy, known as line search method, is to alter the magnitude of the step; the second one, known as trust region method, is to modify both the step magnitude and direction. We shall only concentrate on the former class of globalization strategies subsequently.

A line search method proceeds by replacing the full Newton step $x k + 1 = x k + α d k$ with

x k + 1 = x k + α d k

where the step-length $α \geq 0$ is chosen such that the objective function is reduced,

f (x k + α d k) < f (x k)

Clearly, the optimal choice for $α$ would be $α^{⋆} = argmin {l (α) = f (x k + α d k α \geq 0}$ . The resulting minimization problem can be solved by any one-dimensional exact minimization technique (e.g., Newton’s method itself). However, such techniques are costly in the sense that they often require many iterations to converge and, therefore, many function (or even gradient) evaluations.

In response to this, most modern algorithms implement so-called inexact line search criteria, which aim to find a step-length giving an ”acceptable” decrease in the objective function. Note that sacrificing accuracy, we might impair the convergence of the overall algorithm that iteratively employs such a line search. However, by adopting a line search that guarantees a sufficient degree of descent in the objective function, the convergence of the overall algorithm can still be established.

We now describe one popular definition of an acceptable step-length known as Armijo’s rule. Other popular approaches are the quadratic and cubic fit techniques, as well as Wolfe’s and Glodstein’s tests. Armijo’s rule is driven by two parameters $0 < κ_{1} < 1$ and $κ_{2} > 1$ , which respectively manage the acceptable step-length from being too large or too small (Typical vales are $κ_{1} = 0.2$ and $κ_{2} = 2$ ). Define the line search function $l (α) : = f (x k + α d k)$ , for $α \geq 0$ , and consider the modified first-order approximation $\hat{l} (α) = l (0) + κ_{1} α l^{'} (0)$ . A step-length $\bar{α} \in (0, 1)$ is deemed acceptable if the following conditions hold:

\begin{aligned} l (\bar{α}) \leq \bar{l} (\bar{α}) \\ l (κ_{2} \bar{α}) \geq \hat{l} (κ_{2} \bar{α}) \end{aligned}

The first condition prevents the step-length $\bar{α}$ from being too large, whereas the second condition prevents $\bar{α}$ from being too small. The acceptable region defined by the Armijo’s rule is shown in Fig. 1.16. below.

2.9.5 Recursive Updates

Another limitation of Newton’s method when applied to unconstrained optimization problems is that the Hessian matrix of the objective function is needed at each iteration, then a linear system must be solved for obtaining the search direction. For many applications, this can be a costly computational burden. In response to this, quasi-Newton methods attempt to construct this information recursively. However, by so doing, the quadratic rate of convergence is lost.

The basic idea for many quasi-Newton methods is that two successive iterates $x k, x k + 1$ together with the corresponding gradients $\nabla f (x k), \nabla f (x k + 1)$ , yield curvature information by means of the first-order approximation relation

\nabla f (x k + 1) = \nabla f (x k) + \nabla^{2} f (x k) d k + o (‖ d ‖)

with $d k = x k + 1 - x k$ . In particular, given $n_{x}$ linearly independent iteration increments $δ 0, \dots, δ n_{x} - 1$ an approximation of the Hessian matrix can be obtained as

\nabla^{2} f (x n_{x}) \approx [γ 0 \dots γ n_{x} - 1] {[d}^{0} \dots d n_{x} - 1] - 1

or for the inverse Hessian matrix as

\nabla^{2} f (x n_{x}) \approx [d 0 \dots d n_{x} - 1] {[γ}^{0} \dots γ n_{x} - 1] - 1

where $γ k = \nabla f (x k + 1) - \nabla f (x k)$

Note that when the objective function is quadratic, the previous relations are exact. Many interesting quasi-Newton methods use similar ways, although more sophisticated, to construct an approximate Hessian matrix $B^{k}$ that progressively approaches the inverse Hessian. One of the most popular class of quasi-Newton methods (known as the Broyden family) proceeds as follows:

\begin{aligned} B^{k + 1} & : = B^{k} + \frac{d}{k} - \frac{B^{k} γ}{k} + \\ + ξ γ k^{⊤} B^{k} γ k (\frac{d}{k} d k^{⊤} γ k - \frac{B^{k} γ}{k}) (\frac{d}{k} d k^{⊤} γ k - \frac{B^{k} γ}{k}) ⊤ \end{aligned}

where $0 \leq ξ \leq 1$ . It is easily seen that when supplemented with a line search strategy, $d k^{⊤} γ k < 0$ at each $k$ , and hence the Hessian matrix approximations are guaranteed to exist. Moreover, it can be shown that the successive approximates remain positive definite provided that $B^{0}$ is itself positive definite.

By setting $ξ = 0$ , the previous equation yields the Davidon-Fletcher-Powell (DFP) method, which is historically the first quasi-Newton method, while setting $ξ = 1$ gives the Broyden-Fletcher-Goldfard-Shanno (BFGS) method, for which there is substantial evidence that it is the best general purpose quasi-Newton method currently known.

2.9.6 Summary

A Newton-like algorithm including both a line search method (Armijo’s rule) and Hessian recursive update (DFP update) is as follows:

Algorithm 2.2

Initialization Step:

Let $𝜖 > 0$ be a termination scalar, and choose an initial point $x 0 \in R^{n_{x}}$ and a symmetric, positive definite matrix $B^{0} \in ℝ^{n_{x} \times n_{x}}$ . Let $k = 0$ , and go to the main step.

Main Step:

Search Direction - $d k = - B^{k} \nabla f (x k)$ .
Line Search - Find a step $α^{k}$ satisfying Armijo’s conditions.
Update - Compute the new estimates:
$\begin{aligned} x k + 1 : = x k + α^{k} d k \\ B^{k + 1} : = B^{k} + \frac{d}{k} - \frac{B^{k} γ}{k} \end{aligned}$
with $d k : = x k + 1 - x k$ and $γ k : = \nabla f (x k + 1) - \nabla f (x k)$
If $‖ \nabla f (x k + 1 ‖ < 𝜖$ , stop; otherwise, replace $k \leftarrow k + 1$ , and go to step 1.

The standard unconstrained optimization algorithm in MATLAB (i.e. the function fminunc) is an implementation of quasi-Newton’s method, with DFP or BFGS update, and a line search strategy.

Example 2.21. Consider the problem to find a minimum to Rosenbrock’s function

f (x) = {(1 - x_{1})}^{2} + c {(x_{2} - x_{1}^{2})}^{2}

for $x \in ℝ^{2}$ with $c : = 105$ . We solved this problem using the function fminunc .

The results are shown in Fig. 1.17. Observe the slow convergence of the iterates far from the optimal solution $x ⋆ = {(1, 1)}^{⊤}$ but the very fast convergence in the vicinity of $x ⋆$ .

2.9.7 Constrained Nonlinear Optimization

In this subsection, we turn our attention to algorithms for iteratively solving constrained problems of the form:

min f (x); x \in X \subset ℝ^{n_{x}}

Many modern deterministic algorithms for constrained NLP problems are based on the (rather natural) principle that, instead of solving a difficult problem directly, one had better solve a sequence of simpler, but related, subproblems, which converges to a solution of the original problem either in a finite number of steps or in the limit. Working along these lines, two classes of algorithms can be distinguished for solution of NLP problems with equality and/or inequality constraints. On the one hand, penalty function and interior-point methods consist of solving the problem as a sequence of unconstrained problems (or problems with simple constraints), so that algorithms for unconstrained optimization can be used. These methods, which do not rely on the KKT theory described earlier in 2.6 through 2.8, shall be briefly presented in 2.9.8 and 2.9.9. On the other hand, Newton-like methods solve NLP problems by attempting to find a point satisfying the necessary conditions of optimality (KKT conditions in general). Sequential quadratic programming (SQP), which shall be presented in 2.9.10, represents one such class of methods.

2.9.8 Penalty Function Methods

Methods using penalty functions transform a constrained problem into a single unconstrained problem or a sequence of unconstrained problems. This is done by placing the constraints into the objective function via a penalty parameter in a way that penalizes any violation of the constraints. To illustrate it, consider the NLP problem:

\begin{matrix} \min & f (x) \\ s.t. & g (x) \leq 0 \\ h (x) = 0 \\ x \in X \end{matrix}

where $X \subset ℝ^{n_{x}}$ , $x \in ℝ^{n_{x}}$ and $f : X \to ℝ$ , $g : X \to ℝ^{n_{g}}$ and $h : X \to ℝ^{n_{h}}$ are defined on $X$ .

In general, a suitable penalty function $α (x)$ for the minimization problem is defined by:

α (x) = \sum_{k = 1}^{n_{g}} ϕ [g_{k} (x)] + \sum_{k = 1}^{n_{h}} ψ [h_{k} (x)]

where $ϕ$ and $ψ$ are continuous functions satisfying the conditions:

{\begin{matrix} ϕ (z) = 0 & if z \leq 0 \\ ϕ (z) > 0 & otherwise \end{matrix} and {\begin{matrix} ψ (z) = 0 & if z = 0 \\ ψ (z) > 0 & otherwise \end{matrix}

Typically, $ϕ$ and $ψ$ are of the forms

ϕ (z) = {(\max {0, z})}^{p} and ψ (z) = | z |^{p}

with $p$ a positive integer (taking $p \geq 2$ provides continuously differentiable penalty functions). The function $f (x) + μ α (x)$ is referred to as the auxiliary function.

Example 2.22. Consider the problem to minimize $f (x) = x$ , subject to $g (x) = - x + 2 \leq 0$ . It is immediately evident that the optimal solution lies at the point $x^{⋆} = 2$ , and has objective value $f (x^{⋆}) = 2$ .

Now, consider the penalty problem to minimize $f (x) + μ α (x) = x + μ \max {0, 2 - x}^{2}$ in $ℝ$ , where $μ$ is a large number. Note first that for any $μ$ , the auxiliary function is convex since it is sum of convex functions. Thus, a necessary and sufficient condition for optimality is that the gradient of $f (x) + μ α (x)$ be equal to zero, yielding $x^{μ} = 2 - \frac{1}{2 μ}$ . Thus, the solution of the penalty problem can be made arbitrarily close to the solution of the original problem by choosing $μ$ sufficiently large. Moreover, $f (x^{μ}) + μ α (x^{μ}) = 2 - \frac{1}{4 μ}$ , which can also be made arbitrarily close to $f (x^{⋆})$ by taking $μ$ sufficiently large. These considerations are illustrated in Fig. 1.18. below.

The conclusions of Example 2.22 that the solution of the penalty problem can be made arbitrarily close to the solution of the original problem, and the optimal auxiliary function value arbitrarily close to the optimal objective value, by choosing sufficiently large, is formalized in the following:

Theorem 2.22

Consider a general NLP problem, where $f$ , $g$ and $h$ are continuous functions on $ℝ^{n_{x}}$ and $X$ is a nonempty set in $ℝ^{n_{x}}$ . Suppose that the NLP problem has a feasible solution, and let $α$ be a continuous penalty function. Suppose further that for each $μ$ there exists a solution $x μ \in X$ to the problem $\min {f (x) + μ α (x) : x \in X}$ , and that ${x μ}$ is contained in a compact subset of $X$ . Then,

\min {f (x) : g (x) \leq 0, h (x) = 0, x \in X} = \sup_{μ \geq 0} 𝜃 (μ) = \lim_{μ \to \infty} 𝜃 (μ)

with $𝜃 (μ) : = f (x μ) + μ α (x μ)$ . Furthermore, the limit $\bar{x}$ of any convergent subsequence of ${x μ}$ is an optimal solution to the original problem and $μ α (x μ) \to 0$ as $μ \to \infty$ .

Proof. See [6, Theorem 9.2.2] for a proof. □

Note that the assumption that $X$ is compact is necessary, for it possible that the optimal objective values of the original and penalty problems are not equal otherwise. Yet, this assumption is not very restrictive in most practical cases as the variables usually lie between finite upper and lower bounds. Note also that no restriction is imposed on $f$ , $g$ and $h$ other than continuity. However, the application of an efficient solution procedure for the (unconstrained) auxiliary problems may impose additional restriction on these functions (see 2.9.3).

Under the conditions that (i) $f$ , $g$ , $h$ and $ϕ$ , $ψ$ are continuously differentiable, and (ii) $\bar{x}$ is a regular point the solution to the penalty problem can be used to recover the Lagrange multipliers associated with the constraints at optimality. In the particular case where $X = ℝ^{n_{x}}$ , we get

\begin{aligned} ν_{i}^{μ} & = μ ϕ^{'} [g_{i} (x μ)] \forall i \in 𝒜 (\bar{x}) \\ λ_{i}^{μ} & = μ ψ^{'} [h_{i} (x μ)] \forall i = 1, \dots, n_{h} \end{aligned}

The larger $μ$ , the better the approximation of the Lagrange multipliers:

ν μ \to ν ⋆ and λ μ \to λ ⋆ as μ \to \infty

Example 2.23. Consider the same problem as in Example 1.71. The auxiliary function $f (x) + μ α (x) = x + μ \max {0, 2 - x}^{2}$ being continuously differentiable, the Lagrange multiplier associated to the inequality constraint $g (x) = - x + 2 \leq 0$ can be recovered as $ν^{μ} = 2 μ \max {0, 2 - x}$ (assuming $μ > 0$ ). Note that the exact value of the Lagrange multiplier is obtained for each $μ > 0$ here, because $g$ is a linear constraint.

From a computational viewpoint, superlinear convergence rates might be achievable, in principle, by applying Newton’s method (or its variants such as quasi-Newton methods). Yet, one can expect ill-conditioning problems when $μ$ is taken very large in the penalty problem. With a large $μ$ , more emphasis is placed on feasibility, and most procedures for unconstrained optimization will move quickly towards a feasible point. Even though this point may be far from the optimum, both slow convergence and premature termination can occur due to very small step size and finite precision computations (round-off errors).

As a result of the above mentioned difficulties associated with large penalty parameters, most algorithms using penalty functions employ a sequence of increasing penalty parameters. With each new value of the penalty parameter, an optimization technique is employed, starting with the optimal solution corresponding to the previously chosen parameters value. Such an approach is often referred to as sequential unconstrained minimization (SUM) technique. This way, a sequence of infeasible points is typically generated, whose limit is an optimal solution to the original problem (hence the term exterior penalty function approach).

To conclude our discussion on the penalty function approach, we give an algorithm to solve a general NLP problem, where the penalty function used are of the form specified by $ϕ$ and $ψ$

Algorithm 2.3

Initialization Step:

Let $𝜖 > 0$ be a termination scalar, and choose an initial point $x 0$ , a penalty parameter $μ^{0} > 0$ , and a scalar $β > 1$ . Let $k = 0$ and go to the main step.

Main Step:

Starting with $x k$ , get a solution to the problem:
$x k + 1 \in \arg \min {f (x) + μ^{k} α (x) : x \in X}$
If $μ^{k} α (x k + 1) < 𝜖$ , stop; otherwise, let $μ^{k + 1} = β μ^{k}$ , replace $k \leftarrow k + 1$ , and go to step 1.

2.9.9 Interior-Point Methods

Similar to penalty functions, barrier functions can also be used to transform a constrained problem into an unconstrained problem (or into a sequence of unconstrained problems). These functions act as a barrier and prevent the iterates from leaving the feasible region. If the optimal solution occurs at the boundary of the feasible domain, the procedure moves from the interior to the boundary of the domain, hence the name interior-point methods. To illustrate these methods, consider the NLP problem:

\begin{matrix} \min_{x} & f (x) \\ s.t. & g (x) \leq 0 \\ x \in X \end{matrix}

where $X$ is a subset of $ℝ^{n_{x}}$ , and $f : X \to ℝ$ , $g : X \to ℝ^{n_{g}}$ are continuous on $ℝ^{n_{x}}$ . Note that equality constraints, if any, should be accommodated within the set $X$ . (In the case of affine equality constraints, one can possibly eliminate them after solving for some variables in terms of the others, thereby reducing the dimension of the problem.) The reason why this treatment is necessary is because barrier function methods require the set ${x \in ℝ^{n_{x}} : g (x) < 0}$ to be nonempty; this would obviously be not possible if the equality constraints $h (x) = 0$ were accommodated within the set of inequalities as $h (x) \leq 0$ and $h (x) \geq 0$ .

A barrier problem formulates as:

\begin{matrix} \inf_{μ} & 𝜃 (μ) \\ s.t. & μ > 0 \end{matrix}

where $𝜃 (μ) : = \inf {f (x) + μ b (x) : g (x) < 0, x \in X}$ . Ideally, the barrier function $b$ should take value zero on the region ${x : g (x) \leq 0}$ , and value $\infty$ on its boundary. This would guarantee that the iterates do not leave the domain ${x : g (x) \leq 0}$ provided the minimization problem started at an interior point. However, this discontinuity poses serious difficulties for any computational procedure. Therefore, this ideal construction of $b$ is replaced by the more realistic requirement that $b$ be nonnegative and continuous over the region ${x : g (x) \leq 0}$ and approach infinity as the boundary is approached from the interior:

b (x) = \sum_{k = 1}^{n_{g}} ϕ [g_{k} (x)]

where $ϕ$ is a continuous function over ${z : z < 0}$ that satisfies the conditions

{\begin{matrix} ϕ (z) \geq 0 if z < 0 \\ \lim_{z \to 0^{-}} ϕ (z) = + \infty \end{matrix}

In particular, $μ b$ approaches the ideal barrier function described above as $μ$ approaches zero.

Typically barrier functions are

b (x) = - \sum_{k = 1}^{n_{g}} \frac{1}{g_{k} (x)} or b (x) = - \sum_{k = 1}^{n_{g}} \ln [\min {1, - g_{k} (x)}]

The following barrier function, known as Frisch’s logarithmic barrier function, is also widely used

b (x) = - \sum_{k = 1}^{n_{g}} \ln [- g_{k} (x)]

Actually the Frisch’s logarithmic barrier $ϕ (z) = - \ln (- z)$ does not satisfy the nonnegativity requirement for $z < - 1$ . However the requirement on $ϕ$ can be relaxed and it is sufficient that $ϕ$ be positive close to $z = 0$ .

The function $f (x) + μ b (x)$ is referred to as the auxiliary function.

Given $μ > 0$ , evaluating $𝜃 (μ) = inf {f (x) + μ b (x) : g (x) < 0, x \in X}$ seems no simpler than solving the original problem because of the constraint $g (x) < 0$ . However, starting the optimization from a point in the region $S : = {x : g (x) < 0} \cap X$ yields an optimal point in $S$ , even when the constraint $g (x) < 0$ is ignored. This is because $b$ approaches infinity as the iterates approach the boundary of ${x : g (x) \leq 0}$ from within $S$ , hence preventing them from leaving the set $S$ . This is formalized in the following:

Theorem 2.23

Consider a NLP problem with inequality constraints , where $f$ and $g$ are continuous functions on $ℝ^{n_{x}}$ and $X$ is a nonempty closed set in $ℝ^{n_{x}}$ . Suppose that the minimization problem has an optimal solution $x ⋆$ with the property that, given any neighborhood $ℬ_{η} (x ⋆)$ around $x ⋆$ , there exists an $x \in X \cap ℬ_{η} (x ⋆)$ such that $g (x) < 0$ . Suppose further that for each $μ$ , there exists a solution $x μ \in X$ to the problem $\min {f (x) + μ b (x) : x \in X}$ . Then,

\min {f (x) : g (x) \leq 0, x \in X} = \lim_{μ \to 0^{+}} 𝜃 (μ) = \inf_{μ > 0} 𝜃 (μ)

with $𝜃 (μ) : = f (x μ) + μ b (x μ)$ . Furthermore, the limit of any convergent subsequence of ${x μ}$ is an optimal solution to the original problem, and $μ b (x μ) \to 0$ as $μ \to 0^{+}$ .

Proof. See [6, Theorem 9.4.3] for a proof. □

Under the conditions that (i) $f$ , $g$ and $ϕ$ are continuously differentiable, and (ii) $x ⋆$ is a regular point, the solution to the barrier function problem can be used to recover the Lagrange multipliers associated with the constraints at optimality. In the particular case where $X = ℝ^{n_{x}}$ , we get:

ν_{i}^{μ} = μ ϕ^{'} [g_{i} (x μ)] \forall i \in 𝒜 (x ⋆)

The approximation of the Lagrange multipliers, gets better as $μ$ gets closer to $0$ ,

ν μ \to ν ⋆ as μ \to 0^{+}

Example 2.24. Consider the problem to minimize $f (x) = x$ , subject to $g (x) = - x + 2 \leq 0$ , the solution of which lies at the point $x^{⋆} = 2$ with objective value $f (x^{⋆}) = 2$ . Now, consider the barrier function problem to minimize $f (x) + μ b (x) = x - \frac{μ}{2 - x}$ in $ℝ$ , where $μ$ is a large number. Note first that for any $μ$ , the auxiliary function is convex ¹ ¹ $\frac{μ}{2 - x}$ is a concave function on the convex set $S : = {x \in ℝ, x \geq 2}$ , therefore $- \frac{μ}{2 - x}$ is convex and thus $f (x) + μ b (x)$ is sum of convex functions. . Thus, a necessary and sufficient condition for optimality is that the gradient of $f (x) + μ b (x)$ be equal to zero, yielding $x^{μ} = 2 + \sqrt{μ}$ (assuming $μ > 0$ ). Thus, the solution of the penalty problem can be made arbitrarily close to the solution of the original problem by choosing $μ$ sufficiently close to zero. Moreover, $f (x^{μ}) + μ b (x^{μ}) = 2 + 2 \sqrt{μ}$ , which can also be made arbitrarily close to $f (x^{⋆})$ by taking $μ$ sufficiently close to zero. These considerations are illustrated in Fig. 1.19. below. Regarding the Lagrange multiplier associated to the inequality constraint $g (x) = - x + 2 \leq 0$ , the objective and constraint functions being continuously differentiable, an approximate value can be obtained as $ν^{μ} = \frac{μ}{{(2 - x^{μ})}^{2}} = 1$ . Here again, the exact value of the Lagrange multiplier is obtained for each $μ > 0$ because $g$ is a linear constraint.

The use of barrier functions for solving constrained NLP problems also faces several computational difficulties. First, the search must start with a point $x \in X$ such that $g (x) < 0$ , and finding such a point may not be an easy task for some problems. Also, because of the structure of the barrier function $b$ , and for small values of the parameter $μ$ , most search techniques may face serious ill-conditioning and difficulties with round-off errors while solving the problem to minimize $f (x) + μ b (x)$ over $x \in X$ , especially as the boundary of the region ${x : g (x) \leq 0}$ is approached. Accordingly, interior-point algorithms employ a sequence of decreasing penalty parameters ${μ^{k}} \to 0$ as $k \to \infty$ ; with each new value $μ^{k}$ , an optimal solution to the barrier problem is sought by starting from the previous optimal solution. As in the exterior penalty function approach, it is highly recommended to use suitable second-order Newton or quasi-Newton methods for solving the successive barrier problems.

We describe below a scheme using barrier functions for optimizing a NLP problem.

Algorithm 2.4

Initialization Step:

Let $𝜖 > 0$ be a termination scalar, and choose an initial point $x 0$ , with $g (x 0) < 0$ . Let $μ^{0} > 0$ , $β \in (0, 1)$ , $k = 0$ , and go to the main step.

Main Step:

Starting with $x k$ , get a solution to the problem:
$x k + 1 \in \arg \min {f (x) + μ^{k} b (x) : x \in X}$
If $μ^{k} b (x k + 1) < 𝜖$ , stop; otherwise, let $μ^{k + 1} = β μ^{k}$ , replace $k \leftarrow k + 1$ , and go to step 1.

Note that although the constraint $g (x) < 0$ may be ignored, it is considered in the problem formulation as most line search methods use discrete steps, and a step could lead to a point outside the feasible region (where the value of the barrier function is a large negative number), when close to the boundary. Therefore, the problem can effectively be treated as an unconstrained optimization problem only if an explicit check for feasibility is made.

In recent years, there has been much excitement because some variants of the interior-point algorithm can be shown to be polynomial in time for many classes of convex programs. Moreover, interior-point codes are now proving to be highly competitive with codes based on other algorithms, such as SQP algorithms presented subsequently. A number of free and commercial interior-point solvers are given in Tab. 1.1. below.

2.9.10 Sequential Quadratic Programming

Sequential quadratic programming (SQP) methods, also known as successive, or recursive, quadratic programming, employ Newton’s method (or quasi-Newton methods) to directly solve the KKT conditions for the original problem. As a result, the accompanying subproblem turns out to be the minimization of a quadratic approximation to the Lagrangian function subject to a linear approximation to the constraints. Hence, this type of process is also known as a projected Lagrangian, or the Newton-Lagrange, approach. By its nature, this method produces both primal and dual (Lagrange multiplier) solutions.

To present the concept of SQP, consider first the equality constrained nonlinear program:

\begin{matrix} \min_{x} & f (x) \\ s.t. & h (x) = 0 \end{matrix}

where $x \in ℝ^{n_{x}}$ , and $f : ℝ^{n_{x}} \to ℝ$ , $h : ℝ^{n_{x}} \to ℝ^{n_{h}}$ are twice continuously differentiable. We shall also assume throughout that the equality constraints are linearly independent at a solution $x ⋆$ . (The extension for including inequality constraints is considered subsequently.)

By Theorem 2.13, the first-order necessary conditions of optimality for an equality constrained NLP require a primal solution $x ⋆ \in ℝ^{n_{x}}$ and a Lagrange multiplier vector $λ ⋆ \in ℝ^{n_{h}}$ such that:

\begin{aligned} 0 = \nabla_{x} ℒ (x ⋆, λ ⋆) = \nabla f (x ⋆) + \nabla h (x ⋆) λ ⋆ \\ 0 = \nabla_{λ} ℒ (x ⋆, λ ⋆) = h (x ⋆) = 0 \end{aligned}

where $ℒ (x, λ) : = f (x) + λ ⊤ h (x)$ . Now, consider a Newton-like method to solve the latter system of $n_{x} + n_{h}$ nonlinear equations. Given an iterate $(x k, λ k)$ , a new iterate $(x k + 1, λ k + 1)$ is obtained by solving the first-order approximation:

0 = (\begin{matrix} \nabla \end{matrix} x ℒ (x k, λ k) \nabla λ ℒ (x k, λ k)) + (\begin{matrix} \nabla_{x x}^{2} ℒ (x k, λ k \end{matrix}) \nabla h (x k) \nabla h (x k) ⊤ & 0) (\begin{matrix} x \end{matrix} k + 1 - x k λ k + 1 - λ k)

Substituting $\nabla_{x} ℒ (x k, λ k) = \nabla f (x k) + \nabla h (x k) λ k$ , the first row of the system of equations is:

\nabla_{x x}^{2} ℒ (x k, λ k) [x k + 1 - x k] + \nabla h (x k) [λ k + 1 - λ k] + \nabla f (x k) + \nabla h (x k) λ k = 0

The Lagrange multiplier $λ k$ can be simplified and thus:

\nabla_{x x}^{2} ℒ (x k, λ k) [x k + 1 - x k] + \nabla h (x k) λ k + 1 + \nabla f (x k) = 0

The second row of the system of equations is independent of the Lagrange multipliers, it is:

\nabla h {(x}^{k}) ⊤ [x k + 1 - x k] + h (x k) = 0

Therefore the system can be simplified as:

(\begin{matrix} \nabla_{x x}^{2} ℒ (x k, λ k \end{matrix}) \nabla h (x k) \nabla h (x k) ⊤ & 0) (\begin{matrix} d \end{matrix} k λ k + 1) = - (\begin{matrix} \nabla f (x k) \end{matrix} h (x k))

Where we have defined the Newton step on $x$ as $d k = x k + 1 - x k$ . The simplified system can be solved for $(d k, λ k)$ if a solution exists. Setting $x k + 1 : = x k + d k$ , and incrementing $k$ by $1$ , we can then repeat the process until $d k = 0$ happens to solve the linear system. When this occurs, if at all, we shall have found a stationary point to the equality constrained NLP.

Interestingly enough, a quadratic programming (QP) minimization subproblem can be employed in lieu of the foregoing linear system to find any optimal solution for the linearized NLP problem. Given $(x k, λ k)$ we define the subproblem $QP (x k, λ k)$ as:

\begin{matrix} \min_{d} f (x k) + \nabla f (x k) ⊤ d k + \frac{1}{2} d k^{⊤} \nabla_{x x}^{2} ℒ (x k, λ k) d k \\ s.t. h (x k) + \nabla h (x k) ⊤ d k = 0 \end{matrix}

First, note that an optimum $d k$ to $QP (x k, λ k)$ , if it exists, together with the set of Lagrange multipliers $λ k + 1$ associated with the linearized constraints satisfies the linearized KKT conditions. Second, the objective function of $QP (x k, λ k)$ represents not just a quadratic approximation for $f (x)$ but also incorporates an additional term $\frac{1}{2} d k^{⊤} {\sum_{i = 1}^{n_{h}} λ_{i}^{k} \nabla^{2} h_{i} (x k)} d k$ to represent the curvature of the constraints.

Algorithm 2.5: SQP

Initialization Step:

Choose and initial primal/dual point $(x 0, λ 0)$ , let $k = 0$ , and go to the main step.

Main Step:

Solve the quadratic subproblem $QP (x k, λ k)$ to obtain a solution $d k$ long with a set of Lagrange multipliers $λ k + 1$ .
If $d k = 0$ , then $(d k, λ k + 1)$ satisfies the stationarity conditions of the original NLP problem; stop. Otherwise, let $x k + 1 : = x k + d k$ , replace $k \leftarrow k + 1$ , and go to step 1.

In the case $x ⋆$ is a regular stationary solution for the NLP problem which, together with a set of Lagrange multipliers $λ ⋆$ , satisfies the second-order sufficiency conditions of Theorem 2.16, then the matrix

W : = (\begin{matrix} \nabla_{x x}^{2} ℒ (x ⋆, λ ⋆ \end{matrix}) \nabla h (x ⋆) \nabla h (x ⋆) ⊤ & 0)

can be shown to be nonsingular. Hence, the above rudimentary SQP algorithm exhibits a quadratic rate of convergence by Theorem 2.21.

We now consider the inclusion of inequality constraints $g_{i} (x) \leq 0, i = 1, \dots, n_{h}$ in the NLP problem, that is:

\begin{matrix} \min_{x} & f (x) \\ s.t. & h (x) = 0 \\ g (x) \leq 0 \end{matrix}

where the functions $g_{i}$ are twice continuously differentiable.

Given an iterate $(x k, λ k, ν k)$ , where $λ k$ and $ν k \geq 0$ are the Lagrange multiplier estimates for the equality and inequality constraints, respectively, consider the following QP subproblem as a direct extension of $QP (x k, λ k)$ :

\begin{matrix} \min_{d} & f (x k) + \nabla f (x k) ⊤ d k + \frac{1}{2} d k^{⊤} \nabla_{x x}^{2} ℒ (x k, λ k, ν k) d k \\ s.t. & g (x k) + \nabla g (x k) ⊤ d k \leq 0 \\ h (x k) + \nabla h (x k) ⊤ d k = 0 \end{matrix}

where $ℒ (x, λ, ν) : = f (x) + ν ⊤ g (x) + λ ⊤ h (x)$ . Note that the KKT conditions for $QP (x k, λ k, ν k)$ require that, in addition to primal feasibility, Lagrange multipliers $λ k + 1$ , $ν k + 1$ be found such that:

\begin{aligned} \nabla f (x k) + \nabla_{x x}^{2} ℒ (x k, λ k, ν k) d k + \nabla g (x k) ν k + 1 + \nabla h (x k) λ k + 1 & = 0 \\ [g (x k) + \nabla g (x k) ⊤ d k] ⊤ ν k + 1 & = 0 \end{aligned}

with $ν k + 1 \geq 0$ and $λ k + 1$ unrestricted in sign. Clearly, if $d k = 0$ , then $d k$ together with $λ k + 1$ , $ν k + 1$ yields a KKT solution to the NLP original problem. Otherwise, we set $x k + 1 : = x k + d k$ as before, increment $k$ by $1$ , and repeat the process. Regarding convergence rate, it can be shown that if $x ⋆$ is a regular KKT solution which, together with $λ ⋆$ , $ν ⋆$ satisfies the second-order sufficient conditions of Theorem 2.18, and if $(x 0, λ 0, ν 0)$ is initialized sufficiently close to $(x ⋆, λ ⋆, ν ⋆)$ , then the foregoing iterative procedure shall exhibit a quadratic convergence rate.

The SQP method, as presented thus far, obviously shares the disadvantages of Newton’s method: (i) it requires second-order derivatives $\nabla_{x x}^{2} ℒ$ to be calculated, which in addition might not be positive definite, and (ii) it lacks the global convergence property.

Regarding second-order derivatives, a quasi-Newton positive definite approximation can be used for $\nabla_{x x}^{2} ℒ$ . For example, given a positive definite approximation $B^{k}$ for $\nabla_{x x}^{2} ℒ$ , the quadratic problem $QP (x k, λ k, ν k)$ can be solved with $\nabla_{x x}^{2} ℒ$ replaced by $B^{k}$ . For example, an approximation of the inverse Hessian matrix can be obtained via a Broyden-like procedure, with $γ k$ given by
$γ^{k} : = \nabla ℒ (x k + 1, λ k + 1, ν k + 1) - \nabla ℒ (x k, λ k + 1, ν k + 1)$
It can be shown that this modification to the rudimentary SQP algorithm, similar to the quasi-Newton modification of Newton’s algorithm, looses the quadratic convergence rate property. Instead, it can be shown that the convergence is superlinear when initialized sufficiently close to a solution $(x ⋆, λ ⋆, ν ⋆)$ that satisfies both regularity and second-order sufficiency conditions. However, this superlinear convergence rate is strongly based on the use of unit step sizes (see point (ii) below).
In order to remedy the global convergence deficiency, a globalization strategy can be used, e.g., a line search procedure. Unlike unconstrained optimization problems, however, the choice of a suitable line search or merit function providing a measure of progress is not obvious in the presence of constraints. Two such popular choices of a line search function are:
- The $ℓ_{1}$ Merit Function:
  $ℓ_{1} (x; μ) : = f (x) + μ [\sum_{i = 1}^{n_{h}} | h_{i} (x) | + \sum_{i = 1}^{n_{g}} \max {0, g_{i} (x)}]$
  which satisfies the important property that $x ⋆$ is a local minimizer of $ℓ_{1}$ , provided $(x ⋆, λ ⋆, ν ⋆)$ satisfies the second-order sufficient conditions (see Theorem 2.18) and the penalty parameter $μ$ is so chosen that $μ > | λ_{i}^{⋆} |$ , $i = 1, \dots, n_{h}$ and $μ > ν_{i}^{⋆}$ , $i = 1, \dots, n_{g}$ . Yet, the $ℓ_{1}$ merit function is not differentiable at those $x$ with either $g_{i} (x) = 0$ or $h_{i} (x) = 0$ , and it can be unbounded below even though $x ⋆$ is a local minimizer.
- The $ℓ_{2}$ Augmented Lagrangian (ALAG) Merit Function:
  $ℓ_{2} (x, λ, ν; μ) : = f (x) + \sum_{i = 1}^{n_{h}} λ_{i} h_{i} (x) + \frac{μ}{2} \sum_{i = 1}^{n_{h}} {[h_{i} (x)]}^{2} + \frac{1}{2} \sum_{i = 1}^{n_{g}} ψ_{i} (x, ν; μ)$
  with $ψ_{i} (x, ν; μ) : = \frac{1}{μ} (\max {0, ν_{i} + μ g_{i} (x)}^{2} - ν_{i}^{2})$ , has similar properties to the $l_{1}$ merit function, provided $μ$ is chosen large enough, and is continuously differentiable (although its Hessian matrix is discontinuous). Yet, for $x ⋆$ to be a (local) minimizer of $ℓ_{2} (x, λ, ν; μ)$ , it is necessary that $λ = λ ⋆$ and $ν = ν ⋆$ .

An SQP algorithm including the modifications discussed in (i) and (ii) is as follows:

Algorithm 2.6: Improved SQP

Initialization Step:

Choose and initial primal/dual point $(x 0, λ 0, ν 0)$ , with $ν 0 \geq 0$ , and a positive definite matrix $B^{0}$ . Let $k = 0$ , and go to the main step.

Main Step:

Solve the quadratic subproblem $QP (x k, λ k, ν k)$ , with
$\nabla_{x x}^{2} ℒ (x k, λ k, ν k)$ replaced by $B^{k}$ , to obtain a direction $d k$ along with a set of Lagrange multipliers $(λ k + 1, ν k + 1)$ .
If $d k = 0$ , then $(d k, λ k + 1, ν k + 1)$ satisfies the KKT conditions of the original NLP problem; stop.
Find $x k + 1 : = x k + α^{k} d k$ , where $α^{k}$ improves $ℓ_{1} (x k + α d k)$ over ${α \in ℝ : α > 0}$ [or any other suitable merit function]. Update $B^{k}$ to a positive definite matrix $B^{k + 1}$ [e.g., according to the quasi-Newton update]. Replace $k \leftarrow k + 1$ , and go to step 1.

What Can Go Wrong?

The material presented up to this point was intended to give the reader an understanding of how SQP methods should work. Things do not go so smoothly in practice though. We now discuss a number of common difficulties that can be encountered, and suggest remedial actions to correct the deficiencies. Because real applications may involve more than a single difficulty, the user must be prepared to correct all problems before obtaining satisfactory performance from an optimization software.

Infeasible Constraints
One of the most common difficulties occurs when the NLP problem has infeasible constraints, i.e., the constraints taken all together have no solution. Applying general-purpose SQP software to such problems typically produces one or more of the following symptoms:
- one of the QP subproblem happen to be infeasible, which occurs when the linearized constraints have no solution;
- many NLP iterations produce very little progress;
- the penalty parameters $μ$ of the merit functions growsvery large;
- the Lagrange multipliers become very large.
Although robust SQP software attempt to diagnose this situation, ultimately the only remedy is to reformulate the NLP.
Rank-Deficient Constraints
In contrast to the previous situation, it is possible that the constraints be consistent, but the Jacobian matrix of the active constraints, at the solution point, be either ill-conditioned or rank deficient. This situation was illustrated in Examples 1.43 and 1.55. The application of general-purpose SQP software is likely to produce the following symptoms:
- many NLP iterations produce very little progress;
- the penalty parameters $μ$ of the merit functions grows very large;
- the Lagrange multipliers become very large;
- the rank deficiency in the Jacobian of the active constraints is detected.
Note that many symptoms of rank-deficient constraints are the same as those of inconsistent constraints. It is therefore quite common to confuse this deficiency with inconsistent constraints. Again, the remedy is to reformulate the problem.
Redundant Constraints
A third type of difficulty occurs when the NLP problem contains redundant constraints. Two types of redundancy may be distinguished. In the first type, some of the constraints are unnecessary to the problem formulation, which typically results in the following symptoms:
- the Lagrange multipliers are close to zero;
- the solver has difficulty in detecting the active set.
In the second type, the redundant constraints give rise to rank deficiency, and the problem then exhibits symptoms similar to the rank-deficient case discussed previously. Obviously, the remedy is to reformulate the problem by eliminating the redundant constraints.
Discontinuities
Perhaps the biggest obstacle encountered in the practical application of SQP methods (as well as many other NLP methods including SUM techniques) is the presence of discontinuous behavior. All of the numerical methods described herein assume continuous and differentiable objective function and constraints. Yet, there are many common examples of discontinuous functions in practice, including: IF tests in codes; absolute value, $\min$ , and $\max$ functions; linear interpolation of data; internal iterations such as root finding; etc.
Regarding SQP methods, the standard QP subproblems are no longer appropriate when discontinuities are present. In fact, the KKT necessary conditions simply do not apply! The most common symptoms of discontinuous functions are:
- the iterates converge slowly or, even, diverge;
- the line search takes very small steps $(α \approx 0)$ ;
- the Hessian matrix becomes badly ill-conditioned.
The remedy consists in reformulating discontinuous problems into smooth problems: for absolute value, min, and max functions, tricks can be used that introduce slack variables and additional constraints; linear data interpolation can be replaced by higher order interpolation schemes that are continuous through second derivatives; internal iterations can also be handled via additional NLP constraints; etc.
Inaccurate Gradient Estimation
Any SQP code requires that the user supply the objective function and constraint values, as well as their gradient (and possibly their Hessian too). In general, the user is proposed the option to calculate the gradients via finite differences, e.g.,
$\nabla f (x) \approx \frac{f (x + δ x) - f (x)}{δ x}$
However, this may cause the problem to stop prematurely. First of all, the choice of the perturbation vector $δ x$ is highly non trivial. If too large a value clearly provides inaccurate estimates, too small a value may also result in very bad estimates due to finite arithmetic precision computations. Therefore, one must try to find a trade-off between these two extreme situations. The difficulty stems from the fact that a trade-off may not necessarily exist if the requested accuracy for the gradient is too high. In other word, the error made in the finite-difference approximation of a gradient cannot be made as small as desired. Further, the maximum accuracy that can be achieved with finite difference is both problem dependent (e.g., badly-scaled functions are more problematic than well-scaled functions) and machine dependent (e.g., double precision computations provides more accurate estimates than single precision computations). Typical symptoms of inaccurate gradient estimates in an SQP code are:
- the iterates converge slowly, and the solver may stop prematurely at a suboptimal point (jamming);
- the line search takes very small steps $(α \approx 0)$ .
The situation can be understood as follows. Assume that the gradient estimate is contaminated with noise. Then, instead of computing the true value $\nabla ℒ (x)$ , we get $ℒ (x) + 𝜖$ . But since the iteration seeks a point such that $\nabla ℒ (x) = 0$ , we can expect either a degraded rate of convergence or, worse, no convergence at all, because ultimately the gradient will be dominated by noise.
To avoid these problems, the user should always consider providing the gradients explicitely to the SQP solver, instead of relying on finite-difference estimates. For large-scale problems, this is obviously a time-consuming and error-prone task. In response to this, efficient algorithmic differentiation tools (also called automatic differentiation) have been developed within the last fifteen years. The idea behind it is that, given a piece of program calculating a number of function values (e.g., in fortran77 or C language), an auxiliary program is generated that calculates the derivatives of these functions.
Scaling
Scaling affects everything! Poor scaling can make a good algorithm behave badly. Scaling changes the convergence rate, termination tests, and numerical conditioning.
The most common way of scaling a problem is by introducing scaled variables of the form
${\tilde{x}}_{k} : = u_{k} x_{k} + r_{k}$
for $k = 1, . . ., n_{x}$ with $u_{k}$ and $r_{k}$ being scale weights and shifts, respectively. Likewise, the objective function and constraints are commonly scaled using
$\begin{aligned} \tilde{f} & : = ω_{0} f \\ {\tilde{g}}_{k} & : = ω_{k} g_{k} \end{aligned}$
for $k = 1, . . ., n_{g}$ . The idea is to let the optimization algorithm work with the well-scaled quantities in order to improve performance. However, what well-scaled quantities mean is hard to define, although conventional wisdom suggests the following hints:
- normalize the independent variables to have the same range, e.g., $0 \leq {\tilde{x}}_{k} \leq 1$ ;
- normalize the dependent functions to have the same magnitude, e.g., $\tilde{f} \approx {\tilde{g}}_{1} \approx \dots \approx {\tilde{g}}_{n_{g}} \approx 1$ ;
- normalize the rows and columns of the Jacobian to be of the same magnitude;
- scale the dependent functions so that the Lagrange multipliers are close to one, e.g., $| λ_{1} | \approx \dots \approx | λ_{n_{g}} | \approx 1$ ; etc.