辅导案例-ELEC 0030

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top


ELEC 0030 Numerical Methods page 1


© University College London 2019 – F.A. Fernandez
Numerical Methods
1. Introduction

Most mathematical problems in engineering and physics as well as in many other disciplines
cannot be solved exactly (analytically) because of their complexity. They have to be tackled either
by using analytical approximations, that is by choosing a simplified (and then only approximately
correct) analytical formulation that can then be solved exactly, or by using numerical methods. In
both cases, approximations are involved but this need not represent a disadvantage since in most
practical cases, the parameters used in the mathematical model (the basic equations) will only be
known approximately, by measurements or otherwise. Furthermore, even if the accuracy is high
there is often no practical advantage of having an exact solution, when only a few decimal figures
will have practical significance for interpretation or implementation.
An ideal property sought in the design of numerical methods is that they should be “exact-
in-the-limit”, that is, if we refine the discretization or if we take more terms in a series, or if we
perform a larger number of iterations, etc (depending of the nature of the method), the result should
only improve, approximating asymptotically the exact solution. Of course not all methods will
have this property.
Since approximations will be present, an important aspect of using numerical methods is to
be able to determine the magnitude of the error involved. Naturally, we cannot know the error.
This will imply to know the exact solution! But we can determine error bounds, or limits, so we
can know that the error cannot exceed a certain measure.
There are several sources for these errors. One of them could be an inaccurate mathematical
model; that is, the mathematical description of the problem does not represent correctly the
physical situation. Possible reasons for this are:

Incomplete or incorrect theory
Idealizations/Approximations/Uncertainty

The second one could be due for example to idealization of the geometry of the problem or the
properties of materials involved and the uncertainty of the value of material parameters, etc.
This type of error concerns the mathematical model and the only possibility to reduce it is
to improve the mathematical description of the physical problem. It does not concern the
numerical calculation. It is though, an important part of the more general problem of “numerical
or computer modelling”.

Errors concerning the numerical calculations can be of three kinds:
i) Blunder: Mistakes, computer or programming bugs. We cannot really dismiss this error.
In practice, the possibility of its occurrence must always be considered, in particular when the
result is not known approximately in advance and then, there are no means to evaluate the
‘reasonableness’ of the obtained result.
ii) Discretization or Truncation error: This is an approximation error, where for example
the value of a function is calculated with a finite number of terms in an infinite series, or an integral
is estimated by the sum of a finite number of trapezoidal areas over (non-infinitesimal) segments.
It is important to have an estimate or a bound (limit) for this type of error.
iii) Round off error: (Sometimes called truncation error referring to the truncation in the
number of decimal points). Arithmetic calculations can almost never be carried out with complete


page 2 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
accuracy. Most numbers have infinite decimal representation, which must be rounded. But even
if the data in a problem can be initially expressed exactly by finite decimal representation, divisions
may introduce numbers that must be rounded; multiplication will also introduce more digits. This
type of error has a random character that makes it difficult to deal with.
So far the error we have been discussing is the absolute error or the error defined by:
error = true value – approximation. A problem with this definition is that it doesn’t take into
account the magnitude of the value being measured, for example an absolute error of 1 cm has a
very different significance in the length of a 100 m bridge or of a 10 cm bolt. Another definition,
that reflects this significance is the relative error defined as: relative error = absolute
error/true value. In the example, the length in each case has a relative error of 10−4 and 0.1
respectively; or in percent, 0.01% and 10%.
2. Machine representation of numbers
Not only numbers have to be truncated to be represented (for manual calculation or in a
machine) but also all the intermediate results are subjected to the same restriction (over the
complete calculation process). Of the continuum of real numbers, only a discrete set can be
represented. Additionally, only a limited range of numbers can be represented in a given machine.
Also, even if two numbers can be represented, it is possible that the product of an arithmetic
operation between these numbers is not representable. Additionally, there will be occasions where
results fall outside the range of representable numbers (underflow or overflow errors).
Example: Some computers accept real numbers in a floating-point format, with say, two digits for
the exponent; that is, in the form:
mantissa exponent
± 0.xxxxxxE ±xx

which can only represent numbers in the range: 10–100 ≤ y < 1099.
This is a normalized form, where the mantissa is defined such that:

0.1 ≤ mantissa < 1

We will see this in more detail next.

2.1 Floating-Point Arithmetic and Roundoff Error
Any number can be represented in a normalised format in the form: x = ± a×10b in a decimal
representation. We can even use any other number as the exponent base and we have a more
general form: x = ± a×βb, common examples are binary, octal and hexadecimal representations.
In all of these, a is called the mantissa, a number between 0.1 and 1, which can be infinite in length
(for example for π or for 1/3 or 1/9), β is the base (10 for decimal representation) and b is the
exponent (positive or negative), which can also be infinite.
If we want to represent these numbers in a computer or use them in calculations, we need to
truncate both the mantissa (to a certain number of places), and the exponent, which limits the range
(or size) of numbers that can be considered.

Now, if A is the set of numbers exactly representable in a given machine, the question arises
of how to represent a number x not belonging to A ( x ∉A ).
This is encountered not only when reading data into a computer, but also when representing
intermediate results in the computer during a calculation. Results of the elementary arithmetic
operations between two numbers need not belong to A. Let’s see first how a number is represented


ELEC 0030 Numerical Methods page 3


© University College London 2019 – F.A. Fernandez
(truncated) in a machine.
A machine representation can in most cases be obtained by rounding:

x ––––––––––––– fl(x)

Here and from now on, fl(x) will represent the rounded form of x (this is, with a rounded
mantissa and limited exponent), and not just its representation in normalized floating-point format.
For example, in a computer with t = 4 digit representation for the mantissa:

fl(π) = 0.3142E 1
fl(0.142853) = 0.1429E 0
fl(14.28437) = 0.1428E 2

In general, x is first represented in a normalized form (floating-point format): x = a×10b
(if we consider a decimal system), where a is the mantissa and b the exponent; 1 >  a ≥ 10–1
(so that the first digit of a after the decimal point is not 0).
Then suppose that the decimal representation of  a is given by:

 a = 0.α1α2α3α4 ... αi ... 0 ≤ αi ≤ 9, α1 ≠ 0 (2.1)
(where a can have infinite terms!)

Then, we form:
a' =
0.α1α2 ⋅ ⋅ ⋅αt if 0 ≤ αt +1 ≤ 4
0.α1α2 ⋅ ⋅ ⋅αt +10
− t if α t+1 ≥ 5



(2.2)

That is, only t digits are kept in the mantissa and the last one is rounded up: αt is incremented
by 1 if the next digit αt+1 ≥ 5 and all digits after αt are deleted.

The machine representation (truncated form of x) will be:

fl(x) = sign(x) a’ 10b (2.3)

The exponent b must also be limited. So a floating-point system will be characterized by
the parameters: t, length of the mantissa; β, the base (10 in the decimal case as above) and L and
U, the limits for the exponent b: L ≤ b ≤ U.
To clarify some of the features of the floating-point representation, let’s examine the system
(β, t, L, U) = (10, 2, –1, 2).
The mantissa can have only two digits: Then, there are only 90 possible forms for the
mantissa (excluding 0 and negative numbers):

0.10, 0.11, 0.12, . . . , . . , 0.98, 0.99

The exponent can vary from –1 to 2, so it can only take 4 possible values: –1, 0, 1 and 2.
Then, including now zero and negative numbers, this system can represent exactly only 2×90×4 +
1 = 721 numbers: The set of floating-point numbers is finite.
The smallest positive number in the system is 0.10×10–1 = 0.01
and the largest is: 0.99×102 = 99.
We can also see that the spacing between numbers in the set varies; the numbers are not equally


page 4 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
spaced: At the small end (near 0): 0, 0.01, 0.02, etc the spacing is 0.01 while at the other
extreme: 97, 98, 99 the spacing is 1 (100 time bigger in absolute terms).
In a system like this, we are interested in the relative error produced when any number is
truncated and rounded in order to be represented in the system. (We are here excluding the
underflow or overflow errors produced when we try to represent a number which is outside the
range defined by L and U, for example 1000 or even 100 in the example above.)

It can be shown that the relative error of fl(x) is bounded by:



fl(x) − x
x
≤ 5 ×10− t = eps (2.4)

This limit eps is defined here as the machine precision.

Demonstration:
From (2.1) and (2.2), the normalized decimal representation of x and its truncated floating-
point form, we have that the maximum possible difference between the two forms is 5 at the
decimal position t+1, that is:
fl(x) − x ≤ 5 ×10
− (t+1) ×10b

also, since x ≥ 0.1×10b or 1 x ≤ 10 ×10−b , we obtain the condition (2.4).

From equation (2.4) we can write:

fl(x) = x(1+ε) (2.5)

where ε ≤ eps, for all numbers x. The quantity (1+ε) in (2.5) cannot be distinguished from
1 in this machine representation, and the maximum value of ε is eps. So, we can also define the
machine precision eps as the smallest positive machine number g for which 1 + g > 1.

Definition: machine precision eps= min g g +1 >1,g > 0{ } (2.6)


2.2 Error in basic operations
The result of arithmetic operations between machine numbers will also have to be
represented as machine numbers, then, for each arithmetic operation and assuming that x and y are
already machine numbers, we will have:

x + y –––we get –––fl(x + y) which is equal to (x + y)(1 + ε1); also: (2.7)
x − y ––––––––––––fl(x − y) = (x − y)(1 + ε2) (2.8)
x*y –––––––––––––fl(x*y) = (x*y)(1 + ε3) (2.9)
x/y –––––––––––––fl(x/y) = (x/y)(1 + ε4) with all  εi  ≤ eps (2.10)

If x and y are not floating-point numbers (machine numbers) they will have to be converted
first giving:
x + y –––––––––––– fl(x + y) = fl(fl(x) + fl(y))


ELEC 0030 Numerical Methods page 5


© University College London 2019 – F.A. Fernandez
and similarly for the rest.

Let’s examine for example the subtraction of two such numbers: z = x – y , ignoring higher
order error terms:

fl(z) = fl(fl(x) – fl(y))
= (x(1 + ε1) – y(1 + ε2))(1 + ε3)
= ((x – y) + x ε1 – yε2)(1 + ε3)
= (x – y) + x ε1 – yε2 + (x – y)ε3
Then:
1 23
( ) 1
x yz z x y
z x y x y
ε εε
 +− −
= + ≤ + 
− − 
fl eps

We can see that if x approaches y the relative error can blow up, especially for large values of x
and y. The maximal error bounds are pessimistic and in practical calculations, errors might tend
to cancel. For example, in adding 20000 numbers rounded to, say, 4 decimal places the maximum
error will be 0.5×10–4×20000 = 1 (imagining the maximum absolute truncation of 0.00005 in every
case) while it is extremely improbable that this case occurs. From a statistical point of view, one
can expect that in about 90% of the cases, the error will not exceed 0.005.
Example
Let’s compute the difference between a = 1200 and b = 1194 using a floating-point
system with a 3-digit mantissa:

fl(a – b) = fl(fl(1200) – fl(1194)) = fl(0.120×104 – 0.119×104)
= 0.001×104 = 10

where the correct value is 6, giving a relative error of: 0.667 (or 66.7%).
The machine precision for this system is eps = 5×10–t and the error bound above gives a
limit for the relative error of

eps1 +
a + b
a − b





 = 2.0

2.3 Error propagation in calculations

One of the important tasks in computer modelling is to find algorithms where the error
propagation remains bounded.
In this context, an algorithm is defined as a finite sequence of elementary operations that
prescribe how to calculate the solution to a problem from given input data (as in a sequence of
computer instructions). Problems can arise when one is not careful, as is shown in the following
example:
Example
Assume that we want to calculate the sum of three floating-point numbers: a, b, c. This
has to be done in sequence, that is, using any of the next two algorithms:

i) (a + b) + c or
ii) a + (b + c)


page 6 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez

If the numbers are in floating-point format with t = 8 decimals and their values are for example:

a = 0.23371258E-4
b = 0.33678429E 2
c = -0.33677811E 2

The two algorithms will give the results: i) 0.64100000E-3
ii) 0.64137126E-3

The exact result (which needs 14 decimal points to calculate) is
0.641311258E-3
Exercise 2.1
Show, using an error analysis, why the case ii) gives a more accurate result for the numbers of the
example above. Neglect higher order error terms; that is, products of the form: ε1ε2.
Example
Determination the error propagation in the calculation of y = x − a( )2 using floating-point
arithmetic, by two different algorithms, when x and a are already floating-point numbers.

a) direct calculation: y = x − a( )2
[ ]21 2( ) ( )(1 ) (1 )y x a ε ε= − + +fl
2 21 2( ) ( ) (1 ) (1 )y x a ε ε = − + + fl

And only preserving first order error terms:

[ ]2 2 21 2 1 2( ) ( ) (1 ) (1 ) ( ) (1 2 )(1 )y x a x aε ε ε ε = − + + = − + + fl
2 1 2( ) ( ) (1 2 )y x a ε ε= − + +fl

then: 2 1 2( ) ( ) (2 )y y x a ε ε− = − +fl or

1 2
( ) 2y yy
y
ε ε−∆ = = +fl

We can see that the relative error in the calculation of y using this algorithm is given by 1 22ε ε+ ,
so it is less than 3 eps.
b) Using the expanded form: y = x2 − 2ax + a2

( )2 21 2 3 4 5 6( ) (1 ) 2 (1 )(1 ) (1 ) (1 ) (1 )y x ax aε ε ε ε ε ε = + − + + + + + + fl

This is, taking the square of x first (with its error) and subtracting the product 2ax (with its error)
and the error in the subtraction, and then, adding the last term with its error and the corresponding
error due to that addition. Solving this, keeping only first order error terms we get:

2 21 4 6 2 3 4 6 5 6( ) (1 ) 2 (1 ) (1 )y x ax aε ε ε ε ε ε ε ε ε= + + + − + + + + + + +fl


ELEC 0030 Numerical Methods page 7


© University College London 2019 – F.A. Fernandez

2 2 2 2 2 21 4 2 3 4 5 6( ) 2 ( ) 2 ( ) ( 2 )y x ax a x ax a x ax aε ε ε ε ε ε ε= − + + + − + + + + − +fl

from where we can write:


2 2
6 1 4 2 3 4 52 2 2
( ) 2( ) ( )
( ) ( ) ( )
y y x ax ay
y x a x a x a
ε ε ε ε ε ε ε−∆ = = + + − + + +
− − −
fl

and we can see that there will be problems with this calculation if (x – a)2 is too small compared
with either x2 or a2. The first term above is bounded by eps while the second will be 2eps
multiplied by the amplification factor x2 / x − a( )2 . For example in if x = 15 and a = 14, the three
amplification factors will be respectively: 225, 420 and 196, which gives a total error bound of
(1 + 450 + 1260 + 196)eps = 1907eps compared with 3 eps from algorithm a).

Exercise 2.2
For y = a – b compare the error bounds when a and b are and are not already defined as
floating-point numbers.
Exercise 2.3
Determine the error propagation characteristics of two algorithms to calculate
a) y = x2 − a2 and b) 3( 1)y x= − . Assume in both cases that x and a are floating point
numbers.



page 8 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
3. Root Finding: Solution of nonlinear equations
This is a problem of very common occurrence in engineering and physics. We’ll study in
this section several numerical methods to find roots of equations. The methods can be basically
classified as “bracketing methods” and “open methods”. In the first case, the solution lies inside
an interval limited by a lower and an upper bound. The methods are always convergent as the
iterations progress. In contrast, the open methods are based on procedures that require one starting
point or two that not necessarily enclose the root. Because of this, sometimes they diverge.
However when they do converge, they usually do that faster than the bracketing methods.
3.1 Bracketing Methods
3.1.1 The Bisection Method
If a function is continuous in the interval [a, b] and f(a) and f(b) are of different sign
(f(a)f(b)<0), then at least one root of f(x) will lie in that interval.
The bisection method is an iterative procedure that starts with two points where f(x) has
different sign, that is, that “bracket” a root of the function.
We now define the point

2
bac += (3.1)
as the midpoint of the interval [a, b], and examine the sign of f(c). There are now three
possibilities: f(c)=0, in which case the solution is c, f(c) is positive or f(c) is negative. The next
interval where to search for the root will be either [a, c] or [c, b] if a and c or c and b are of different
sign, respectively. (Equivalently, we can search for the case f(a)f(c)<0 or f(c)f(b)<0). The process
then continues, each time halving the size of the search interval and “bracketing” the solution as
shown in the figure below..
Fig. 3.1 Fig. 3.2
Convergence and error
Since we know that the solution lies in the interval [a, b] the absolute error for c is bounded
by (b−a)/2, then, we can see that after n iterations, halving the interval in each iteration, the search
interval would have reduced in length to (b – a)/2n, so the maximum error after n iterations is:

nn
abc
2

≤−α (3.2)

where α is the exact position of the root and cn is the nth approximation found by this method.
Furthermore, if we want to find the solution with a tolerance ε (that is, εα ≤− nc ), we can





46
a b c a
n bn cn
an+1

bn+1 cn+1


ELEC 0030 Numerical Methods page 9


© University College London 2019 – F.A. Fernandez
calculate the maximum number of iterations required from the expression above. Naturally, if at
one stage the solution lies at the middle of the current interval the search finishes early.
An approximate relative error (or percent error) at iteration n+1 can be defined as:

1
1
+
+ −
= n
nn
c
ccε

but from the figure above we can see that
2
11
1
++
+ −=−
nn
nn abcc and since
2
11
1
++
+ +=
nn
n abc ,
the relative error at iteration n+1 can be written as:

11
11
++
++
+

= nn
nn
ab
abε (3.3)

This expression can also be used to stop iterations.

Exercise 3.1
Demonstrate that the number of iterations required to achieve a tolerance ε is the integer that
satisfies:
2log
log)log( ε−−

abn (3.4)

Example
The function: )3cos()( xxf = , has one root in the interval [0, 1] ( 6rx π= = 0.523598775598).
The following simple Matlab program implements the bisection method to find this root.

a=0; b=1; %limits of the search interval [a,b]
eps=1e-6; %sets tolerance to 1e-6
fa=ff(a);
fb=ff(b);
if(fa*fb>0) % is the root not between a and b?
disp('root not in the interval selected')
else % the root is between a and b
n=ceil((log(b-a)-log(eps))/log(2)); %rounded up to closest integer
disp('Iteration number a b c')
for i=1:n
c=a+0.5*(b-a); %c is set as the midpoint between a and b
disp([sprintf('%8d',i),' ',sprintf('%15.8f',a,b,c)])
fc=ff(c);
if(fa*fc)<0 %the root is between a and c
b=c;
fb=fc;
elseif(fa*fc)>0 %the root is between b and c
a=c;
fa=fc;
else return
end
end


page 10 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
end

together with the function definition:
function y=ff(x)
%****************************
y=cos(3*x);
%****************************

And the corresponding results are:
Iteration number a b c
1 0.00000000 1.00000000 0.50000000
2 0.50000000 1.00000000 0.75000000
3 0.50000000 0.75000000 0.62500000
4 0.50000000 0.62500000 0.56250000
5 0.50000000 0.56250000 0.53125000
6 0.50000000 0.53125000 0.51562500
7 0.51562500 0.53125000 0.52343750
8 0.52343750 0.53125000 0.52734375
9 0.52343750 0.52734375 0.52539063
10 0.52343750 0.52539063 0.52441406
11 0.52343750 0.52441406 0.52392578
12 0.52343750 0.52392578 0.52368164
13 0.52343750 0.52368164 0.52355957
14 0.52355957 0.52368164 0.52362061
15 0.52355957 0.52362061 0.52359009
16 0.52359009 0.52362061 0.52360535
17 0.52359009 0.52360535 0.52359772
18 0.52359772 0.52360535 0.52360153
19 0.52359772 0.52360153 0.52359962
20 0.52359772 0.52359962 0.52359867

Provided that the solution lies in the initial interval, and since the search interval is
continually divided by two, we can see that this method will always converge to the solution and
will find it within a required precision in a finite number of iterations.
However, due to the rather blind choice of solution (it is always chosen as the middle of the
interval), the error doesn’t vary monotonically. For the previous example 0)3cos()( == xxf :
Iteration
number c f(c) error %
1 0.50000000 0.07073720 4.50703414
2 0.75000000 -0.62817362 -43.23944878
3 0.62500000 -0.29953351 -19.36620732
4 0.56250000 -0.11643894 -7.42958659
5 0.53125000 -0.02295166 -1.46127622
6 0.51562500 0.02391905 1.52287896
7 0.52343750 0.00048383 0.03080137
8 0.52734375 -0.01123469 -0.71523743
9 0.52539063 -0.00537552 -0.34221803
10 0.52441406 -0.00244586 -0.15570833
11 0.52392578 -0.00098102 -0.06245348
12 0.52368164 -0.00024860 -0.01582605
13 0.52355957 0.00011762 0.00748766


ELEC 0030 Numerical Methods page 11


© University College London 2019 – F.A. Fernandez
14 0.52362061 -0.00006549 -0.00416920
15 0.52359009 0.00002606 0.00165923
16 0.52360535 -0.00001971 -0.00125498
17 0.52359772 0.00000317 0.00020212
18 0.52360153 -0.00000827 -0.00052643
19 0.52359962 -0.00000255 -0.00016215
20 0.52359867 0.00000031 0.00001998

We can see that the error is not continually decreasing although in the end it has to be small. This
is due to the rather “brute force” nature of the algorithm. The approximation to the solution is
chosen blindly as the midpoint of the interval without and attempt at guessing its position inside
the interval. For example, if at some iteration n, the magnitudes of )( naf and )( nbf are very
different, say, )()( nn bfaf >> , it is likely that the solution is closer to b than to a, if the function
is smooth.
A possible way to improve it is to select the point c by interpolating the values at a and b.
This is called the “regula falsi” method or method of false position.

3.1.2 Regula Falsi Method
In this case, the next point is obtained by linear interpolation between the values at a and b.


From the figure (left), we can see that:


ac
bc
af
bf


=
)(
)( (3.5)
from where

)()(
)()(
afbf
abfbafc


=
or alternatively:

)()(
))((
afbf
baafac


+= (3.6)

Fig. 3.3
The algorithm is the same as the bisection method except for the calculation of the point c.

In this case, for the same function, ( 0)3cos()( == xxf ) the solution, within the same absolute
tolerance (10−6) is found in only 4 iterations:

Iteration number a b c f(c)
1 0.00000000 1.00000000 0.50251446 0.06321078
2 0.50251446 1.00000000 0.53237237 -0.02631774
3 0.50251446 0.53237237 0.52359536 0.00001025
4 0.52359536 0.53237237 0.52359878 -0.00000000

and for the error:
Iter.
number c f(c) error %
1 0.50251446 0.06321078 4.02680812
a
b c


page 12 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
2 0.53237237 -0.02631774 -1.67563307
3 0.52359536 0.00001025 0.00065258
4 0.52359878 -0.00000000 -0.00000008

We can see that the error decreases much more rapidly than in the bisection method. The size of
the interval also decreases more rapidly. In this case the successive values are:
Iter.
number b-a b-a in bisection method
1 0.49748554 0.5
2 0.02985791 0.25
3 0.00877701 0.125
4 0.00000342 0.0625

However, this is not necessarily always the case and some occasions, the search interval can
remain large. In particular, one of limits can remain stuck while the other converges to the
solution. In that case the length of the interval tends to a finite value instead of converging to zero.
In the following example for the function 1)( 10 −= xxf the solution requires 70 iterations to reach
a tolerance of 10−6 with the regula falsi method while only 24 are needed with the bisection method.
We can also see that the right side of the interval remains stuck at 1.3 and the size of the interval
will tend to 0.3 in the limit instead of converging to zero. The figure shows the interpolating lines
at each iteration. The corresponding approximations are the points where these lines cross the x-
axis.

Fig. 3.4 Standard regula falsi Fig. 3.5 Modified regula falsi
3.1.3 Modified regula falsi method
A modified version of the regula falsi method that improves this situation consists of an algorithm
that detects when one of the limits gets stuck and then reduces the value at that limit by 2, changing
the slope of the approximating function. For the same example, this modified version reaches the
solution in 13 iterations. The sequence of approximations and interpolating lines is shown in the
figure above (right).

3.2 Open Methods
These methods start with only one point or two but not necessarily bracketing the root. One of the
simplest is the fixed point iteration.
3.2.1 Fixed point iteration
This method is also called one-point iteration or successive substitution. For a function of the
0 0.2 0.4 0.6 0.8 1 1.2
-1
0
1
2
3
4
5
6
7
8
9
10
0 0.2 0.4 0.6 0.8 1 1.2
-1
0
1
2
3
4
5
6
7
8
9
10


ELEC 0030 Numerical Methods page 13


© University College London 2019 – F.A. Fernandez
form 0)( =xf , the method simply consists of re-arranging it into the form: )(xgx =

Example
For the function 0505.01.15.0 2 =+− xx the algorithm can be set as: 505.01.05.0 2 +−= xxx or
2
1)1.0( 2 +−
=
xx . With an initial guess introduced in the right hand side, a new value of x is
obtained and the iteration can continue.
Starting from the value: x0 = 0.5, the successive values are:

Iteration number x error %
0 0.5 23.40526755
1 0.58000000 11.15011036
2 0.61520000 5.757841193
3 0.63271552 3.074648057
4 0.64189291 1.668768191
5 0.64682396 0.913383012
6 0.64950822 0.502182715
7 0.65097964 0.276776655
8 0.65178928 0.152748337
9 0.65223571 0.08436105
10 0.65248214 0.046610413
11 0.65261826 0.025758511
12 0.65269347 0.014236789


Fig. 3.6 Plot of x and g(x) Fig. 3.7
Close-up showing the successive approximations

Convergence
This is a very simple method but solutions are not guaranteed. The following figures show
situations when the method converges and when it diverges.
In cases (a) and (b) the method converges, while in cases (c) and (d) it diverges.

0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

y=g(x)
← y=x
0.4 0.45 0.5 0.55 0.6 0.65
0.5
0.52
0.54
0.56
0.58
0.6
0.62
0.64
0.66
0.68
0.7


page 14 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez

(a) convergence

(b) convergence

(c) divergence

(d) divergence
Fig. 3.8 (a – d)

From the plots above we can see that it is relatively easy to determine when the method will
converge, so the best way of ensuring success is to plot the functions )(xgy = and xy = . In a
more rigorous way, we can also see that for convergence to occur the slope of g(x) should be less
that that of x in the region of search. That is, 1)(' If divergence is predicted, a different form of re-writing the problem 0)( =xf in the form
)(xgy = needs to be found that satisfies the condition above.
For example, for the function 0133)( 2 =−+= xxxf , with a solution at 0.26376260 =x , we
can separate it in the following two forms:
(a)
3
13)(
2 +−
==
xxgx and (b) 143)( 2 −+== xxxgx
In the first case, xxg 2)(' −= then 0.5275252)('
0
−==xxxg while for the second case,
46)(' += xxg and then, 5.5825757)('
0
==xxxg .









y=g(x)
← y=x












y=g(x) →
y=x →












y=g(x) →
← y=x












← y=g(x)
y=x →


ELEC 0030 Numerical Methods page 15


© University College London 2019 – F.A. Fernandez

Fig. 3.9(a)
3
13)(
2 +−
==
xxgx

Fig. 3.9(b) 143)( 2 −+== xxxgx
This illustrates the main deficiency of this method. Convergence often depends on how the
problem is formulated. Additionally, divergence can also occur if the initial guess is not
sufficiently close to the solution.
3.2.2 Newton-Raphson Method
This is one of the most used methods for root finding. It also needs only one point to start
the iterations, but as a difference to the fixed point iteration it will converge to the solution
provided the function is monotonically varying in the region of interest.
Starting from a point x0, a tangent to the function f(x) (a line with the slope of the derivative
of f) is extrapolated to find the point where it crosses the x-axis, providing a new approximation.
The same procedure is repeated until the error tolerance is achieved. The method needs repeated
evaluation of the function and its derivative and an appropriate stopping criterion is the value of
the function at the successive approximations: f(xn).


Fig. 3.10
Form the figure we can see that at stage n:
1
)()('
+−
=
nn
n
n xx
xfxf , then the next
0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.1
0.2
0.3
0.4
0.5
← y=x
y=g(x)

0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.1
0.2
0.3
0.4
0.5
← y=x
y=g(x) →









x0x1
slope = f '(x0)


page 16 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
approximation is found as:
)('
)(
1
n
n
nn xf
xfxx −=+ . (3.7)
Example
For the same function as in the previous examples: 0)3cos()( == xxf , and for the same
tolerance of 10−6, the solution is found in 3 iterations starting from x0 = 0.3 (After 3 iterations
the accuracy is better than 10−8). Starting from 0.5, only 2 iterations are sufficient.

Iteration number x f(x)
0 0.30000000 0.62160997
1 0.56451705 -0.12244676
2 0.52339200 0.00062033
3 0.52359878 -0.00000000


The method can also be derived from the Taylor series expansion. This also provides a useful
insight on the rate of convergence of the method.

Considering the Taylor expansion truncated to the first order (see Appendix):

21
)2(
11 )(!2
)())((')()( iiiiiii xx
fxxxfxfxf −+−+= +++
ξ
(3.8)

Considering now the exact solution xr and the Taylor expansion, evaluated at this point:
2
)2(
)(
!2
)())((')(0)( iririir xx
fxxxfxfxf −+−+== ξ
and reordering (assuming a single root – first derivative ≠ 0):
2
)2(
)(
)('!2
)(
)('
)(
ir
ii
i
ir xxxf
f
xf
xfxx −−−= ξ (3.9)

Using now (3.7) for 1+ix : )('
)(
1
i
i
ii xf
xfxx −=+ and substituting in (3.9) gives:
2
)2(
1 )()('!2
)(
ir
i
ir xxxf
fxx −−= +
ξ
which can be reordered as:
2
)2(
1 )()('!2
)()( ir
i
ir xxxf
fxx −−=− +
ξ (3.10)
The error at stage i can be written as the difference between xr and xi: ( )i r ix x= −E then, from
(3.10) we can write:

(2)
2
1
( )
2 '( )i ii
f
f x
ξ
+

=E E
Assuming convergence, both ξ and xi should eventually become xr, so the previous equation can
be re-arranged on the form:


(2)
2
1
( )
2 '( )
r
i i
r
f x
f x+

=E E (3.11)


ELEC 0030 Numerical Methods page 17


© University College London 2019 – F.A. Fernandez

We can see that the relation between errors of successive order is quadratic. That means that on
each Newton-Raphson iteration, the number of correct decimal points should double. This is what
is called quadratic convergence.
Although the convergence rate is generally quite good, there are cases that show poor or no
convergence. An example is when there is an inflexion point near the root and in that case, the
iteration values will start to progressively diverge from the solution. Another case is when the
root is a multiple root, that is, when the first derivative is also zero.
Stopping the iterations
It can be demonstrated that the absolute error (difference between the current approximation and
the correct value) can be written as a multiple of the difference between the last two consecutive
approximations:
)()( 1−−=−= nnnnn xxCxxαE
and the constant Cn tends to 1 as the method converges. Then, a convenient criterion for stopping
the iterations is when:
ε<− − )( 1nnn xxC , (3.12)
choosing a small value for Cn, usually 1.
3.2.3 The Secant Method
One of the difficulties of the Newton-Raphson method is the need to evaluate the derivative. This
can be very inconvenient of difficult in some cases. However, the derivative can be approximated
using a finite difference expression, for example, the backward difference:

1
1)()()('





ii
ii
i xx
xfxfxf (3.13)
If we now substitute this in the expression for the Newton-Raphson itertions, the following
equation is obtained:

)()(
))((
1
1
1


+ −

−=
ii
iii
ii xfxf
xxxfxx (3.14)
which is the formula for the secant method.
Example
Use the secant method to find the root of xexf x −= −)( . Start with the estimates at x−1 = 0 and
x0 = 1. The exact result is: 0.56714329…
First iteration:
63212.0)(1
0.1)(0
00
11
−==
== −−
xfx
xfx
then 61270.0
163212.0
)01(63212.011 =−−
−−
−=x ε = 8%
Second iteration:
07081.0)(61270.0
63212.0)(1
11
00
−==
−==
xfx
xfx
then 50.56383832
)63212.0(07081.0
)161270.0(07081.061270.02 =−−−
−−
−=x

Note that in this case the 2 points are at the same side of the root (not bracketing it).
Using Excel, a simple calculation can be made giving:



page 18 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
i xi f(xi) error %
-1 0 1
0 1 -0.63212
1 0.612700047 -0.070814271 8.032671349
2 0.563838325 0.005182455 -0.582738902
3 0.567170359 -4.24203E-05 0.004772880
4 0.567143307 -2.53813E-08 2.92795E-06
5 0.567143290 1.24234E-13 7.22401E-08


The secant method doesn’t require the evaluation of the derivative of the function as the Newton-
Raphson method does but still, it suffers from the same problems. The convergence of the method
is similar to that of Newton and similarly, it has severe problems if the derivative is zero or near
zero in the region of interest.
3.2.4 Multiple Roots
We have seen that some of the methods have poor convergence if the derivative is very small or
zero. For higher order zeros (multiple roots), the function is zero and also the first n−1 derivatives
(n is the order of the root). In this case the Newton-Raphson method (and the secant method will
converge poorly).
We can notice however, that if the function f(x) has a multiple root at x = α, the function:

)('
)()(
xf
xfxg = (3.15)
has a simple root at x = α (If the root of f is of order n, the root of the derivative is of order n−1).
We can then use the standard Newton-Raphson method for the function g(x).
Exercise 3.2
Use the Newton-Raphson method to find a root of the function xxexf −−= 11)( . Start the
iterations with 00 =x .

3.2.5 Extrapolation and Acceleration
Observing how the iterations proceed in the methods we have just studied, one can expect some
advantages if an extrapolated value is calculated from the iterations in order to improve the next
guess and accelerate the convergence.
One of the best methods for extrapolation and acceleration is the Aitken’s 2δ method:
Considering the sequence of values xn, the extrapolated sequence can be constructed by:
1( )n n n nx x x xα −= − − where
1
1 22
n n
n n n
x x
x x x
α −
− −

=
− +
(3.16)
Example
For the function 20.5 1.1 0.505x x− + , (page 13) using the Fixed-Point method to find a root with
and without acceleration gives the results that follow. The iterations are started with 5.00 =x and
the alternative form:
2( 0.1) 1( )
2
xx g x − += = is used.


ELEC 0030 Numerical Methods page 19


© University College London 2019 – F.A. Fernandez
Iter.
Number
Fixed Point method Accelerated FP method
x error % x error %
0 0.5 23.40526755 -- --
1 0.58 11.15011036 -- --
2 0.6152 5.757841193 0.642857143 1.521058278
3 0.63271552 3.074648057 0.650063694 0.417090524
4 0.641892913 1.668768191 0.651994046 0.121380999
5 0.646823964 0.913383012 0.652550135 0.036194034
6 0.649508224 0.502182715 0.652715129 0.010918594
7 0.650979644 0.276776655 0.652764775 0.003313424
8 0.651789284 0.152748337 0.65277982 0.001008684
9 0.652235707 0.08436105 0.652784397 0.00030759
10 0.652482138 0.046610413 0.652785792 9.38845E-05
11 0.652618256 0.025758511 0.652786217 2.86706E-05
12 0.652693469 0.014236789 0.652786347 8.75791E-06


This acceleration method can also be used embedded in the standard method; that is, after a couple
of normal iterations, the Aitken’s method can be used to jump ahead, continue from there for
another 2 iterations and used it again to further accelerate the convergence. For the Fixed-Point
iteration for example, this can be done in the form:
Starting from a value for x0: the first two iterates are found using the standard method:
x1=g(x0);
x2=g(x1);
Now, we can use the Aitken’s extrapolation in a repeated form:
alpha=(x2-x1)/(x2-2*x1+x0)
xbar=x2-alpha*(x2-x1)
and now we can refresh the initial guess:
x0=xbar
and re-start the iterations.
For the same case of the example above this would give:

Iter.
Number
Fixed Point method Accelerated FP method
x error % x error %
0 0.5 23.40526755 -- --
1 0.58 11.15011036 -- --
2 0.6152 5.757841193 0.642857143 1.521058278
3 0.642857143 1.521058278 -- --
4 0.647346939 0.833268844 -- --
5 0.649794336 0.458353419 0.65272704 0.009094066
6 0.65272704 0.009094066 -- --
7 0.65275359 0.005026806 -- --
8 0.652768266 0.002778668 0.652786402 3.33604E-07
9 0.652786402 3.33604E-07 -- --
10 0.652786403 1.84412E-07 -- --
11 0.652786404 1.0194E-07 0.652786405 0

Notice the acceleration of convergence occurring at the 3rd, 6th and 9th iterations.
Graphically this is shown in Fig. 3.11:


page 20 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
Similarly for the Newton-Raphson method where
the evaluation of x1, and x2 are replaced by the
corresponding forms for the N-R method and the
function and its derivative needs to be calculated
at each stage:
f0=f(x0) % calculation of the function
df0=df(x0) % calculation of the
derivative
x1=x0-f0/df0
x2=x1-f1/df1
alpha=(x2-x1)/(x2-2*x1+x0)
xbar=x2-alpha*(x2-x1)
x0=xbar

3.2.6 Higher Order Methods
All the methods we have seen consist of approximating the function using a straight line and
looking for the intersection of that line with the axis. An obvious extension of this idea is to use a
higher order approximation, for example, fitting a second order curve (a parabola) to the function
and looking for the zero of this instead. A method that implements this is the Muller method.
Müller’s method
Using three points, we can find the equation of a parabola that fits the function and then, find the
zeros of the parabola. This is the Müller or Müller-Traub method.

For example, for the function: 26 −= xy
(used to find the value of 6 2 ), the
function (in blue) and the approximating
parabola (in red) are shown in the figure:

In this case, the 3 points are chosen as
5.01 =x , 5.12 =x and 0.13 =x
In general, the points don’t need to be
equally spaced, in order or bracketing a
root as in the figure, which shows a case
of a real root.

Using Newton interpolation (see chapter
4) the parabola passing through the points:
),( 11 yx , ),( 22 yx and ),( 33 yx can be
written as.
3 2 3 1 3 2( ) ( )( )y y c x x d x x x x= + − + − −
where
12
12
1 xx
yyc


= ,
23
23
2 xx
yyc


= and
13
12
1 xx
ccd


=
Solving for the zero closest to 3x gives:
13
2
3
34
4)(
2
dysssigns
yxx
−+
−= (3.17)
where )( 2312 xxdcs −+= .

Fig. 3.11

Fig. 3.12
0.5 1 1.5
-5
0
5
10
x1
x2
x3 ↓
x4


ELEC 0030 Numerical Methods page 21


© University College London 2019 – F.A. Fernandez
The results of the iterations are:

Iteration number x error
1 1.07797900 0.07797900
2 1.37190697 0.29392797
3 1.37170268 -0.00020429
4 1.37148553 -0.00021716
5 1.17938786 -0.19209767
6 1.14443703 -0.03495083
7 1.12268990 -0.02174712
8 1.12242784 -0.00026206
9 1.12246205 0.00003420
10 1.12246205 0.00000000

The Müller’s method requires three points to start the iterations but it doesn’t require evaluation
of derivatives as the Newton-Raphson’s. It can also be used to find complex roots, even is the
starting values are real.
3.2.6 Systems of Nonlinear Equations
In the previous discussions, we have considered only the solution of single nonlinear equations.
However, there are many cases where systems of nonlinear equations require solutions. For this
we will only consider one approach, an extension of the Newton-Raphson method seen before.
Multivariate Newton Method
In the one variable Newton’s method, the expression: 1
( )
'( )
k
k k
k
f xx x
f x+
= − come from the Taylor
series truncated to first order: 1 1( ) ( ) '( )( )k k k k kf x f x f x x x+ += + − where we set (our aim):
1( ) 0kf x + = .
In the case of functions of more than one variable, say for example, ( , , )f u v w , the corresponding
truncated Taylor series (to first order) would be:
1 1 1 1 1 1( , , ) ( , , ) ( ) ( ) ( )
k k k
k k k k k k k k k k k k
u v w
f f ff u v w f u v w u u v v w w
u v w+ + + + + +
∂ ∂ ∂
= + − + − + −
∂ ∂ ∂

and similarly for other functions g and h of u, v and w, constituting a system of equations. We
can see that this can be written in compact form as:
21 1 1( ) ( ) ( )( ) (( ) )k k k k k k kF F DF O+ + += + − + −x x x x x x x (3.18)
where ( , , )Tu v w=x and ( ) ( ( , , ), ( , , ), ( , , ))TF f u v w g u v w h u v w=x . ( )DF x is the matrix:
( )
f f f
x y z
g g gDF
x y z
h h h
x y z
 ∂ ∂ ∂
 ∂ ∂ ∂ 
∂ ∂ ∂ =  ∂ ∂ ∂ 
∂ ∂ ∂ 
 ∂ ∂ ∂ 
x known as the Jacobian matrix. (3.19)
Following the same procedure as for the one variable case, we can set 1( ) 0kF + =x in (3.18) and
we have:
10 ( ) ( )( )k k k kF DF +≈ + −x x x x


page 22 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
or 11 ( ) ( )k k k kDF F

+ = −x x x x (3.20)
which can be written as: 1( ) ( );k k k k k kDF F += − = −x s x s x x (3.21)
so the algorithm is:
– choose a starting vector 0x
– For k = 0, 1, ….
– Calculate the matrix ( )DF x and evaluate it at kx
– Solve the linear system of equations: ( ) ( )k k kDF F= −x s x for the increments s.
– Calculate the next point as: 1k k k+ = +x x s
– end
Example
Use The Multivariate Newton’s method
starting with the guess vector 0 (1, 2)
T=x
to find a solution of the system:


3
2 2
0
1 0
v u
u v
− =
+ − =


The matrix ( )DF x is:


23 1( )
2 2
uDF
u v
 −
=  
 
x
then
0
3 1
( )
2 4
DF
− 
=  
 
x

Using 0 (1, 2)
T=x , the first increment 0s
is found from:

1 0
2
3 1 1
( )
2 4 4
s
F
s
−     
= − = −    
    
x
The solution of this is 0 (0, 1)
T= −s so the next point is 1 0 0 (1, 1)
T= + =x x s .
Setting a tolerance of 810− , convergence is achieved in 6 iterations and the result is:
( )6 6( , ) 0.826031357654187, 0.563624162161259x y =



Fig. 3.13 Successive iterative solutions


ELEC 0030 Numerical Methods page 23


© University College London 2019 – F.A. Fernandez
4. Interpolation and Approximation
There are many occasions when there is a need to convert discrete data, from measurements or
calculations into a smooth function. This could be necessary for example, to obtain estimates of
the values between data points. It could be also necessary if one wants to represent a function by
a simpler one that approximates its behaviour. Two approaches to this task can be used.
Interpolation consists of fitting a smooth curve exactly to the data points and creating the curve
that spans these points. In approximation, the curve doesn’t necessarily fit exactly the data points
but it approximates the overall behaviour of the data, following an established criterion.
4.1 Interpolation
4.1.1 Method of Undetermined Coefficients
The simplest and most obvious form of finding a polynomial of order n that passes through a set
of n+1 points would be to establish the n+1 equations:
( ) ; 1, 2, , 1n i ip x y i n= = +
This results in a matrix problem to solve, with the matrix rows given by:
21 1, 2, , 1ni i ix x x i n  = +  
However, this matrix problem is notoriously ill-conditioned (see chapter on Matrix Computations)
and the results are very sensitive to small changes in the points position ( ix ) or to the values ( iy )
and can be highly inaccurate.
4.1.2 Lagrange Interpolation
The basic interpolation problem can be formulated as:
Given a set of nodes, {xi, i =0,…,n} and corresponding data values {yi, i =0,…,n}, find the
polynomial p(x) of degree less or equal to n such that p(xi) = yi.
Consider the family of functions: nk
xx
xx
xL
n
kjj jk
jn
k ,1,0,)(
,0
)( =


= ∏
≠=
(4.18)
We can see that they are polynomials of order n and have the property (interpolatory condition):





=
==
ji
ji
xL jij
n
i 0
1
)( ,
)( δ (4.19)
Then, if we define the polynomial by:

=
=
n
k
n
kkn xLyxp
0
)( )()( (4.20)
then: i
n
k
i
n
kkin yxLyxp == ∑
=0
)( )()( (4.21)
The uniqueness of this interpolation polynomial can also be demonstrated (that is, that there is
only one polynomial of order n or less that satisfy this condition.


page 24 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
Lagrange Polynomials
In more detail, from the general definition (4.18) the equation for the first order polynomial
(straight line) passing through two points 0 0( , )x y and 1 1( , )x y is:
(1) (1) 011 0 0 1 1 0 1
0 1 1 0
( ) x xx xp x L y L y y y
x x x x
−−
= + = +
− −
(4.22)
The second order polynomial (parabola) passing through three points is:
(2) (2) (2)2 0 0 1 1 2 2( )p x L y L y L y= + +
0 2 0 11 22 0 1 2
0 1 0 2 1 0 1 2 2 0 2 1
( )( ) ( )( )( )( )( )
( )( ) ( )( ) ( )( )
x x x x x x x xx x x xp x y y y
x x x x x x x x x x x x
− − − −− −
= + +
− − − − − −
(4.23)
In general, we can see that the interpolation polynomials have the form given in (4.20) for any
order.
Each of the Lagrange interpolation functions )(nkL associated to each of the nodes xk (given in
general by (4.18)) is:
( ) 0 1 1 1
0 1 1 1
( )( ) ( )( ) ( ) ( )( )
( )( ) ( )( ) ( )
n k k n
k
k k k k k k k n
x x x x x x x x x x N xL x
x x x x x x x x x x D
− +
− +
− − − − −
= =
− − − − −
 
 
(4.24)
The denominator has the same form as the numerator and D = N(xk).
Example
Find the interpolating polynomial that passes through the three points: )4,2(),( 11 −=yx ,
)2,0(),( 22 =yx and )8,2(),( 33 =yx .
Substituting in (4.20), or more specifically, (4.23):
8
)02)(22(
)0)(2(2
)20)(20(
)2)(2(4
)22)(02(
)2)(0()(2 −+
−+
+
−+
−+
+
−−−−
−−
=
xxxxxxxp
3
)2(
32
)2(
21
)2(
1
222
2 88
22
4
44
8
2)( yLyLyLxxxxxxp ++=++


+

=
2)( 22 ++= xxxp


Fig. 4.1 shows the complete interpolating
polynomial )(2 xp and the three Lagrange
interpolation polynomials )()2( xLk k = 1, 2,
3 corresponding to each of the nodal points.

Note that the function corresponding to one
node has a value 1 at that node and 0 at the
other two.


Fig. 4.1
-2 -1 0 1 2
-2
0
2
4
6
8
p2(x)
p2(x)
p2(x)p2(x)p2(x)
p2(x)p2(x)
-1.5 -0.5 0.5 1.5
0
0.5
1
L1
(2)(x)
L2
(2)(x)
L3
(2)(x)


ELEC 0030 Numerical Methods page 25


© University College London 2019 – F.A. Fernandez
Exercise 4.1
Find the 5th order Lagrange interpolation polynomial to fit the data: xd = {0, 1, 2, 3, 4, 5} and yd
= {2, 1, 3.4, 3.8, 5.8, 4.8}.
Exercise 4.2
Show that an arbitrary polynomial of order n can be represented exactly by ∑
=
=
n
i
ii xLxpxp
0
)()()(
using an arbitrary set of (distinct) data points xi.

4.1.3 Newton Interpolation
It can be easily demonstrated that the polynomial interpolating a set of points is unique (Ex. 4.2),
and the Lagrange method allows us to find it. The Newton interpolation method gives eventually
the same result but it can be more convenient in some cases. In particular, it is simpler to extend
the interpolation adding extra points, which in the Lagrange method would need a total re-
calculation of the interpolation functions.

The general form of a polynomial used to interpolate n + 1 data points is:
nn xbxbxbxbbxf +++++= 
3
3
2
210)( (4.25)
Newton’s method, and Lagrange’s, give us a procedure to find the coefficients bi.


From the figure we can write:


12
12
1
1
xx
yy
xx
yy


=


which can be re-arranged as:

)( 1
12
12
1 xxxx
yyyy −


+= (4.26)

This is the Newton’s expression for the first
order interpolation polynomial (or linear
interpolation).
Fig. 4.2

That is, the Newton form of the equation of a straight line that passes through 2 points (x1,
y1) and (x2, y2) is
)()( 110 xxaaxp −+= ; where:






=
=
12
12
1
10
xx
yya
ya
(4.27)
Similarly, the general expression for a second order polynomial passing through the 3 data points:
(x1, y1), (x2, y2) and (x3, y3) can be written as:
22102 )( xbxbbxp ++=
x1 x x2
y1
y
y2


page 26 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
or, re-arranging: ))(()()( 2121102 xxxxaxxaaxp −−+−+= (4.28)
Substituting the values for the 3 points, we get, after some re-arrangement:
















=


=
=
13
12
12
23
23
2
12
12
1
10
xx
xx
yy
xx
yy
a
xx
yya
ya
(4.29)
The individual terms in the above expression are usually called “divided differences” and
denominated by the symbol D. That is,

ii
ii
i xx
yyDy


=
+
+
1
1 ,
ii
ii
i xx
DyDyyD


=
+
+
2
12 ,
ii
ii
i xx
yDyDyD


=
+
+
3
2
1
2
3 , etc
In this form, (4.29) takes the form: 10 ya = , 11 Dya = and 1
2
2 yDa = .
The general form of Newton interpolation polynomials is then an extension of (4.27) and (4.28):
+−−+−+= ))(()()( 212110 xxxxaxxaaxpn (4.30)
or, ∑
=
+=
n
i
iin xWaaxp
1
0 )()( with ∏
=
−=
i
j
ii xxxW
1
)()( (4.31))
with the coefficients:
10 ya = , 1yDa
i
i =
Example
We can consider the previous example of finding the interpolating polynomial that passes through
the three points: )4,2(),( 11 −=yx , )2,0(),( 22 =yx and )8,2(),( 33 =yx .
In this case it is usual and convenient to arrange the calculations in a table with the following
quantities in each column:

ix iy iDy iyD
2
−2 4
1)2(0
42
12
12 −=
−−

=


xx
yy
0 2
1
)2(2
)1(3
13
12 =
−−
−−
=


xx
DyDy
302
28
23
23 =


=


xx
yy
2 8

Then, the coefficients are: a0 = 4, a1 = −1 and a2 = 1 and the polynomial is:
2)0)(2()2(4)( 22 ++=−+++−= xxxxxxp


ELEC 0030 Numerical Methods page 27


© University College London 2019 – F.A. Fernandez
Note that it is the same polynomial found using Lagrange interpolation.

One important property of the Newton’s construction of the interpolating polynomial is what
makes it easy to extend including more points. If additional points are included, the new higher
order polynomial can be easily constructed from the previous one:
)()()( 111 xWaxpxp nnnn +++ += with ))(()( 11 ++ −= nnn xxxWxW and 1
1
1 yDa
n
n
+
+ = (4.32)
In this way, it has many similarities with the Taylor expansion, where additional terms increase
the order of the polynomial. These similarities allow a treatment of the error in the same way as
it is done with Taylor expansions.
Exercise 4.3
Find the 5th order interpolation polynomial to fit the data: xd = {0, 1, 2, 3, 4, 5} and yd = {2, 1,
3.4, 3.8, 5.8, 4.8}, this time using Newton interpolation.

Some Practical Notes:
In some of the examples seen here we have used equally spaced data. This is by no means
necessary and the data can have any distribution. However, if the data is equally spaced, simpler
expressions for the divided differences can be derived (not done here).
____________________________

One of the problems with interpolation of data using a single polynomial is that this
technique is very sensitive to noisy data. A very small change in the values of the data can lead to
a drastic change in the interpolating function. This is illustrated in the following example:
The following figure shows the interpolating polynomial (in blue) for the data:
xd = {0, 1, 2, 3, 4, 5}
yd = {2, 1, 3.4, 4.8, 5.8, 4.8}
Now, if we add two more points with a slight amount of noise: xd’ = {2.2, 2.7} and yd’ = {3.5,
4.25} (shown with the filled black markers), the new interpolation polynomial (red line) shows a
dramatic difference to the first one.

Fig. 4.3
4.1.4 Hermite Interpolation
The problem arises because the extra points force a higher degree polynomial and this can have a
higher oscillatory behaviour. Another approach that avoids this problem is to use data for the
derivative of the function too. If we also ask for the derivative values to be matched at the nodes,
the oscillations will be prevented. This is done with the “Hermite Interpolation”. The
0 1 2 3 4 5
0
1
2
3
4
5
6


page 28 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
development is rather similar to that of the Newton’s method but more complicated due to the
involvement of the derivative values. It can also be constructed easily with the help of a table (as
in Newton’s) and divided differences. We will not cover here the details of the derivation but
simply, the procedure to find it.

The table is similar to that for the Newton interpolation but we enter the data points twice (see
below) and the derivatives values are placed between repeated data points, in alternate rows as the
first divided differences. The initial set-up for 2 points is marked with red circles.

i ix iy iDy iyD
2 iyD
3
1 1x 1y
'1y
1 1x 1y
12
11 '
xx
yDyA


=


12
12
1 xx
yyDy


=
12 xx
ABC


=
2 2x 2y

12
12 '
xx
DyyB


=
'2y
2 2x 2y

The corresponding interpolating polynomial is:
)()()()(')( 2
2
1
2
11112 xxxxCxxAxxyyxH −−+−+−+= (4.33)
The coefficients of the successive terms are marked in the table with blue squares.

The figure shows the function )sin()( xxy π=
(+) compared with the interpolated curves
using Hermite, Newton and cubic splines
method (see next section).
The data points are marked by circles.
Newton interpolation results in a polynomial
of order 4 while Hermite gives a polynomial
of order 7.
(If 7 points are used for the Newton iteration
the results are quite similar.)


Fig. 4.4


Another approach to avoid the oscillations present when using high order polynomials is to
use lower order polynomials to interpolate subsets of the data and assemble the overall
approximating function piecewise. This is what is called “spline interpolation”.
4.1.5 Spline Interpolation
Any piecewise interpolation of data by low order functions is called spline interpolation and the
0 0.5 1 1.5
-1
-0.5
0
0.5
1
1.5
cubic splines
Newton
Hermite


ELEC 0030 Numerical Methods page 29


© University College London 2019 – F.A. Fernandez
simplest and widely used is the piecewise linear interpolation, or simply, joining consecutive data
points by straight lines.
First Order Splines
If we have a set of n+1 data points: (xi, yi), i = 0, , n; the first order splines (straight lines)
can be defined as:
0 0 0 0 1( ) ( );f x y m x x x x x= + − ≤ ≤ (4.34)
1 1 1 1 2( ) ( );f x y m x x x x x= + − ≤ ≤ (4.35)

1);()( +≤≤−+= iiiii xxxxxmyxf (4.36)
with the slopes given by:
ii
ii
i xx
yym


=
+
+
1
1
Quadratic and Cubic Splines
First order splines are straight-forward. To fit a line in each interval between data points, we need
only 2 pieces of data: the data values at the 2 points. If we now want to fit a higher order function,
for example a second order polynomial, a parabola, we need to determine 3 coefficients, and in
general, for an m-order spline we need m+1 equations.
If we want a smooth function, over the complete domain, we should impose continuity of
the representation in contiguous intervals and as many derivatives as possible to be continuous too
at the adjoining points.
For n+1 points, there are n intervals and consequently, n functions to determine. If we use
quadratic splines (second order) their equations will be of the form: cxbxaxf iii ++=
2)( and we
need 3 equations per interval to find the 3 coefficients, that is, a total of 3n equations.
We can establish the following conditions to be satisfied:
1) The values of the functions corresponding to adjacent intervals must be equal at the common
interior nodes (no discontinuities at the nodes) and to the value of the function there.
This can be represented by:
2 21 1 1 ( )i i i i i i i i i i ia x b x c a x b x c f x− − −+ + = + + = (4.37)
for i = 2,…,n, that is, at the node xi, boundary between intervals i−1 and i, the functions defined
in each of these intervals must coincide. This gives us a total of 2n−2 equations (there are 2
equations in the above line and n−1 interior points).
2) The first and last functions must pass through the first and last points (end points). This
gives us another 2 equations 21 1 1 1 1 1( )a x b x c f x+ + = and
2
1 1 1( )n n n n n na x b x c f x+ + ++ + =
3) The first derivative at the interior nodes must be continuous.
This can be written as: 1 '( ) ' ( )i i i if x f x− = for i = 2,…,n. Then:
2 21 12 2i i i i i ia x b a x b− −+ = + (4.38)
and we have another n−1 equations.
All these give us a total of 3n−1 equations when we need 3n. An additional condition must then
be established. Usually this can be chosen at the end points, for example, stating that a1 = 0 (This
corresponds to asking for the second derivative to be zero at the first point and results in the two
first points joined by a straight line).


page 30 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
For example, for a set of 4 data points, we can establish the necessary equations as listed above,
giving a total of 8 equations for 8 unknowns (having fixed a1 = 0 already). This can be solved by
the matrix techniques that we will study later.
Quadratic splines have some shortcomings that are not present in cubic splines, so these are
preferred. However, their calculation is even more cumbersome than that of quadratic splines. In
this case, the function and the first and second derivatives are continuous at the nodes.
Because of their popularity, cubic splines are commonly found in computer libraries and for
example, Matlab has a standard function that calculates them.

To illustrate the advantages of this
technique, Fig. 4.5 shows the cubic
interpolation splines (blue line) for the
original data set described in Fig. 4.3, and is
compared with the cubic splines
interpolation of the extended data set (red
line, with two points added). The figure
shows that there is little difference to the
overall result when the 2 extra points are
added, contrary to what happens with the
single polynomial interpolation shown in
Fig. 4.3/
Using the built-in Matlab function, the code
for plotting the red line in this graph is
simply:

Fig. 4.5 Comparison of splines interpolation with 2
added points

xd=[0,1,2,2.2,2.7,3,4,5];
yd=[2,1,3,3.5,4.25,4.8,5.8,4.8];
x=0:0.05:5;
y=spline(xd,yd,x);
plot(x,y,'r','Linewidth',2.5), hold
plot(xd,yd,'ok','MarkerSize',8,'MarkerFaceColor','w','LineWidth',2)
Note that the these lines only include the drawing of the red line and the command for the
markers is not exactly the one used since the added points have been drawn separately using the
attributes 'or','MarkerSize',8,'MarkerFaceColor','r'.
Exercise 4.4
Using Matlab plot the function xxexf
2sin2.11.0)( = in the interval [0, 4] and construct a cubic
spline interpolation using the values of the function at the points xd = 0: 0.5: 4 (in Matlab notation).
Use Excel to construct a table of divided differences for this function at the points xd and find the
coefficients of the Newton interpolation polynomial. Use Matlab to plot the corresponding
polynomial in the same figure as the splines and compare the results

If the main objective is to create a smooth function to represent the data, it is sometimes
preferable to choose a function that doesn’t necessarily pass exactly through the data but
approximates its overall behaviour. This is what is called “approximation”. The problems here
are how to choose the approximating function and what is considered the best choice.



ELEC 0030 Numerical Methods page 31


© University College London 2019 – F.A. Fernandez
4.2 Approximation
There are many different ways to approximate a function and you have seen some of them
in detail already. Taylor expansions, least squares curve fitting and Fourier series are examples of
this.
Methods like the “least squares” look for a single function or polynomial to approximate
the desired function. Another approach, of which the Fourier series is an example, consists of
using a family of simple functions to build an expansion that approximates the given function.
The problem then is to find the appropriate set of coefficients for that expansion. Taylor series are
somehow related, the main difference is that while the other methods based on expansions attempt
to find an overall approximation, Taylor series are meant to approximate the function at one
particular point and its close vicinity.
4.2.1 Least Squares Curve Fitting
One of the most common problems of approximation is fitting a curve to experimental data. In
this case, the usual objective is to find the curve that approximates the data minimizing some
measure of the error. In the method of least squares the measure chosen is the sum of the squares
of the differences between the data and the approximating curve,
If the data is given by: (xi, yi), i = 1, , n. We can define the error of the approximation by:
( )∑
=
−=
n
i
ii xfyE
1
2)( (4.39)
The commonest choices for the approximating functions are polynomials, straight lines, parabolas,
etc. Then, the minimization of the error will give the necessary relations to determine the
coefficients.
Approximation by a Straight Line
The equation of a straight line is: bxaxf +=)( so the error to minimise is:
( )∑
=
+−=
n
i
ii bxaybaE
1
2)(),( (4.40)
The error is a function of the parameters a and b that define the straight line. Then, the
minimization of the error can be achieved by making the derivatives of E with respect to a and b
equal to zero. These conditions give:
( ) 00)(2
111
=−−⇒=+−−=

∂ ∑∑∑
===
n
i
i
n
i
i
n
i
ii xbnaybxaya
E (4.41)
( ) 00)(2
1
2
111
=−−⇒=+−−=

∂ ∑∑∑∑
====
n
i
i
n
i
i
n
i
ii
n
i
iii xbxayxxbxayb
E (4.42)
which can be simplified to:
∑∑
==
=+
n
i
i
n
i
i yxbna
11
(4.43)
and ∑∑∑
===
=+
n
i
ii
n
i
i
n
i
i yxxbxa
11
2
1
(4.44)


page 32 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
Solving the system for a and b gives:
2
11
2
1111
2










=
∑∑
∑∑∑∑
==
====
n
i
i
n
i
i
n
i
ii
n
i
i
n
i
i
n
i
i
xxn
yxxyx
a and 2
11
2
111










=
∑∑
∑∑∑
==
===
n
i
i
n
i
i
n
i
i
n
i
i
n
i
ii
xxn
yxyxn
b (4.45)
Example
Fitting a straight line to the data given by:
xd = {0, 0.2, 0.8, 1, 1.2, 1.9, 2, 2.1, 2.95, 3} and
yd = {0.01, 0.22, 0.76, 1.03, 1.18, 1.94, 2.01, 2.08, 2.9, 2.95}

Then, 15.15
10
1
=∑
=i
ixd ; 08.15
10
1
=∑
=i
iyd ,
32.8425
10
1
2 =∑
=i
ixd and 32.577
10
1
=∑
=i
ii ydxd

and the parameters of the straight line are:
a = 0.01742473648290 and
b= 0.98387806172746

The figure shows the approximating
straight line (red) together with the curve for
Newton interpolation (black) and for cubic
splines (blue).
The problems occurring with the use of
higher order polynomial interpolation are also evident.
Example
The data (xd, yd) shown in the figure appears
to behave in an exponential manner. Then,
defining the variable )log( ii ydzd = , zd
should vary linearly with xd. We can then fit
a straight line to the pair of variables (xd, zd).
If the fitting function is the function z(x), the
function

zey =

is a least squares fit to the original data.




Second Order Approximation
If a second order curve is used for the approximation: 2)( cxbxaxf ++= , there are 3 parameters

Fig. 4.6

Fig. 4.7
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
2.5
3
Newton Interpolation
Cubic Splines
Least Squares Line Fit
0 0.5 1 1.5 2 2.5 3
0
5
10
15
20
25


ELEC 0030 Numerical Methods page 33


© University College London 2019 – F.A. Fernandez
to find and the error is given by:
( )∑
=
++−=
n
i
i cxbxaycbaE
1
22)(),,( (4.45)
Making the derivatives of E with respect to a, b and c equal to zero will give the necessary 3
equations for the coefficients of the parabola. The expressions are similar to those found for the
straight line fit although more complicated.
Matlab has standard functions to perform least squares approximations with polynomials of any
order. If the data is given by (xd, yd), and m is the desired order of the polynomial to fit, the
function:

coeff = polyfit(xd, yd, m)

returns the coefficients of the polynomial.
If now x is the array of values where the approximation is wanted,

y = polyval(coeff, x)

returns the values of the polynomial fit at the point x.
Exercise 4.5
Find the coefficients of a second order polynomial that approximates the function xexs =)( .in the
interval [−1, 1] in the least squares sense using the points xd = [–1, 0, 1]. Plot the function and the
approximating polynomial together with the Taylor approximation (Taylor series truncated to
second order) for comparison.

Approximation using Continuous Functions
The same idea of “least squares”, that is, to try to minimize an error expression in the least squares
sense, can be used to the approximation of continuous functions. Imagine that instead of discrete
data sets we have a rather complicated analytical expression representing the function of interest.
This case is rather different to all we have seen previously because now there is a certainty of the
value of the desired variable at every point, but it might be desirable to have a simpler expression
representing this behaviour, at least in a certain domain, for further analysis purposes. For example
is the function is s(x) and we want to approximate it using a second order polynomial, we can
formulate the problem as minimizing the square of the difference over the domain of interest, that
is:


−++= dxxscbxaxE 22 ))(( (4.46)
Again, asking for the derivatives of E with respect to a, b and c to be zero, will give us the
necessary equations to find these coefficients. There is one problem though, the resultant matrix
problem, particularly for large systems (high order polynomials) is normally very ill-conditioned,
so some special precautions need to be taken.
4.2.2 Approximation using Orthogonal Functions
We can extend the same ideas of least squares to the approximation of a given function by an
expansion in terms of a set of “basis functions”. You are already familiar with this idea from the
use of Fourier expansions to approximate functions, but the concept is not restricted to use of sines


page 34 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
and cosines as basis functions.
First of all, we called a base a family of functions that satisfy a set of properties. Similarly with
what you know of vectors, for example the set of unit vectors xˆ , yˆ and zˆ constitute a base in the
3D space because any vector in that space can be expressed as a combination of these three.
Naturally, we don’t actually need these vectors to be perpendicular to each other but that helps (as
we’ll see later). However, the base must be complete, that is no vector in 3D escapes this
representation. For example if we only consider the vectors xˆ and yˆ , no combination of them
can possibly have a component along the zˆ axis and the resultant vectors are restricted to the xy-
plane. In the same sense, we want a set of basis functions to be complete (as sines and cosines
are) in order that any function can have its representation by an expansion. The difference now is
that unlike the 3D space, we need an infinite set (that is the dimension of the space of functions).
So, if we select a complete set of basis functions: )(xkφ , we can represent our function by:

1
( ) ( ) ( )
n
n k k
k
f x f x c xϕ
=
≈ = ∑ (4.47)
an expansion truncated to n terms.
In this context, the error of this approximation is the function difference between the exact and the
approximate functions: ( ) ( ) ( )n nr x f x f x= −  and we can use the least squares ideas again, seeking
to minimise the norm of this error. That is: the quantity, (error residual):


−≡= dxxfxfxrR nnn
22 ))(~)(()( (4.48)
with the subscript n because the error residual above is also a function of truncation level.
This concept of norm is analogous to that of the norm or modulus of a vector: ∑
=
=⋅=
N
i
ivvvv
1
22  .
In order to extend it to functions we need to introduce the inner product of functions, (analogous
to the dot product of vectors):
If we have two functions f and g defined over the same domain Ω, their inner product is the
quantity:
, ( ) ( )f g f x g x dx

= ∫ (4.49)
The inner product satisfies the following properties:
1. , 0f f > for all nonzero functions
2. , ,f g g f= for all functions f and g.
3. 1 2 1 2, , ,f g g f g f gα β α β+ = + for all functions f and g and scalars α and β.

Note: The above definition of inner product is sometimes extended using a weighting function
w(x) in the form: ∫

= dxxgxfxwgf )()()(, and provided that w(x) is non-negative it satisfies all
the required properties.
In a similar form as with the dot product of vectors, )(),()( 2 xfxfxf =
Using the inner product definition, the error expression above can be written as:
2 22( ( ) ( )) ( ( ) ( )) ( ) 2 ( ), ( ) ( ), ( )n n n n n nR f x f x f x f x dx f x f x f x f x f x

= − = − = − +∫     (4.50)


ELEC 0030 Numerical Methods page 35


© University College London 2019 – F.A. Fernandez
and if we write )(~ xfn as: ∑
=
=
n
k
kkn xcxf
1
)()(~ φ , as above,
we get:
∑∑∑
= ==
+−=
n
j
n
k
jkjk
n
k
kkn xxccxxfcxfR
1 11
2 )(),()(),(2)( φφφ (4.51)
We can see that the error residual Rn, is a function of the coefficients ck of the expansion and then,
to find those values that minimize this error, we can make the derivatives of Rn with respect to ck
equal to zero for all k. That is:
0=


k
n
c
R for k = 1, …, n. (4.52)
The first term is independent of ck, so it will not contribute and the other two will yield the general
equation:
)(),()(),(
1
xxfxxc k
n
j
jkj φφφ =∑
=
for k = 1, …, n. (4.53)
Writing this down in detail gives:
(k = 1) 11221111 ,,,, φφφφφφφ fccc nn =+++ 
(k = 2) 22222112 ,,,, φφφφφφφ fccc nn =+++ 
(k = n) nnnnnn fccc φφφφφφφ ,,,, 2211 =+++ 

which can be written as a matrix problem of the form: Φc = s, where the matrix Φ contains all the
inner products (in all combinations), the vector c is the list of coefficients and s is the list of values
in the right hand side (which are all known: we can calculate them all).
We can find the coefficients solving the system of equations but we can see that this will be
a much easier task if all crossed products of basic functions yielded zero; that is, if
0)(),( =xx jk φφ for all j ≠ k (4.54)
This is what is called orthogonality and is a very useful property of the functions in a base.
(Similar with what happens with a base of perpendicular vectors: all dot products between different
members of the base are zero).
Then, if the basis functions )(xkφ are orthogonal, the solution of the system above is
straight-forward:

)(),(
)(),(
xx
xxf
c
kk
k
k φφ
φ
= (4.55)
Remark:
You can surely recognise here the properties of the sine and cosine functions involved in Fourier
expansions: They form a complete set, they are orthogonal: 0)(),( =xx jk φφ where kφ is either
xkπsin or xkπcos . and the coefficients of the expansions are given by the same expression (4.55)
above.


page 36 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
Fourier expansions are convenient but sinusoidal functions are not the only or simpler
possibility. There are many other sets of orthogonal functions that can form a base, in particular,
we are interested in polynomials because of the convenience of calculations.
Families of Orthogonal Polynomials
There are many different families of polynomials that can be used in this manner. Some are more
useful than others in particular problems. They are often originated as solution to some differential
equations.
1. Legendre Polynomials
They are orthogonal polynomials in the interval [−1, 1], with weighting function 1. That is,
0)()()(),(
1
1
== ∫

dxxPxPxPxP jiji if i ≠ j. They are usually normalised so that Pn(1) = 1 and their
norm in this case is:
12
2)()()(),(
1
1 +
== ∫

n
dxxPxPxPxP nnnn (4.56)
The first few are:
1)(0 =xP
xxP =)(1
2)13()( 22 −= xxP
2)35()( 33 xxxP −=
8)33035()( 244 +−= xxxP ,
)637015(
8
1 53
5 xxxP +−=
)2313151055(
16
1 642
6 xxxP +−+−=
)42969331535(
16
1 753
7 xxxxP +−+−=
etc.

Fig. 4.8

In general they can be defined by the expression: nn
n
n
n
n xxn
xP )1(
!2
)1()( 2−

∂−
= (4.57)
They also satisfy the recurrence relation: )()()12()()1( 11 xnPxxPnxPn nnn −+ −+=+ (4.58)

2. Chebyshev Polynomials
The general, compact definition of these polynomials is: ))(coscos()( 1 xnxTn
−= (4.59)
and they satisfy the following orthogonality condition:






==
≠=

=

= ∫
− 0if
0if2
if0
1
)()(
)(),(
1
1
2
ji
ji
ji
dx
x
xTxT
xTxT jiji
π
π (4.60)
That is, they are orthogonal in the interval [−1, 1] with the weighting function: 211)( xxw −= .
They are characterised by having all their oscillations of the same amplitude and in the interval
-1 0 1
-1
0
1


ELEC 0030 Numerical Methods page 37


© University College London 2019 – F.A. Fernandez
[−1, 1] and also all their zeros occur in the same interval.
The first few Chebyshev polynomials are:
1)(0 =xT
xxT =)(1
12)( 22 −= xxT
xxxT 34)( 33 −=
188)( 244 +−= xxxT
xxxxT 52016)( 355 +−=
1184832)( 2466 −+−= xxxxT
xxxxxT 75611264)( 3577 −+−=
… etc.

Fig. 4.9
They also can be constructed from the recurrence relation:
)()(2)( 11 xTxxTxT nnn −+ −= for n ≥ 1. (4.61)
Notice that for deriving expansions in terms of these polynomials, the modified inner product
including the weighting function must be used.
3. Hermite Polynomials
The definition of the Hermite polynomials is:
2 2
( ) ( 1)
n
n x x
n n
dH x e e
dx
−= − and their orthogonalty
condition is:

2
( ) ( ) 2 !x nn m nmH x H x e dx nπ δ


−∞
=∫ (4.62)
That is, they are orthogonal in (−∞, ∞) with the weighing function
2
( ) xw x e−=
The first few polynomials are:

0( ) 1H x =
1( ) 2H x x=
2
2( ) 4 2H x x= −
3
3( ) 8 12H x x x= −
4 2
4( ) 6 3H x x x= − +
5 3
5 ( ) 32 160 120H x x x x= − +
6 4 2
6 ( ) 64 480 720 120H x x x x= − + −


and the rest can be derived using the recursive relation:
1 1( ) 2 ( ) 2 ( )n n nH x xH x nH x+ −= − (4.63)

-1 0 1
-1
0
1


page 38 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
There are multiple other possibilities. For example, another commonly used family of functions
are the Laguerre polynomials, which are orthogonal in [0, ∞) with weighting function
xe− . Also,
sets of orthogonal functions can be constructed starting from a “mother function” and considering
its successive derivatives. These can be modified one by one so the whole set is orthogonal.
For example, the set of Hermite polynomials is orthogonal but not orthonormal and needs a
weighting function. This suggests modifying this set or, creating the functions (known as Hermite
functions) in the form:

2 21( ) ( )
!2
x
n
x e H x
n
ψ
π
−= (4.64)
which are orthonormal; that is, they satisfy: ( ) ( )n m nmx x xψ ψ δ

−∞
=∫ . These functions correspond
to the orthogonalized form of the derivatives of the Gaussian function (the mother function in this
case).
Example
Approximate the function 221
1)(
xa
xf
+
= with a = 4 in the interval [−1, 1] in the least squares
sense using Legendre polynomials up to order 2.

The approximation is: ∑
=
=
2
1
)()(~
k
kk xPcxf and the coefficients are: ∫

=
1
1
2 )()()(
1 dxxPxf
xP
c k
k
k
with
12
2)( 2
+
=
k
xPk .
Form the expression above we can see that the calculation of the coefficients will involve integrals:

− +
=
1
1
221
dx
xa
xI
m
m
which satisfy the recurrence relation:
222
1
)1(
2
−−

= mm Iaam
I with a
a
I 10 tan
2 −=
We can also see that due to symmetry Im = 0 for m odd. (Integral of an odd function over the
interval [−1, 1]). Also, because of this, only even numbered coefficients are necessary. (Odd
coefficients of the expansion are zero).
The coefficients are then:
0
1
1
22
1
1
0 2
1
1
1
2
1)(
2
1 Idx
xa
dxxfc =
+
== ∫∫
−−

( )02
1
1
22
21
1
22 34
5
1
13
4
5)()(
2
5 IIdx
xa
xdxxfxPc −=
+

== ∫∫
−−

( )024
1
1
44 3303516
9)()(
2
9 IIIdxxfxPc +−== ∫




ELEC 0030 Numerical Methods page 39


© University College London 2019 – F.A. Fernandez
( )0246
1
1
66 510531523132
13)()(
2
13 IIIIdxxfxPc −+−== ∫


( )02468
1
1
88 3512606930120126435256
17)()(
2
17 IIIIIdxxfxPc +−+−== ∫



The results are shown in Fig. 4.10,
where the red line corresponds to the
approximating curve. The green line at the
bottom shows the absolute value of the
difference between the two curves.
The total error of the approximation
(integral of this difference, divided by the
integral of the original curve) is of 10.5%.

Fig. 4.10
Exercise 4.6
Use cosine functions ( )cos( xnπ ) instead of Legendre polynomials (in the form of a Fourier series)
and compare the results.

Remarks:
We can observe in Fig. 4.10 of the example above that the error oscillates through the
domain. This is typical of least squares approximation, where the overall error is minimised.
In this context, Chebyshev polynomials are the best possible choice. The error obtained with
its approximation is the least possible with any other polynomial up to the same degree.
Furthermore, these polynomials have other useful properties. We have seen before interpolation
of data by higher order polynomials, using equally spaced data points and the problems that this
cause were quite clear. However, one then can think if there is a different distribution of points
that helps in minimising the error. The answer is yes, the optimum arrangement is to locate the
data points on the zeros of the Chebyshev polynomial of the order necessary to give the required
number of data points. These occur for:





 −= π
n
kxk 2
12cos for k = 1, …, n.
4.2.3 Approximation to a Point
The approximation methods we have studied so far attempt to reach a “gobal” approximation to a
function; that is, to minimise the error over a complete domain. This might be desirable in many
cases but there could be others where it is more important to achieve high approximation at one
particular point of the domain and in its close vicinity. One such method is the Taylor
approximation, where the function is approximated by a Taylor polynomial.
-1 0 1
0
0.5
1


page 40 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
Taylor Polynomial Approximation
Among polynomial approximation methods, the Taylor polynomial gives the maximum possible
“order of contact” between the function and the polynomial; that is, the n-order Taylor polynomial
coincides with the function and with its n derivatives at the point of contact.
The Taylor polynomial for approximating a function f(x) at a point a is given by:
n
n
ax
n
afaxafaxafafxp )(
!
)()(
!2
)(''))((')()(
)(
2 −++−+−+=  (4.65)
and the error incurred is given by: 1
)1(
)(
)!1(
)( ++ −
+
n
n
ax
n
f ξ where ξ is a point between a and x.
(See Appendix).

The figure shows the function (blue line):

xxxy ππ 2cossin)( +=

with the Taylor approximation (red line)
using a Taylor polynomial of order 9 (Taylor
series truncated at order 9). For comparison,
the Newton interpolation using 9 equally
spaced points (indicated with * and a black
line), giving a single polynomial of order 9.
While the Taylor approximation is very good
in the vicinity of zero (9 derivatives are also
matched there) it deteriorates badly away
from zero.
Fig. 4.11
If a function varies rapidly or has a pole, polynomial approximations will not be able to achieve
high degree of accuracy. In that case an approximation using rational functions will be better. The
polynomial in the denominator will provide the facilities for rapid variations and poles.
Padé Approximation
Padé Approximants are rational functions, or ratios of polynomials, that fit the value of a function
and a number of its derivatives at one point. They usually provide an approximation that is better
than that of the Taylor polynomials, in particular in the case of functions containing poles.
A Padé approximation to a function f(x) that can be represented by a Taylor series in [a, b]
(or Padé approximant) is a ratio between two polynomials pm(x) and qn(x) of orders m and n
respectively:

01
2
2
01
2
2
)(
)()()(
bxbxbxb
axaxaxa
xq
xpxRxf n
n
m
m
n
mm
n
++++
++++
==≈

 (4.66)
For simplicity, we’ll consider only approximations to a function at x = 0. For other values, a
simple transformation of variables can be used.
If the Taylor approximation at x = 0 (Maclaurin series) to f(x) is:
01
2
2
0
)( cxcxcxcxcxt kk
k
i
i
i ++++==∑
=
 with nmk += (4.67)
-1 0 1
-2
-1
0
1


ELEC 0030 Numerical Methods page 41


© University College London 2019 – F.A. Fernandez
we can write:
)(
)()()(
xq
xpxRxt
n
mm
n =≈ or )()()( xpxqxt mn = . (4.68)
Considering now this equation in its expanded form:
01
2
201
2
201
2
2 ))(( axaxaxabxbxbxbcxcxcxc
m
m
n
n
k
k ++++=++++++++  (4.69)
we can establish a system of equations to find the coefficients of pm and qm (or simply p and q).
First of all, we can force this condition to apply at x = 0 (exact matching of the function at x = 0).
This will give: )0()0()0( mn pqt = or:
000 abc = (4.70)
but, since the ratio R doesn’t change if we multiply the numerator and denominator by any number,
we can choose the value of 10 =b and this gives us the value of 0a .
Taking now the first derivative of (4.69) will give:
+++++++− ))(2( 011
2
2
1 bxbxbcxcxkc nn
k
k 
12
1
12
1
01 2)2)(( axaxmabxbxnbcxcxc
m
m
n
n
k
k +++=++++++
−−
 (4.71)
and again, forcing this equation to be satisfied at x = 0 (exact matching of the first derivative),
gives:
11001 abcbc =+ (4.72)
which gives an equation relating coefficients 1a and 1b : 01101 bcbca =− (4.73)
To continue with the process of taking derivatives of (4.69) we can establish first some general
formulae. If we call g(x) the product of denominators in the left hand side of (4.69), we can apply
repeatedly the product rule to form the derivatives giving:

)(')()()(')(' xqxtxqxtxg +=
)('')()(')('2)()('')('' xqxtxqxtxqxtxg ++=


=


=
i
j
jjii xqxt
jij
ixg
0
)()()( )()(
)!(!
!)( (4.74)

and since we are interested on the values of the derivatives at x = 0, we have:

=


=
i
j
jjii qt
jij
ig
0
)()()( )0()0(
)!(!
!)0( (4.75)
The first derivative of a polynomial, say, )(xqn is: 12
1 2 bxbxnb nn +++

 , so the second
derivative will be: 23
2 232)1( bxbxnbn nn +⋅++−

 and so on. Then these derivatives evaluated
at x = 0 are successively: jbjbbbb !,,,)432(,)32(,2, 4321 ⋅⋅⋅

Then, we can write (4.75) as: ∑∑
=

=
− =−−
=
i
j
jji
i
j
jji
i bcibjcji
jij
ig
00
)( !!)!(
)!(!
!)0( and equating
this to the ith derivative of )(xpm evaluated at x = 0, which is iai! gives:


page 42 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez

=
−=
i
j
jjii bca
0
, or ∑
=
−+=
i
j
jjiii bcbca
1
0 but since 10 =b we can finally write:
i
i
j
jjii cbca =−∑
=

1
for i = 1, …, k, (k = m + n). (4.76)
where we take the coefficients 0=ia for mi > and 0=ib for ni > .
Example
Consider the function
x
xf

=
1
1)( . This function has a pole at x = 1 and polynomial
approximations will not perform well.
The Taylor coefficients of this function are given by:
!2
)12(531
j
jc jj
−⋅⋅
=

So the first five are: 10 =c , 2
1
1 =c , 8
3
2 =c , 16
5
3 =c , 128
35
4 =c which gives a polynomial of
order 4.
From the equations above we can calculate the Padé coefficients. We choose: m = n =2, so k = m
+ n = 4.

Taking 10 =b , 000 abc = gives 10 =a
The other equations (from asking the derivatives to fit) are:
i
i
j
jjii cbca =−∑
=

1
for i = 1, …, k, (k = m + n)
and in detail:

4403122134
33021123
220112
1101
cbcbcbcbca
cbcbcbca
cbcbca
cbca
=−−−−
=−−−
=−−
=−

but for m = n =2, we have 04343 ==== bbaa and the system can be written as:

1 0 1 1
2 1 1 0 2 2
2 1 1 2 3
3 1 2 2 4
0 0 0
0 0 0
a c b c
a c b c b c
c b c b c
c b c b c
− =
− − =
− − =
− − =

and we can see that the second set of equations can be solved for the coefficients bi. Re-writing
this system:
2 1 1 2 3
3 1 2 2 4
c b c b c
c b c b c
+ = −
+ = −

In general, when 2kmn == as in this case the matrix that define the system will be of the form:


ELEC 0030 Numerical Methods page 43


© University College London 2019 – F.A. Fernandez

















+−+−+−
−−−
−−

0321
3012
2101
1210
rrrr
rrrr
rrrr
rrrr
nnn
n
n
n






This is a special kind of matrix called the Toeplitz matrix that has the same element along each
diagonal so it is defined by a total of 2n-1 numbers. There are special methods for solving systems
with this type of matrix. Once the coefficients bi are known the coefficients ai are easily found.
Solving the system will give:
]161,43,1[ −=a and ]165,45,1[ −=b
giving:
2
2
2
210
2
2102
2 0.31251.251
0.06250.751)(
xx
xx
xbxbb
xaxaaxR
+−
+−
=
++
++
=
The figure shows the function


x
xf

=
1
1)( (blue line) and the Padé
approximant (red line)

2
2
2
2 0.31251.251
0.06250.751)(
xx
xxxR
+−
+−
=

The Taylor polynomial up to 4th order:
44
3
3
2
210)( xcxcxcxccxP ++++=
with a green line.

Fig. 4.12

It is clear that the Padé approximant gives a better fit, particularly closer to the pole.

Increasing the order, the Padé
approximation gets closer to the singularity.
The figure shows the approximations
for
m = n = 1, 2, 3 and 4.
The poles closer to 1 of this functions are:

333.1:)(11 xR
106.1:)(22 xR
052.1:)(33 xR
031.1:)(44 xR

Fig. 4.13
0 1
0
2
4
6
8
0 0.5 1
0
2
4
6
8


page 44 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez

We can see that even )(11 xR , a ratio of linear functions of x, gives a better approximation than the
4th order Taylor polynomial.
Exercise 4.7
Using the Taylor (McLaurin) expansion of f(x) = cos x, truncated to order 4:

242
1)(
424
0
xxxcxt
k
k
k +−== ∑
=

Find the Padé approximant:
01
2
2
01
2
2
2
22
2 )(
)()(
bxbxb
axaxa
xq
xpxR
++
++
==



ELEC 0030 Numerical Methods page 45


© University College London 2019 – F.A. Fernandez
5. Matrix Computations
We have seen earlier that a number of issues arise when we consider errors in the calculations
dealing with machine numbers. When matrices are involved, the problems of accuracy of
representation, error propagation and sensitivity of the solutions to small variations in the data are
much more important.
Before discussing any methods of solving matrix equations, we consider first the rather
fundamental matrix property of ‘condition number’.
5.1 ‘Condition’ of a Matrix
We have seen that multiplying or dividing two floating-point numbers gives an error of the
order of the ‘last preserved bit’. If, say, two numbers are held to 8 decimal digits, the resulting
product (or quotient) will effectively have its least significant bit ‘truncated’ and therefore have a
relative uncertainty of ± 10–8.
By contrast, with matrices and vectors, multiplying (that is, evaluating y = Ax) or ‘dividing’
(that is, solving Ax = y for x) can lose in some cases ALL significant figures!
Before examining this problem we have to define matrix and vector norms.
5.1.1 Vector and matrix norms
To introduce the idea of ‘length’ into vectors and matrices, we have to consider norms:
Vector norms
If xT = (x1, x2,...., xn) is a real or complex vector, a general norm is denoted by Nx and is defined
by:

1
1
Nn
N
iN
i
x
=
 
=  
 
∑x (5.1)
So the usual “Euclidian norm”, or “length” is 2x or simply x :
2212 ... nxx ++=x (5.2)
Other norms are used, e.g. 1x and ∞x , the latter corresponding to the greatest in
magnitude |xi| (Show this as an exercise).
Matrix norm
If A is an n-by-n real or complex matrix, we denote its norm defined by:

0
max NN x
N

  =  
  
Ax
A
x
for any choice of the vector x (5.3)
According to our choice of N, in defining the vector norms by (5.1), we have corresponding
1Ax , 2Ax , ∞Ax , the Euclidian N = 2 being the commonest. Note that (ignoring the question
“how do we find its value”), for given A and N, A has some specific numerical value ≥ 0.


page 46 ELEC0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
5.1.2 ‘Condition’ of a Linear System Ax = y
This is in the context of Hadamard’s general concept of a ‘well-posed-problem’, which is
roughly one where the result is not too sensitive to small changes in the problem specification.

Definition: The problem of finding x, satisfying Ax = y, is well posed or well conditioned if:

(i) a unique x satisfies Ax = y, and
(ii) small changes in either A or y result in small changes in x.

For a quantitative measure of “how well conditioned” a problem is, we need to estimate the
amount of variation in x when y varies and/or the variation in x when A changes slightly or the
corresponding changes in y when either (or both) x and A vary.
Suppose A is fixed, but y changes slightly to y + δy, with the associated x changing to x + δx.
We have:
Ax = y, (5.4)
and so: A (x + δx) = y + δy

Subtracting gives:
A δx = δy or δx = A–1 δy (5.5)
From our definition (5.3), we must have for any A and z:
A
z
Az
≤ and so zAAz ⋅≤ (5.6)
Taking the norm of both sides of (5.5) and using inequality (5.6) gives:
yAyAx δδδ ⋅≤= −− 11 (5.7)
Taking the norm of (5.4) and using inequality (5.6) gives:
xAxAy ⋅≤= (5.8)
Finally multiplying corresponding sides of (5.7) and (5.8) and dividing by yx gives our
fundamental result:

y
y
AA
x
x δδ 1−≤ (5.9)
For any square matrix A we introduce its condition number and define:
cond(A) =
1−AA (5.10)
We note that a ‘good’ condition number is small, near to 1.
5.1.3 Relevance of the Condition Number
The quantity yyδ can be interpreted as a measure of relative uncertainty in the vector
y. Similarly, xxδ is the associated relative uncertainty in the vector x.
From equations (5.9) and (5.10), cond(A) gives an upper bound (worst case) factor of


ELEC 0030 Numerical Methods page 47


© University College London 2019 – F.A. Fernandez
degradation of precision between y and x = A–1y. Note that if we re-write the equations from (5.4)
for A–1 instead of A (and reversing x and y), equation (5.9) will be the same but with x and y
reversed and (5.10) would remain unchanged. These two equations therefore give the important
result that:
If A denotes the precise transformation y = Ax and δx, δy are related small changes in x and
y, the ratio:

yy
xx
δ
δ
must lie between 1/ cond(A) and cond(A).
Example
Here is an example using integers for total precision. Suppose:






=
9899
99100
A
We then have:
Ax = y: 





=





− 1000
1000
1000
1000
A
Shifting x slightly gives:






=





− 1197
1199
999
1001
A
Alternatively, shifting y slightly:






=





− 999
1001
801
803
A
So a small change in y can cause a big change in x or vice versa. We have this clear moral,
concerning any matrix multiplication or (effectively) inversion:
For given A, either multiplying (Ax) or ‘dividing’ (A–1y) can be catastrophic, the degree of
catastrophe depending on cond(A) and on the ‘direction’ of change in x or y.
In the above example, cond(A) is 39206 (approximately 40000).

5.2 Matrix Computations – Problem Types
We now consider methods for solving matrix equations. The most common problems
encountered are of the form:
A x = y (5.11)
or A x = 0, requiring det(A) = 0 (5.12)
(we will consider this a special case of (5.11)) or
A x = λ B x (5.13)
where A (and B) are known n×n matrices and x and λ are unknown.
Usually B (and sometimes A) is positive definite (meaning that xTBx > 0 for all x
A is sometimes complex (and also B), but numerically the difference is straightforward, and so we
will consider only real matrices.


page 48 ELEC0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
5.2.1 Types of Matrices A and B (Sparsity Patterns)
We can classify the problems (or the matrices) according to their sparsity (that is, the
proportion of zero entries). This will have a strong relevance on the choice of solution method.
In this form, the main categories are: dense, where most of the matrix elements are nonzero and
sparse, where a large proportion of them are zero. Among the sparse, we can distinguish several
types:
Banded matrices: all nonzero elements are grouped in a band around the main diagonal
(fixed and variable band).
Arbitrarily sparse
Finally, sparse matrix with any pattern, but where the elements are not stored; that is,
elements are ‘generated’ or calculated each time they are needed in the solving algorithm.

* * * 0 0 0
* * * * 0 0
* * * * * 0
0 * * * * *
0 0 * * * *
0 0 0 * * *
 
 
 
 
 
 
 
 
 

Fig. 5.1 Zeros and non-zeros in band matrix of semi-bandwidth 3
We can also distinguish between different solution methods, basically:
DIRECT where the solution emerges in a finite number of calculations (if we temporarily ignore
round-off error due to finite word-length).
INDIRECT or iterative, where a step-by-step procedure converges towards the correct solution.
Indirect methods can be specially suited to sparse matrices (especially when the order is
large) as they can often be implemented without the need to store the entire matrix A (or
intermediate forms of matrices) in high speed memory.
All the common direct routines are available in software libraries and in books and journals,
most commonly in Fortran or Fortran90/95, but also some in C.
5.3 Direct Methods of Solving Ax = y
5.3.1 Gauss elimination or LU decomposition
The classic solution method of (5.11) is by the Gauss method. For example, given the system:











=




















1
1
1
1163
852
741
3
2
1
x
x
x
(5.14)
we subtract 2 times the first row from the second row, and then we subtract 3 times the first row
from the third row, to give:












−=




















−−
−−
2
1
1
1060
630
741
3
2
1
x
x
x
(5.15)
and then subtracting 2 times the second row from the third row gives:


ELEC 0030 Numerical Methods page 49


© University College London 2019 – F.A. Fernandez











−=




















−−
0
1
1
200
630
741
3
2
1
x
x
x
(5.16)
The steps from (5.14) to (5.16) are termed ‘triangulation’ or ‘forward-elimination’. The
triangular form of the left-hand matrix of (5.16) is crucial; it allows the next steps.
The third row immediately gives:
x3 = 0 (5.17a)
and substitution into row 2 gives:
(–3) x2 + 0 = –1 and so: x2 =1/3 (5.17b)
and then into row 1 gives:
x1 + 4 (1/3) + 0 = 1 and so: x1 = –1/3 (5.17c)
The steps through (5.17) are called ‘back-substitution’. At this point we ignore the complications
involved with ‘pivoting’, a technique sometimes required to improve numerical stability and which
consists of changing the order of rows and columns.
Important points about this algorithm:
1. When performed on a dense matrix, (or on a sparse matrix but not taking advantage of
the zeros), computing time is proportional to n3 (n: order of the matrix). This means that doubling
the order of the matrix will increase computation time by up to 8 times!
2. The determinant comes immediately as the product of the diagonal elements of (5.16).
3. Algorithms that take advantage of the special ‘band’ and ‘variable-band’ are very
straightforward, by changing the limits of the loops when performing row or column operations,
and some ‘book-keeping’. For example, in a matrix of ‘semi-bandwidth’ 3, the first column has
non-zero elements only in the first 3 rows as in Fig. 5.1. Then only those 3 numbers need storing,
and only the 2 elements below the diagonal need ‘eliminating’ in the first column.
4. Oddly, it turns out that, in our context, one should NEVER find the inverse matrix A in
order to solve Ax = y for x. Even if it needs doing for a number of different right-hand-side
vectors, y, it is better to ‘keep a record’ of the triangular form of (5.16) and back-substitute as
necessary.
5. Other methods very similar to Gauss are due to Crout and Choleski. The latter is (only)
for use with symmetric matrices. Its advantage is that time AND storage are half that of the
orthodox Gauss.
6. There are variations in the implementation of the basic method developed to take
advantage of the special type of sparse matrices encountered in some cases, for example when
solving problems using finite differences or finite elements. One of these is the frontal method;
here, elimination takes place in carefully controlled manner, with intermediate results being kept
in backing-store. Another variation consists of only storing the ‘nonzero’ elements in the matrix,
at the expense of a great deal of ‘book-keeping’ and reordering (renumbering) rows and columns
through the process, in the search for the best compromise between numerical stability and fill-in.
5.3.2 LU Factorisation
In practice and particularly for large sparse matrices, the Gauss elimination method takes the form
of an LU factorisation. That is, the matrix A is factorised in the form A = LU, where L and U are
lower triangular and upper triangular matrices, respectively.
Once A is in that form, the problem to solve is: LUx = y. This can be written as Lz = y defining


page 50 ELEC0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
the auxiliary vector z = Ux, which can be found simply by forward substitution. (n steps). Then,
x is found from Ux = z by backward substitution. The most expensive part of this algorithm is
the factorisation itself. There are several variations here, where one of the most popular is the
Choleski factorisation where T=L U (or T=U L ) applicable to symmetric positive definite
matrices..
5.4 Iterative Methods of Solving Ax = y
We will outline 2 types of methods: (1a) the Jacobi (or simultaneous displacement) and (1b) the
closely related Gauss-Seidel (or successive displacement) algorithm and (2) Gradient methods and
its most popular version, the conjugate gradient method.
5.4.1 Jacobi and Gauss-Seidel Methods
Two algorithms that are simple to implement are the closely related Jacobi (simultaneous
displacement) and Gauss-Seidel (successive displacement or ‘relaxation’).
a) Jacobi Method or Simultaneous Displacement
Suppose the set of equations for solution are:
a11x1 + a12x2 + a13x3 = y1
a21x1 + a22x2 + a23x3 = y2 (5.18)
a31x1 + a32x2 + a33x3 = y3
This can be re-organised to:
x1 = ( y1 – a12x2 – a13x3)/a11 (5.19a)
x2 = ( y2 – a23x3 – a21x1)/a22 (5.19b)
x3 = ( y3 – a31x1 – a32x2)/a33 (5.19c)
Suppose we had the vector x(0) = [x1, x2, x3](0), and substituted it into the right-hand side of
(5.19), to yield on the left-hand side the new vector x(1) = [x1, x2, x3](1). Successive substitutions
will give the sequence of vectors:
x(0), x(1), x(2), x(3), . . . .
Because (5.19) is merely a rearrangement of the equations for solution, the ‘correct’ solution
substituted into (5.19) must be self-consistent, i.e. yield itself! The sequence will:
either converge to the correct solution
or diverge.
If the matrix A is split in the form:
A = L + D + U (do not confuse with the L and U factors)
where L, D and U are the lower, diagonal and upper triangular parts of A, the Jacobi algorithm
can be written in matrix form as:
1 1( )k k k+ −= − −x D y Lx Ux .
b) Gauss-Seidel Method or Successive Displacement
Note that when eq. (5.19a) is applied, a ‘new’ value of x1 would be available, which could
be used instead of the ‘previous’ x1 value when solving (5.19b). And similarly for x1 and x2 when
applying (5.19c). This is the Gauss-Seidel or successive displacement iterative scheme, illustrated
here with an example, showing that the computer program (in Matlab) is barely more complicated


ELEC 0030 Numerical Methods page 51


© University College London 2019 – F.A. Fernandez
than writing down the equations.

% Example of successive displacement
x1 = 0. ;
x2 = 0. ;
x3 = 0. ; % Equations being solved are:
for i=1:10
x1 = (4. + x2 - x3)/4. ; % 4*x1 - x2 + x3 = 4
x2 = (9. - 2.*x3 - x1)/6. ; % x1 +6*x2 + 2*x3 = 9
x3 = (2. + x1 + 2.*x2)/5. ; % -x1 -2*x2 + 5*x3 = 2
fprintf('%-18.8g %-18.8g %-18.8g\n',x1, x2, x3)
end

1 1.3333333 1.1333333
1.05 0.94722222 0.98888889
0.98958333 1.0054398 1.0000926
1.0013368 0.99974633 1.0001659
0.99989511 0.99996218 0.9999639
0.99999957 1.0000121 1.0000048
1.0000018 0.99999811 0.99999961
0.99999962 1.0000002 1
1 0.99999999 1
1 1 1

Using the same split form of the matrix A, the Successive Displacement or Gauss-Seidel
algorithm can be written as:
1 1 1( )k k k+ − += − −x D y Lx Ux or 1 1( ) ( )k k+ −= + −x L D y Ux
Whether the algorithm converges or not depends on the matrix A and (surprisingly) not on
the right-hand-side vector y of (5.18). Convergence does not even depend on the ‘starting value’
of the vector, which only affects the necessary number of iterations.
We will skip over any formal proof of convergence. But the schemes converge if-and-only-
if all the eigenvalues of the matrix:
D–1(U + L) {for simultaneous displacement}
or (D + L)–1U {for successive displacement}
lie within the unit circle.
For applications, it is simpler to use some sufficient (but not necessary!) conditions, when
possible, such as:
(a) If A is symmetric and positive-definite, then successive displacement converges.
(b) If, in addition, ai,j < 0 for all i ≠ j, then simultaneous displacement also converges.
Condition (a) is a commonly satisfied condition. Usually, the successive method converges
in about half the computer time of the simultaneous, but, strictly there are matrices where one
method converges and the other does not, and vice-versa.
c) Successive Overrelaxation (SOR) method
A variation of the Successive Displacement method is called the Successive Overrelaxation
Method or SOR.


page 52 ELEC0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
From the matrix form of the Gauss-Seidel algorithm, we can write, subtracting kx from both sides:
1 1 1( )k k k k k+ − +− = − − −x x D y Lx Dx Ux (Gauss-Seidel)
and we can think of this difference as a “correction” to the previous value and in every iteration, a
new correction will be made. So, we can think of accelerating convergence if we go beyond this
correction; that is to correct by a bit more or “over-relax” the system. The idea of the SOR is then
to calculate: 1 1( )k k k kω+ += + −x x x x where ω, 1 < ω < 2 is a correction parameter. Note that
with ω = 1, the SOR reverts to the standard Gauss-Seidel.
The SOR algorithm then becomes:
1 1 1( )k k k k kω+ − += + − − −x x D y Lx Dx Ux
or 1 1 11( ) [ ( ) ]; 1 2k kωω ω ω
+ − −= + + − < The algorithm to calculate each term at iteration k+1 for the three methods is then;
for i =1, n

1
1
1
1 i nk k k k
i i i ij j ij j
ii j j i
x x y a x a x
a

+
= =
 
= + − −  
 
∑ ∑ Jacobi

1
1 1
1
1 i nk k k k
i i i ij j ij j
ii j j i
x x y a x a x
a

+ +
= =
 
= + − −  
 
∑ ∑ Gauss-Seidel

1
1 1
1
i n
k k k k
i i i ij j ij j
ii j j i
x x y a x a x
a
ω −+ +
= =
 
= + − −  
 
∑ ∑ SOR
end
5.4.2 Gradient Methods
Although known and used for decades before, these methods came to be adopted in the
1980’s as one of the most popular iterative algorithms for solving linear systems of equations:
A x = y. The fundamental idea of this approach is to define the residual y – A x for some trial
vector x and proceed to minimize the residual with respect to the components of x..
The equation to be solved for x:
A x = y (5.20)
can be recast as: “finding x to minimize the residual”, a column vector r defined as a function of
x by:
r = y – A x (5.21)
The general idea of this kind of methods is to search for the solution (minimum of the
residual) in a multidimensional space (spanned by the components of vector x), starting from a
point x0 and choosing a direction to move. The optimum distance to move along that direction
can then be calculated.
In the steepest descent method, the simplest form of this method, these directions are chosen
as those of the gradient of the error residual at each iteration point. Because of this, they will be
mutually orthogonal and then there will be no more than n different directions. In 2D, see Fig.
5.2, this means that every time we’ll have to make a change of direction at right angle to the
previous, but this will not always allow us to reach the minimum or at least not in an efficient way.
The norm r of this residual vector is an obvious choice for the quantity to minimise, and


ELEC 0030 Numerical Methods page 53


© University College London 2019 – F.A. Fernandez
also 2r , the square of the norm of the residual, which is not negative and is only zero when the
error is zero and there are no square roots to calculate. However, using (5.21), gives (if A is
symmetric: AT = A):
2r = (y – Ax)T(y – Ax) = xTAAx – 2xTAy + yTy (5.22)
which is rather awkward to compute, because of the product AA. Another possible choice of error
function (measure of the error), valid for the case where the matrix A is positive definite and which
is also minimised for the correct solution is the function: h2(x) = rTA–1r (instead of rTr as in
(5.22). This gives a simpler form (see Appendix, section 5):
2 1 1 1( ) ( ) 2T T T T Th − − −= = − − = − +r A r y A x A y A x y A y x y x Ax (5.23)
Because the first term in the r.h.s. of (5.23) is independent of the variables and will play no part in
the minimisation, it can be omitted. Also, for convenience we can also slightly change the
definition of the error function dividing by a factor of 2 so we finally have:
2 1
2
T Th = −x Ax x y (5.24)
which in expanded form is: 2 1
2 i ij j i ii j i
h x a x x y= −∑∑ ∑ .
We can also note from (5.24) that 2h∇ = − = −Ax y r (5.25)
The Steepest Descent Method:
The algorithm proceeds by
evaluating the error function h2 at a
starting point x0 and then choosing a
search direction. It would seem obvious
that the direction to choose in this case is
the direction of the negative of the
gradient of the error function at that point.
From (5.25) we can see that this search
direction is 2h= −∇ =p r .
The next step is to find the minimum
value of the error function along that line.
That is, if =p r is the opposite direction
of the gradient (p pointing down), the line
is defined by all the points ( )α+x p ,
where α is a scalar parameter. The next
step is to find the value of α that
minimizes the error. This gives the next point x1. (Since in this case, α is the only variable, it is
simple to calculate the gradient of the error function as a function of α and finding the
corresponding minimum).
(Note that h2 is quadratic so the 2D contour lines will be quadratic curves – not as shown in the
figure.)
Exercise 5.1:
Show that the value of α that minimizes the error function in the line ( )α+x p in the two cases
mentioned above (using the squared norm of the residual as error function or the function h2), for

Fig. 5.2
u
v
x0
x1


page 54 ELEC0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
a symmetric matrix A, is given respectively by:
( )2
T
α

=
p A y A x
Ap
and ( )
T


=
p y Ax
p Ap


The algorithm is then:
Take 0 =x 0 , 0 =r y , 0 0=p r . (Or 0 0=x x , 0 0= −r y Ax , 0 0=p r )
for n = 1, 2, 3, …
- Compute a step length: 1 1 1 1
1 1 1 1
T T
n n n n
n T T
n n n n
α − − − −
− − − −
= =
p r r r
p Ap r Ar

- Update the approximate solution: 1 1n n n nα− −= +x x p
- Update the residual: 1 1 1 1( )n n n n n n n nα α− − − −= − = − + = −r y Ax y A x p r Ap
- Set the new search direction: n n=p r
end
Note that it is not necessary to normalize the vector p because any normalization factor will be
cancelled in the calculation of the product αp for the next step.
It turns out that the best direction to choose is not that of the negative of the gradient which
that is the one taken in the conventional “gradient descent” or “steepest descent” method. In this
case the consecutive directions are all orthogonal to each other as illustrated in Fig. 5.2 and this
conduces to poor convergence. There are only n different directions and the algorithm will start
using them again and again in a “multidimensional zig-zag”.
Several variations appear at this point. The more efficient and popular “Conjugate Gradient
Method” looks in a direction which is ‘A–orthogonal’ (or “conjugate”) to the previous one
(choosing 1 0
T
n n− =p Ap instead of 1 0
T
n n− =p p ) in order to avoid this. The reason why this choice
is made is beyond the scope of this course but the idea can be understood if we see that in 2D (2×2
matrices) the contours of the error function (if A is symmetric and positive definite) are ellipses
and the perpendicular to an ellipse from an arbitrary point in the ellipse doesn’t pass through the
minimum. The choice of the A-orthogonal direction to the previous one is equivalent to transform
the variables such that the contours are circles, in which case the perpendicular will pass exactly
through the centre (minimum). For this reason, this algorithm is guaranteed to converge in a
maximum of n iterations. Of course, if the order is large the convergence can still be very slow in
some cases.
A useful feature of these methods (as can be observed from the expressions above) is that
reference to the matrix A is only via simple matrix products; at any iteration, we need only form
the product of A times a vector (Ax0 to start and then Apn-1). These can be formed from a given
sparse matrix A without unnecessary multiplication (of non-zeros) or storage.
The Conjugate Gradient Algorithm
The basic steps of the algorithm are as follow:
- Choose a starting point 0x and calculate 0 0= −r y A x
- Choose first direction to search. In the first step we choose that of the negative gradient of
h2: and this coincides with the direction of the residual r:
2 12( ) ( )
T Th∇ = ∇ − = − = −0 0 0 0x Ax x y Ax y r
so we choose: 0 0=p r .
- For n = 1, 2, 3, …


ELEC 0030 Numerical Methods page 55


© University College London 2019 – F.A. Fernandez
- Calculate the step length:
( )1 1 1 1
1 1 1 1
T T
n n n n
n T T
n n n n
α − − − −
− − − −

= =
p y A x p r
p A p p A p
.
- Calculate the new point: 1 1n n n nα− −= +x x p
- Calculate the new residual: 1 1n n n nα− −= −r r Ap
- Calculate the next search direction. Here is where the methods differ. For the
conjugate gradient the new vector p is not r but instead:
1n n n nβ −= +p r p where
1 1
T
n n
n T
n n
β
− −
=
r r
r r
(gradient correction factor)
- end
In many cases it is necessary to use a preliminary ‘pre-conditioning’ of the matrix A to
improve the convergence rate, particularly when the condition number of A is large. This leads to
the popular ‘PCCG’ or Pre-Conditioned-Conjugate-Gradient algorithm, as a complete package
and to many variations in implementation that can be found as commercial packages using
different preconditioning methods.
5.4.3 Preconditioning
One of the problems encountered frequently when using iterative methods to solve the
problem =Ax y is low convergence rate. This can happen when the condition number of the
matrix A is large and the problem becomes more serious when the matrices are very large. The
solution to this is to modify somehow the problem to make it easier to solve. For example, if we
modify the matrix equation multiplying it by another matrix or matrices we will get a new system:
=Ax y   which might be easier to solve.
A trivial system that is solved with no extra cost is when =A I , the identity matrix. But
obviously, that would mean for example, premultiplying A by its inverse, but we know that
calculating the inverse is a very costly procedure, particularly when the matrices are large and
sparse. However, we can think of an approximation to this, selecting the transformation so that
A is as close as possible to the identity matrix but the transformation is easier to calculate.
There are several ways of performing this transformation:
One is the left preconditioning. The idea here is to select a matrix M that resembles A such that
1−M A resembles the identity matrix.
left preconditioning: 1 1− −=M Ax M y
1 1, , ,− −= = = =Ax y A M A x x y M y    
It can also be
right preconditioning: 1− =AM Mx y
1, , ,−= = = =Ax y A AM x Mx y y    
or two-sided preconditioning: 1 1 11 2 2 1
− − −=M AM M x M y
1 1 11 2 2 1, , ,
− − −= = = =Ax y A M AM x M x y M y    
To perform matrix-vector multiplications of the form: =u Av , a common need in these iterative
methods, the explicit calculation of the inverse of the matrices M can be avoided. For example,
for the case of two-sided preconditioning, the product can be calculated by:


page 56 ELEC0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
(i) find 1u from 2 1 =M u v
(ii) find 2u as 2 1=u Au
(iii) find u from 1 2=M u u
And similarly, but simpler for the other forms.
Splitting the matrix A in the form: A = L + D + U where L, D and U are the lower triangular,
diagonal and upper triangular parts of A, some common forms of preconditioners are:
Jacobi: M = D
Gauss-Seidel: M = D + L
SOR: 1
ω
= +M D L where ω is a scalar parameter in [0, 2].
Symmetric Gauss-Seidel 11 ( ) ( )
−= + +M D L D D U T=U L for symmetric matrices
Symmetric SOR: 11 2
1 1 1 1( ), ( ) ( )
2 ω ω ω ω
−= + = +

M D L M D D U
A very common method is to use an incomplete LU decomposition, and in particular for
symmetric, positive definite matrices, an incomplete Cholesky ( T=A LL ) decomposition (or IC).
In a “full” decomposition, factorising a sparse matrix in the form =A LU leads to an increase in
the number of nonzero values or “fill-in”. In the incomplete decomposition, the fill-in is partially
or totally ignored, leading to just an approximation of the matrix A.
This can be used for preconditioning using 1 2
T
inc inc= =M M M L L for the IC:
1 1( )T Tinc inc inc inc
− − −=L AL L x L y
1 1, , ,T Tinc inc inc inc
− − −= = = =Ax y A L AL x L x y L y    
This is the case of the popular “ICCG” method or Incomplete Choleski Conjugate Gradients,
where the CG method is applied to a matrix preconditioned using IC.
Example of Conjugate Gradients

Fig. 5.3 Comparison of Convergence


ELEC 0030 Numerical Methods page 57


© University College London 2019 – F.A. Fernandez
Fig. 5.3 shows the results of using the Conjugate Gradients method with and without
preconditioning to solve the problem Ax = b where the matrix A is symmetric of order 500 and
consists of 3 columns defined by the Matlab command:

A=sparse(diag(sqrt(1:n))
+diag(cos(1:(n-10)),10)
+diag(cos(1:(n-10)),-10));
and the right hand side vector is
b = A*ones(500,1).
It can be seen that even the simple Jacobi preconditioner gives in this case a significant acceleration
of convergence.
These calculation were done using the script ExamplePCG.m
5.4.4 Krylov Methods
Several iterative methods are based in the same basic strategy. They rely on the accurate
calculation of a Krylov space that contains the solution and only differ in the definition of those
spaces and the search method. A Krylov method solves Ax = b by repeatedly performing matrix-
vector multiplications involving A. They are the most successful methods available at present to
solve problems of linear algebra.
Consider two vectors a and b of dimension 3. If they are not parallel, they will form a plane and
we can think of all the vectors in 3D that lie on that plane. All those vectors can be found as linear
combinations of a and b, and form what is called a subspace of the full 3D space (a plane). Since
a and b generate all the vectors in the subspace, they are a base in that subspace. For convenience,
frequently they are modified so they become orthogonal, since orthogonal bases are easier to
handle and we can denote the subspace as: { , }S span= a b .
Now, let’s consider vectors of large dimensions, as for example, the solution vector of a matrix
equation of order n. For any vector r, if it is easy to calculate the product Ar, then it is easy to
generate the sequence of vectors: 2A r , 3A r , …. and so on and we can think of the sequence of
spaces of increasing order k:
2 3 1( , ) { , , , , , }, 1,2,kkK span k
−= =A r r Ar A r A r A r  (5.26)
known as the Krylov subspaces.
Krylov subspace iterative methods normally start from an initial guess 0 0=x and with r = b and
find the solution x of Ax = b by computing successively by iterations the vectors , 1,2,k k =x 
such that kx is in the subspace ( , )kK A b :
( , ), 1,2,k kK k∈ =x A b  (5.27)
The conjugate gradients method is one of these methods and applies when the matrix A is
symmetric and positive definite. It requires only one matrix-vector product with A and 5 simple
vector operations in each iteration to calculate iterates kx satisfying (5.25), which minimise
k − Ax x over the Krylov subspace. (The A-norm is defined as
T=Av v Av ).
If the matrix A is symmetric but indefinite, another version of Krylov method, the minimum
residual or MINRES needs one matrix-vector product with A and 7 simple vector operations in
each iteration to calculate iterates kx satisfying (5.25), which minimise the Euclidean norm of the
residual.
If the matrix A is not symmetric, the problem is more difficult and there are several methods that
can be used. One of the most common is the Generalised Minimum Residual (GMRES). This
also computes iterates that minimise the Euclidean norm of the residual but requires an increasing


page 58 ELEC0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
amount of computation and storage at each successive iteration to achieve this.
Then, GMRES is a good method if only a few iterations are needed, which can be the case when
a good preconditioner is used. Otherwise, the memory and computing time requirements can be
excessive. One way to ameliorate this problem is to restart GMRES at regular intervals. This
reduces the memory requirements but it can also slow convergence.
As will be seen in the next section, the sequence of vectors kA r will become more and more
parallel to the dominant eigenvector of A as the iterations progress. So, in order to construct
appropriate bases in these subspaces, orthogonalisation of the vectors is a very important step.
The GMRES algorithm can be described as:
choose an initial guess 0x
form the residual vector: 0= −r b Ax
normalise: 1
2
=
rq
r

for 1,2, ,k m= 
k=y Aq (form the consecutive products)
for 1,2, ,j k= 
Tjk jh = q y find components of y along the previous vectors jq
jk jh= −y y q removal of this component (Gram-Schimdt orthogonalisation)
end
1, 2k kh + = y calculate the norm of y
1 1,k k kh+ +=q y normalise the new vector
Find kc that minimise: 2k −Hc z where 2( ,0,0 0)=z r  .
0k k k= +x x Q c
end

The matrix H of dimensions ( 1)k k+ × is formed with the elements ,j kh and the columns of the
matrix Q are the Krylov vectors kq .



ELEC 0030 Numerical Methods page 59


© University College London 2019 – F.A. Fernandez
Example of GMRES with preconditioning

Fig. 5.4 shows the results of
using the GMRES method with
and without preconditioning to
solve the problem Ax = b where
the matrix A is of order 500 and
not symmetric. It consists of 3
columns defined by the Matlab
command:

A=sparse(diag(sqrt(1:n))
+diag(cos(1:(n-10)),10)
+diag(sin(1:(n-10)),-10));

The right hand side vector is
given by:
b = A*ones(500,1).

The problem has been solved without restarts and uses the script ExamplePGMRes.m.
5.5 Matrix Eigenvalue Problems
The second class of matrix problem that occurs frequently in the numerical solution of
differential equations as well as in many other areas, is the matrix eigenvalue problem. In many
occasions, the matrices will be large and very sparse; in others, they will be dense and normally
the solution methods will have to take these characteristics into account in order to achieve a
solution in an efficient way. There is a number of different methods to solve matrix eigenvalue
problems involving dense or sparse matrices, each of them with different characteristics and better
adapted to different types of matrices and requirements.
5.5.1 Choice of method
The choice of method will depend on the characteristics of the problem (type of the matrices)
and on the solution requirements. For example, most methods suitable for dense matrices calculate
all eigenvectors and eigenvalues of the system. However, in many problems arising from the
numerical solution of PDEs one is only interested in one or just a few eigenvalues and/or
eigenvectors. Also, in many cases the matrices will be large and sparse.
In what follows we will concentrate in methods which are suitable for sparse matrices (of
course the same methods can be applied to dense matrices).
The problem to solve can have two different forms:
λ=Ax x (standard eigenvalue problem) (5.28)
λ=Ax Bx (generalized eigenvalue problem) (5.29)
Sometimes the generalized eigenvalue problem can be converted into the form (5.28) simply
by premultiplying by the inverse of B:

1 λ− =B Ax x
However, if A and B are symmetric, the new matrix in the left hand side ( 1−B A ) will have lost
this property. Instead, it is preferable to decompose the matrix B (factorise) in the form T=B LL
(Choleski factorisation - possible if B is positive definite). Substituting in (5.29) and

Fig. 5.4 Comparison of Convergence


page 60 ELEC0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
premultiplying by 1−L will give:
1 Tλ− =L Ax L x and since: 1( )T T− =L L I , the identity matrix,
1 1( )T T Tλ− − =L A L L x L x ; putting: T =L x y and 1 1( )T− − =L A L A gives:
λ=Ay y
The matrix A is symmetric if A and B are symmetric and the eigenvalues are not modified.
The eigenvectors x can be obtained from y simply by back-substitution. However, if the matrices
A and B are sparse, this method is not convenient because A will generally be dense. In the case
of sparse matrices, it is more convenient to solve the generalized problem directly.
Solution Methods
We can again classify solution methods as: direct and iterative. Direct or transformation
methods find all eigenvectors and eigenvalues in a fixed number of steps and they are not
suitable to large and sparse matrices. These methods include
i) Jacobi rotations Converts the matrix of (5.28) to diagonal form
ii) QR (or QZ for complex) Converts the matrices to triangular form
iii) Conversion to Tri-diagonal Normally to be followed by either i) or ii)
We will not examine any of these in detail.
5.6 Vector Iteration Methods:
These are better suited for sparse matrices and also for the case when only a few eigenvalues and
eigenvectors are needed.
5.6.1 The Power Method: or Direct iteration
For the standard problem (5.28) the algorithm consists of the repeated multiplication of a
starting vector by the matrix A, this can be seen to produce a reinforcement of the component of
the trial vector along the direction of the eigenvector of largest absolute value, causing the trial
vector to converge gradually to that eigenvector. The algorithm can be described schematically
by:

Fig. 5.5


ELEC 0030 Numerical Methods page 61


© University College London 2019 – F.A. Fernandez
The normalization step is necessary because otherwise the iteration vector can grow
indefinitely in length over the iterations.
How does this algorithm work?
If φi, i = 1, .. N are the eigenvectors of A, we can write any vector of N components as a
superposition of them (they constitute a base in the space of N dimensions). In particular, for the
starting vector:

=
=
N
i
ii
1
0 φαx (5.30)
When we multiply by A we get: 1 0=x Ax or:
∑∑∑
===
===
N
i
iii
N
i
ii
N
i
ii
111
1
~ φλαφαφα AAx (5.31)
If λ1 is the eigenvalue of largest absolute value, we can also write this as: ∑
=
=
N
i
i
i
i
1 1
11
~ φ
λ
λ
αλx
This is then normalized by:
1
1
1 ~
~
x
xx =
Then, after n iterations of multiplications by A and normalization we will get:

=






=
N
i
i
n
i
in C
1 1
φ
λ
λ
αx (5.32)
where C is a normalization constant.
From this expression (5.32) we can see that since iλλ ≥1 , for all i, the coefficient of all
φi for i ≠ 1, will tend to zero as n increases and then the vector xn will gradually converge to φ1.
This method (the power method) finds the dominant eigenvector; that is, the eigenvector that
corresponds to the largest eigenvalue (in absolute value). To find the eigenvalue we can see that
pre-multiplying (5.28) by the transpose of φi will give:
i
T
iii
T
i φφλφφ =A where i
T
i φφ A and i
T
i φφ are scalars, so we can write:

i
T
i
i
T
i
i
φφ
φφ
λ
A
= (5.33)
This is known as the Rayleigh quotient and can be used to obtain the eigenvalue from the
eigenvector. This expression has interesting properties; if we only know an estimate of the
eigenvector, (5.33) will give an estimate of the eigenvalue with a higher order of accuracy than
that of the eigenvector itself.
Exercise 5.2
Write a short program to calculate the dominant eigenvector and the corresponding
eigenvalue of a matrix like that in (7.30) but of order 7, using the power method. Terminate
iterations when the relative difference between two successive estimates of the eigenvalue are
within a tolerance of 10–6.
Exercise 5.3 (advanced)
Using the same reasoning used to explain the power method for the standard eigenvalue problem


page 62 ELEC0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
(5.30)-(5.32), develop the corresponding algorithm for the generalized eigenvalue problem
Ax =λBx.

The power method in the form presented above can only be used to find the dominant
eigenvector. However, modified forms of the basic idea can be developed to find any eigenvector
of the matrix. One of these modifications is the inverse iteration.
5.6.2 Inverse Iteration
For the system Ax = λx we can write: xAx 11 −=
λ
from where we can see that the
eigenvalues of A–1 are 1/λ; the reciprocals of the eigenvalues of A and the eigenvectors are the
same. In this form, if we are interested in finding the eigenvector of A corresponding to the
smallest eigenvalue in absolute value (closest to zero), we can notice that for that eigenvalue λ, its
reciprocal is the largest, and so it can be found using the power method on the matrix A–1.

The procedure then is as follows:

1. Choose a starting vector x0.
2. Find 11 0
−=x A x or instead, solve: 1 0=A x x (avoiding the explicit calculation of A–1)
3. Normalize: 1 1C=x x (C is a normalization factor to have: 1 1=x )
4. Re-start

Exercise 5.4
Write a program and calculate by inverse iteration the smallest eigenvalue of the tridiagonal
matrix A of order 7 where the elements are –4 in the main diagonal and 1 in the subdiagonals.
You can use the algorithms given in section 2 of the Appendix for the solution of the linear system
of equations.

An important extension of the inverse iteration method allows finding any eigenvalue of the
spectrum (spectrum of a matrix: set of eigenvalues) not just that of smallest eigenvalue in absolute
value.
5.6.3 Shifted Inverse Iteration
Suppose that the system Ax = λx has the set of eigenvalues { λi}. If we construct the matrix
A~ as: σ= −A A I where I is the identity matrix and σ is a real number, we have:
( )σ λ σ λ σ= − = − = −Ax Ax x x x x (5.34)
And we can see that the matrix A~ has the same eigenvectors as A and its eigenvalues are {λ−σ},
that is, the same eigenvalues as A but shifted by σ. Then, if we apply the inverse iteration method
to the matrix A~ , the procedure will yield the eigenvalue (λi−σ) closest to zero; that is, we can find
the eigenvalue λi of A closest to the real number σ.
5.6.4 Rayleigh Iteration
Another extension of this method is known as Rayleigh iteration. In this case, the Rayleigh
quotient is used to calculate an estimate of the eigenvalue at each iteration and the shift is updated
using this value.


ELEC 0030 Numerical Methods page 63


© University College London 2019 – F.A. Fernandez
Since the convergence rate of the power method depends on the relative value of the
coefficients of each eigenvector in the expansion of the trial vector during the iterations (as in
(5.30)) and these are affected by the ratio between the eigenvalues λi and λ1, the convergence will
be fastest when this ratio is largest as we can see from (5.32). The same reasoning applied to the
shifted inverse iteration method leads to the conclusion that the convergence rate will be fastest
when the shift is chosen as close as possible to the target eigenvalue. In this form, the Rayleigh
iteration has faster convergence than the ordinary shifted inverse iteration.

Exercise 5.5
Write a program using shifted inverse iteration to find the eigenvalue of the matrix A of the
previous exercise which lies closest to 3.5.
Then, write a modified version of this program using Rayleigh quotient update of the shift
in every iteration (Rayleigh iteration). Compare the convergence of both procedures (by the
number of iterations needed to achieve the same tolerance for the relative difference between
successive estimates of the eigenvalue). Use a tolerance of 10–6 in both programs.



page 64 ELEC0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
6. Numerical Differentiation and Integration
6.1 Numerical Differentiation
For a function of one variable f(x), the derivative at a point x = a is defined as:

0
( ) ( )'( ) lim
h
f a h f af a
h→
+ −
= (6.1)
This suggests that choosing a small value for h the derivative can be reasonably
approximated by the forward difference:
( ) ( )'( ) f a h f af a
h
+ −
≈ (6.2)
An equally valid approximation can be the backward difference:
( ) ( )'( ) f a f a hf a
h
− −
≈ (6.3)
Graphically, we can see the meaning of each of
these expressions in the figure.
The derivative at xc is the slope of the tangent
to the curve at the point C. The slope of the
chords between the points A and C, and B and
C are the values of the backward and forward
difference approximations to the derivative,
respectively.
We can see that a better approximation to the
derivative is obtained by the slope of the chord
between points A and B, labelled “central
difference” in the figure.



Fig. 6.1

We can understand this better analysing the error in each approximation by the use of Taylor
expansions.
Considering the expansions for the points a+h and a−h:

(2) (3)
2 3( ) ( )( ) ( ) '( )
2! 3!
f a f af a h f a f a h h h+ = + + + +  (6.4)

(2) (3)
2 3( ) ( )( ) ( ) '( )
2! 3!
f a f af a h f a f a h h h− = − + − +  (6.5)
where in both cases the reminder (error) can be represented by a term of the form:
(4)
4( )
4!
f hξ (see
Appendix), so we could write instead

(2) (3)
2 3 4( ) ( )( ) ( ) '( ) ( )
2! 3!
f a f af a h f a f a h h h O h+ = + + + +
where the symbol O(h4) means: “a term of the order of h4 ”
Truncating (6.4) to first order we can then see that 2( ) ( ) '( ) ( )f a h f a f a h O h+ = + + , so
for the forward difference formula we have:


ELEC 0030 Numerical Methods page 65


© University College London 2019 – F.A. Fernandez
( ) ( )'( ) ( )f a h f af a O h
h
+ −
= + (6.6)
and we can see that the error of this approximation is of the order of h. A similar result is obtained
for the backward difference.
We can also see that subtracting (6.4) and (6.5) and discarding terms of order of h3 and
higher we can obtain a better approximation:
3( ) ( ) 2 '( ) ( )f a h f a h f a h O h+ − − = +
From where we obtain the central difference formula:
2( ) ( )'( ) ( )
2
f a h f a hf a O h
h
+ − −
= + (6.7)
which has an error of the order of h2.
Expressions (6.2) and (6.3) are “two-point” formulae for the first derivative. Many more
different expressions with different degrees of accuracy can be constructed using a similar
procedure and using more points. For example, the Taylor expansions for the function at a number
of points can be used to eliminate all the terms except the desired derivative.
Example
Considering the 3 points x0, x1 = x0 + h and x2 = x0 + 2h and taking the Taylor expansions at x1 and
x2 we can construct a three-point forward difference formula:

(2)
2 30
1 0 0
( )( ) ( ) '( ) ( )
2!
f xf x f x f x h h O h= + + + (6.8)

(2)
2 30
2 0 0
( )( ) ( ) '( )2 4 ( )
2!
f xf x f x f x h h O h= + + + (6.9)
Multiplying (6.8) by 4 and subtracting to eliminate the second derivative term:
31 2 0 04 ( ) ( ) 3 ( ) 2 '( ) ( )f x f x f x f x h O h− = + +
from where we can extract for the first derivative:
20 1 20
3 ( ) 4 ( ) ( )'( ) ( )
2
f x f x f xf x O h
h
− + −
= + (6.10)
Exercise 6.1
Considering the 3 points x0, x1 = x0 − h and x2 = x0 − 2h and the Taylor expansions at x1 and x2 find
a three-point backward difference formula. What is the order of the error?
Exercise 6.2
Using the Taylor expansions for f(a+h) and f(a–h) show that a suitable formula for the
second derivative is:
(2) 2
( ) 2 ( ) ( )( ) f a h f a f a hf a
h
− − + +
≈ (6.11)
Show also that the error is O(h2).
Exercise 6.3
Use the Taylor expansions for f(a+h), f(a–h), f(a+2h) and f(a–2h) to show that the following
are formulae for f ’(a) and f(2)(a), and that both have an error of the order of h4:


page 66 ELEC0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
( 2 ) 8 ( ) 8 ( ) ( 2 )'( )
12
f a h f a h f a h f a hf a
h
− − − + + − +

(2) 2
( 2 ) 16 ( ) 30 ( ) 16 ( ) ( 2 )( )
12
f a h f a h f a f a h f a hf a
h
− − + − − + + − +


Expressions for the derivatives can also be found using other methods. For example, if the function
is interpolated with a polynomial using, say, n points, the derivative (first, second, etc) can be
estimated by calculating the derivative of the interpolating polynomial at the point of interest.
Example:
Considering the 3 points x1, x2 and x3 with x1 < x2 < x3 (this time, not necessarily equally spaced)
and respective function values y1, y2 and y3, we can use the Lagrange interpolation polynomial to
approximate y(x):
1 1 2 2 3 3( ) ( ) ( ) ( ) ( )f x L x L x y L x y L x y≈ = + + (6.12)
where:
2 3
1
1 2 1 3
( )( )( )
( )( )
x x x xL x
x x x x
− −
=
− −
, 1 32
2 1 2 3
( )( )( )
( )( )
x x x xL x
x x x x
− −
=
− −
and 1 23
3 1 3 2
( )( )( )
( )( )
x x x xL x
x x x x
− −
=
− −
(6.13)
so the first derivative can be approximated by:
1 1 2 2 3 3'( ) '( ) '( ) '( ) '( )f x L x L x y L x y L x y≈ = + +
which is:
2 3 1 3 1 21 2 3
1 2 1 3 2 1 2 3 3 1 3 2
2 2 2'( )
( )( ) ( )( ) ( )( )
x x x x x x x x xf x y y y
x x x x x x x x x x x x
− − − − − −
≈ + +
− − − − − −
.
This general expression will give the value of the derivative at any of the points x1, x2 or x3.
For example, at x1:
1 2 3 1 3 1 21 1 2 3
1 2 1 3 2 1 2 3 3 1 3 2
2'( )
( )( ) ( )( ) ( )( )
x x x x x x xf x y y y
x x x x x x x x x x x x
− − − −
≈ + +
− − − − − −
(6.14)
Exercise 6.4
Show that if the points are equally spaced by the distance h in (6.12) and the expression is evaluated
at x1, x2 and x3, the expression reduces respectively to the 3-point forward difference formula
(6.10), the central difference and the 3-point backward difference formulae.

Central Difference Expressions for derivatives
The following expressions are central difference approximations for different derivatives:
i) 1 1( ) ( )'( )
2
i i
i
f x f xf x
h
+ −−=
ii) 1 12
( ) 2 ( ) ( )''( ) i i ii
f x f x f xf x
h
+ −− +=


ELEC 0030 Numerical Methods page 67


© University College London 2019 – F.A. Fernandez
iii) 2 1 1 23
( ) 2 ( ) 2 ( ) ( )'''( )
2
i i i i
i
f x f x f x f xf x
h
+ + − −− + −=
iv) (4) 2 1 1 24
( ) 4 ( ) 6 ( ) 4 ( ) ( )( ) i i i i ii
f x f x f x f x f xf x
h
+ + − −− + − +=
Naturally, many more expressions can be developed using more points and/or different methods.
Exercise 6.5
Derive expressions iii) and iv) above. What is the order of the error for each of the 4 expressions
above?

6.1.1 Partial Derivatives
For a function of two variables: f(x,y), the partial derivative: ( , )f x y
x


is defined as:
0
( , ) ( , ) ( , )lim
h
f x y f x h y f x y
x h→
∂ + −
=

, which clearly, is a function of y.
Again, we can approximate this expression by a difference assuming that h is sufficiently small.
Then, for example a central difference expression for ( , )f x y
x


is:
( , ) ( , ) ( , )
2
f x y f x h y f x h y
x h
∂ + − −
=

(6.15)
Similarly,
( , ) ( , ) ( , )
2
f x y f x y h f x y h
y h
∂ + − −
=


In this form, the gradient of f is given by:
ˆ ˆ( , ) ( , ) [ ( , ) ( , )] [ ( , ) ( , )]ˆ ˆ( , )
2
f x y f x y f x h y f x h y x f x y h f x y h yf x y x y
x y h
∂ ∂ + − − + + − −
∇ = + ≈
∂ ∂

Exercise 6.6
Using central difference formulae for the second derivatives derive an expression for the Laplacian
( 2 f∇ ) of a scalar function f.

6.2 Numerical Integration
In general, numerical integration methods approximate the definite integral of a function f by a
weighted sum of function values at several points in the interval of integration. In general these
methods are called “quadrature” and there are different methods to choose the points and the
weights.
6.2.1 Trapezoid Rule
The simplest method to approximate the integral of a function is the trapezoid rule. In this case,
the interval of integration is divided into a number of subintervals on which the function is simply
approximated by a straight line as shown in Fig. 6.2.


page 68 ELEC0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
The integral (area under the curve) is then
approximated by the sum of the areas of the
trapezoids based on each subinterval.
The area of the trapezoid with base in the
interval [xi, xi+1] is:

11
( )( )
2
i i
i i
f fx x ++
+


and the total area is then the sum of all the terms
of this form. If we denote by

1( )i i ih x x+= −
Fig. 6.2
the width of the subinterval [xi, xi+1], the integral can be approximated by:

1
1
1
1
1( ) ( )
2
nx n
i i i
ix
f x dx h f f

+
=
≈ +∑∫ (6.16)
which can be written as:

1
1
1 1 2 1 2 3 2 3 1 2 12
1
( ) [ ( ) ( ) ( ) ]
nx n
n n n n n i i
ix
f x dx f h f h h f h h f h h f h fω− − −
=
≈ + + + + + + + + = ∑∫ 
with:
1
1
1
2 1
( ) 2 2, , 1
2
i i i
n
h i
h h i n
h i n
ω −

=
= + = −
 =


If all the subintervals are of the same width (the points are equally spaced), this reduces to:

1
1
1
2 1
2 1 or
( ) ;
2, , 12
nx n n
n
i i i i
i ix
h i nf ff x dx h f f
h i n
ω ω

= =
=  +
≈ + = =   = − 
∑ ∑∫

(6.17)
Exercise 6.7
(a) Using the trapezoid rule and 10 and 100 subintervals (11 and 101 points, respectively)
calculate the integral of exp(−x2) between 0 and 2 and compare the results.
(b) Calculate the integral of 1/x between 1 and 2 using 11 and 101 points.


It can be shown that the error incurred with the application of the trapezoid rule in one interval is
given by the term:

3
(2)( ) ( )
12
b aE f ξ−= − (6.18)
where ξ is a point inside the interval and the error is defined as the difference between the exact
integral (I) and the area of the trapezoid (A): E = I − A.
If this is applied to the Trapezoid rule using a number of subintervals, the error term changes to:


ELEC 0030 Numerical Methods page 69


© University College London 2019 – F.A. Fernandez

2
(2)( ) ( )
12 h
b a hE f ξ−= − (6.19)
where now hξ is a point in the complete interval [a, b] and depends on h. Considering the Error
as the sum of the individual errors in each subinterval, we can write this term in the form:

3 2
(2) (2)
1 1
( ) ( )
12 12
n n
i i
i i
h hE f hfξ ξ
= =
= − = −∑ ∑ (6.20)
where iξ are points in each subintervals. Expression (6.20), in the limit when n → ∞ and 0h →
corresponds to the integral of f(2) over the interval [a, b]. Then, we can write (6.20) as:

2
( '( ) '( ))
12
hE f b f a≈ − − (6.21)
Two options are open now, we can use this term to estimate the error incurred or equivalently, to
determine the number of equally spaced points required for a determined precision or, include this
term in the calculation to form a corrected form of the trapezoid rule:

21
1
2
( ) ( '( ) '( ))
2 12
b n
n
i
ia
f f hf x dx h f f b f a

=
 +
≈ + − − 
 
∑∫ (6.22)
Example
For the function: ( ) sin( )f x x xπ= the exact
value of the integral is:
1
0
1( ) 0.31830988618379f x dx
π
= =∫
Calculating the integral with five subintervals
the results without and with the correction term
are:

Integral Error %
0.30855052604273 3.0660 %
0.31378651379871 1.4211 %

6.2.2 Simpson’s Rule
In the case of the trapezoid rule, the function is approximated by a straight line and this can be
done repeatedly by subdividing the interval. A higher degree of accuracy using the same number
of subintervals can be obtained with a better approximation than the straight line. For example,
choosing a quadratic approximation could give a better result. This is the Simpson’s rule.
Consider the function f(x) and the interval [a, b]. Defining the points x0, x1 and x2, as:
0 1 2, , 2
a bx a x x b+= = = and defining ( ) 2h b a= −
Using Lagrange interpolation to generate a second order polynomial approximation to f(x) gives
as in (6.12):
0 0 1 1 2 2( ) ( ) ( ) ( ) ( ) ( ) ( )f x f x L x f x L x f x L x≈ + +


page 70 ELEC0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
where:
1 20 1 22
0 1 0 2
( )( ) 1( ) ( )( )
( )( ) 2
x x x xL x x x x x
x x x x h
− −
= = − −
− −

0 21 0 22
1 0 1 2
( )( ) 1( ) ( )( )
( )( )
x x x xL x x x x x
x x x x h
− −
= = − − −
− −

and 0 12 0 12
2 0 2 1
( )( ) 1( ) ( )( )
( )( ) 2
x x x xL x x x x x
x x x x h
− −
= = − −
− −

Then,
( )
2
0 0 1 1 2 2( ) ( ) ( ) ( ) ( ) ( ) ( )
b a h
a a
f x dx f x L x f x L x f x L x dx
+
≈ + +∫ ∫ (6.23)
or
2 2 2
0 0 1 1 2 2( ) ( ) ( ) ( ) ( ) ( ) ( )
b a h a h a h
a a a a
f x dx f x L x dx f x L x dx f x L x dx
+ + +
≈ + +∫ ∫ ∫ ∫
The individual integrals are:

2 2
0 1 22
1( ) ( )( )
2
a h a h
a a
L x dx x x x x dx
h
+ +
= − −∫ ∫
and with the change of variable: 1x t x x→ = −

2 3 2
0 2 2
1 1( ) ( )
3 2 32 2
ha h h
a h h
t t hL x dx t t h dt h
h h
+
− −
 
= − = − = 
 
∫ ∫
Similarly,
2
1
4( )
3
a h
a
hL x dx
+
=∫ and
2
2 ( ) 3
a h
a
hL x dx
+
=∫
Then, substituting in (6.23):
( ))2()(4)(
3
)( hafhafafhdxxf
b
a
++++≈∫ (6.24)
Example
Use of the Simpson’s rule to calculate the integral:
1.25
0.25
(sin 0.5)x dxπ +∫
The 2nd order Lagrange interpolation polynomial used in this case is:


ELEC 0030 Numerical Methods page 71


© University College London 2019 – F.A. Fernandez
2( ) 2.8284 3.0355 0.5214p x x x= − + +
We have: x0 = a = 0.25,
2 0 2 1.25x x h b= + = = , then
1 0 0.75x x h= + = and h = 0.5

The figure shows the function (in blue) and the
2nd order Lagrange interpolation (red) used to
calculate the integral with the Simpson’s rule.
The exact value of this integral is:

1.25
0.25
(sin 0.5) 0.950158158x dxπ + =∫
Applying the Simpson’s rule to this
function gives:

Fig. 6.3
( )
1.25
0.25
(sin 0.5) ( ) 4 ( ) ( 2 ) 0.97140452
3
hx dx f a f a h f a hπ + ≈ + + + + =∫

As with the trapezoid rule, higher accuracy can be obtained subdividing the interval of integration
and adding the result of the integrals over each subintervals.
Exercise 6.8
Write down the expression for the composite Simpson’s rule using n equal subintervals and use it
to calculate the integral
1.25
0.25
(sin 0.5)x dxπ +∫ of the example above using 10 subintervals.

6.2.3 The Midpoint Rule
Another simple method to approximate a definite integral is the midpoint rule where the function
is simply approximated by a constant value over the interval; the value of the function at the
midpoint. In such a way, the integral of f(x) in the interval [a, b] is approximated by the area of
the rectangle of base (b − a) and height f(c), where c = (a + b)/2.
Then, for the integral:
( ) ( ) ( )
b
a
I f x dx b a f c Error= ≈ − +∫ (6.25)
To estimate the error, let’s consider the Taylor expansion for the function f, truncated to first order
and centred in the midpoint of the interval [a, b]:
21( ) ( ) ( ) ( ) '( ) (( ) )f x p x f c x c f c O x c≈ = + − + − (6.26)
The error term in (6.26) is actually: 2 (2)1
1( ) ( ) ( )
2
R x x c f ξ= − (see App. 1) so the error in (6.25)
so (6.25) can be approximated as:
0 0.5 1 1.5
0
1
x0 x1
x2


page 72 ELEC0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
1 1 1( ) ( ) [ ( ) ( ) '( ) ( )] ( )( ) ( )
b b b b
a a a a
I f x dx p x dx f c x c f c R x dx f c b a R x dx= ≈ = + − + = − +∫ ∫ ∫ ∫
because the integral of the second term is: '( ) ( ) 0
b
a
f c x c dx− =∫ since ( ) 2c a b= + .
Comparing with (6.25) gives:
( )(2) 2 (2) 3 (2) 3 31 1 1 1( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )2 6 6
b b b
a
a a
Error R x dx f x c dx f x c f b c a cξ ξ ξ= = − = − = − − −∫ ∫
but since b – c = c – a = (b − a)/2, we have: 3 (2)1 ( ) ( )
24
E b a f ξ= − (6.27)
We can write then:
3 (2)1( ) ( ) ( ) ( ) ( )
24
b
a
f x dx b a f c b a f ξ= − + −∫ (6.28)
for some ξ in the interval [a, b].
Similarly to the case of the trapezoid rule and the Simpson’s rule, the midpoint rule can be
used in a composite manner, after subdividing the interval of integration in a number of
subintervals to achieve higher precision. In that case, the expression for the integral becomes:

2
(2)
1
( )( ) ( ) ( )
24
b n
i
ia
h b af x dx h f c f η
=

= +∑∫ (6.29)
where ci are the midpoints of each of the n subintervals, h = (b − a)/n, the length of each subinterval
and η is a point between a and b.
Exercise 6.9
Use the Midpoint Rule to calculate the integral:
1.25
0.25
(sin 0.5)x dxπ +∫ using 2 and 10 subintervals.
compare the result with that of the Simpson’s rule in the example above.

6.2.4 Gaussian Quadrature
In the trapezoid, Simpson’s and midpoint rule, the definite integral of a function f(x) is
approximated by the exact integral of a polynomial that approximates the function. In all these
cases, the evaluation points are chosen arbitrarily, often equally spaced. However, it is rather clear
that the precision attained is dependent on the position of these points, which gives another route
to optimisation. After that, the weighting coefficients are determined by the choice of method.
Considering again the general approach to the approximation of the definite integral, the problem
can be written in the form:

1
11
( ) ( ) ( )
n
n n
n i i
i
f x dx G f w f x
=−
≈ = ∑∫ (6.30)
(The interval of integration is here chosen as [−1, 1], but obviously, any other interval can be
mapped into this by a change of variable.)
The objective now is to find for a given n, the best choice of evaluation points nix (called here


ELEC 0030 Numerical Methods page 73


© University College London 2019 – F.A. Fernandez
“Gauss points”) and weights niw (“Gauss weights”) to get maximum precision in the
approximation. Compared to the criterion for the trapezoid and Simpson’s rules, this is equivalent
to try to find for a fixed n the best choice for nix and
n
iw such that the approximation (6.30) is
exact for a polynomial of degree N, with N (>n) as large as possible. (That is, going beyond the
degree of approximation of the previous methods.)
So what we are saying is: Find nix and
n
iw such that:

1
11
( ) ( )
n
n n
N i N i
i
P x dx w P x
=−
= ∑∫ exactly for some N > n.
Choosing the coefficients of PN: 0,ia i k= ≠ , in sequence for 0, ,k N=  , we have:

1
11
( )
n
k n n k
i i
i
x dx w x
=−
= ∑∫ for k = 0, 1, 2, …, N (6.31)
with N as large as possible (note the equal sign now). We can simplify the problem to these
functions now because any polynomial is just a superposition of terms of the form kx , so if the
integral is exact for each of them, it will be for any polynomial containing those terms.
Expression (6.31) is a system of equations that the Gauss points and weights (unknown) need to
satisfy. The problem is then, to find the nix and
n
iw (n of each). This is a nonlinear problem that
cannot be solved directly. We also have to determine the value of N. It can be shown that this
number is N = 2n − 1. This is also rather natural since (6.31) consists of N+1 equations and we
need to find 2n parameters.
Finding the weights
For a set of weights and Gauss points, let’s consider the Lagrange interpolation polynomials of
order n, associated to each of the Gauss points:

1,
( )
nn
jn
i n n
j j i i j
x x
L x
x x= ≠

=
−∏ (6.32)
then, since the expression (6.31) should be exact for any polynomial up to order N = 2n − 1, and
( )niL x is of order n, we have:

1
11
( ) ( )
n
n n n n
i j i j
j
L x dx w L x
=−
= ∑∫ for each i = 1, .. , n (6.33)
but since the ( )niL x are interpolation polynomials, ( )
n n j
i j iL x δ= (that is, they are 1 if j = i and 0
otherwise), all the terms in the sum in (6.33) are zero except for j = i and we have:

1
1
( )n ni iL x dx w

=∫ (6.34)
With this, we have the weights for a given set of Gauss points. We need to find now the best
choice for these.

If P(x) is an arbitrary polynomial of degree ≤ 2n − 1, we can write: P(x) = Pn(x) Q(x) + R(x); (Q
and R are respectively the quotient polynomial and reminder polynomial of the division of P by
Pn). Pn(x) is of order n and then, Q and R are of degree n − 1 or less. Then we have:


page 74 ELEC0030 Numerical Methods


© University College London 2019 – F.A. Fernandez

1 1 1
1 1 1
( ) ( ) ( ) ( )nP x dx P x Q x dx R x dx
− − −
= +∫ ∫ ∫ (6.35)
If we now define the polynomial Pn(x) by its roots and choose these as the Gauss points:

=
−=
n
i
n
in xxxP
1
)()( , (6.36)
then, the integral of the product Pn(x) Q(x), which is a polynomial of degree ≤ 2n − 1, must be
given exactly by the quadrature expression:

1
11
( ) ( ) ( ) ( )
n
n n n
n i n i i
i
P x Q x dx w P x Q x
=−
= ∑∫ (6.37)
but since the Gaussian points are the roots of Pn(x), (6.37) must be zero; that is:

1
1
( ) ( ) 0nP x Q x dx

=∫ (6.38)
for all polynomials Q(x) of degree n − 1 or less. This implies that Pn(x) must be a member of a
family of orthogonal polynomials.1. In particular, Legendre polynomials are a good choice
because they are orthogonal in the interval [−1, 1] with a weighting function w(x) = 1.

Going back now to the integral of the arbitrary polynomial P(x) of degree ≤ 2n − 1, and equation
(6.35), we have that if we choose Pn(x) as in (6.36), (6 38) is satisfied and then, (6.35) reduces to:

1 1
1 1
( ) ( )P x dx R x dx
− −
=∫ ∫ (6.39)
but since R(x) is of degree n − 1 or less, the interpolation using Lagrange polynomials for the n
Gauss points will give the exact representation of R (see Exercise 4.2). That is:

1
( ) ( ) ( )
n
n n
i i
i
R x R x L x
=
= ∑ exactly. (6.40)
Then,
1 1 1
1 11 1 1
( ) ( ) ( ) ( ) ( )
n n
n n n n
i i i i
i i
R x dx R x L x dx R x L x dx
= =− − −
= =∑ ∑∫ ∫ ∫ (6.41)
but we have seen before, in (6.34), that the integral of the Lagrange interpolation polynomial for
point i is the value of niw , so:

1
11
( ) ( )
n
n n
i i
i
R x dx w R x
=−
= ∑∫ (6.42)
Now, since P(x) = Pn(x) Q(x) + R(x) and ( ) 0nn iP x = (see 6.36), ( ) ( )
n n
i iP x R x= and from (6.39):

1 Remember that for orthogonal polynomials in [−1, 1], all their roots lie in [−1, 1], and satisfy:
1
1
( ) ( ) ji j ip x p x dx δ

=∫ . Additionally,
1
1
( ) ( ) 0ip x q x dx

=∫ for any polynomial q of degree ≤ i −1.


ELEC 0030 Numerical Methods page 75


© University College London 2019 – F.A. Fernandez

1
11
( ) ( )
n
n n
i i
i
P x dx w P x
=−
= ∑∫ (6.43)
which tells us that the integral of the arbitrary polynomial P(x) of order 2n − 1 can be calculated
exactly using the set of Gauss points nix , the roots of the nth order Legendre polynomial and the
weights niw determined by (6.34) – the integral over the interval [−1, 1] of the Lagrange polynomial
of order n corresponding to the Gauss point nix .
Back to the start then, we have now a set of n Gauss points and weights that yield the exact
evaluation of the integral of a polynomial up to degree 2n − 1. These should also give the best
result to the integral of an arbitrary function f:

1
11
( ) ( )
n
n n
i i
i
f x dx w f x
=−
≈ ∑∫ (6.44)
Gauss nodes and weights for different orders are given in the following table.

Gaussian Quadrature: Nodes and Weights
n Nodes nix Weghts
n
iw
1 0.0 2.0
2 3 3± = ±0.577350269189 1.0
3
0 8 9 = 0.888888888889
15 5± = ±0.774596669241 5 9 = 0.555555555556
4
525 70 30 35± − =±0.339981043585 (18 30) 36+ = 0.652145154863
353070525 +± =±0.861136311594 (18 30) 36− = 0.347854845137
5
0 128 225 = 0.568888888889
5 2 10 7 3± − = ±0.538469310106 (322 13 70) 900+ = 0.478628670499
5 2 10 7 3± + = ±0.906179845939 (322 13 70) 900− = 0.236926885056
6
±0.238619186 0.4679139
±0.661209386 0.3607616
±0.932469514 0.1713245

Example
For the integral:
1
1
sin 5xe x dx−

∫ the results of the calculation using Gauss quadrature are listed in
the table:


page 76 ELEC0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
n Integral Error %
2 −0.307533965529
3 0.634074857001
4 0.172538331616 28.71
5 0.247736352452 −2.35
6 0.241785750244 0.10
The error is also calculated compared with the exact value: 0.24203832101745.
The results with few Gauss points are not very good because the function varies strongly in
the interval of integration. However, for n = 6, the error is very small.
Exercise 6.10
Compare the results of the example above with those of the Simpson’s and Trapezoid Rules for
the same number of points.

Change of Variable
The above procedure was developed for definite integrals over the interval [−1, 1]. Integrals over
other intervals can be calculated after a change of variables. For example if the integral to calculate
is:
( )
b
a
f t dt∫
To change from the interval [a, b] to [−1, 1] the following change of variable can be made:

( ) 2
( ) 2
t a bx
b a
− +


or
2 2
b a a bt x− +→ + so
2
b adt dx−=
Then, the integral can be approximated by:

1
( )
2 2 2
b n
n n
i i
ia
b a b a a bf t dt w f x
=
− − + ≈ + 
 
∑∫ (6.45)
Exercise 6.11
Use Gaussian quadrature to calculate: ∫ +
25.1
25.0
)5.0(sin dxxπ

The same ideas can be extended to other types of integrals as for example, surface integrals on
quadrilateral or triangular domains. The latter is commonly used in the solution of problems using
the finite element method.


page 77 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
7. Solution of Partial Differential Equations
7.1 Solution of Ordinary Differential Equations (ODEs) – Initial Value
Problems – (IVP)
The solution of ordinary differential equations (ODEs) and in particular, initial value
problems (IVPs) can be implemented using methods derived from the techniques described before
for numerical differentiation and integration.
In general, any first order ODE for a function y(t) can be written in the form:
' ( , )y f t y= , (7.1)
That is, expressing the derivative as a function of the independent variable, t, and the function, y.
For example: the equation: ' 0y aty b+ + = will result in ' ( , ) ( )y f t y aty b= = − + with an initial
condition: 0 0( )y t y= .

The general idea for a solution method is to start with a discretisation of the domain of interest
defining a step h in the variable t, so that: 0 , 1, 2, ,it t ih i N= + =  . Since we know the value of
y at 0t t= , we look for a way to predict the value at 1t t= and so forth.
Denoting ( )iy t as iy , the exact calculation is based on the expression:

1 1 1
1 1' ' ( , )
i i i
i i i
t t t
i i i i i
t t t
y dt y y y y y dt y f t y dt
+ + +
+ += − ⇒ = + = +∫ ∫ ∫ (7.2)
Then, the main concern is calculating the value of the integral in each subinterval.
The simplest form of doing this is with the Euler method. Although this method is rarely used in
practice due to its inferior performance, it is useful to describe the basic ideas on which many other
practical methods are based.
7.1.1 Euler Method (Explicit or Forward Euler)
Using a Taylor expansion truncated to the first
order term, or the expression for the forward
difference approximation to the derivative, we
have:
2( ) ( ) '( ) ( )i i iy t h y t hy t O h+ = + + (7.3)

From where the iterations can be established as:
0 0 1, ( , ), 1, 2, ,i i i iy y y y hf t y i N+= = + =  (7.4)
In this case, the integral is simply approximated by the product '( )ihy t . We can see this graphically
in Fig. 7.1. The method is very simple but it is not very accurate. From (7.3) we can see that the
truncation error in one step is 2( )O h (local truncation error). However, this is considering that we
do start the interval with the correct value. In practice, the error will accumulate (it is not just the
sum of the errors). Ignoring this effect, the total, global error over the domain of N subintervals
will be 2( )NO h , but N L h= so the global error is at best ( )O h .

Fig. 7.1 Approximation of the integral


ELEC 0030 Numerical Methods page 78


© University College London 2019 – F.A. Fernandez
Example
The so-called logistic equation, which describes the rate of change of a population is an example
of a first order initial value ODE:
' (1 )y cy y= −
with a starting population of 0 0( )y t y= . It shows that the rate of change of a population is
proportional to the current size of the population and the remaining capacity (1−y).
Discretising the variable t in the form: 0 , 1, 2, , 1it t ih i N= + = − the iterations start with
0 0( )y t y= and then: 1 (1 ), 1, 2, , 1i i i iy y hcy y i N+ = + − = − .
Fig. 7.2 shows the solution of this equation
for two different step sizes and two different
starting conditions for c = 3.
It can be seen that if the starting population is
of 10%, it will grow eventually to the
maximum capacity. If the starting condition
is of a value greater than the capacity of the
environment, the population will decrease to
that value.
The figure also show the difference in the
calculations when different step sizes are
used.

The red line shows a coarse discretisation
involving 10 subintervals and the black line
shows the results for a calculation using 20
subintervals.

7.1.2 Implicit and Explicit Methods
The form of the Euler method just described and many variations are called explicit because the
new approximation 1iy + can be calculated as an explicit expression of the previous values.
Although the calculations are simpler, solving some problems they are unstable unless extremely
small increments are used. Systems that present this problem are normally called “stiff” and
explicit methods don’t work properly solving them. An exact, unambiguous characterisation of
stiff differential equations is not easy to establish but they are normally associated with situations
where small variations in some parameters lead to large changes in the solution. However, this is
a property of the differential equation and not of the problem, which can be formulated in a non-
stiff form, without these difficulties.
7.1.3 Implicit or Backward Euler Method
An implicit form of the Euler method can be derived starting from the backward difference
formula or for the Taylor expansion for 1ky −
21 1( ) ( ) '( ) ( ) ( , )k k k k k k k ky y t h y t hy t O h y y hf t y− −= − = − + ⇒ = −
Changing 1k k→ + : 1 1 1( , )k k k ky y hf t y+ + += + (7.5)
The problem now is that the value of 1ky − in the r.h.s. is not known and need to be estimated.

Fig. 7.2 Numerical solution of the logistic equation
with two different step sizes



page 79 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
Eqn. (7.5) is a normally a nonlinear equation in 1ky + and needs to be solved in each iteration. For
example, for the equation: ' sin( ) 0y t y− = the function f is: ( , ) sin( )f t y t y= and the iterations
are:
1 1 1sin( )k k k ky y ht y+ + += +
Rewriting this as: 1 1( )sin( ) 0k k k ky h t h y y+ +− + − = or 1( ) 0, kf x x y += =
At each step k the values of ky , kt and h are known and the equation can be solved by iterations
with a root finding procedure like the fixed point iteration method or Newton-Raphson.
Formulated with the fixed point iteration method this could be:
1 ( )sin( )j jk kx y h t h x
+ = + +
and a suitable starting point could be: 0 kx y=
If solved with Newton Raphson, we have:
( ) ( )sin( )k kg x x h t h x y= − + − and '( ) 1 ( )cos( )kg x h t h x= − +
and the iterations are: 1 ( )
'( )
j
j j
j
g xx x
g x
+ = −
Example
For the equation: ' 0, (0) 1y ay y+ = = , ( , )f t y ay= − and the iterations are:
1 1k k ky y ahy+ += − or 1
1
1k k
y y
ah+
=
+

In this case the resultant
equation is linear and can be
solved without iterations.
Fig. 7.3 shows a comparison
between the results for a = 10
from the Explicit Euler method
(red) and the Implicit Euler
(blue).
The black line shows the exact
solution.
Results shown with circles are
for a = 10 and h = 0.195. The
red line marked with “x”
corresponds to the Explicit
Euler method with a step size of
0.05.
For step sizes larger than 0.2
the oscillations observed with
the Explicit Euler method increase. For this method the iterations become:
1 (1 )k ky ah y+ = −
so to avoid increasing oscillations, the step size should satisfy the condition:
|1 | 1 2ah h a− < → < .


Fig. 7.3 Convergence of Explicit and Implicit Euler methods


ELEC 0030 Numerical Methods page 80


© University College London 2019 – F.A. Fernandez

There are no restrictions for the Implicit Euler method.

7.1.4 Heun or Explicit Trapezoid Method
The Heun or explicit trapezoid method is one of the variations of the Euler method that have an
improved accuracy. While in the Euler method the integral in (7.2) is simply approximated by the
product of the interval length (h) with the value of the derivative at the start of the interval the
trapezoid method uses an approximation based on the trapezoid rule.
In the simple case of an equation of the form: ' ( )y f t= , that is, if the ODE doesn’t contain the
function y itself, the Euler method would be based on: 1 ( )i i iy y hf t+ = + that is (as for any other
case) the slope used for the updated value is based at the point it . If instead of using ( )ihf t to
approximate the integral in (2), the trapezoid rule expression is used, we would have:
1 1[ ( ) ( )]2i i i i
hy y f t f t+ += + +
That is, the value used is the average between the values at the ends of the subinterval and they
are both known.
However, in the general case, where the function f also depends on y, the equivalent expression
would be:
1 1 1[ ( , ) ( , )]2i i i i i i
hy y f t y f t y+ + += + + (7.6)
but the value of y at the end of the interval, 1iy + in the right hand side is not available yet so instead,
an approximation for it must be used.
We can use the value that the Euler method would give at that point: 1 ( , )i i i iy y hf t y+ = + This
results in the expression:
1 1[ ( , ) ( , ( , )]2i i i i i i i i
hy y f t y f t y hf t y+ += + + + (7.7)
This method is also explicit and it can be shown that the local truncation error is 3( )O h which
gives a global error of 2( )O h , compared with first order for the Euler method.
An implicit version of this method can be formulated if instead of approximating 1iy + in (7.6) by
the expression given by the forward Euler method, (7.6) is solved for the value of 1iy + . This, as
for the Implicit Euler method, will normally mean to find the root of a nonlinear function of 1iy + .


page 81 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
Example
Solution of the initial value problem in the
interval [0, 1]:
3' , (0) 1y ty t y= + =
The iterations for the Euler method are:
31 ( , ), ( , )i i i i i i i i iy y hf t y f t y t y t+ = + = −
and for the trapezoid method:
1 1 1[ ( , ) ( , )]2i i i i i i
hy y f t y f t y+ + += + +
Fig. 7.4 Comparison between Trapezoid and
Euler methods
Now, 1iy + in the last term is replaced by the value given by the Euler method approximation:
1 1[ ( , ) ( , ( , )]2i i i i i i i i
hy y f t y f t y hf t y+ += + + +
where 3 31 1 1 1( , ( , ) ( ( )) ;i i i i i i i i i i i if t y hf t y t y h t y t t t t h+ + + ++ = + − − = +
It can be seen that even with just 10 subintervals, the results from the Trapezoid method are
indistinguishable from the exact solution.

7.1.5 Midpoint method
Another variation that leads to a method of order 2 is based on the use of the midpoint rule to
approximate the integral in (7.2):

1
1 '
i
i
t
i i
t
y y y dt
+
+ = + ∫
The midpoint rule in this case would be:
1
( , )
i
i
t
t
f dt h f c y
+
=∫ where c is the midpoint of the
interval, that is: 2it h+ . In this expression, the function f needs to be evaluated at 2it c t h= = +
and with ( 2)iy y t h= + which is not directly available. We can do the same as in the Trapezoid
method and use the Euler method to approximate this value:
( 2) ( 2) ( , )i i i iy t h y h f t y+ = +
so the full expression for the midpoint method will be:
1 ( 2, ( 2) ( , ))i i i i i iy y h f t h y h f t y+ = + + + (7.8)
It can be shown that as for the trapezoid method, the local truncation error is 3( )O h and the global
error is 2( )O h .
7.1.6 Taylor methods
The Euler method is derived from the Taylor expansion truncated to order one (see (7.3)). In the
same form, we can derive different methods using truncation to a higher degree. In this way, a
Taylor method of order k uses a truncation to order k. The Euler method is a Taylor method of


ELEC 0030 Numerical Methods page 82


© University College London 2019 – F.A. Fernandez
order one.
Using a Taylor expansion truncated to order k:

2
( ) 1( ) ( ) '( ) ''( ) ( ) ( )
2 !
k
k k
i i i i i
h hy t h y t hy t y t y t O h
k
++ = + + + + +
we can formulate the Taylor method of order k in the following form:
0 0( )y y t=

2
( 1)
1 ( , ) '( , ) ( , )2 !
k
k
i i i i i i i i
h hy y hf t y f t y f t y
k

+ = + + + +
where the derivatives of f are the total derivatives:
( , ) ( , ) ( , ) ( , )'( , ) ( , )i i i i i i i ii i i i
f t y f t y f t y f t ydyf t y f t y
t y dt t y
∂ ∂ ∂ ∂
= + = +
∂ ∂ ∂ ∂
(7.9)
and so on.
The local truncation error of these methods is 1( )kO h + and then, its global error is ( )kO h
For example, the second order Taylor method will take the form:

2
1
, ,
( , ) ( , )
2
i i i i
i i i i i i
t y t y
h f fy y hf t y f t y
t y+
∂ ∂ 
= + + + ∂ ∂ 
(7.10)
Example
For the Initial Value Problem: 3 0 0' , ( )y ty t y t y= + =
Since, 3( , )f t y ty t= + , it follows from (7.9) that
2 3'( , ) 3 ( )f t y y t t ty t= + + +
and then the second order Taylor method gives:

2
3 2 4
1 ( ) ( (3 ) )2i i i i i i i i i
hy y h t y t y y t t+ = + + + + + +

The problem with these methods is that they require the calculation of the derivatives of f(t,y),
which sometimes can be complicated. Carl Runge and Wilhelm Kutta devised a procedure to
avoid these calculations.
7.1.7 Runge-Kutta methods
This is the name given to a whole family of methods derived from the Taylor expansion and
include the Euler method (Runge-Kutta of order 1), the Trapezoid method, the midpoint method
and others.
Second order Runge-Kutta methods
To avoid the calculation of the derivative in (7.10), the iteration equation is written as:
1 0 1 1 2( )i iy y h c k c k+ = + + (7.11)
and it can be shown (see Appendix, section 6) that the coefficients satisfy the equations:


page 83 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez

0 1
1
1
1
1 2
1 2
c c
c p
c q
+ =
=
=
(7.12)
These are three equations and there are four coefficients, so one of them can be chosen arbitrarily.
Usually 1c is chosen and the common values are:

1 0
1 0
1 0
0 1 0 0 Euler method
1 2 1 2 1 1 Trapezoid method
1 0 1 2 1 2 Midpoint method
c c p q
c c p q
c c p q
= → = = =
= → = = =
= → = = =

Another method, not derived here, is the Ralston’s method which is specified by the value:
1 2 3c = , which gives: 0 1 3,c = , 3 4p = and 3 4q = .
Similarly, for Taylor methods of higher order, equivalent expressions can be derived to avoid the
explicit calculation of the derivatives.
Fourth order Runge-Kutta method (RK4)
The most common of the Runge-Kutta methods is the order 4, called RK4 because its global error
is 4( )O h .
The derivation of the corresponding expression is rather involved and is omitted here.
The iterations are specified by:
1 1 2 3 4( 2 2 )6i i
hy y s s s s+ = + + + + (7.13)
where:
1
2 1
3 2
4 3
( , )
( 2, 2 )
( 2, 2 )
( , )
i i
i i
i i
i i
s f t y
s f t h y h s
s f t h y h s
s f t h y hs
=
= + +
= + +
= + +
(7.14)
Note that in (7.13) the term 1 2 3 4( 2 2 ) 6h s s s s+ + + is used as the estimate of the integral of the
slope in the interval in a form that is more accurate than the previous methods.
Programming of this procedure is straightforward and this is another reason for its popularity.
Example
Using the Runge-Kutta method of order 4 to solve the initial value problem: 3'y ty t= + with
(0) 1y =
which has the exact solution:
2 2 2( ) 3 2ty t e t= − − , gives:

steps step size h error at t=1
5 0.20000000 2.378807e-05
10 0.10000000 1.465466e-06
20 0.05000000 9.035429e-08
40 0.02500000 5.598290e-09
80 0.01250000 3.482015e-10
160 0.00625000 2.170353e-11
320 0.00312500 1.338929e-12
640 0.00156250 1.050271e-13

Results calculated using the script ExampleRK4.m:
L=1;
% Exact value at t=1
R=3*exp(0.5)-3;


ELEC 0030 Numerical Methods page 84


© University College London 2019 – F.A. Fernandez
disp('steps step size h error at t=1')
for n=[5,10,20,40,80,160,320,640]
h=L/n;
yk=1;
tk=0;
for k=1:n
s1=functf(tk,yk);
tk=tk+h/2;
s2=functf(tk,yk+(h/2)*s1);
s3=functf(tk,yk+(h/2)*s2);
tk=tk+h/2;
s4=functf(tk,yk+h*s3);
yk=yk+(h/6)*(s1 + 2*s2 + 2*s3 + s4);
end
e=abs(R-yk);
fprintf('%5d %12.8f %15.6e\n',n,h,e)
end

function s=functf(ti,yi)
s=ti*yi+ti^3;
end

7.1.8 Systems of ODEs
A numerical solution of an IVP consisting of a system of ODEs can be implemented as a simple
extension of the same methodology used for a single ODE as described in (7.2) above.
For example, a system of first order ODEs for the functions: 1 2, , , ny y y , can be expressed by:

1 1 1, 2
2 2 1, 2
1, 2
' ( , , , )
' ( , , , )
' ( , , , )
n
n
n n n
y f t y y y
y f t y y y
y f t y y y
=
=
=




(7.15)

or 1 2' ( , , , , ), 1, 2, ,i i ny f t y y y i n= = 
each with its own initial conditions.
We can write the system in a compact form as the vector equation ' ( , )Y f t Y= where the vector
Y is: 1 2( , , , )
T
ny y y and ( , )f t Y is the vector 1 2( ( , ), ( , ), , ( , ))
T
nf t Y f t Y f t Y . The initial
conditions will be given by the vector 0 1 0 2 0 0( ( ), ( ), , ( ))nY y t y t y t=  .
Example
Apply the Euler method to the system of fist order ODEs:
21 2 1' 2y y y= −
22 1 2 2'y y y ty= − −
with the initial conditions: 1(0) 0y = and 2 (0) 1y = .
The individual equations (scalar) for the Euler iterations are:
21, 1 1, 2, 1,( 2 )i i i iy y h y y+ = + −
22, 1 2, 1, 2, 2,( )i i i i i iy y h y y t y+ = + − −
and a simple program can be written to perform these iterations starting from the initial conditions.



page 85 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
7.1.9 Higher order ODEs
Higher order equations can also be formulated in the same form as a system of equations. For
example an equation of second order such as:
'' ' 0y aty by c+ + + = , with the initial conditions: 0 0( )y t y= , 0 0'( ) 'y t y=
can be written as
'' ( , , ')y f t y y= where ( , , ') 'f t y y aty by c= − − − .
We can define: 1y y= , 2 1' ( ')y y y= = and we have the system of equations:
1 2'y y=
2 1 2 2 1' ( , , )y f t y y aty by c= = − − −
with the initial conditions: 1 0 1,0( )y t y= , 2 0 2,0( )y t y= .
In general, for an equation of order n we have:
( ) ( 1)( , , ', '', , )n ny f t y y y y −=  (7.16)
and we can treat this as a system of equations, where the vector Y is: ( 1)( , ', '', , )n TY y y y y −=  so
the system of equations becomes:
( ) ( 1) , 1, 2, , 1k ky y k n+= = − . (7.17a)
and ( ) ( 1)( , , ', '', , )n ny f t y y y y −=  (7.17b)
Example
Convert the third order ODE: 2''' ( '') ' '' siny a y y yy t= − + + into a system of first order equations.
Defining the functions: 1y y= , 2 1 'y y= and 3 2 'y y= we have the system of equations:
1 2'y y=
2 3'y y=
and 23 3 2 1 3' siny ay y y y t= − + +

7.2 Solution of Partial Differential Equations
7.2.1 The Finite Difference Method: Solution of Differential Equations by Finite
Differences
We will study here a direct method to solve differential equations which is based on the use of
finite differences. This consists of the direct substitution of derivatives by finite differences and
thus the problem reduces to an algebraic form.
One Dimensional Problems:
Let’s consider a problem given by the equation:

Lf = g (7.18)

where L is a linear differential operator and g(x) is a known function. The problem is to find the
function f(x) satisfying equation (7.18) over a given region (interval) [a, b] and subjected to some
boundary conditions at a and b.
The basic idea is to substitute the derivatives by appropriate difference formulae like those
seen in section 6. This will convert the problem into a system of algebraic equations.
In order to apply systematically the difference approximations we proceed as follow:


ELEC 0030 Numerical Methods page 86


© University College London 2019 – F.A. Fernandez
–– First, we divide the interval [a, b] into N equal subintervals of length h : h = (b–a)/N,
defining the points xi : xi = a + ih
–– Next, we approximate all derivatives in the operator L by appropriate difference formulae
(h must be sufficiently small – N large – to do this accurately).
–– Finally, we formulate the corresponding difference equation at each point xi. This will
generate a linear system of N–1 equations on the N–1 unknown values of fi = f(xi).
Example:
If we take Lf = g to be a general second order differential equation:
''( ) '( ) ( ) ( )cf x df x ef x g x+ + = (7.19)
with constant coefficients c, d and e. [ , ]x a b∈ and boundary conditions:
f(a) at a and f(b) at b. (7.20)
Using (6.11) and (6.7) at point i:
1 12
2'' i i ii
f f ff
h
− +− +≈ and 1 1'
2
i i
i
f ff
h
+ −−≈ (7.21)
results for the equation at point i:
1 1 1 12
2
2
i i i i i
i i
f f f f fc d e f g
hh
− + + −− + −+ + = (7.22)
or:
( )2 21 1 2 2 2i i i i
d dc h f c eh f c h f g h− +
   − + − + + + =   
   
(7.23)
for all i except i = 1 and N–1, where for i = 1: fi-1 = f(a) and
for i = N–1: fi+1 = f(b)

This can be written as a matrix problem of the form: A f = g, where f = { fi} g = {h2gi}
are vectors of order N–1. The matrix A has only 3 elements per row:
, 1 2i i
da c h−
 = − 
 
, ( )2, 2i ia c eh= − + , , 1 2i i
da c h+
 = + 
 

Exercise 7.1
Formulate the algorithm to solve a general second order differential equation over the
interval [a, b], with Dirichlet boundary conditions at a and b (known values of f at a and b). Write
a short computer program to implement it and use it to solve:
f’’ + 2f’ + 5f = e–xsin(x) in [0,5]; f(0) = f(5) = 0

Two Dimensional Problems
We consider now the problem Lf = g in 2 dimensions, where f is a function of 2 variables,
say: x and y. In this case, we need to approximate derivatives in x and y in L. To do this, we
superimpose a regular grid over the domain of the problem and (as for the 1–D case) we only
consider the values of f and g at the nodes of the grid.


page 87 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
Example:
Referring to the figure below, the problem consists of finding the potential distribution
between the inner and outer conducting surfaces of square cross-section when a fixed voltage is
applied between them. The equation describing this problem is the Laplace equation:

2 0φ∇ = with the boundary conditions φ = 0 in the outer conductor and φ =1 in the inner
conductor.
To approximate
2 2
2
2 2x y
φ φφ ∂ ∂∇ = +
∂ ∂
(7.24)
we can choose for convenience the same spacing h for x and y. Ignoring the symmetry of the
problem, the whole cross-section is discretized (as in Fig. 7.5). With this choice, there are 56 free
nodes (for which the potential is not known).


Fig. 7.5 Finite difference mesh for solution of the electrostatic field in square coaxial line.
On this mesh, we only consider the unknown internal nodal
values and only those nodes are numbered. An internal point of
the grid, xi, labelled by O in Fig. 7.4, is surrounded by other four,
which for convenience are labelled by N, S, E and W. For this
point we can approximate the derivatives in (7.23) by:
( )
2
2 2
1 2W O Ex h
φ φ φ φ∂ ≈ − +

and ( )
2
2 2
1 2N O Sy h
φ φ φ φ∂ ≈ − +

(7.25)
Then the equation 2 0φ∇ = becomes simply:
( )2 24N S E W O hφ φ φ φ φ φ∇ ≈ + + + − = 0 or
φN +φS + φE + φW − 4φO = 0 (7.26)


ELEC 0030 Numerical Methods page 88


© University College London 2019 – F.A. Fernandez
Formulating this equation for each point in the grid and using the boundary conditions where
appropriate, we end up with a system of N equations, one for each of the N free points of the grid.
Applying equation (7.26) to point 1 of the mesh gives:
10 2 10 0 4 0φ φ φ+ + + − = (7.27)
and to point 2: 11 3 1 20 4 0φ φ φ φ+ + + − = (7.28)
A typical interior point such as 11 gives:
2 20 12 10 114 0φ φ φ φ φ+ + + − = (7.29)
and one near the inner conductor, 13 will give:
4 14 12 131 4 0φ φ φ φ+ + + − = (7.30)
In this way, we can assemble all 56 equations from the 56 mesh points of Fig. 7.4, in terms
of the 56 unknowns. The resulting 56 equations can be expressed as:
A x = y (7.31a)
or
1
2
3
13
55
56
4 1 1 0
1 4 1 1 0
1 4 1 0
1
4 1 0
1 4 0
φ
φ
φ
φ
φ
φ
−     
    −     
    −
    ⋅⋅ ⋅    
    ⋅⋅ ⋅
=    
⋅ −    
    ⋅ ⋅
    
⋅ ⋅    
    −     
    −    
 



(7.31b)
The unknown vector x of (7.30a) is simply (φ1, φ2, . . . , φ56)T.
The right-hand-side vector y of eqs. (7.31) consists of zeros except for the –1s coming from
equations like (7.30) corresponding to points next to the inner conductor with potential 1; namely,
from points 12 to 16, 20, 21, 24, 25, 28, 29, 32, 33, 36, 37, 41 to 45.
The 56×56 matrix A has mostly zero elements, except for ‘–4’ on the diagonal, and either
two, three or four ‘+1’ somewhere else on each matrix row, the number and distribution depending
on the geometric node location.
One has to be careful not to confuse the row-and-column numbers of the 2-D array A with
the ‘x and y coordinates’ of the physical problem. Each number 1 to 56 of the mesh points in the
figure above corresponds precisely to the row number of matrix A and the row number of column
vector x.
Equation (7.31) is standard, as given in (5.11), and could be solved with a standard library
package, with a Gauss elimination solver routine, or better, a sparse or a band-matrix version of
Gauss elimination. Better still would be the Gauss-Seidel or successive displacement. Section 2
in the Appendix shows details of such a solution.

Enforcing Boundary Conditions
In the description so far we have seen the implementation of the Dirichlet boundary


page 89 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
condition; that is a condition where the values of the desired function are known at the edges of
the region of interest (ends of the interval in 1-D or boundary of a 2-D or 3-D region). This has
been implemented in a straightforward way in equations (7.27), (7.28) and (7.30).
Frequently, other types of boundary conditions appear, for example, when the derivatives of
the function are known (instead of the values themselves) at the boundary – this is the Neumann
condition. In occasions, a mixed condition will apply; something like: ( ) ff r K
n

+ =

, where
the second term refers to the normal derivative (or derivative along the normal direction to the
boundary. This type of condition appears in some radiation problems (Sommerfeld radiation
condition).
We will see next some forms of dealing with these boundary conditions in the context of
finite difference calculations.

For example in the case of the ‘square coaxial problem’ studied earlier, we can see that the
solution will have symmetry properties, which makes unnecessary the calculation of the potential
over the complete cross-section. In fact only one eighth of the cross-section is actually needed.
The new region of interest can be one eighth of
the complete cross-section: the shaded region or one
quarter of the cross-section, to avoid oblique lines.
In any case, the new boundary conditions needed on
the dashed lines that limit the new region of interest
are of the Neumann type: 0
n
φ∂
=

since the lines
of constant potential will be perpendicular to those
edges. (n represents the normal to the boundary).

Fig. 7.6

We will need a different strategy to deal with these conditions.
For this condition it is more convenient to define the mesh in a different manner. If we place
the boundary at half the node spacing from the start of the mesh as in Fig. 7.7, we can implement
the normal derivative condition in a simple form:
(Note that the node numbers used in Fig. 7.7 do not correspond to the mesh numbering defined
earlier for the whole cross-section).


Fig. 7.7
The boundary condition that applies at point b (not
actually part of the mesh) is: 0
n
φ∂
=

We can
approximate this by using a central difference with
the point 1 and the auxiliary point a outside the
mesh. Then: 10 a
bn h
φ φφ −∂
= =

and then,
1aφ φ= .

So the corresponding equation for point 1 is: φN +φS + φE + φW − 4φ0 = 0
Substituting values we have:
0 +φ10 +φ1 +φ2 − 4φ1 = 0


ELEC 0030 Numerical Methods page 90


© University College London 2019 – F.A. Fernandez
or φ2 + φ10 − 3φ1 = 0
A Dirichlet condition can also be implemented in a similar form, if necessary. For example
if part of a straight boundary is subjected to a Neumann condition and the rest to a Dirichlet
condition (as for example in the case of a conductor that does not cover the whole boundary), a
Dirichlet condition will have to be implemented in a mesh separated half of the spacing from the
edge.
Exercise 7.2:
Consider the following figure and the
point a on the boundary (not in the mesh) with
the points 5 and 15 and a virtual point (shown
in grey) above the boundary. Use Taylor
expansions up to order 2 at the 3 points ±h/2
and 3h/2 from a, to find the value of φ on the
virtual point when the function φ has a fixed
value V at the boundary. This corresponds to
the case where part of a boundary is subjected
to a Dirichlet boundary condition while other
has a Neumann condition.

Fig. 7.8

Using the results of Exercise 7.2, the difference equation corresponding to the discretization
of the Laplace equation at point 5 where the boundary condition is φ = V:
4 6 15 54 0Nϕ ϕ ϕ ϕ ϕ+ + + − = but from Ex. 7.2: 5 15
1 (8 6 )
3N
Vϕ ϕ ϕ= − + so:
4 6 15 5
4 86
3 3
Vϕ ϕ ϕ ϕ+ + − = −
Exercise 7.3
Using Taylor expansions, derive an equation to implement the Neumann condition using
five points along a line normal to the edge and at distances 0, h, 2h, 3h, and 4h.
Exercise 7.4 (Advanced)
Using two points like 11 and 12 in the figure of Exercise 7.2, find the equation corresponding
to the implementation of a radiation condition of the form: p K
n
φφ ∂+ =

, where p is a constant
with the same dimensions as h.
Repeat using 3 points along the same line and use the extra degree of freedom to increase
the degree of approximation (eliminating second and third derivatives).
Example: Heat conduction (diffusion) in a uniform rod.
A uniform rod of length L is initially at temperature 0. It is then connected to a heat source
at temperature 1 at one end while the other is attached to a sink at temperature 0. Find the
temperature distribution along the rod, as time varies. We need the time-domain solution
(transient). In the steady state we would expect a linear distribution of temperature between both
ends.
The equation describing the temperature variation in space and time, assuming a heat
conductivity of 1 is the normalized diffusion equation:


page 91 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez

2
2
u u
tx
∂ ∂
=
∂∂
(7.32)
where u(x,t) is the temperature. The boundary and initial conditions are:

(0, ) 0
for all
( , ) 1
u t
t
u L t
= 
= 
u(x,0) = 0 for x < L
We can discretize the solution space (x.t) with a regular grid with spacing ∆x and ∆t: The solution
will be sought at positions: xi, i = 1, ...., M–1 (leaving out of the calculation the points at the ends
of the rod), so ∆ x = L/M) and at times: tn, n = 0, 1, 2,... .
Using this discretization, the boundary and initial condition become:
0
0
for all
1
n
n
M
u
n
u
= 

= 
0 0iu = for i = 0, ...., M–1 (7.33)
We now discretize equation (7.32), converting the derivatives into differences:
For the time derivative at time n and position i, we can use the central difference formula:

2
1 1
2 2
,
2n n ni i i
i n
u u u u
x x
− +∂ − +=
∂ ∆
(7.34)
In the right hand side of (7.32), we need to approximate the time derivative at the same point (i,n).
We could use the forward difference:

1n n
i iu uu
t t
+ −∂

∂ ∆
(7.35)
and in this case we get:
1
1 1
2
2n n n n ni i i i iu u u u u
tx
+
− +− + −=
∆∆
(7.36)
as the difference equation.
Rearranging: 1 1 12 2 21 2
n n n n
i i i i
t t tu u u u
x x x
+
− +
∆ ∆ ∆ = + − + ∆ ∆ ∆ 
(7.37)
This gives us a form of calculating the temperature u at point
i and time n+1 as a function of u at points i, i–1 and i+1, at
time n. We have then a time-stepping algorithm.


Fig. 7.9
Equation (7.37) is valid for n = 0,...., N and i = 1, ...., M–1.
Special cases: for i = 1: 1
n
iu − = 0, and for i = M–1: 1
n
iu + = 1 for all n (at all times).
If we call: b = 1− 2 ∆t
∆x2


 

 and 2
tc
x

=

, we can rewrite (7.36) as:


ELEC 0030 Numerical Methods page 92


© University College London 2019 – F.A. Fernandez

1
1 1 2
1
2 1 2 3
1
1 2 1
n n n
n n n n
n n n
M M M
u bu cu
u cu bu cu
u cu bu c
+
+
+
− − −
= +
= + +
= + +


(7.38)
which can be written in matrix form as: 1n+u = A nu + v, where the matrix A and the vector v
are:

b c
c b c
c b c
c b
⋅ ⋅ 
 ⋅ ⋅ 
= ⋅ ⋅ 
 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 
 ⋅ 
A and
0
0
0
c
 
 
 
= ⋅ 
 
 
  
v (7.39)
nu is the vector containing all values niu . It is known at n = 0, so the matrix equation (7.38) can
be solved for all successive time steps.

Problems with this formulation:
– It is unstable unless ∆ t ≤ ∆ x2/2 (c ≤ 0.5 or b ≥ 0) This is due to the rather poor
approximation to the time derivative provided by the forward difference (7.35). A better scheme
can be constructed using the central difference in the RHS of (7.32) instead of the forward
difference.
The Crank-Nicolson method: Using the central difference for the time derivative
A central difference approximation to the RHS of (7.15) would be:
1 1
, 2
n n
i i
i n
u u u
t t
+ −∂ −
=
∂ ∆

This would involve values of u at three time steps: n–1, n, and n+1, which would be undesirable.
A solution to this is to shrink the gap between the two values, having instead of the difference
between 1niu
+ and 1niu
− , the difference between 1niu
+ and niu . However, if we consider this a
‘central’ difference, the derivative must be evaluated at the central point which corresponds to the
value n+1/2:

1
, 1/2
n n
i i
i n
u u u
t t
+
+
∂ −
=
∂ ∆
(7.40)
We then need to evaluate the left hand side of (7.32) at the same time, but this is not on the grid!
We will have:

2 1/2 1/2 1/2
1 1
2 2
, 1/2
2n n ni i i
i n
u u u u
x x
+ + +
− +
+
∂ − +
=
∂ ∆
(7.41)
Since the values of u are restricted to positions in the grid, we have to approximate the values in
the RHS of (7.41). Since those values are at the centre of the intervals, we can approximate them
by the average between the neighbouring grid points:

1
1/2
2
n n
n i i
i
u uu
+
+ +≈ and similarly for i–1, and i+1. (7.42)


page 93 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
We can now substitute (7.42) into (7.41) and evaluate the equation (7.32) at time n+1/2. After re-
arranging this gives:
1 1 11 1 1 12 2
n n n n n n
i i i i i iu du u u eu u
+ + +
− + − +− + = − + − (7.43)
where
2
1 xd
t
 ∆
= + ∆ 
and e = 1−
∆x2
∆t






This form of treating the derivative (evaluating the equation between grid points in order to
have a central difference approximation for the first order derivative is called the Crank-Nicolson
method and has several advantages over the previous formulation (using the forward difference).

Fig. 7.10 shows the scheme of calculation: the points or
values involved in each calculation. The dark spot represents
the position where the equation is evaluated. This method
provides a second order approximation for both derivatives
and also, very important, it is unconditionally stable.
n+1
n
i i+1i–1
n+1/2

Fig. 7.10 Crank-Nicolson scheme

We can now write (7.43) for each value of i as in (7.38), considering the special cases at the ends
of the rod and at t = 0, and write the corresponding matrix form. In this case we will get:
Au n+1 = Bun − 2v (7.44)
where:
A =
−2d 1 ⋅
1 −2d 1 ⋅
⋅ ⋅ ⋅ ⋅
⋅ ⋅ 1 −2d












, B =
2e −1 ⋅
−1 2e −1 ⋅
⋅ ⋅ ⋅ ⋅
⋅ ⋅ −1 2e












and v =
0


1













Example
Consider now a parabolic equation in 2 space dimensions and time, like the Schrödinger
equation or the diffusion equation in 2D:

2 2
2 2
u u ua
tx y
∂ ∂ ∂
+ =
∂∂ ∂
(7.45)
For example, this could represent the temperature distribution over a 2-dimensional plate. Let’s
consider a square plate of length 1 in x and y and the following boundary conditions:

a) u(0,y,t) = 1
b) u(1,y,t) = 0
c) u(x,0,t) = 1
d) u(x,1,t) = 0
e) u(x,y,0) = 0 for x > 0 and y > 0
That is, the sides x = 0 and y = 0 are kept at
temperature 1 at all times while the sides x = 1
and y = 1 are kept at u = 0.
The whole plate, except the sides x = 0 and y = 0
are at temperature 0 at t = 0.

The following diagram shows the discretization for the x coordinate:





ELEC 0030 Numerical Methods page 94


© University College London 2019 – F.A. Fernandez
There are R+2 points with R unknown values (i = 1, ..., R). The two extreme points: i = 0 and i
= R+1 correspond to x = 0 and x = 1, respectively.
The discretization of the y coordinate can be made in similar form. For convenience, we can
also use R+2 points, where only R are unknown: (j = 1, ..., R).
Discretization of time is done by considering t = n∆ t, where ∆ t is the time step.
Approximation of the time derivative (first order):
As in the previous example, in order to use the central difference and only two time levels,
we use the Crank-Nicolson method. That is, equation (7.45) will be evaluated half way between
time grid points.
This will give:

1
, ,
, , 1/2
n n
i j i j
i j n
u uu
t t
+
+
−∂
=
∂ ∆
(7.46)
We need to approximate the LHS at the same time using the average of the values at n and n+1:
( )2 ( 1/2) 2 ( ) 2 ( 1)12n n nu u u+ +∇ = ∇ + ∇
Applying this and (7.46) to (7.45) we get:
( )2 ( 1) 2 ( ) ( 1) ( )2n n n nau u u ut
+ +∇ + ∇ = −


or re-writing:
2 ( 1) ( 1) 2 ( ) ( )2 2n n n na au u u u
t t
+ +  ∇ − = −∇ + ∆ ∆ 
(7.47)
where u is still a continuous function of position.
Approximation of the space derivative (second order):
Using the central difference for the second order derivatives and using ∆x = ∆y = h, we get:
( )2 2
1 4N S W E Ou u u u u uh
∇ = + + + −
and using this in (7.47):

2
1 1 1 1 124n n n n nN S W E O
h au u u u u
t
+ + + + + + + + − + ∆ 


224n n n n nN S W E O
h au u u u u
t
  
= − + + + − −   ∆  
(7.48)
Defining now node numbering over the grid and a vector u(n) containing all the unknown values
(for all i and j), (7.47) can be written as a matrix equation of the form:
Au(n+1) = − Bu(n) (7.49)
Exercise 7.5
A length L of transmission line is terminated at both ends with a short circuit. A unit impulse
voltage is applied in the middle of the line at time t = 0.
The voltage φ(x, t) along the line satisfies the following differential equation:


page 95 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez

2 2
2
2 2 0px t
φ φ∂ ∂
+ =
∂ ∂
with the boundary and initial conditions:
(0, ) ( , ) 0t L tφ φ= = , φ(x,t) = 0 for t < 0 and φ(L 2,0) =1

a) By discretizing both the coordinate x and time t as xi = (i – 1)∆x , i = 1, 2, .., R+1 and
tm = m∆t , m = 0, 1, 2, ..., use finite differences to formulate the numerical solution of the equation
above. Show that the problem reduces to a matrix problem of the form:
Φm+1 = AΦm + BΦm−1
where A and B are matrices and Φm is a vector containing the voltages at each of the discretization
points xi at time tm.

b) Choosing R = 7, find the matrices A and B giving the values of their elements, taking
special care at the edges of the grid. Show the discretized equation corresponding to points at the
edge of the grid (for i = 1 or R+1) and consider the boundary condition φ(0) = φ(L) = 0.

c) How would the matrices A and B change if the boundary condition was changed to 0
x
φ∂
=


at x = 0, L? (Corresponding in this case to an open circuit). Show the discretized equation
corresponding to one of the edge points and propose a way to transform it so it contains only values
corresponding to points in the defined grid.



ELEC 0030 Numerical Methods page 96


© University College London 2019 – F.A. Fernandez
8. Solution of Boundary Value Problems
A physical problem is often described by a differential equation of the form:
( ) ( )u x s x= L on a region Ω. (8.1)
where L is a linear differential operator, ( )u x is a function to be found, ( )s x is a known function
and x is the position vector of any point in Ω.
We also need to impose boundary conditions on the values of u and/or its derivatives on Γ,
the boundary of Ω, in order to have a unique solution to the problem.
The boundary conditions can be written in general as:
( ) ( )u x t x= B on Γ. (8.2)
with B, a linear differential operator and ( )t x a known function. So we will have for example:
B = 1 : ( ) ( )u x t x=  : known values of u on Γ (Dirichlet condition)
n

=

B : ( ) ( )u x t x
n

=



: fixed values of the normal derivative (Neumann condition)
k
n

= +

B : ( )u ku t x
n

+ =


: mixed condition (Radiation condition)
A problem described in this form is known as a boundary value problem (differential equation
+ boundary conditions).
8.1 Strong and Weak Solutions to Boundary Value Problems
8.1.1 Strong Solution:
This is the direct approach to the solution of the problem, as specified above in (8.1)-(8.2);
for example, the finite difference solution method studied earlier is of this type.
8.1.2 Weak Solution:
This is an indirect approach. Instead of trying to solve the problem directly, we can re-
formulate it as the search for a function that satisfies some conditions also satisfied by the solution
to the original problem (8.1)-(8.2). With a proper definition of these conditions the search will
lead to the correct and unique solution.
Before expanding on the details of this search, we need to use the idea of inner product
between two functions defined earlier. This will allow us to quantify (put a number to) how close
a function is to another or how big or small is an error function.

The commonest definition of inner product between two functions f and g is:
f , g = fg dΩ∫ for real f and g or f , g = f g *dΩ∫ for complex functions f and g (8.3)
In general, the inner product between two functions will be a real number obtained by global
operations between the functions over the domain and satisfying some defining properties (for real
functions):
i) f , g = g, f
ii) αf + βg,h = α f ,h + β g, h α and β scalars (8.4)


page 97 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
iii) f , f = 0 if and only if f = 0

From this definition we can deduce that:
If for a function r, we have that r,h = 0 for any choice of h, then 0r ≡ .
We will try to use this property for testing an error residual r.
We are now in position to formulate the weak solution of (8.1)-(8.2) as:
Given the function ( )s x and an arbitrary function ( )h x in Ω; find the function ( )u x that satisfies:
hshu ,, =L or 0, =− hsuL (8.5
for any choice of the function h.
8.2 Approximate Solutions to the Weak Formulation
8.2.1 The Rayleigh-Ritz Method
In this case, instead of trying to find the exact solution ( )u x that satisfies (8.5), we only look
for an approximation. We can choose a set of basis functions and assume that the wanted function,
or rather, a suitable approximation to it, can be represented as an expansion in terms of the basis
functions (as for example with Fourier expansions using sines and cosines).
Now, once we have chosen the basis functions, the unknown is no longer the function u, but
its expansion coefficients; that is: numbers! so the problem (8.5) is converted from “finding a
function u among all possible functions defined on Ω” into the search for a set of numbers. We
now need some method to find these numbers.
We will see two methods to solve (84) using the Rayleigh-Ritz procedure above:

Weighted Residuals
Variational Method

8.2.2 Weighted Residuals
For this approach the problem (8.5) is rewritten in the form: r = Lu – s = 0 (8.6)
where the function r is the residual or error or the difference
between the LHS and the RHS of (8.1) when we try any function in place of u. This residual will
only be zero when the function we use is the correct solution to the problem.
We can now look at the solution to (8.6) as an optimization (or minimization) problem; that
is: “Find the function u such that r = 0” or in approximate way:
“Find the function u such that r is as small as possible” and here we need to be able to
measure how ‘small’ is the error (or residual).
Using these ideas on the weak formulation, this is now transformed into:
Given the function ( )s x and an arbitrary function ( )h x in Ω; find the function ( )u x that satisfies:
r, h = 0 (where r = Lu – s) (8.7)
for any choice of the function h.

Applying the Rayleigh-Ritz procedure, we can put:

1
( ) ( )
N
j j
j
u x d b x
=
= ∑  (8.8)


ELEC 0030 Numerical Methods page 98


© University College London 2019 – F.A. Fernandez
where dj, j = 1, ... , N are scalar coefficients and
( )jb x
 , j = 1, ... , N are a set of linearly independent basis functions (normally called
trial functions – chosen)

We now need to introduce the function h. For this, we can also use an expansion, in general
using another set of functions ( )iw x
 for example:

1
( ) ( )
N
i i
i
h x c w x
=
= ∑  (8.9)
where ci, i = 1, ... , N are scalar coefficients and
( )iw x
 , i = 1, ... , N a set of linearly independent expansion functions for h
(normally called weighting functions – also chosen)

Using now (8.9) into (8.7) we get:

1 1
, ( ), ( ) ( ), ( ) 0
N N
i i i i
i i
r h r x c w x c r x w x
= =
= = =∑ ∑    (8.10)
for any choice of the coefficients ci.
Since (8.10) has to be satisfied for any choice of the coefficients ci (equivalent to say ‘any
choice of h), we can deduce that:
, 0ir w = for all i, i = 1, ... , N (8.11)
which is simpler than (8.10). (It comes from the choice: ci = 1 and all others = 0, in sequence).
So, we can conclude that we don’t need to test the residual against any (and all) possible
functions h; we only need to test it against each of the chosen weighting functions wi, i = 1, ... ,
N.
The weak formulation can now be expressed as:
Given the function ( )s x and the weighting functions ( )iw x
 in Ω; find the function ( )u x that
satisfies:
, 0ir w = for all i, i = 1, ... , N (where r = Lu – s) (8.11)
Now, we can expand the function u in terms of the trial (basis) functions as in (8.8) and use this
expansion in r: r = Lu – s. With this, (8.11) becomes:
0)( ,)()(,,
1
=−=−= ∑
=
xwxsxbdwsuwr i
N
j
jjii
LL for all i
or

1
, , , 0
N
i j j i i
j
r w d b w s w
=
= − =∑ L for all i (8.12)
Note that in this expression the only unknowns are the coefficients dj.
We can rewrite this expression (8.12) as: aij dj
j =1
N
∑ = si for all i: i = 1, N
which can be put in matrix notation as:


page 99 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
A d = s (8.13)
where A = { aij} , d = { dj} and s = { si} with aij = ij wb ,L and , i is s w=
Since the trial functions bj are known, the matrix elements aij are all known and the problem has
been reduced to solve (8.13), finding the coefficients dj of the expansion of the function u.

Different choices of the trial functions and the weighting functions define the different
variations by which this method is known:

i) wi = bi –––––––> method of Galerkin
ii) ( ) ( )i iw x x xδ= −
   –––––––> Point matching method (or collocation method) This
is equivalent to asking for the residual to be zero at a fixed number of points on the domain.
8.2.3 Variational Method

The central idea of this method is to find a functional* (or variational expression) associated
to the boundary value problem and for which the solution of the BV problem leads to a stationary
value of the functional. Combining this idea with the Rayleigh-Ritz procedure we can develop a
systematic solution method.

Before introducing numerical techniques to solve variational formulations, let’s examine an
example to illustrate the nature of the variational approach:

Example:
The problem is to find the path of light travel between two points. We can start using a
variational approach and for this we need a variational expression related to this problem. In
particular, for this case we can use Fermat’s Principle that says that ‘light travels through the
quickest path’ (Note that this is not necessarily the shortest). We can formulate this statement in
the form:
time = min
ds
v(s)
P1
P2
∫ (8.14)
where v(s) is the velocity point by point along the path. We are not really interested in the actual
time taken to go from P1 to P2, but on the conditions that this minimum imposes on the path s(x,y)
itself. That is, we want to find the actual path between these points that minimize this time. We
can write for the velocity: v = c/n, where n(x,y) is the refractive index.
Let’s consider first a uniform medium, that is, one with n uniform (constant). In that case,
(8.14) becomes:

* A functional is simply a function of a function over a complete domain which gives as a result
a number. For example: 2( ) dJ φ φ

= Ω∫ . This expression gives a numerical value for each
function φ we use in J(φ). Note that J is not a function of x .



ELEC 0030 Numerical Methods page 100


© University College London 2019 – F.A. Fernandez
time =
n
c
min ds
P1
P2
∫ =
n
c
min( path length) (8.15)
The integral in this case reduces to the actual length of the path, so the above statement asks for
the path of ‘minimum length’ or the shortest path between P1 to P2. Obviously, the solution in this
case is the straight line joining the points. However, you can see that (8.14) can be applied to an
inhomogeneous medium, and the minimization process should lead to the actual trajectory.
Extending the example a little more, let’s consider an interface between two media with
different refractive indices. Without losing generality, we can consider a situation like that of the
figure 8.1.
From above, we know that in each media the path will be a straight line, but what are the
coordinates of the point on the y–axis (interface) where both straight lines meet?
Applying (8.14) gives:
time = 1
c
min n1 ds
P1
P0
∫ + n2 ds
P0
P2









(8.16)

Both integrals correspond to the respective lengths of
the two branches of the total path, but we don’t know
the coordinate y0 of the point P0. We can re-write
(8.16) in the form:


2 2 2 21 1 1 0 2 2 2 0
1 min ( ) ( )time n x y y n x y y
c
 = + − + + −  
(8.17)

where the only variable (unknown) is y0. To find it we need to find the minimum, and for this we
do:
1 0 2 01 22 2 2 2
0 1 1 0 2 2 0
( ) ( )( ) 0
( ) ( )
y y y yd time n n
dy x y y x y y
 − − − − = = +
 + − + − 
(8.18)
Now, from the figure we can observe that the right hand side can be written in terms of the angles
of incidence and refraction as:
n1 sinα1 − n2 sinα2 = 0 (8.19)
as the condition the point P0 must satisfy, and we know this is right because (8.19) is the familiar
Snell’s law.

So, the general idea of this variational approach is to formulate the problem in a form that we look
for a stationary condition (maximum, minimum, inflexion point) on some parameter which
depends on the desired solution. (Like in the above problem, we looked for the time, which
depends on the actual path travelled).
An important property of a variational approach is that precisely because the solution
function produces a stationary value of the functional, this is rather insensitive to small
perturbations of the solution (approximations). This is a very desirable property, particularly for
the application of numerical methods where all solutions are only approximate. To illustrate this
property, let’s analyse another example:

Fig. 8.1


page 101 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
Example
Consider the problem of finding the natural resonant frequencies of a vibrating string (a
chord in a guitar). For this problem, an appropriate variational expression is (do not worry about
where this comes from, but there is a proof in the Appendix):
k 2 = s.v.
dy
dx


 


2
dx
a
b

y2 dx
a
b

(8.20)
The above expression corresponds to the k–number or resonant frequencies of a string vibrating
freely and attached at the ends at a and b.
For simplicity, let’s change the limits to –a and a. The first mode of oscillation has the
form:
y = Acos
πx
2a
then sin
2 2
dy A x
dx a a
π π
= −
Using this exact solution in (99): we get for the first mode (prove this):
k 2 = π
2
4a2
then k =
π
2a

1.571
a

Now, to show how a ‘bad’ approximation of the function y can still give a rather acceptable value
for k, let’s try a simple triangular shape (instead of the correct sinusoidal shape of the vibrating
string):







>




 −
<




 +
=
0 1
0 1
x
a
xA
x
a
xA
y

Fig. 8.2
then
/ 0
/ 0
A a xdy
A a xdx
<
= − >
and using these values in (99), gives (prove this):
k 2 =
3
a2
then k ≈
1.732
a

which is not too bad considering how coarse is the approximation to y(x). If instead of the
triangular shape we try a second order (parabolic) shape:
( )( )21y a x a= − then dydx = −2
A
a2
x
Substituting these values in (99) gives now:
k 2 =
2.5
a2
then k =
1.581
a

which is a rather good approximation.

Now, how can we use this method systematically? As said before, the use of the Rayleigh-
Ritz procedure permits to construct a systematic numerical method. In summary, we can specify


ELEC 0030 Numerical Methods page 102


© University College London 2019 – F.A. Fernandez
the necessary steps as:
BV problem: L u = s → Find variational expression J(u)
Use Rayleigh-Ritz: u = d j bj
j=1
N
∑ Insert in J(u)
Find stationary value of J(u) = J({ dj}), that is:
find the coefficients dj
Reconstruct u = d j bj
j=1
N

u is solution of BV problem ← then
Many variational formulations are derived from fundamental principles: minimisation of
energy, reciprocity, conservation laws, etc. They can also be formulated in general for certain
types of boundary value problems. See appendix for expressions appropriate to solve problems of
the type L u = s and L u = λu and their derivation.
However, we will skip here the problem of actually finding the corresponding variational
expression to a boundary value problem, simply saying that there are systematic methods to find
them. We will be concerned here on how to solve a problem, once we already have a variational
formulation.
Example
For the problem of the square coaxial (or square capacitor) seen earlier, the BV problem is
defined by the Laplace equation: 2 0φ∇ = with some boundary conditions (L = ∇2 , u = φ,
s = 0).
An appropriate functional (variational expression) for this case is:
( )2( ) dJ φ φ

= ∇ Ω∫ (given) (8.21)
Using Rayleigh-Ritz: ( )
2
1
( ) ,
N
j j
j
d b x y dJ φ
=Ω
  
= ∇ Ω      
∑∫
= dj ∇bj (x, y)
j=1
N



 


 
2
dΩ



1 1
( ) ( , ) ( , )
N N
i j i j
i j
d d b x y b x y dJ φ
= =Ω
= ∇ ⋅∇ Ω∑∑∫ (8.22)
This can be written as the matrix equation: J(d) = dTAd,
where d = { dj} and aij = i jb b d

∇ ⋅∇ Ω∫
Now, find stationary value: 0
id
J∂
=

for all i: i = 1, ... , N


page 103 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
so, applying it to (8.22):
1
0
N
ij j
ji
a d
d
J
=

= =
∂ ∑ , for all i: i = 1, ... , N
And the problem reduces to the matrix equation: A d = 0 (8.23)
Solving the system of equations (8.23) we obtain the coefficients dj and the unknown
function can be obtained as:
u(x,y) = dj bj (x, y)
j=1
N


We can see that both methods, the weighted residuals and the variational method transform
the BV problem into an algebraic, matrix problem.

One of the first steps in the implementation of either method is the choice of appropriate
expansion functions to use in the Rayleigh-Ritz procedure: basis or trial functions and weighting
functions.
The finite element method provides a simple form to construct these functions and
implementing these methods.



page 104 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
9. FINITE ELEMENTS
As the name suggests, the finite element method is based on the division of the domain of
interest Ω into ‘elements’, or small pieces Ei that cover Ω completely but without intersections:
They constitute a tessellation (tiling) of Ω:
; i i j
i
E E E= Ω ∩ = ∅


Over the subdivided domain we apply the methods discussed earlier (either weighted
residuals or variational). The basis functions are defined locally in each element and because each
element is small, these functions can be very simple and still constitute overall a good
approximation of the desired function. In this form inside each element Ee, the wanted function
u( x ) in (8.1) is represented by a local approximation ( )eu x

 valid only in the element number e:
Ee. The complete function u( x

) over the whole domain Ω is then simply approximated by the
addition of all the local pieces: ( ) ( )e
e
u x u x≈ ∑ 
An important characteristic of this method is that it is ‘exact-in-the-limit’, that is, the degree
of approximation can only improve when the number of elements increase, the solution gradually
and monotonically converging to the exact value. In this form, a solution can always be obtained
to any degree of approximation, provided the availability of computer resources.
9.1 One dimensional problems
Let’s consider first a one dimensional case with Ω as the interval [a, b]. We first divide Ω
into N subintervals, not necessarily equal in size, defined by N+1 nodes xi, as shown in Fig. 9.1.
The wanted function u(x) is then locally approximated in each subinterval by a simple function,
for example, a straight line: a function of the type: ax + b, defined differently in each subinterval.
The total approximation ˜ u (x) is then the superposition of all these locally defined functions, a
piecewise linear approximation to u(x). The amount of error in this approximation will depend on
the size of each element (and consequently on the total number of elements) and more importantly,
on their size in relation to the local variation of the desired function.

a x2 xix3 b
˜ u(x) u(x)

Fig. 9.1 Fig. 9.2
The function ˜ u (x) , approximation to u(x), is the addition of the locally defined functions
˜ u e(x) which are only nonzero in the subinterval e. Now, these local functions ˜ u e(x) can be defined
as the superposition of interpolation functions Ni(x) and Ni+1(x) as shown in the Fig. 9.2. From
Fig. 9.2 we can see that the function ˜ u e(x), local approximation to u(x) in element e, can be written
as:


ELEC 0030 Numerical Methods page 105


© University College London 2019 – F.A. Fernandez
˜ u e(x) = ui Ni (x) + ui+1Ni+1(x) (9.1)
Now, if we consider the neighbouring subintervals and the associated interpolation functions
as in the figure below, we can extend the definition of these functions to form the triangular shapes
below:

Fig. 9.3
With this definition, we can write for the function ˜ u (x) in the complete domain Ω:
u(x) ≈ ˜ u (x) = ˜ u e(x)
e=1
Ne
∑ = uiNi (x)
i =1
Np
∑ (9.2)
(Np is the number of nodes, Ne is the number of elements). The functions Ni(x) are defined as the
(double-sided) interpolation functions at node i, i = 1, ... , Np so Ni(x) = 1 at node i and 0 at all
other nodes.

This form of expanding u(x) has a very useful property:
— The trial functions Ni(x), known in the FE method as shape functions, are interpolation
functions, and as a consequence, the coefficients of the expansion: ui in (104), are the
nodal values, that is, the values of the wanted function at the nodal points. Solving the
resultant matrix equation will give these values directly.
Exercise 9.1:
Examine Fig. 9.3 and Fig. 9.2 and show that indeed (9.1) is valid in the element (xi, xi+1) and
so (9.2) is valid over the full domain.
Example
Consider the boundary value problem: d
2 y
dx 2
+ k2 y = 0 for x ∈ [a, b] with y(a) = y(b) = 0

This corresponds for example, to the equation describing the envelope of the transverse
displacement of a vibrating string attached at both ends. A suitable variational expression
(corresponding to resonant frequencies) is the following, as seen in (8.20) and in the Appendix:


J = k2 =
dy
dx


 


2
dx
a
b

y2 dx
a
b

(given) (9.3)
Using now the expansion: y = yj N j (x)
j =1
N
∑ (9.4)
where N is the number of nodes, yj are the nodal values (unknown) and Nj(x) are the shape
functions.


page 106 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
From (9.3) we have:k 2 = k2 (yj ) =
Q
R
where:
Q =
d
dx
y j Nj (x)
j=1
N



 


 








2
dx
a
b
∫ = yj
dN j
dxj=1
N









2
dx
a
b
∫ = y j
dNj
dx
yk
dNk
dxk=1
N

j =1
N
∑ dx
a
b
∫ (9.5)
and
R = yj N j (x)
j =1
N









2
dx
a
b
∫ = y j Nj (x) yk Nk (x)
k=1
N

j=1
N
∑ dx
a
b
∫ (9.6)
To find the stationary value, we do: dk
2
dyi
= 0 for each yi, i = 1, ... , N (9.7)
But since k 2 =
Q
R
, then dk
2
dyi
=
Q' R − QR '
R2
so dk
2
dyi
= 0 ⇒ Q ' R = QR ' and
Q' =
Q
R
R '= k2 R ' so finally: dQ
dyi
= k2
dR
dyi
for all yi, i = 1, ... , N (9.8)
We now have to evaluate these derivatives:

dQ
dyi
=
dNi
dx
yk
dNk
dxk=1
N






 dx
a
b
∫ + yj
dN j
dxj=1
N



 


 
dNi
dx
dx
a
b

or

dQ
dyi
= 2
dNi
dx
yj
dNj
dxj =1
N



 


  dx
a
b
∫ = 2 yj
dNi
dx
dNj
dx
dx
a
b



 


 
j =1
N

which can be written as:

dQ
dyi
= 2 aij yj
j =1
N
∑ where aij =
dNi
dx
dNj
dx
dx
a
b
∫ (9.9)
For the second term:

dR
dyi
= Ni(x) yj N j (x)
j =1
N



 


  dx
a
b
∫ + yk Nk(x)
k=1
N






 Ni(x) dx
a
b

or

dR
dyi
= 2 yj Ni (x)Nj (x)dx
a
b

j =1
N
∑ = 2 bij yj
j =1
N
∑ with bij = Ni (x)Nj (x)dx
a
b
∫ (9.10)
Replacing (9.9) and (9.10) in (9.8), we can write the matrix equation:
A y = k2 B y (9.11)
where A = { aij} , B = { bij} and y = { yj} is the vector of nodal values.



ELEC 0030 Numerical Methods page 107


© University College London 2019 – F.A. Fernandez
Equation (9.11) is a matrix eigenvalue problem. The solution will give the eigenvalue k2
and the corresponding solution vector y (list of nodal values of the function y(x)).
The matrix elements:
aij =
dNi
dx
dNj
dx
dx
a
b
∫ and bij = Ni Nj dx
a
b
∫ (9.12)
The shape functions Ni and Nj are only nonzero in the vicinity of nodes i and j respectively (see
Fig. 9.4). So, aij = bij = 0 if Ni and Nj do not overlap; that is, if j ≠ i–1, i, i+1. Then, the
matrices A and B are tridiagonal: They will have not more than 3 elements per row.


Fig. 9.4
Exercise 9.2:
Define generically the triangular functions Ni, integrate (9.12) and calculate the value of the matrix
elements aij and bij.

9.2 Two Dimensional Finite Elements
In this case, a two dimensional region Ω is subdivided into smaller pieces or ‘elements’ and
the subdivision satisfies the same properties as in the one-dimensional case; that is, the elements
cover completely the region of interest and there is no intersections (no overlapping). The most
common form of subdividing a 2D region is by using triangles. Quadrilateral elements are also
used and they have some useful properties but by far the most common, versatile and easier to use
are triangles with straight sides. There are well-developed methods to produce appropriate
meshing (or subdivision) of a 2D region into triangles and they have maximum flexibility to
accommodate to intricate shapes of the region of interest Ω.
The process of calculation follows the same route as in the 1D case. Now, a function of two
variables u(x,y) is approximated by shape functions Nj(x,y) defined as interpolation functions over
one element (in this case a triangle). This approximation is given by:
u(x,y) ≈ uj N j (x, y)
j =1
N
∑ (9.13)
where N is the number of nodes in the mesh and the coefficients uj are the nodal values (unknown).
Nj(x,y) are the shape functions defined for every node of the mesh.


page 108 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez

Fig. 9.5
Fig. 9.5 shows a rectangular region in the xy–plane subdivided into triangles, with the
corresponding piecewise planar approximation to a function u(x,y) plotted along the vertical axis.
Note that the approximation is composed by flat ‘tiles’ that fit exactly along the edges so the
approximation is continuous over the entire region but its derivatives are not. The approximation
shown in the figure shows uses first order functions, that is, pieces of planes (flat tiles). Other
types are also possible but require defining more nodes in each triangle. (While a plane is totally
defined by 3 points - e.g. the nodal values, a second order surface for example, will need 6 points
- for example the values at the 3 vertices and at 3 midside points).
For a first order approximation, the function u(x,y) is approximated in each triangle by a
function of the form (first order in x and y):
˜ u(x, y) = p + qx + ry = (1 x y)
p
q
r



 



 
(9.14)
where p, q and r are constants with different values in each triangle. Similarly to the one
dimensional case, this function can be written in terms of shape functions (interpolation
polynomials):
u(x,y) ≈ ˜ u (x,y) = u1N1(x, y) + u2 N2 (x, y) + u3 N3(x, y) (9.15)
for a triangle with nodes numbered 1, 2 and 3, with coordinates (x1 , y1), (x2 , y2) and (x3 , y3).
The shape functions Ni are such that Ni = 1 at node i and 0 at all the others:
It can be shown that the function Ni satisfying this property is:
Ni (x, y) =
1
2A
ai + bix + ciy( ) (9.16)
where A is the area of the triangle and
a1 = x2y3 – x3y2 b1 = y2 – y3 c1 = x3 – x2
a2 = x3y1 – x1y3 b2 = y3 – y1 c2 = x1 – x3
a3 = x1y2 – x2y1 b3 = y1 – y2 c3 = x2 – x1


ELEC 0030 Numerical Methods page 109


© University College London 2019 – F.A. Fernandez
(See demonstration in the Appendix.)
The function N1 defined in (9.16) and shown below corresponds to the shape function
(interpolation function) for the node numbered 1 in the triangle shown. Now, the same node can
be a vertex of other neighbouring triangles in which we will also define a corresponding shape
function for node 1 (with expressions like (9.16) but with different values of the constants a, b and
c – different orientation of the plane), building up the complete shape function for node 1: N1.
This is shown in the next figure for a node number i, which belongs to five triangles.



Fig. 9.6
Joining all the facets of Ni, for each of the triangles that contain node i, we can refer to this
function as Ni(x,y) and then, considering all the nodes of the mesh, the function u can be written
as:
u(x,y) = uj Nj (x, y)
j=1
N
∑ for j = 1, ..., N (9.17)
We can now use this expansion for substitution in a variational expression or in the weighted
residuals expression, to obtain the corresponding matrix problem for the expansion coefficients.
An important advantage of this method is that these coefficients are precisely the nodal values of
the wanted function u, so the result is obtained immediately when solving the matrix problem. An
additional advantage is the sparsity of the resultant matrices.


page 110 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
9.3 Matrix Sparsity
Considering the global shape function for node i, shown in the Fig. 9.6, we can see that it is
zero in all other nodes of the mesh. If we now consider the global shape function corresponding
to another node, say, j, we can see that products of the form Ni Nj or of derivatives of these
functions, which will appear in the definition of the matrix elements (as seen in the 1–D case), will
almost always be zero, except when the nodes i and j are either the same node or immediate
neighbours so there is an overlap (see figure above). This implies that the corresponding matrices
will be very sparse, which is very convenient in terms of computer requirements.
Example:
If we consider a simple mesh:

The matrix sparsity pattern results:



Example:
For the problem of finding the potential distribution in the square coaxial, or equivalently, the
temperature distribution (in steady state) between the two square section surfaces, an appropriate
variational expression is:


J = ∇φ( )2 dΩ

∫ (9.18)
and substituting (9.17):


J = ∇ φ j Nj
j =1
N



 


 
2
dx dy

∫ = φ j ∇N j
j =1
N



 


 
2
dx dy


which can be re-written as:


J = φi ∇Ni
i=1
N
∑ ⋅ φ j ∇Nj
j=1
N
∑ dx dy


or


ELEC 0030 Numerical Methods page 111


© University College London 2019 – F.A. Fernandez


J = φiφ j
j=1
N

i =1
N
∑ ∇Ni ⋅∇Nj dx dy

∫ = ΦT AΦ (9.19)
where A = { aij} , aij = ∇Ni ⋅∇Nj dx dy

∫ and Φ = { φj}
Note that again, the coefficients aij can be calculated and the only unknowns are the nodal values
φj.
We now have to find the stationary value; that is: put:

∂J
∂φi
= 0 for i = 1, ... , N


∂J
∂φi
= 0 = 2
i=1
N
∑ aij φj for i = 1, ... , N then: AΦ = 0 (9.20)
Then, the resultant equation is (9.20): AΦ = 0. We need to evaluate first the elements of
the matrix A. For this, we can consider the integral over the complete domain Ω as the sum of
the integrals over each element Ω k, k = 1, ... , Ne (Ne elements in the mesh).
aij = ∇Ni ⋅∇Nj dx dy

∫ = ∇Ni ⋅∇Nj dx dy
Ωk

k=1
Ne
∑ (9.21)
Before calculating these values, let’s consider the matrix sparsity; that is, let’s see which
elements of A are actually nonzero. As discussed earlier (page 45), aij will only be nonzero if the
nodes i and j are both in the same triangle. In this way, the sum in (9.21) will only extend to at
most two triangles for each combination of i and j.
Inside one triangle, the shape function Ni(x,y), defined for the node i is:
Ni (x, y) =
1
2A
(ai + bi x + ciy); Then: ∇Ni =
∂Ni
∂x
ˆ x +
∂Ni
∂y
ˆ y =
1
2A
(bi ˆ x + ci ˆ y )
And then:
aij =
1
4Ak
2 (bibj + cicj ) dx dy
Ωk

k=1
Ne
∑ = 14Ak
(bibj + cicj )
k=1
Ne
∑ (9.2)
where the sum will only have a few terms (for those values of k corresponding to the triangles
containing nodes i and j. The values of Ak, the area of the triangle and bi, bj, ci and cj will be
different for each triangle concerned.
In particular, considering for example the element a4 7 corresponding to the mesh in the
previous figure, the sum extends over the triangles containing nodes 4 and 7; that is triangles
number 5 and 6:
a4 7 =
1
4A5
b4
(5)b7
(5) + c4
(5)c7
(5)( )+ 14A6 b4
(6)b7
(6) + c4
(6)c7
(6)( )
In this case, the integral in (9.22) reduces simply to the area of the triangle and the
calculations are easy. However, in other cases the integration can be complicated because they
have to be done over the triangle, which is at any position and orientation in the x–y plane.
For example, solving a variational expression like:


page 112 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez


J = k2 =
(∇φ)2 dΩ∫
φ2 dΩ∫
(corresponding to (9.3) for 1D problems) (9.23)
This results in an eigenvalue problem AΦ = k2BΦ where the matrix B has the elements:
bij = Ni Nj dΩ

∫ (9.24)
The elements of A are the same as in (9.24).
Exercise 9.3:
Apply the Rayleigh-Ritz procedure to expression (9.23) and show that indeed the matrix elements
aij and bij have the values given in (9.21) and (9.24).

For the case of integrals like those in (126): bij = Ni Nj dΩ

∫ = Ni Nj dΩ
Ω k

k=1
Ne

The sum is over all triangles containing nodes i and j. For example for the element 5,6 in the
previous mesh, these will be triangles 2 and 7 only.
If we take triangle 7, its contribution to this element is the term:
b56 =
1
4A7
(a5 + b5x + c5y)(a6 + b6 x + c6 y) dx dy
Ω 7

or
b56 =
1
4A7
[a5a6 + (a5b6 + a6b5)x + (a5c6 + a6c5)y + b5b6x
2
Ω 7

+(b5c6 + b6c5)xy + c5c6 y
2 ] dx dy
Integrals like these need to be calculated for every pair of nodes in every triangle of the mesh.
These calculations can be cumbersome is attempted directly. However, it is much simpler to use
a transformation of coordinates, into a local system. This has the advantage that the integrals can
be calculated for just one model triangle and then the result converted back to the x and y
coordinates. The most common system of local coordinates used for this purpose is the triangle
area coordinates.
9.4 Triangle Area Coordinates

For a triangle as in Fig. 9.7, any point inside can
be specified by coordinates x and y or for
example, by the coordinates ξ1 and ξ2.
These coordinates are defined with value 1
at one node and zero at the opposite side, varying
linearly between these limits. For the sake of
symmetry, we can define 3 coordinates:
(ξ1, ξ2, ξ3) when obviously, only two are
independent.
A formal definition of these coordinates



ELEC 0030 Numerical Methods page 113


© University College London 2019 – F.A. Fernandez
can be made in terms of the area of the triangles
shown in Fig. 9.7. Fig. 9.7 Triangle area coordinates
The advantage of using this local coordinate system is that the actual shape of the triangle is
not important. Any point inside is defined by its proportional distance to the sides, irrespective of
triangle shape. In this way, calculations can be made in model triangle using this system and then,
mapped back to the global coordinates.

If A1, A2 and A3 are the areas of the triangles
formed by P and two of the vertices of the
triangle, the area coordinates are defined as the
ratio:
ξi =
Ai
A
where A is the area of the triangle.
Note that the triangle marked with the
dotted line has the same area as A1 (same base,
same height), so the line of constant ξ1 is the one
marked in Fig. 9.8.

Fig. 9.8

Since A1 + A2 + A3 = A we have that ξ1 + ξ2 + ξ3 =1, which shows their linear dependence.
The area of each of these triangles, for example A1, can be calculated using (see Appendix):
1 2 2
3 3
1
1 det 1
2
1
x y
A x y
x y
 
 =  
 
 

where x and y are the coordinates of point P.
Evaluating this determinant gives: A1 =
1
2 x2y3 − x3y2( )+ y2 − y3( )x + x3 − x2( )y[ ]
and using the definitions in the Appendix:
A1 =
1
2 a1 + b1x + c1y( )
Then, ξ1 =
A1
A
=
1
2A
a1 + b1x + c1y( ) or ξ1 = N1(x, y) (9.25)
We have then, that these coordinates vary in the same form as the shape functions, which is
quite convenient for calculations.
Expression (9.25) also gives us the required relationship between the local (ξ1,ξ2,ξ3) and
global coordinates (x,y); that is the expression we need to convert coordinates (x,y into ξ1,ξ2,ξ).
We can also find the inverse relation, that is (x,y) in terms of (ξ1,ξ2,ξ3). This is all we need
to convert from one system of coordinates to the other.
For the inverse relationship, we have from the Appendix:
N1 N2 N3( )= ξ1 ξ2 ξ3( )= 1 x y( )
1 x1 y1
1 x2 y2
1 x3 y3



 



 
−1

from where:


page 114 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez
1 x y( ) = ξ1 ξ2 ξ3( )
1 x1 y1
1 x2 y2
1 x3 y3



 



 

and expanding:
1 = ξ1 + ξ2 +ξ3
x = x1ξ1 + x2ξ2 + x3ξ3 (9.26)
y = y1ξ1 + y2ξ2 + y3ξ3
Equations (9.25) and (9.26) will allow us to change from one system to the other.
Finally, the evaluation of integrals can be made now in terms of the local coordinates in the
usual way:
1 2 3 1 2( , ) ( , , )
k k
f x y dxdy f J d dξ ξ ξ ξ ξ
Ω Ω
=∫ ∫
where J is the Jacobian of the transformation. In this case this is simply 2A, so the expression
we need to use to transform integrals is:
f (x, y) dxdy
Ωk
∫ = 2A f (ξ1,ξ2,ξ3) dξ1dξ2
Ωk
∫ (9.27)
Example:
The integral (9.24) of the previous exercise, is difficult to calculate in terms of x and y for
an arbitrary triangle, and needs to be calculated separately for each pair of nodes in each triangle
of the mesh. Transforming to (local) area coordinates this is much simpler:
NiN j dxdy
Ωk
∫ = 2 A ξiξ j dξ1dξ2
Ωk
∫ (9.28)
To determine the limits of integration, we can see
that the total area of the triangle can be covered by
ribbons like that shown in Fig. 9.9, with a width dξ1
and length ξ2. Now, we can note that at the left end
of this ribbon, ξ3 = 0, so ξ2 = 1–ξ1. Moving the
ribbon from the side 2–3 of the triangle to the vertex
1 (ξ1 changing from 0 to 1) will cover the complete
triangle, so the limits of integration must be:

ξ2: 0 ––> 1 – ξ1 and
ξ1: 0 ––> 1



Fig. 9.9
So that the integral of (9.28) results:
NiN j dxdy
Ωk
∫ = 2 A ξiξ j dξ2dξ1
0
1−ξ1

0
1
∫ (9.29)
Note that the above conclusion about integration limits is valid for any integral over the
triangle, not only the one used above.
We can now calculate these integrals. Taking first the case where i ≠ j:


ELEC 0030 Numerical Methods page 115


© University College London 2019 – F.A. Fernandez
a) choosing i = 1 and j = 2 (This is an arbitrary choice – you can check that any other choice,
e.g. 1,3 will give the same result).
Iij = 2A ξ1 ξ2 dξ2dξ1
0
1−ξ1

0
1
∫ = A ξ1(1− ξ1)2 dξ1
0
1
∫ = A
1
2

2
3
+
1
4


 

 =
A
12

b) For i = j and choosing i = 1:
Iii = 2 A ξ1
2 dξ2dξ1
0
1−ξ1

0
1
∫ = 2A ξ12 (1 − ξ1)dξ1
0
1
∫ = 2A
1
3

1
4


 

 =
A
6

Once calculated in this form, the result can be used for any triangle irrespective of the shape
and position. We can see that for this integral, only the area A will change when applied to different
triangles.

9.5 Higher Order Shape Functions
In most cases involving second order differential equations, the resultant weighted residuals
expression or the corresponding variational expression can be written without involving second
order derivatives. In these cases, first order shape functions will be fine. However, if this is not
possible, other type of elements/shape functions must be used. (Note that second order derivatives
of a first order function will be zero everywhere.) We can also choose to use different shape
functions even if we could use first order polynomials, for example, to get higher accuracy with
less elements.
9.6 Second Order Shape Functions:
To define a first order polynomial in x and y (planar shape functions) we needed 3 degrees
of freedom (the nodal values of u will define completely the approximation. If we use second
order shape functions, we need to fit this type of surface over each triangle. To do this uniquely,
we need six degrees of freedom: either the 3 nodal values and the derivatives of u at those points
or the values at six points on the triangle. (This is analogous to trying to fit a second order curve
to an interval – while a straight line is completely defined by 2 points, a second order curve will
need 3). Choosing the easier option, we form triangles with 6 points: the 3 vertices and 3 midside
points:



Fig. 9.10 shows a triangle with 6 points identified by
their triangle coordinates.

We need to specify now the shape functions. If
we take first the function N1(ξ1, ξ2, ξ3), corresponding
to node 1, we know that it should be 1 there and zero
at every other node. That is, it must be 0 at ξ1 = 0
(nodes 2, 3 and 5) and also at ξ1 = 1/2 (nodes 4 and
6) Then, we can simply write:

N1 = ξ1 2ξ1 −1( ) (9.30)
Fig. 9.10


page 116 ELEC 0030 Numerical Methods


© University College London 2019 – F.A. Fernandez

In the same form, we can write the corresponding shape functions for the other vertices
(nodes 2 and 3). For the mid-side nodes, for example node 4, we have that N4 should be zero at
all nodes except 4 where its value is 1. We can see that all other nodes are either on the side 2–3
(where ξ1 = 0) or on the side 3–1 (where ξ2 = 0). So the function N4 should be:
N4 = 4ξ1ξ2 (9.31)
Fig. 9.11 shows the two types of second order shape functions, one for vertices and one for
mid-side points.

Fig. 9.11a

Fig. 9.11b
Exercise 9.4:
For the mesh of second order triangles of the figure below, find the corresponding sparsity
pattern.

Exercise 9.5:
Use a similar reasoning as that used to define the second order shape functions (9.30)-(9.31),
to find the third order shape functions in terms of triangle area coordinates.
Note that in this case there are 3 different types of functions.





ELEC 3030 Numerical Methods Appendix page A1


© University College London 2019 – A. Fernandez
APPENDIX

1. Taylor theorem
For a continuous function we have )()()(' afxfdttf
x
a
−=∫ , then, we can write:
∫+=
x
a
dttfafxf )(')()( or )()()( 0 xRafxf += (A1.1)
where the reminder is ∫=
x
a
dttfxR )(')(0 .
We can now integrate R0 by parts using:
(2)'( ) ( )
( )
u f t du f t dt
dv dt v x t
= =
= = − −
giving:

∫∫ −+−−==
x
a
x
a
x
a
dttftxtftxdttfxR )()()(')()(')( )2(0 which gives after solving and
substituting in (A1.1):
∫ −+−+=
x
a
dttftxafaxafxf )()()(')()()( )2( or
)())((')()( 1 xRaxafafxf +−+= (A1.2)
We can also integrate R1 by parts using this time:
(2) (3)
2
( ) ( )
( )( )
2
u f t du f t dt
x tdv x t dt v
= =

= − = −
which gives:
∫∫ −+−−=−=
x
a
x
a
x
a
dttftxtxtfdttftxxR )()()(
2
)()()()( )3(22
)2(
)2(
1 and again, after
substituting in (A1.2) gives:
∫ −+−+−+=
x
a
dttftxaxafaxafafxf )()()(
2
)())((')()( )3(22
)2(


Proceeding in this way we get the expansion:

n
n
n
Rax
n
afaxafaxafaxafafxf +−++−+−+−+= )(
!
)()(
!3
)()(
2
)())((')()(
)(
3
)3(
2
)2(

(A1.3)
where the reminder can be written as: ∫ +

=
x
a
n
n
n dttfn
txxR )(
!
)()( )1( (A1.4)
To find a more useful for the reminder we need to invoke some general mathematical theorems:


page A2 ELEC 3030 Numerical Methods Appendix


© University College London 2019 – A. Fernandez
First Theorem for the Mean Value of Integrals
If the function )(tg is continuous and integrable in the interval [a, x], then there exists a point ξ
between a and x such that: ))(()( axgdttg
x
a
−=∫ ξ .
This says simply that the integral can be represented by an average value of the function )(ξg
times the length of the interval. Because this average value must be between the minimum and
the maximum values of the function in that interval; and if the function is continuous, there will
be a point ξ for which the function has this value.
And in a more complex form:
Second Theorem for the Mean Value of Integrals
Now if the functions g and h are continuous and integrable in the interval and h does not change
sign in the interval, there exists a point ξ between a and x such that:
∫∫ =
x
a
x
a
dtthgdtthtg )()()()( ξ
If we now use this second theorem on expression (A1.4), with: )()( )1( tftg n+= and
!
)()(
n
txth
n−
= , we get: ∫

= +
x
a
n
n
n dtn
txfxR
!
)()()( )1( ξ which can be integrated, giving:

1
)1(
)(
)!1(
)()( +
+

+
= n
n
n txn
fxR ξ (A1.5)
for the reminder of the Taylor expansion, where ξ is appoint between a and x.



ELEC 3030 Numerical Methods Appendix page A3


© University College London 2019 – A. Fernandez
2. Implementation of Gaussian Elimination

Referring to the steps (5.14) - (5.16) to eliminate the non-zero entries in the lower triangle
of the matrix, we can see that (in step (5.15)) to eliminate the unknown x1 from equation i (i = 2 –
N); – that is, to eliminate the matrix elements ai1, we need to subtract from eqn. i equation 1 times
ai1/a11. With this, all the elements of column 1 except a11 will be zero. After that, to remove x2
from equations 3 to N (or elements ai2 , i = 3 – N as in (5.16)), we need to scale the new equation
2 by the factor ai2/a22 and subtract it from row i, i = 3 – N. The pattern now repeats in the same
form, so the steps for triangularizing matrix A (keeping just the upper triangle) can be implemented
by the following code (written here in Matlab)1:

for k=1:n-1
v(k+1:n)=A(k+1:n,k)/A(k,k); % find multipliers
for i=k+1:n
A(i,k+1:n)=A(i,k+1:n)-v(i)*A(k,k+1:n);
end
end

In fact, this can be simplified eliminating the second loop, by noting that all operations on
rows can be performed simultaneously. The simpler version of the code is then:

for k=1:n-1
v(k+1:n)=A(k+1:n,k)/A(k,k); % find multipliers
A(k+1:n,k+1:n)=A(k+1:n,k+1:n)-v(k+1:n)*A(k,k+1:n);
end
U=triu(A); % This function simply puts zeros in the lower triangle

The factorization is completed by calculating the lower triangular matrix L. The complete
procedure can be implemented as follow:

function [L,U] = GE(A)
%
% Computes LU factorization of matrix A
% input: matrix A
% output: matrices L and U
%
[n,n]=size(A);
for k=1:n-1
A(k+1:n,k)=A(k+1:n,k)/A(k,k); % find multipliers
A(k+1:n,k+1:n)=A(k+1:n,k+1:n)-A(k+1:n,k)*A(k,k+1:n);
end
L=eye(n,n) + tril(A,-1);
U=triu(A);

This function can be used to find the LU factors of a matrix A using dense storage. The
function eye(n,n) returns the identity matrix of order n and tril(A,-1) gives a lower triangular
matrix with the elements of A in the lower triangle, excluding the diagonal, and setting all others

1 Note that if the problem is actually solved in Matlab there is no need to perform all the steps described here. The
command x=A\b will give the solution of Ax=b using LU factorization if there is not a simpler method.


page A4 ELEC 3030 Numerical Methods Appendix


© University College London 2019 – A. Fernandez
to zero (lij = aij if j ≤ i–1, and 0 otherwise).
The solution of a linear system of equations Ax = b is completed (after the LU factorization
of the matrix A is performed) with a forward substitution using matrix L followed by backward
substitution using the matrix U. Forward substitution consists of finding the first unknown x1
directly as b1/l11 , substituting this value in the second equation, find x2 and so on. This can be
implemented by:

function x = LTriSol(L,b)
%
% Solves the triangular system Lx = b by forward substitution
%
n=length(b);
x=zeros(n,1); % a vector of zeros to start
for j=1:n-1
x(j)=b(j)/L(j,j);
b(j+1:n)=b(j+1:n)-x(j)*L(j+1:n,j);
end
x(n)=b(n)/L(n,n);

Backward substitution can be implemented in a similar form, this time the unknowns are
found from the end upwards:

function x = UTriSol(L,b)
%
% Solves the triangular system Ux = b by backward substitution
%
n=length(b);
x=zeros(n,1);
for j=n:-1:2 % from n to 2 one by one
x(j)=b(j)/U(j,j);
b(1:j-1)=b(1:j-1)-x(j)*U(1:j-1,j);
end
x(1)=b(1)/U(1,1);

With these functions the solution of the system of equations Ax = b can be performed in
three steps by the code:
[L,U] = GE(A);
y = LTriSol(L,b);
x = UTriSol(U,y);


Exercise
Use the functions GE, LTriSol and UTriSol to solve the system of equations generated
by the finite difference modelling of the square coaxial structure, given by equation (7.30). You
will have to complete first the matrix A given only schematically in (7.30), after applying the
boundary conditions. Note that because of the geometry of the structure, not all rows will have
the same pattern.
To input the matrix it might be useful to start with the matlab command:



ELEC 3030 Numerical Methods Appendix page A5


© University College London 2019 – A. Fernandez
A = triu(tril(ones(n,n),1),-1)-5*eye(n,n)

This will generate a tridiagonal matrix of order n with –4 in the main diagonal and 1 in the two
subdiagonals. After that you will have to adjust any differences between this matrix and A.
Compare the results with those obtained by Gauss-Seidel.


page A6 ELEC 3030 Numerical Methods Appendix


© University College London 2019 – A. Fernandez
3. Solution of Finite Difference Equations by Gauss-Seidel

Equation (7.30) is the matrix equation derived by applying the finite differences method to
the Laplace equation. Solving this system of equation by the method of Gauss-Seidel, as shown
in (5.18) and (5.19), consists of taking the original simultaneous equations, putting all diagonal
matrix terms on the left-hand-side as in (5.19) and effectively putting them in a DO loop (after
initializing the ‘starting vector’).

Equation (7.25) is the typical equation, and putting the diagonal term on the left-hand-side
gives:
φO =
1
4 φN +φS + φE + φW( )

We could write 56 lines of code (one per equation) or even simpler, use subscripted variables
inside a loop. In this case the elements of A are all either 0, +1 or -4, and are easily “generated
during the algorithm”, rather that actually stored in an array. This simplifies the computer
program, and instead of A, the only array needed holds the current value of vector elements:
xT = φ1, φ2, ... ,φ56( ).

The program can be simplified further by keeping the vector of 56 unknowns x in a 2-D
array z(11,11) to be identified spatially with the 2-D Cartesian coordinates of the physical
problem (see figure). For example z(3,2) stores the value of φ10. There is obviously scope to
improve efficiency since with this arrangement we store values corresponding to nodes with
known, fixed voltage values, including all those nodes inside the inner conductor. None of the
actually need to be stored, but doing so makes the program simpler. In this case, the program (in
old Fortran 77) can be as simple as:

c Gauss-Seidel to solve Laplace between square inner & outer
c conductors. Note that this applies to any potential problem,
c including potential (voltage) distribution or steady state
c current flow or heat flow.
c
dimension z(11,11)
data z/121*0./
do i=4,8
do j=4,8
z(i,j)=1.
enddo
enddo
c
do n=1,30
do i=2,10
do j=2,10
if(z(i,j).lt.1.) z(i,j)=.25*(z(i–1,j)+z(i+1,j)+
$ z(i,j–1)+z(i,j+1))
enddo
enddo
c in each iteration print values in the 3rd row
write(*,6) n,(z(3,j),j=1,11)
enddo


ELEC 3030 Numerical Methods Appendix page A7


© University College London 2019 – A. Fernandez
c
write(*,7) z
6 format(1x,i2,11f7.4)
7 format(‘FINAL RESULTS=’,//(/1x,11f7.4))
stop
end

In the program, first z(i,j) is initialized with zeros, then the values corresponding to the
inner conductor are set to one (1 V). After this, the iterations start (to a maximum of 30) and the
Gauss-Seidel equations are solved.
In order to check the convergence, the values of potentials in one intermediate row (the third)
are printed after every iteration. We can see that after 19 iterations there are no more changes
(within 4 decimals). Naturally, a more efficient monitoring of convergence can be implemented
whereby the changes are monitored, either in a point by point basis, or as the norm of the
difference, and the iterations are stopped when this value is within a prefixed precision.
The results are:


1 0.0000 0.0000 0.0000 0.2500 0.3125 0.3281 0.3320 0.3330 0.0833 0.0208
0.0000 2 0.0000 0.0000 0.1250 0.3750 0.4492 0.4717 0.4785 0.4181 0.1895 0.0699
0.0000 3 0.0000 0.0469 0.2109 0.4473 0.5225 0.5472 0.5399 0.4737 0.2579 0.1098
0.0000 4 0.0000 0.0908 0.2690 0.4922 0.5653 0.5865 0.5742 0.5082 0.3016 0.1369
0.0000 5 0.0000 0.1229 0.3076 0.5205 0.5902 0.6084 0.5944 0.5299 0.3293 0.1542
0.0000 6 0.0000 0.1446 0.3326 0.5381 0.6047 0.6211 0.6067 0.5435 0.3465 0.1650
0.0000 7 0.0000 0.1586 0.3484 0.5488 0.6133 0.6288 0.6143 0.5519 0.3572 0.1716
0.0000 8 0.0000 0.1675 0.3582 0.5553 0.6184 0.6334 0.6190 0.5572 0.3637 0.1756
0.0000 9 0.0000 0.1729 0.3641 0.5592 0.6215 0.6362 0.6218 0.5604 0.3676 0.1780
0.0000 10 0.0000 0.1763 0.3678 0.5616 0.6234 0.6379 0.6236 0.5624 0.3700 0.1794
0.0000 11 0.0000 0.1783 0.3700 0.5631 0.6245 0.6390 0.6247 0.5636 0.3713 0.1802
0.0000 12 0.0000 0.1795 0.3713 0.5639 0.6253 0.6396 0.6254 0.5643 0.3722 0.1807
0.0000 13 0.0000 0.1802 0.3721 0.5645 0.6257 0.6400 0.6258 0.5647 0.3727 0.1810
0.0000 14 0.0000 0.1807 0.3726 0.5648 0.6259 0.6403 0.6260 0.5649 0.3729 0.1812
0.0000 15 0.0000 0.1810 0.3729 0.5650 0.6261 0.6404 0.6261 0.5651 0.3731 0.1813
0.0000 16 0.0000 0.1811 0.3731 0.5651 0.6262 0.6405 0.6262 0.5652 0.3732 0.1813
0.0000 17 0.0000 0.1812 0.3732 0.5652 0.6263 0.6406 0.6263 0.5652 0.3733 0.1813
0.0000 18 0.0000 0.1813 0.3732 0.5652 0.6263 0.6406 0.6263 0.5652 0.3733 0.1814
0.0000 19 0.0000 0.1813 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814
0.0000 20 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814
0.0000 21 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814
0.0000 22 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814
0.0000 23 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814
0.0000 24 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814
0.0000 25 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814
0.0000 26 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814
0.0000 27 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814
0.0000 28 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814
0.0000 29 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814
0.0000 30 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814
0.0000 FINAL RESULTS=
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0907 0.1848 0.2615 0.2994 0.3099 0.2994 0.2615 0.1814 0.0907 0.0000


page A8 ELEC 3030 Numerical Methods Appendix


© University College London 2019 – A. Fernandez
0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
0.0000 0.2615 0.5653 1.0000 1.0000 1.0000 1.0000 1.0000 0.5653 0.2615 0.0000
0.0000 0.2994 0.6263 1.0000 1.0000 1.0000 1.0000 1.0000 0.6263 0.2994 0.0000
0.0000 0.3099 0.6406 1.0000 1.0000 1.0000 1.0000 1.0000 0.6406 0.3099 0.0000
0.0000 0.2994 0.6263 1.0000 1.0000 1.0000 1.0000 1.0000 0.6263 0.2994 0.0000
0.0000 0.2615 0.5653 1.0000 1.0000 1.0000 1.0000 1.0000 0.5653 0.2615 0.0000
0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
0.0000 0.0907 0.1848 0.2615 0.2994 0.3099 0.2994 0.2615 0.1814 0.0907 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000




ELEC 3030 Numerical Methods Appendix page A9


© University College London 2019 – A. Fernandez
4. Matrix – vector products
Some expressions that often need to be evaluated are the products between a matrix and
vectors.
For example, if the matrix A is defined with elements aij, that is: A = {aij} and the vector b = {bj},
the product Ab = c is the vector: c = {ci} where
1
N
i ij j
j
c a b
=
= ∑ .


1 111
1
121
1
N
j j
j
ij
N
Nj j
jN N
b ca a b
a
a
a b
b c
=
=
     
     
     
      = =
     
     
           


 
   



The quadratic form (scalar) Ts = x Ax for A = {aij} and x = {xj} also appears frequently:

We can call the product Ax by c calculated as before, so { }
1
N
i ij j
j
c a x
=
 
= =  
 
∑c
Pre-multiplying the vector c by xT gives:


1
N
T
i i
i
s x c
=
= = ∑x c (equivalent to the dot product)

And replacing the expression for c:


1 1 1 1
N N N N
i ij j i ij j
i j i j
s x a x x a x
= = = =
= =∑ ∑ ∑∑




page A10 ELEC 3030 Numerical Methods Appendix


© University College London 2019 – A. Fernandez
5. Error residual in gradient methods
In the solution of matrix problems using gradient methods it is convenient to minimise the
functional (5.24)

2 1
2
T Th = −x Ax x y
(A5.1)
instead of the error residual. Here we demonstrate that minimising (A#.1) is equivalent to solve
the system of equations =Ax y if A is symmetric and positive definite..
Considering changes of 2h when x varies along the line: α+x p for a real α and a vector p of
fixed direction.
First, we show that 2 2( ) ( )h hα+ >x p x :
2 1( ) ( ) ( ) ( )
2
T Th α α α α+ = + + + +x p x p A x p x p y
2 21 1 1 1( )
2 2 2 2
T T T T T Th xα α α α α+ = + + + − −x p x Ax x Ap p Ax p Ap y p y
if A is symmetric ( T=A A ) then Tαx Ap = Tαp Ax an we can write:
2 21 1( ) ( 2 )
2 2
T T T T Th xα α α α+ = − + − +x p x Ax y p Ax p y p Ap
The first terms in the right hand side correspond to 2( )h x :
2 2 21( ) ( ) ( )
2
T Th hα α α+ = + − +x p x p Ax y p Ap
where the last term is always positive.
We can see that 2h is a quadratic function of α with a positive leading coefficient and as
such, will have a minimum in the line α+x p .
The value of α that gives the minimum is found by: 2( ) 0d h
d
α
α
+ =x p
and is:
( )T


=
p y A x
p A p
 (see Exercise 5.1).
The minimum value of 2h is then:
( ) 22 2 [ ]( ) ( )
T
Th hα

+ = −
p y A x
x p x
p A p

where the last term is positive. This means that 2 2( ) ( )h hα+ That is, p is not orthogonal to the residual r = y – Ax. Let’s consider the two possibilities:

i) x is such that Ax = y. Then 2 2( ) ( )h hα+ =x p x and this is the minimum value.
ii) x is such that Ax ≠ y. Then, 2 2( ) ( )h hα+ ( ) 0T − ≠p y A x and 2( )h x is not the minimum.
In consequence, minimising 2( )h x is equivalent to find the solution of Ax = y.



ELEC 3030 Numerical Methods Appendix page A11


© University College London 2019 – A. Fernandez
6. Derivation of Second Order Runge Kutta
The second order Taylor expansion:

2
3
1 ( , ) '( , ) ( )2i i i i i i
hy y hf t y f t y O h+ = + + + (A6..1)
can be written as:
2
3
1
, ,
( , ) ( , ) ( )
2
i i i i
i i i i i i
t y t y
h f fy y hf t y f t y O h
t y+
∂ ∂ 
= + + + + ∂ ∂ 
(A6..2)
Rewriting this in the form:
1 0 1 1 2( )i iy y h c k c k+ = + + (A6..3)
where: 1 ( , )i ik f t y= and 2 ( , ( , ))i i i ik f t ph y qhf t y= + + , which gives:
1 0 1( , ) ( , ( , ))i i i i i i i iy y hc f t y hc f t ph y qhf t y+ = + + + + (A6..4)
Now, the function f with variations in t and y can be represented as a first order Taylor expansion
in terms of the two variables as:
2
, ,
( , ) ( , ) ( )
i i i i
i i i i
t y t y
f ff t p t y q y f t y p t q y O h
t y
∂ ∂
+ ∆ + ∆ = + ∆ + ∆ +
∂ ∂
(A6..5)
but the variations are: t h∆ = and
,
( , )
i i
i i
t y
dyy t hf t y
dt
∆ = ∆ = so (A6..4) becomes:
21 1
, ,
( , ) ( , ) ( )
i i i i
i i i i
t y t y
f ff t ph y qk h f t y ph qk h O h
t y
∂ ∂
+ + = + + +
∂ ∂
(A6..6)
Replacing (A6..6) in (A6..4) leads to:
21 0 1 1
, ,
( , ) ( , ) ( )
i i i i
i i i i i i
t y t y
f fy y hc f t y hc f t y ph qk h O h
t y+
∂ ∂ 
= + + + + +  ∂ ∂ 
(A6..7)
which has the same form as (A6..2).
Equating terms between (A6..2) and (A6..7) give the equations for the unknown coefficients is
(A6..3):

0 1
1
1
1
1 2
1 2
c c
c p
c q
+ =
=
=
(A6..8)
There are three equations and four coefficients, so one of them can be chosen arbitrarily.



page A12 ELEC 3030 Numerical Methods Appendix


© University College London 2019 – A. Fernandez
7. Derivation of a variational formulation

For a boundary value problem specified by the equation:

u s=L (A7.1)

where the operator L is self-adjoint† the following expression is a corresponding variational
functional:

= , 2 ,u u s u−J L (A7.2)
Demonstration:
Allowing a variation or perturbation uδ to the function u we get a corresponding variation in J
leading to:
= ( ), 2 ,u u u u s u uδ δ δ δ+ + + − +J J L

Expanding the right hand side and keeping only first order perturbations (discarding terms with
higher orders in uδ ):

, = , , , 2 , 2 ,u u u u u u s u s uδ δ δ δ+ + + − −J J L L L

using (A7.2) leads to:
= , , 2 ,u u u u s uδ δ δ δ+ −J L L

Now, since L is self-adjoint: , ,u u u uδ δ=L L so finally:

= 2 , 2 , 2 ,u u s u u s uδ δ δ δ− = −J L L (A7.3)

Seeking the stationary condition: = 2 , 0u s uδ δ− =J L and since uδ is arbitrary, this implies
that the equation (A7.1) is satisfied. That is, finding the function u that renders the functional J
stationary, is equivalent to find the solution to (A7.1).
If the condition for self adjointness of the operator L includes some boundary terms, these
will also appear in (A7.3) and will imply that additionally to solve (A7.1) the function u will satisfy
some boundary condition related to these boundary terms.


†An operator L is said to be self-adjoint is it satisfies the condition:
, , boundary termsu v u v= +L L


ELEC 3030 Numerical Methods Appendix page A13


© University College London 2019 – A. Fernandez
8. A variational formulation for the wave equation

Equation (8.20) shows the use of the variational method to find a solution to the one-
dimensional wave equation. In this section, it will be shown that finding the stationary value of
of the functional (8.20) is equivalent to solve the differential equation

Equation (8.20) states:
k 2 = s.v.
dy
dx


 


2
dx
a
b

y2 dx
a
b

(A8.1)

Let’s examine a variation of k2 produced by a small perturbation δy on the solution y:

k 2(y + δy) = k2 + δk 2 =
d(y + δy)
dx


 


2
dx
a
b

(y +δy)2 dx
a
b


And re-writing:
k2 +δk2( ) (y +δy)2 dx
a
b
∫ =
d(y + δy)
dx


 


2
dx
a
b
∫ (A8.2)

Expanding and neglecting higher order variations:

k2 +δk2( ) (y2 + 2yδy) dx
a
b
∫ =
dy
dx


 


2
+ 2
dy
dx
dδy
dx





 dx
a
b

or
k 2 y2 dx
a
b
∫ + 2k2 yδy dx
a
b
∫ + δk2 y2 dx
a
b
∫ =
dy
dx


 


2
dx
a
b
∫ + 2
dy
dx
dδy
dx
dx
a
b
∫ (A8.3)

and now using (A8.1):
2k2 yδy dx
a
b
∫ +δk 2 y2 dx
a
b
∫ = 2
dy
dx
dδy
dx
dx
a
b
∫ (A8.4)

Now, since we want k2 to be stationary about the solution function y, we make δk2 = 0, and we
examine what conditions this imposes on the function y:

δk2 = 0 implies: k 2 yδy dx
a
b
∫ =
dy
dx
dδy
dx
dx
a
b
∫ (A8.5)
Integrating the RHS by parts:
k 2 yδy dx
a
b
∫ =
dy
dx
δy
a
b
− δy
d2 y
dx2
dx
a
b



page A14 ELEC 3030 Numerical Methods Appendix


© University College London 2019 – A. Fernandez
Or re-arranging:
δy
d2 y
dx2
+ k2 y





 dx
a
b
∫ =
dy
dx
δy
a
b
(A8.6)

Since δy is arbitrary, (A8.6) can only be valid if both sides are zero. That means that the function
y that makes the functional stationary should satisfy the differential equation:

d
2 y
dx 2
+ k2 y = 0 (A8.7)

Similarly, the boundary condition dy
dx
= 0 at a and b (A8.8)
is satisfied since δy is arbitrary.

The equation (A8.7) resulting from forcing the functional stationary is called the Euler equation
of this functional and the boundary condition (A8.8) is the natural boundary condition.

Summarizing, we can see that imposing the stationary condition of (A8.1) with respect to
small variations of the function y leads to y satisfying the differential equation (A8.7), which is the
wave equation, and the boundary conditions (A8.8); that is, zero normal derivative (Neumann
B.C.).
If a different boundary condition is desired it has to be imposed directly. For the Dirichlet
boundary condition, that is, fixed values of y at the ends, fixing these values on the boundary
implies that y is not allowed to vary on the boundary and then δy = 0 in the boundary.


ELEC 3030 Numerical Methods Appendix page A15


© University College London 2019 – A. Fernandez
9. Area of a Triangle

For a triangle with nodes 1, 2 and 3 with coordinates (x1, y1), (x2, y2) and (x3, y3):

The area of the triangle is: A = Area of rectangle – area(A) – area(B) – area(C)

Area of rectangle = (x2 − x1)(y3 − y2 ) = (x2 y3 + x1y2 − x1y3 − x2 y2 )
Area (A) = 12 (x2 − x3 )(y3 − y2 ) =
1
2 (x2 y3 + x3y2 − x2 y2 − x3y3 )
Area (B) = 12 (x3 − x1 )(y3 − y1) =
1
2 (x3y3 − x3y1 − x1y3 + x1y1)
Area (C) = 12 (x2 − x1)(y1 − y2 ) =
1
2 (x2 y1 − x2y2 − x1y1 + x1y2 )

Then, the area of the triangle is:

A = 12 (x2y3 − x3y2 ) + (x3y1 − x1y3) + (x1y2 − x2 y1)[ ] (A9.1)

which can be written as: A = 1
2
det
1 x1 y1
1 x2 y2
1 x3 y3
(A9.2)



page A16 ELEC 3030 Numerical Methods Appendix


© University College London 2019 – A. Fernandez
10. Shape Functions (Interpolation Functions)

The function u is approximated in each triangle by a first order function (a plane). This will
be given by an expression of the form: p + qx + ry which can also be written as a vector product
(equation 9.14).
˜ u(x, y) = p + qx + ry = (1 x y)
p
q
r










(A10.1)

Evaluating this expression at each node of a triangle (with nodes numbered 1, 2 and 3):


u1 = p + qx1 + ry1
u2 = p + qx2 + ry2
u3 = p + qx3 + ry3



 

u1
u2
u3










=
1 x1 y1
1 x2 y2
1 x3 y3










p
q
r










(A10.2)

And from here we can calculate the value of the constants p, q and r in terms of the nodal values
and the coordinates of the nodes:


p
q
r










=
1 x1 y1
1 x2 y2
1 x3 y3










−1 u1
u2
u3










(A10.3)

Replacing (A10. 3) in (A10.1):
˜ u (x, y) = (1 x y)
1 x1 y1
1 x2 y2
1 x3 y3










−1 u1
u2
u3










(A10.4)

and comparing with (9.15), written as a vector product:

u(x,y) ≈ ˜ u (x,y) = u1N1(x, y) + u2 N2 (x, y) + u3 N3(x, y) = (N1 N2 N3 )
u1
u2
u3










(A10.5)
we have finally:
(N1 N2 N3) = (1 x y)
1 x1 y1
1 x2 y2
1 x3 y3










−1
(A10.6)

Solving the right hand side (inverting the matrix and multiplying), gives the expression for
each shape function (9.16):
Ni (x, y) =
1
2A
ai + bix + ciy( ) (A10.7)

where A is the area of the triangle and



ELEC 3030 Numerical Methods Appendix page A17


© University College London 2019 – A. Fernandez
a1 = x2y3 – x3y2 b1 = y2 – y3 c1 = x3 – x2
a2 = x3y1 – x1y3 b2 = y3 – y1 c2 = x1 – x3 (A10.8)
a3 = x1y2 – x2y1 b3 = y1 – y2 c3 = x2 – x1

Note that from (A10.1), the area of the triangle can be written as:

A = 12 a1 + a2 + a3( ) (A10.9)


page A18 ELEC 3030 Numerical Methods Appendix


© University College London 2019 – A. Fernandez

51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468