程序辅导案例 > Program >

程序代写案例-PSL432

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

THEORETICAL
PHYSIOLOGY

PSL432
University of Toronto

DOUGLAS TWEED

i

Course outline

This course deals with the mathematical principles of
control and learning in the sensorimotor systems of the
brain.
The course director is Douglas Tweed, reachable at
[email protected] or 416-912-7497.

The teaching assistant is Ali Mojdeh, ali.mojdeh@
mail.utoronto.ca

Lectures are Tuesdays 2–4 p.m., office hours by ap-
pointment. For all these events, use the Zoom link an-
nounced on Quercus.
Format
We will run an inverted classroom: before each class
(except the first one), you should finish that week’s as-
signed readings from these handout notes. If anything in
the notes is unclear, you may ask questions about it in
class, and answer other students’ questions.
Materials
There is no text book. All course material will be cov-
ered in these handout notes and computer code posted
on Quercus.

ii

Evaluation
There will be no tests or exams. All course marks will
be based on 3 take-home assignments, worth 25%,
40%, and 35%. The assignments will be posted on Feb
1, Mar 1, and Mar 22, and will be due 2 weeks later, not
counting Reading Week, i.e. they will be due one mi-
nute before midnight on Feb 22, Mar 15, and Apr 5.
The penalty for late submission will be 5% (of the total
marks for that assignment) per day late. Assignments
may include calculations and proofs, but will mainly in-
volve writing computer simulations of sensorimotor
control and learning.
Simulations
You will write all your simulations in the Matlab pro-
gramming language. You can get Matlab for free
through
https://onesearch.library.utoronto.ca/ic/mathworks-soft-
ware-%E2%80%93-matlab-students
You need only Matlab itself, not any of its “toolboxes”.
Starting Jan 18, each class will end with a simulation
workshop, where you can apply the day’s new concepts
on your own computer, so make sure you have Matlab
installed by then.

iii

Schedule

Date Topic

Jan 11 Control Systems
Jan 18 Control Systems
Jan 25 Control Systems
Feb 1* Supervised Learning
Feb 8 Supervised Learning
Feb 15 Reading week T
Feb 22 Supervised Learning
Mar 1* Sensory Maps
Mar 8 Reinforcement Learning
Mar 15 Reinforcement Learning
Mar 22* Reinforcement Learning
Mar 29 Regularization
Apr 5 Regularization

* Days when problem sets will be assigned. You will
have 3 weeks for the first, and 2 weeks for the second
and third, i.e. they will be due one minute before mid-
night on Feb 22, Mar 15, and Apr 5.

iv

1

1 Control systems

In this course we will study how sensorimotor systems
in the brain learn to control the body. The mathematical
principles of learning and control have become clearer
in recent years thanks to research in artificial intelli-
gence and robotics, so we will borrow concepts from
these disciplines to help us understand the brain, and
conversely, we will consider how ideas from neurosci-
ence may benefit artificial systems.
State dynamics
Any sensorimotor task starts with an agent of some sort
embedded in an environment. We write st for the state
of the environment at time t, and at for the agent’s ac-
tion at this time (where both st and at are column vec-
tors).
For example, the agent might be the saccadic system in
the brain, which rotates the eyes quickly to aim them at
an object of interest. In that case st might include the
current positions of the eyes and the location of the vis-
ual target. And at might be the motor command — the
firing rates of the motoneurons to the muscles that
move the eyes.
In this course, we assume that time passes in discrete
steps of some fixed size ∆t, because that approach is
convenient for computer simulations. With each step,
the state takes a new value,

2

(1) ( , ),t t t ts f s a 
where f is called the state-dynamics function.
Simulating dynamics
In our equations, time passes in discrete steps. But real
time is continuous, and real processes obey differential
equations, such as ds/dt = g(s, a). We can approximate
such systems in our discrete-time framework using Eu-
ler integration, where we say
(2)
d
( , ) ( , ).
d
t t t t t t t t t
s
s f s a s s s t s a
t
       
In other words, we compute ds/dt at time t and pretend
that the velocity stays constant over the whole time in-
terval [t, t + Δt). This approximation will usually be
more accurate when ∆t is smaller, but small ∆t’s are
computationally expensive, because it takes a lot of
them to cover, say, 1 s of time. In this course, a good
compromise will be to choose ∆t in the range 0.1 to
0.001 s. You can find an example in the code file Euler
_integration.m, posted online (and see Getting Started
with Matlab, near the end of this chapter).
We will call ∆st/∆t the discrete-time velocity of s.
›› run EULER_INTEGRATION.M
Policy
The agent’s job is to choose actions, a, that achieve its
aims, given the rules laid out by the state dynamics. In
this course we will usually assume that the agent, based

3

on its sensory inputs, memories, and intentions, knows
st in full at every time t. That is, we consider situations
where state feedback is available, not ones where the
state is only partially observable.
The rule the agent uses to compute a’s from s’s is called
its policy,  ,
(3) ( ).t ta μ s
A policy may be stochastic, e.g. it may choose its action
at by computing some deterministic function of st and
then adding Gaussian noise, i.e. a random variable with
a Gaussian, or normal, distribution.
Rewards
An agent needs a policy that will achieve its aims, but
what are these aims? We assume the agent receives or
computes, at each time step, a signal called a reward, rt,
usually determined by the current state and action,
(4) ( , )t t tr r s a
The reward function defines what the agent wants. A
high reward indicates something the agent finds pleas-
ant, which might be light, or dark, or heat, or cold, or
some food, or having some visual target perfectly cen-
tred on the fovea, whereas a lower reward is less pleas-
ant.
The reward at any moment is a scalar: it may be com-
puted from the vectors st and at, but it expresses the
overall pleasantness of the situation in a single number.
The agent’s job is to choose actions that yield high re-
wards.

4

For example, suppose the agent wants to hold s near a
target state s = 0. We can express that wish mathemati-
cally by defining a reward that is maximal when s = 0,
and is otherwise much lower, such as
(5) ( , ) tanh( )t t t t tr s a s s 
T
This rt attains its maximum, namely 0, when s equals its
target value of 0. But if s is even a little off target then rt
is close to its minimum of –1.
It may seem odd that the reward (5) is defined with a
negative sign and has a maximal value of 0 — i.e. the
best reward we can ever get is nothing. But rewards of
this type are common in sensorimotor systems, many of
which try to eliminate some bad thing, such as being
off-target, and their highest reward is to zero that bad
thing.
Equation (5) also shows that rt doesn’t always depend
on both st and at, but I have written r(st, at) nonetheless
because in general, rewards do depend on both.
Returns
I have introduced the idea of rewards as a step toward
expressing an agent’s aims mathematically, but many
practical aims can’t be defined simply in terms of max-
imizing a reward at each moment. Instead, most purpos-
ive behaviour aims to maximize the time-integral of a
reward, called the return, which (for some reason) is
usually represented by a letter G in formulas.
For example, suppose we want to drive a mechanical
system such as an arm or leg to a target state as quickly

5

as possible and hold it there — a task called time-opti-
mal control. If the target state is 0 then (5) is an apt re-
ward function, but we should not aim to maximize this
reward at each moment. If we do, we will drive s to 0 at
high speed, so as to raise rt quickly, but then we may
have so much momentum that we overshoot the target,
and rt plummets. Essentially, if we maximize rt at each
moment then we won’t hit the brakes in a timely fash-
ion. Policies of this type, which maximize immediate
reward at each instant with no regard for future conse-
quences, are called myopic.
We don’t want a myopic policy that maximizes imme-
diate reward, we want a policy that cares about the fu-
ture, and so tries to maximize the discrete-time integral
of the reward from the present moment, t = 0, to some
future time T,
(6) 0...Δ Σt T tG t r
where T may be . This integral formulation expresses
the aim of time-optimal control, because clearly the
way to maximize G is to get s to 0 as quickly as possi-
ble without overshooting.
Vestibulo-ocular reflex
An example of a sensorimotor system is the vestibulo-
ocular reflex, or VOR, which senses head motion and
counterrotates the eyes to keep the retinal images stable.
Without your VOR, your vision would blur whenever
you moved, like a photo taken with a moving camera.
For simplicity we will consider just one eye, rotating
only horizontally, in which case the state st consists of
horizontal eye position qt and current horizontal head

6

velocity ht, the latter measured by sensors called semi-
circular canals located in the inner ears, in cavities in
the skull bones called the vestibula. We write
(7) [ ; ]t t ts q h
where the semicolon means we stack qt and ht verti-
cally, one on top of the other, to form a column vector
st.
The variables qt and ht have just one element each, and
so could be regarded as scalars (i.e. numbers), but we
will view them as one-element vectors because in most
systems, the position and input variables are vectorial.
Throughout the course, we will write q for the configu-
ration of a system, i.e. its position, location, or posture,
as opposed to its velocity, acceleration, etc. The ele-
ments of q are the system’s degrees of freedom.
What are the state dynamics of the VOR? Eye position
q evolves (to a good approximation) this way:
(8) ( ) /t t t t t t tq q q q t a κq ρ.       
Here κ and ρ are constants that reflect the mechanical
properties — stiffness and viscosity — of the eye mus-
cles. Realistic values are κ = 250 and ρ = 50, if t is
measured in seconds, q in radians, and a in mean spike
rate per motoneuron per second.
As for head velocity ht, we will measure it in radians/s
and treat it as something random, which the agent can-
not predict, but must simply read in from its sensors in
the inner ear. In reality you can predict your head mo-
tion to some extent, as it is partly under your control,
but it may also be affected by unpredictable factors, as

7

when you are riding a bucking bronco. For now, it is
simpler to treat ht as completely unpredictable.
What are the aims of the VOR? It tries to minimize the
velocity with which the visual image is slipping across
the retina, so we can define its reward as the negative of
the squared slip velocity,
(9)
2
( , ) (Δ /Δ )t t t t tr r s a q t h   
(though other choices, such as the negative of the non-
squared magnitude of the slip velocity, would also be
reasonable).
The VOR’s policy is a neural network in the brainstem
and cerebellum. It receives information about st and
uses it to compute an action at, which is another one-el-
ement vector, the mean firing of the motoneurons. (Or
more precisely, it is the difference between the mean
firing rates of the motoneurons to the rightward and
leftward-pulling eye muscles.)
What is the best policy? Using (8), we see that rt can be
maximized (that is, zeroed) if
(10) ( ) = .t t t ta μ s κq ρh 
To apply this policy, the agent must know not just st but
also the stiffness and viscosity coefficients of the mus-
cles, κ and ρ. No sensory signals can report those values
in any straightforward way, but by observing eye mo-
tions evoked by a wide variety of actions, a, the agent
can gradually deduce them. In other words, the agent
can learn estimates of κ and ρ.
To emphasize that it uses approximations rather than
exact, true values, we write the policy this way,

8

(11) ( ) = ,t t t ta μ s κ q ρ h 
where corner brackets indicate estimates. In any system
with estimates of environment properties, such as κ
and ρ, we call the set of estimates a model of the envi-
ronment.

›› run VOR.m
Saccades
We have deduced the ideal policy for the VOR, but for
most sensorimotor systems it is harder to choose opti-
mal actions. To see some of the issues, we will examine
the system that drives saccades, the rapid eye move-
ments that rotate the gaze lines to a visual target. Again
we consider horizontal movements of one eye.
For this task, the state is
(12) [ ; * ],t t ts q q
where qt is horizontal eye position as in (7), and q*t is
desired horizontal eye position, the position that will
point the fovea at the visual target. Eye position q
evolves according to (8), as it did in the VOR, and with
the same κ and ρ, because both systems use the same
eyeballs and muscles.
Target position q* we treat as random. In reality, the
saccadic system must analyse the retinal images to de-
tect objects and choose one as the visual target, but we
will assume the choice has already been made in cir-
cuits upstream, and the saccadic system receives a one-
element vector q*t.

9

For this first look at saccades, we will take as our re-
ward the negative squared difference between the actual
and desired eye positions,
(13)
2
* .t t tr q q  
(In fact, slightly more complicated rewards are more re-
alistic for saccades, but (13) will do for now.)
The policy is implemented by neurons mainly in the
midbrain and brainstem. We assume these neurons al-
ways know the current qt and q*t. But because time
passes in discrete steps, they have to wait ∆t time units
before they learn the next state and select the next ac-
tion. In other words, our discrete ∆t mirrors the fact that
real agents, even in continuous time, receive their feed-
back with a delay. For the saccadic system we set ∆t =
0.01 s because that much delay is biologically reasona-
ble (though values as small as 0.001 s may also be real-
istic).
Another issue, crucial for saccades, is that actions are
bounded. Motoneurons can’t fire billions of spikes per
second, to snap the eye to its target in a nanosecond.
Real motoneurons can’t exceed about 500 spikes/s, and
most saccades take 20–100 ms.
So what is the best policy for saccades? We might try to
find a policy by solving for the action at that zeroes rt,
as we did with the VOR, but we won’t succeed, because
nothing in (13) — the formula for rt — is a function of
at. Specifically, (8) tells us that at (together with other
variables) determines ∆qt, not qt.
We might think of solving for the at that produces a ∆qt
such that qt+∆t = q*t. But that won’t work either: because

10

at is bounded, it is almost never possible to make qt+∆t =
q*t, i.e. to reach the target in a single 0.01-s time step.
A better approach is to define some sensible desired dy-
namics. To keep the formulas compact, we will write
q(1)t for the discrete-time velocity ∆qt / ∆t, q(2)t for the ac-
celeration ∆q(1)t / ∆t, q(3)t for the jerk ∆q(2)t / ∆t, and so
on. And for qt we may also write q(0)t.
Using this notation, we might define a desired eye ve-
locity for saccade trajectories,
(14) (1)* ( * ),t t tq α q q 
where α is a negative constant. That is, the eye should
have a negative velocity when qt > q*t and a positive
velocity when qt < q*t, so q is always moving toward its
target.
We can solve (8) for the at that makes q(1)t = q(1)*t,
namely
(15) ( *).t t t ta κ q ρ α q q  
This policy will drive the eye to its target, zeroing the
reward (13), so long as κ and ρ are accurate and α is
not so large that it causes overshoot, as you can check
in the posted code file.
›› run SACCADIC_LINEAR.m
But there are many possible choices of desired dynam-
ics, and (14) is not particularly good. For one thing, sac-
cades should be fast, but (14) is slow. It makes q(1)*t a
linear function of the motor error, qt – qt*, which
means that as the error shrinks, the velocity shrinks pro-
portionally. If initial error is, say, 0.4 radians, then the

11

eye may start toward the target at high speed, but a little
later, when the error has shrunk to 0.2, velocity will be
halved.
Better to choose desired dynamics that maintain speed
until the target is reached, such as
(16) (1)* sgn( *),t t tq α q q 
where the sgn function (pronounced signum) yields 1
when its argument is positive, –1 when it is negative,
and 0 when it is 0. The resulting policy is
(17) sgn( *).t t t ta κ q ρ α q q  
This policy is a simple case of sliding-mode control,
which we will define in greater generality later.
›› run SACCADIC_SLIDING.m
As you explore this code, you will find that (17) doesn’t
actually work very well — it causes chatter, or rapid
oscillations in q and a (why?) — but the following sof-
tened version is better,
(18) tanh( ( *)).t t t ta κ q ρ α β q q  
This policy is “softened” in the sense that tanh — the
hyperbolic tangent function — is a smoothed approxi-
mation to the abrupt step function sgn.
In fact, (18) is close to what the real saccadic system
does, except that the real system uses its κ and ρ not
only as in (18), but also to update an internal estimate of
qt, by running a kind of neural simulation of the state
dynamics (8):
(19) Δ ( ) / .t t t t tq q t a κ q ρ    

12

It then uses this running estimate instead of the real qt in
its policy,
(20) tanh( ( *)).t t t ta κ q ρ α β q q  
This way, the policy neurons get (approximate) infor-
mation about eye position faster than if they waited for
visual or proprioceptive signals. This mechanism is
called internal feedback because the agent learns qt
from its own model rather than from the senses. Internal
feedback is what allows the saccadic system to achieve
a feedback delay ∆t as small as 0.01 s. Visual feedback
takes longer — about 0.1 s.
›› run SACCADIC_INTERNAL.m
Higher-order dynamics
The VOR and saccades both have (to a good approxi-
mation) first-order state dynamics, meaning that no
time-derivatives higher than the first — velocity — ap-
pear in the dynamics. But most sensorimotor systems
are at least second-order.
To see the implications, we will start not with any real
sensorimotor dynamics but with the simplest second-or-
der system, namely a frictionless bead sliding on a
straight, horizontal wire, driven by a varying force
which we will regard as an action a. The bead’s loca-
tion q obeys Newton’s second law, d2q/dt2 = a (assum-
ing the bead has mass 1 kg, and force a is measured in
newtons). The state must now include both position and
velocity, because both affect the future evolution of the
system,
(21) (1)[ ; ],t t ts q q

13

and the state dynamics are
(22) (1) (1)Δ [Δ ; Δ ] Δ [ ; ].t t t t ts q q t q a 
Notice that at determines ∆q(1)t but has no effect on ∆qt,
and therefore none on qt+∆t. It will affect qt+2∆t and later
q’s, but qt+∆t is already determined by the current veloc-
ity q(1)t. In other words, the system has inertia, and this
fact can complicate control.
Suppose we want to drive our bead-on-a-wire to state
s = 0, assuming |a| ≤ 5, and taking (5) as our reward. As
with saccades, no at will zero this reward immediately,
but we can choose desired dynamics that take s to 0
eventually.
For saccades, it was easy to choose reasonable linear
dynamics, because the state dynamics were first-order.
But for an nth-order system, how do we choose constant
coefficients αi so the linear dynamics,
(23) ( ) ( 1) (1) (0)1 1 0* ... ,
n n
nq α q α q α q

   
give the behaviour we want, e.g. driving q to 0?
We define the Hurwitz polynomial
(24) 1 11 1 0...
n n
nx α x α x α

   
(replacing the time-derivatives q(i) from (23) with pow-
ers xi of a new variable x), and choose our α’s so this
polynomial’s roots all have real parts less than 0. Then
if we use these α’s in (23), we get what are called Hur-
witz dynamics, which (it can be shown) drive s to 0, so
long as ∆t is not too large.

14

If our goal is not 0 then we make appropriate adjust-
ments, e.g. if we want to take q to some nonzero sta-
tionary target q* then we set
(25) ( ) ( 1) (1) (0)1 1 0* ... ( *).
n n
nq α q α q α q q

    
For the bead on the wire, we might choose the Hurwitz
dynamics
(26) (2) (1) (0) (1)1 0* 6 9 ,q α q α q q q    
which imply the policy
(27) (1)6 9 .t t ta q q  
Defining a policy this way, based on Hurwitz dynamics,
is called feedback linearization.
But as with saccades, linear dynamics are slow. We can
speed things up using sliding-mode control, meaning we
choose Hurwitz dynamics for q(n–1), but faster, nonlinear
dynamics for q(n). For example, given a stationary target
q*, we set
(28) ( 1) ( 2) (1) (0)2 1 0* ... ( *)
n n
nq α q α q α q q
 
    
and
(29)
     1 1
1* sgn( *).
n
n
n n
q q qα
 
 
Here α0 to αn–2 are chosen to make (28) Hurwitz, and
αn–1 is a negative number. Given (29), q(
n–1) moves rap-
idly toward q(n–1)*, reaching it quickly, and thereafter
the state dynamics obey (28), with the result that q →
q* exponentially.
For the bead on the wire, we might choose

15

(30) (1) 0* ( *) 6q α q q q   
and
(31)
       
1
1 1 12
* sgn( *) 5sgn( 6 ),q q qα q q    
which imply the policy
(32)
 1
5sgn( 6 ).tt tqa q  
But even (32) is not the time-optimal policy — the one
that gets s to 0 as quickly as possible, maximizing re-
turn (6) given reward (5). In this simple case of a fric-
tionless bead on a wire, we can deduce the time-optimal
policy analytically (see posted code), but in general,
time-optimal control is hard.
›› run INERTIAL_TIME_OPT.m
Learning
We have managed to find sensible policies for the
VOR, saccades, and inertial control. But we did it by
examining equations and reasoning about them. How
can such policies be acquired by sensorimotor networks
in the brain? They can’t be hardwired in by the genome,
because they depend on mechanical coefficients like κ
and ρ, which change as the body grows. They must be
at least partly learned.
Also, we found our policies by specific reasoning, tai-
lored to the VOR, saccades, and a bead on a wire. And
in fact our reasoning worked only because those tasks
were very simple. The brain has more likely evolved
one or a few general mechanisms that can learn policies
for a wide range of difficult tasks. In this course we will

16

explore how such mechanisms can be embodied in the
neural networks of the brain.
Getting started with Matlab
In Matlab, you write code (m-files) in the editor win-
dow, and when you run your code, you may see printed
outputs and error messages in the command window.
You can also use the command window as a calculator
(e.g. type 7*5 and get 35), and you can write and run
one-line programs there.
The following words or key-presses are useful in the
command window.
Ctrl-C will halt a running piece of Matlab code.
help will provide information about Matlab functions,
e.g. help rand will reveal how to use Matlab’s ran-
dom-number function, rand.
If you don’t know the name of the function you want,
then write lookfor; e.g. lookfor random will bring
up a lot of documentation where the word “random” ap-
pears.
If there is a bug in your code then error messages will
appear in the command window.
In the editor window, omitting the semicolon ; after a
statement will cause Matlab to display its output in the
command window, which may help you debug your
code, e.g. if you write c = a + b without a semicolon,
then the value of c will appear in the command win-
dow.

17

Start a line with % to make it a comment (i.e. ignored
by the computer but useful for humans reading your
code).
If a variable takes the value NaN (not a number) or inf
(infinite) then there is a problem in your code.
Write A(2, 7) for the element in the second row and
seventh column of a matrix A.
A(2:4, 7:12) is the 3-by-6 submatrix of A containing
its rows 2 through 4 and columns 7 through 12.
A(:, 7:12) is the submatrix of A containing all its
rows and the columns 7 through 12.
A’ is the transpose of the matrix A.
A*B is the matrix product of A and B.
A.*B is the elementwise product.
Simulation workshop
1. Save Euler_integration.m under a different name
and use the new file to experiment with various
changes.
(a) Change dur, Dt, and the initial s.
(b) Right now the state dynamics are s(1)t (i.e. ∆st / ∆t) =
st. Make the program more flexible by introducing a
variable α and setting the equation to s(1)t = αst.
(c) Plot the exact, continuous-time solution to the cor-
responding differential equation ds/dt = αs with a

18

dashed blue line in the same panel as the discrete-time,
Euler approximation.
(d) Change other aspects of the graphing by altering the
code following “% plot”.
2. To get practice with states and configurations, write
an m-file that uses Euler integration to express and
solve (approximately) the second-order differential
equation d2q/dt2 = –αq, with initial conditions q = 0 and
dq/dt = α1/2. Also plot the exact solution as a dashed
line. Try different time steps.
3. In Saccadic_linear.m, try larger and smaller values
for α, the desired-dynamics coefficient. Does anything
odd happen when α is very large? Why?
4. Write an m-file that uses feedback linearization to
control a system with state dynamics d2q/dt2 = a – αq,
for a small constant α. Make the agent drive q to a tar-
get q* that jumps every 500 ms, as in Saccadic_lin-
ear.m.
Other resources
Hanneton S et al. (1997) Does the brain use sliding var-
iables for the control of movements? Biological Cyber-
netics 77: 381-393
Khalil HK (2002) Nonlinear Systems. Prentice Hall
Slotine JJE, Li W (1991) Applied Nonlinear Control.
Prentice Hall

19

2 Supervised learning

Agents must learn their policies, based on information
from their sensors. To explore the principles, we will
first consider a simpler task, where a learner tries to de-
duce a target function τ from examples of x’s and τ(x)’s.
This task is called supervised learning; the x's are called
inputs; the τ(x)'s are target outputs, desired outputs, la-
bels, or supervisor or teacher signals; and each ordered
pair (x, τ(x)) received by the learner is an example. The
learner tries to mimic τ, i.e. to learn to predict τ(x) given
x. Usually a learner can’t mimic τ precisely but creates
an estimate τ that approximates τ.
The other main type of learning, besides supervised, is
reinforcement learning, where an agent learns a policy,
based not on teacher signals like τ(x), but on rewards.
We will get to reinforcement learning in Chapter 4, but
first we need to understand supervised learning.
Error and loss
To formulate a supervised-learning task, we define the
error vector
(33) ( ) ( ) ( ).e x x x  
Then we choose a loss function, L — a scalar function
of e, usually non-negative, with a single, global mini-
mum at e = 0, and no other minima. An example is the
squared error,
(34) .L e e T

20

Then the aim of learning is to adjust τ to minimize the
expected (i.e. average) loss, E[L], averaged across all
possible inputs x, taking into account that some x’s may
be more common than others.
Neural networks
The learners we study in this course are layered net-
works of computational units called neurons — simpli-
fied models of the real neural networks in the brain. We
write nl for the number of layers in the network, and
y{l} for the signal in layer l, i.e. the vector of the firing
rates of all the neurons in that layer. Each signal (apart
from the first layer’s) depends on the signal in the pre-
vious, upstream layer, by the formula
(35) { } ( { } { 1} { }).y l W l y l b l  
Here W{l} is the matrix of synaptic weights from layer
l – 1 to layer l; b{l} is the bias vector of layer l (repre-
senting baseline activities of the cells in the layer); and
φ is the activation function. The term in parentheses is
called the pre-activation potential,
(36) { } { } { 1} { }.v l W l y l b l  
For our activation function, we will mainly use the rec-
tified-linear-unit, or relu, function,
(37) { } ( { }) max(0, { }),y l v l v l 
in layers 2 through nl – 1, and the identity function,
y{nl} = v{nl}, in the final layer. No activation function
is applied in the first layer of the network either, so we
have also y{1} = v{1}.

21

Another popular activation function is the hyperbolic
tangent,
(38) { } tanh( { }),y l v l
which has some advantages, e.g. it is bounded between
–1 and 1, and so tanh-neuron signals can never fly off
toward infinity, as may happen with relu.
In any layered network, we regard the first-layer signal,
y{1}, as the network's input, also called x. The net-
work’s output is its final-layer signal, y{nl}, though
sometimes it is convenient to make τ(x) equal not to
y{nl} but to some function of y{nl}. Often we write
simply y for τ(x), and y* for the desired value τ(x), in
which case (33) becomes
(39) *.e y y 
So long as they have at least 3 layers, networks with
relu or tanh activation functions (or indeed with any of
a wide variety of other nonlinear activation functions)
are universal approximators, meaning we can approxi-
mate any continuous function τ arbitrarily well with a
network of this form if we have enough neurons and we
find appropriate W’s and b’s.
Gradient descent
Learning is a matter of finding weights and biases that
yield a small E[L]. Finding the very best W’s and b’s is
computationally intractable — too slow and expensive
— but it is feasible to guess W and b and then improve
those guesses by gradient descent.

22

To explain this idea, it is convenient to arrange all the
adjustable parameters — all the W{l} and b{l} — to-
gether in a single column vector, . For any one net-
work, the set of all its possible  ’s is called its parame-
ter space. E[L] is a function of , because when a net-
work has good parameters then E[L] is small, and when
it has bad parameters then E[L] is large.
The gradient E[L]/ at any locus in parameter space
is a row vector whose transpose points in the direction
of steepest increase of E[L]. Therefore we can decrease
E[L] by adjusting  in the opposite direction, setting
(40) E[ ] / ,t t t L    
T   
where θt is the parameter vector at time t, η is some
small positive number, and θt+Δt is the new parameter
vector, after the adjustment. Then we compute E[L]/
at this new spot in parameter space, subtract η times
that gradient to get θt+2Δt, and repeat over and over,
gradually improving .
We can picture the graph of E[L] versus  as a high-di-
mensional surface, the E[L]-landscape. Then gradient
descent is like walking down the slopes of that land-
scape in hopes of finding a low basin. The quantity η in
(40) scales the lengths of the strides we take, as we will
discuss in the section Faster learning.
And from now on, for simplicity, we will set the time
step Δt equal to 1; for example, if θ is adjusted once
every millisecond then we use milliseconds as our time
units, so we can set Δt = 1 and write θt+1 instead of θt+Δt.
›› run GRADIENT_DESCENT_QUADRATIC.m

23

Usually we can’t compute E[L]/, but we can esti-
mate it by computing Lt/, where Lt is the loss for the
current example. At each time point t, the network re-
ceives an example (xt, τ(xt)), and computes its estimate
τ(xt) and its error et = τ(xt) – τ(xt). From et it can com-
pute Lt/, which is an estimate of E[L]/. The esti-
mate is unbiased, meaning it equals E[L]/ on aver-
age, but of course it does vary around this mean, and
that variation is called gradient noise.
Backprop
The error-backpropagation algorithm, or backprop,
computes Lt/ for all the adjustable parameters 
in a
layered network.
As a step toward that goal, it computes the derivatives
of Lt with respect to all the network’s pre-activation po-
tentials, v{l}. Dropping the subscript t’s to avoid clutter,
we see by the chain rule of calculus that
(41)
{ 1} { }
{ } { 1} { } { }
{ }
{ 1} .
{ 1} { }
L L v l dy l
v l v l y l dv l
L dy l
W l
v l dv l
   

   

 
 

This equation tells us how to find L/v for layer l given
L/v for layer l + 1.
To get started, we need L/v for the final layer. Usu-
ally that is easy to compute, though the details depend
on the formula for L. If L is the squared-error loss, as in
(34), and the final-layer activation function is the iden-
tity, then

24

(42)
2
{ } { } { }
{ }
2 2 .
{ }
l l l
l
l
L dL e e
e
v n de v n v n
dy n
e e
dv n
  
 
  
 
T
T T

So backprop starts at the output end with (42) and
works backward layer by layer through the network
with (41) to get all the L/v{l}.
This same procedure also yields the derivatives of L
with respect to the biases b{l}, because
(43)
{ }
.
{ } { } { } { }
L L v l L
b l v l b l v l
   
 
   

To get the derivative of L with respect to the weights,
we again apply the chain rule,
(44)
{ }
{ 1} .
{ } { } { } { }
L L v l L
y l
W l v l W l v l
   
  
   

Given these derivatives, we can adjust the weights and
biases by gradient descent, setting ∆W = – (L/W)
T,
∆b = – (L/b)T.
›› run BACKPROPAGATION.m
Finding other gradients by backprop
We have seen that backprop can compute the gradient
of L with respect to many network variables, including
the input vector, y{1} — also called v{1} or x. But
given any vector α of the same dimensionality as y{nl},
we can replace L by αTy{nl} in (41) and (42). It follows
that, if we begin backprop with α instead of 2e, we get
αT y{nl}/x at the input end. In other words, we can

25

compute the gradient of any network’s output with re-
spect to its input, multiplied by an arbitrary vector α, by
backpropagating α. In brief,
(45) backpropagating { }/ .lα α y n x  
T
This principle will be crucial to reinforcement learning
in Chapter 4.
Feedback alignment
Very likely, backprop can’t operate in the brain, be-
cause of (41). That equation contains the term W{l + 1},
which means that the feedback circuits that compute the
parameter adjustments must know the weight matrices
W. That is, neurons in these feedback pathways, whose
signals represent L/v{l}, must know the weights of
vast numbers of synapses on other neurons, in the for-
ward path — the neurons whose signals are y{l}. De-
spite decades of research on inter-neuronal communica-
tion, we have no known mechanism that could convey
this synaptic data with anywhere near the speed and ca-
pacity required for backprop.
But another mechanism, very close to backprop, works
almost as well and is biologically feasible. If the W’s in
(41) are replaced by random, fixed matrices then there
is no need to communicate synaptic weights from cell
to cell, and yet (it can be shown) the forward-path
weight matrices W{l} evolve to resemble the fixed ma-
trices in the feedback circuits, so that in the end it is as
if those feedback matrices had been set equal to the
W{l}’s, as in backprop.
›› run FEEDBACK_ALIGNMENT.m

26

Minibatches
Backprop (and feedback alignment) networks learn
much faster if they receive, at each time t, not a single
example (xt, τ(xt)), but a set of nm examples called a
minibatch, with nm usually in the range 100–256. One
reason minibatches help is that with more examples we
can compute better estimates of E[L]/, reducing gra-
dient noise. Another reason is that we can arrange all
the data vectors of the minibatch side by side in a single
matrix, and computers can process that matrix of, say,
100 examples much faster than they can 100 individual
examples.
When we use a minibatch of nm examples, we are run-
ning nm different inputs through the same network, so
we get nm different sets of v’s and y’s, and nm different
outputs, errors, and losses, but all nm runs share the
same W’s and b’s. We will write Lm for the loss in ex-
ample m of the minibatch, v{l}i,m for the pre-activation
potential of neuron i of layer l in example m, and analo-
gous things for other variables. In this setting, L means
the average of all nm losses in the minibatch. Backprop
equations (42), (41), (43), and (44) then become
(46)
,
, ,
, ,
d { }
2 2 ,
{ } d { }
l i mm
i m i m
l i m l i m
y nL
e e
v n v n

 


(47)
,
,
, , ,
d { }
Σ { 1} ,
{ } { 1} d { }
i mm m
j j i
i m j m i m
y lL L
W l
v l v l v l
  
  
    

(48)
,
1
Σ ,
{ } { }
m
m
i m i m
LL
b l n v l
  
  
  

27

(49)
,
, ,
1
Σ { 1} .
{ } { }
m
m k m
i k m i m
LL
y l
W l n v l
  
  
  

Brains likely can’t use minibatches, but in most of our
simulations we will use them nonetheless, for speed.
We get much the same result we would have obtained
without minibatches, but faster.
›› run BACKPROPAGATION_MINIBATCH.m
Initialization
In many problems, backprop-based learning will suc-
ceed or fail depending on how we set the initial values
of the network’s weights and biases. One rule of thumb,
which does not always work, is that in a relu network,
the weights should start out small and the biases mainly
positive, say with a mean near 0.1. That way, most neu-
rons in the network will yield nonzero signals given
most inputs x, whereas with negative b’s, or large W’s
that overpowered the b’s, many neurons would be in the
flat part of the relu function — a problem, as neurons
can’t learn when they are in the flats because then the
derivative y/v, which drives learning in (41), is 0.
Activation functions like relu that are zero over some
stretch of their domain are said to show hard-zero satu-
ration. One might expect better learning with other acti-
vation functions, such as the hyperbolic tangent, that are
flat nowhere, but in practice, relu networks usually
learn better, maybe because the relu function, being so
simple, simplifies the E[L]-landscape. Also, most com-
mon activation functions without hard-zero saturation,
such as tanh, are almost flat over parts of their domains,

28

and so they don’t really avoid the problem. Real neu-
rons in the brain show hard-zero saturation, as their fir-
ing rates can’t be negative.
Faster learning
Speed of learning hinges on the rate constant  in (40).
If η is too small, then  improves very slowly. If η is
too large, it may overstep a minimum and increase
E[L].
Annealing
Often it helps to anneal η, shrinking it as time goes by
so the network learns rapidly at first and then slows
down to home in on an optimal . But choosing the an-
nealing schedule (the rate at which η shrinks) is a matter
of trial and error, as is choosing the initial η.
Momentum
Backprop usually learns faster with momentum, mean-
ing that the parameter adjustment ∆ is not exactly –η
times the gradient, but also depends on the ∆ in the
previous time step:
(50) / ,t t t L     
T    
where β is a number between 0 and 1, usually 0.9 or
larger. The result is that if the negative-gradient –∂L/∂θ
consistently points in some direction over several time
steps, then  will build up speed in that direction,
whereas if the negative-gradient flicks back and forth
between opposite directions from time step to time step
then that back-and-forth motion will be averaged out
and ignored.

29

This averaging is crucial because the E[L]-landscape is
often poorly conditioned, meaning its ridges, saddles,
valleys, and basins are much steeper in some dimen-
sions than in others. Imagine a deep river valley that
snakes its way downhill very slowly, i.e. with a shallow
grade, but that has very steep sides. In other words, the
land slopes steeply down to the river on either side, but
slopes downhill much more gradually in the direction of
the river’s flow. In a valley like this, it is usually a bad
idea to step exactly down the gradient: if you are stand-
ing on either side-slope, the negative-gradient will point
straight down that slope, at almost a right angle to the
riverbank, and will be huge, so unless  in (40) is tiny,
∆ will also be huge, and so  will tend to jump back
and forth across the valley, ∆ t taking it from one side
to the other, and ∆ t+1 taking it back again, rather than
snaking steadily downhill along the valley floor.
On the other hand, if  is tiny, then  will descend the
steep side in an orderly way, but then its progress along
the valley floor will be very slow.
So no single  is appropriate for the steep and the shal-
low dimensions. But with (50), the back-and-forth
jumps across the valley will tend to cancel each other
out, and any motion along the valley will compound, so
 improves more efficiently.
Nesterov momentum
In (50), we applied the momentum after computing the
gradient, i.e. we received the input xt, computed the gra-
dient, and then adjusted . Another approach, often
faster, is to adjust  to an interim value,  ', by applying
the momentum, then compute the gradient at  ', and
move down that gradient:

30

(51)
1 1
1
,
/ ( ) ,
.
t t
t
t t t
θ' θ β θ
θ θ' η L θ θ'
θ θ θ
 

  
   
  
T
The rationale is, if we are already committed to taking a
big momentum step, then we should check the lay of
the land after that step, not before.
RMSprop
We have seen that E[L]-landscapes have steep and shal-
low dimensions, and that no single  can work well for
all of them. Therefore it is useful to define a separate
learning-rate factor for each dimension, i.e. for each pa-
rameter  i. One approach is to choose a single, scalar 
as usual, but then divide each ∆i by |Lt/ i|, replacing
(40) by
(52)  1, , / / / ,t i t i t i t iη L L ε          
where  is some tiny number, added to prevent us ever
dividing by 0. This scheme keeps the sizes of the ad-
justments, |∆ i|, close to . No  i can take huge steps
down a steep dimension and fly out of control, or take
minuscule steps along a nearly-flat dimension and stag-
nate.
But because of gradient noise, it is better to scale ∆ i
not by the current |Lt/ i| alone, but by a running aver-
age of recent |Lt/ i|. In the method called RMSprop,
we introduce a vector  rms, where the superscript means
root-mean square. We initialize it to 0, we update it by
(53)
rms rms 2
, 1, (1 ) ( / ) ,t i t i t iβ β L      

31

where  rmst,i is the i-th element of  rms at time t, and we
use it to adjust ,
(54)    rms1, , ,/ / .t i t i t i t iη L ε        
In (53), setting β = 0.9 usually strikes a nice balance be-
tween too little and too much memory. If we set β = 0
(that is, we scale ∆ based only on the current mini-
batch) then we overreact to noise, but if we set β much
higher than 0.9 then we retain obsolete information
about gradients in a region of  -space we have already
left behind.
Adam
The most popular way of adjusting parameters in AI
right now is probably the method of adaptive moments,
or adam, which maintains running estimates of the
mean, or first moment of L/,
(55)
1 1
, 1 1, 1(1 ) / ,t i t i t im β m β L    
and the second moment,
(56) 2 2 2, 2 1, 2(1 )( / ) .t i t i t im β m β L    
To get started, we have to initialize m1 and m2, and for
lack of any better guess, we set them both to 0, with the
result that our running estimates in (55) and (56) are bi-
ased toward 0. Those biases may not hurt us, because
they fade with time, as we process more and more ex-
amples and the initial values have less and less effect.
But just to be safe, we can bias-correct. It can be shown
that the following corrected estimates are unbiased:

32

(57)
1 1
1
2 2
2
' / (1 ),
' / (1 ),
t
t t
t
t t
m m
m m
 
 

where β1
t and β2
t are the β’s from (55) and (56) raised to
the power t.
Finally, we adjust the parameters,
(58)  1 21, , , ,/ .t i t i t i t iηm ' m ' ε    
With adam, m1t,i in the numerator acts like momentum,
cancelling out components of the gradient that reverse
direction from one time step to the next, while m2t,i in
the denominator acts like RMSprop, bounding the sizes
of the adjustments, ∆ i. Usually we set  = 10–8,
β1 = 0.9 and β2 = 0.999, and choose  by trial and error.
›› run COMPARE_DESCENT_QUADRATIC.m
›› run COMPARE_DESCENT_COMPLEX.m
Batch normalization
Even with adam, a backprop network may still learn
slowly if its signals are uncentred, i.e. if any of the ele-
ments y{l}i has a nonzero mean across examples, or if
the elements have unequal variances. These problems
can arise even if the input signals to the network as a
whole, y{1}i, are nicely centred and have equal vari-
ances, because the signals in downstream layers may
not have those properties.
›› run UNCENTRED.m

33

We can often improve learning simply by ensuring that
each element of each signal y{l} has a mean of 0 and a
standard deviation of 1. In this method, called batch
normalization or batchnorm, we normalize each layer’s
signal y{l}: we calculate the mean μ{l}i and standard
deviation σ{l}i of each y{l}i across the current mini-
batch, then subtract off the means and divide by the
standard deviations to create a normalized signal,
(59)        
, ,
( ) / ,
i m i m i i
y' y μl l l σ l 
where y{l}i,m is the i-th element of example m in the
minibatch. (There is another version of batchnorm
which normalizes the pre-activation potential v{l} ra-
ther than y{l}, but it is more complicated and usually
doesn’t work any better.)
To backprop through this batchnorm network, we must
compute Lm/y from Lm/y',
(60)
, ,
, , , ,
/ ( /
Σ / Σ / )
/ ( ε),
i m m m i m
i m n i n n i n n n i n
m i
L y n L y'
y' y' L y' L y'
n σ
     
    


where nm is the number of examples in the minibatch,
and ε is a small number introduced to prevent us divid-
ing by 0.
Batchnorm sometimes improves learning dramatically,
though in other cases it makes it worse. It often works
better with tanh than with relu activation functions, be-
cause with relu, zeroing the mean pre-activation poten-
tial pushes it into hard-zero saturation, i.e. the flat part
of the relu curve.
›› run BATCHNORM.m

34

Brains likely can’t use batchnorm, because they likely
can’t use minibatches, but variants of batchnorm may
be feasible. For example, each neuron may monitor its
own activity through time, and alter itself to keep its
mean firing rate and variance near some target values.
Comparing algorithms
To assess learning algorithms, and judge their biologi-
cal feasibility, we have to consider how much energy
they consume, how long they take to learn, and how
much memory they need.
Energy
Computer scientists measure the complexity of their al-
gorithms by counting the number of floating-point oper-
ations, or flops. For example, multiplying or adding two
floating-point variables counts as one flop. Usually, we
will assume that methods that involve a lot of flops in a
computer also involve a lot of computational effort in a
brain.
On the other hand, there are operations that are hard in
computers but easy in brains, or vice versa, so we try to
identify these. Moreover, the energy cost of an algo-
rithm in the brain depends not only on its flop count but
also on how many neurons it needs, as we will discuss
below.
Time
In a computer, the running time of an algorithm de-
pends mainly on the number of flops, because comput-
ers work mainly serially, doing just a few operations at

35

a time in the central processing unit (CPU) or several
thousand at a time in the graphical processing unit
(GPU). But in the brain, vast arrays of neurons do bil-
lions of operations in parallel, so learning time depends
more on the number of examples required than on the
flops.
Memory
A crucial difference between brains and computers is
that brains represent data in at least two different forms:
in neuron parameters such as biases and synaptic
weights, or in signals (firing rates) of neurons.
Synapses are smaller than neurons and more numerous,
so they are a metabolically cheap, high-capacity
memory store. But they can’t send their information to
other neurons or even other places in the same neuron
— there is no known way for messages about the
strengths of synapses to travel from one place to an-
other.
Information stored in neural firing, on the other hand, is
transmissible — it can travel to other cells along axons.
Therefore in our algorithms we have to consider which
pieces of information are transmitted from place to
place, so we can deduce both the synaptic and the trans-
missible memory requirements, or in other words how
many synapses and how many neurons we need.
Scaling
Sensorimotor tasks are often high-dimensional, mean-
ing the state vectors have many elements. For example,
visual inputs to the brain run on the optic nerves, with

36

more than a million axons each, and so visual input is a
vector of more than 2 million elements.
We are interested in how complexity, time, and memory
scale with dimensionality. If the number of flops
needed for a computation increases linearly with the di-
mensionality n of its input, then we say the computation
is O(n), or on the order of n. If it grows with the second
power of n then it is O(n2). For example, computing the
dot product of two n-element vectors takes n scalar
multiplications, and about as many additions, and so the
dot product is an O(n) operation. Multiplying an n-ele-
ment vector by an n-by-n matrix is O(n2). Multiplying
two n-by-n matrices is O(n3). O(n3) operations are slow
and expensive when n is large.
Other resources
Goodfellow I, Bengio Y, Courville A (2016) Deep
Learning. MIT Press.
Kingma DP, Ba JL (2015) Adam: a method for stochas-
tic optimization. arXiv:1412.6980
Ioffe S, Szegedy C (2015) Batch normalization: acceler-
ating deep network training by reducing internal covari-
ate shift. arXiv:1502:03167
Rumelhart DE, Hinton GE, Williams RJ (1986) Learn-
ing representations by back-propagating errors. Nature
323

37

3 Sensory maps

With backprop or feedback alignment, neural networks
can learn visual tasks, such as recognizing objects in
images. To feed an image into a network, we encode it
as a vector, e.g. if the image is greyscale, with n pixels,
we string those greyscale values into an n-element vec-
tor and feed that into the network as its input, x.
But this vector does not convey the geometry of the im-
age: that each vector-element represents a pixel in a 2-D
array, and each pixel is spatially related to every other
pixel — right, left, above, or below it, and close by or
far away. This geometry matters, because most things
we want to recognize visually are contiguous objects,
which are defined by their boundaries, which in images
show up as edges, i.e. sharp differences in brightness or
colour between neighbouring pixels.
How can we convey the geometric relations of the im-
age pixels to a neural network? In this chapter we ex-
plore one approach that is used in the brain and in com-
puter vision.
Receptive fields
In the human eye, light is detected by photoreceptors
(rods and cones) in the retina. From those receptors,
signals pass through a series of layers — bipolar cells
and retinal ganglion cells in the eye, and then, in the
brain, the lateral geniculate nuclei, primary visual cor-
tex, and so on. Any one bipolar cell gets input from
only a small fraction of the photoreceptors, and those
input receptors are not scattered about the retina but are

38

all close together in a contiguous patch called the recep-
tive field of the bipolar cell.
The bipolar cell distinguishes different input patterns,
responding more to some patterns of activity among its
input receptors than to others. And because those recep-
tors are all neighbours, the cell distinguishes local fea-
tures of the image. For instance, many bipolar cells re-
spond to small spots of light or dark against contrasting
backgrounds. Owing to their input connections, then,
the cells detect geometric features, namely spots, that
are useful for identifying objects.
Feature maps
Moreover, neighbouring bipolar cells have neighbour-
ing receptive fields, and therefore the sheet of bipolar
cells is a map of visual space, just as the sheet of recep-
tors is. The loci in this second map — the individual bi-
polar cells — do not represent the brightness of points
in space as the receptors do. Rather, they represent
spots of contrast. We say the bipolars form a feature
map, each cell signaling the presence or absence of an
image feature, in this case a dark or light spot on a con-
trasting background, in a region of visual space.
Because the bipolar cells form a map, just as the recep-
tors do, the brain can apply the same principle again at
the next layer: any one retinal ganglion cell gets input
from only a small contiguous patch of bipolar cells, and
neighbouring ganglion cells get input from neighbour-
ing patches. So the ganglion cells form a new feature
map. And so on through the lateral geniculate nuclei,
primary visual cortex, and beyond.

39

Layer by layer, the features become more and more ab-
stract. So-called simple cells in primary visual cortex
combine information from neighbouring spot-detectors
in the lateral geniculate to detect straight, oriented
edges. Farther downstream, other maps combine neigh-
bouring edges into contours; and maybe neighbouring
contours into outlines of shapes, and shapes into scenes,
though the details are less clearly understood the deeper
we go into the visual system.
In neuroscience, these visual areas are called retinotopic
maps, and analogous terms exist for other senses. Pri-
mary somatosensory cortex, which gets information
from touch receptors, is a somatotopic map, where
neighbouring cells get information from neighbouring
patches of skin. And some auditory areas are tonotopic
maps, where neighbouring cells represent tones of simi-
lar frequencies.
Given any one retinal ganglion cell, its input field is the
set of bipolar cells that project to it, whereas its recep-
tive field is the set of photoreceptors that communicate
with it, via the bipolars. And similarly for any layer of
any sensory system: a cell’s receptive field is a patch of
receptors, and its input field is a patch of cells in the
previous, upstream map. Or more generally, a cell may
have several input fields, if it receives projections from
more than one map, e.g. direct projections from cells
two or more layers back, or feedback from maps farther
downstream.
Connection matrices
The next file of sample code constructs a network
where all layers except the final one are maps. It defines

40

a map width for each map layer, i.e. how many pixels
wide the map is (the code creates only square maps, so
any map’s height is the same as its width). All maps
share the same field-fraction, which sets the width of
the input fields as a fraction of the width of the up-
stream map, e.g. if the field-fraction is 0.25 and the
layer-2 map is 32 pixels wide, then each cell in layer 3
has an input field in layer 2 that is 32 × 0.25 = 8 pixels
wide (and tall).
For each cell in each map layer after the first (the input)
layer, the posted code file chooses an input field, lo-
cated so that neighbouring cells have neighbouring
fields. For each layer l, the locations and sizes of all its
cells’ fields are stored in a connection matrix, C{l}, of
ones and zeros. For example, cell 35 in layer 3 receives
a projection (an axon branch) from cell 140 in layer 2,
and so C{3}35, 140 = 1, but that same cell 35 in layer 3
receives no projection from cell 130 in layer 2, and so
C{3}35, 130 = 0. As the layer-3 map has 16 by 16 pixels,
and the layer-2 map has 32 by 32, C{3} has 162 = 256
rows (one for each neuron in layer 3) and 322 = 1024
columns (one for each neuron in layer 2). As each cell
in layer 3 has an input field 8 pixels wide (and tall), it
follows that each row in C{3} has exactly 82 = 64 en-
tries that are equal to 1, and the other 1024 – 64 = 960
elements are zeros. In short, C{3} is structured so that
each cell in layer 3 gets input from only 64 cells in layer
2, and those 64 cells form a contiguous 8-by-8 patch —
an input field. The code plots the whole connection ma-
trix C{3}, to show that most connections are absent
(matrix entries are 0s) and only a few exist (entries are
1s). It also plots row 35 of the matrix reshaped into a
32-by-32 map, to show the geometry of the input field
of cell 35.

41

The same code file ensures that the network’s weight
matrices agree with the connection matrices: the synap-
tic weight from any one neuron to any other can be non-
zero only if there is a connection between those two
cells, i.e. if C{l}i,j = 0 then W{l}i,j = 0. Of course the
converse is not true: W{l}i,j may happen to be 0 even if
C{l}i,j = 1.
Because C and W matrices between map layers have
few nonzero elements, they are said to be sparse. The
code shows how they can be converted into Matlab’s
“sparse” data type, which can make operations on them
run faster. In contrast, networks like those in Chapter 2,
where each cell in each layer projects to every cell in
the next layer, are called dense or fully-connected net-
works. In computer vision, most neural networks have
several layers of maps, with sparse weight matrices, fol-
lowed by a layer or two with dense connections.
›› run MAP_NETWORK.m
Reading handwritten numerals
One way to test computer-vision algorithms is to see
how well they learn to recognize handwritten numerals
in the MNIST (Modified National Institute of Stand-
ards) data set, put together by Yann LeCun and col-
leagues. MNIST contains a training set of 60,000 exam-
ples, each a 28-by-28-pixel image of one of the digits
“0” through “9”, and a test set of 10,000 additional ex-
amples, which we present to the network only after it
has finished learning, to assess how well it recognizes
examples it did not see during its training. For each of
the 70,000 example images, MNIST also provides a la-
bel, or in other words a desired-output vector y*, which

42

reveals what numeral is in the image. Of course the
learning network does not receive the labels as part of
its inputs, but it does use them to compute its errors, as
in equation (39), and we use them to assess whether the
network’s answers are correct.
When neural networks are trained on classification
tasks, such as recognizing handwritten digits in the
MNIST data set, the desired-output vectors (the labels)
are usually one-hot, meaning there is always one ele-
ment of the vector equal to 1 and the others are all 0s.
For example, if the net is learning to recognize the 10
digits “0” through “9”, then the desired output y* is typ-
ically a 10-element vector, set to [1; 0; 0; 0; 0; 0; 0; 0;
0; 0] when the digit is a “0”, to [0; 1; 0; 0; 0; 0; 0; 0; 0;
0] when the digit is a “1”, and so on.
In this task, therefore, the actual output y of the network
must also be a 10-element vector, where each element
of y learns to signal the presence of one of the object
classes, “0” through “9”. We say the network has given
the correct answer when the largest element of y is the
one with the correct index, e.g. if the digit is a “5” then
y*6 = 1 and all other y*i = 0, and we say the network has
answered correctly if y6 > yi for all i ≠ 6.
When y* is one-hot then it is a good idea to restrict the
elements of y to the range [0, 1] as well, e.g. by sending
the final-layer signals through a rectified tanh function,
y = max(0, tanh(y{nl})).
On the course website you can find the MNIST data set,
in the file MNIST.mat, and also MNIST_key.m, which
illustrates the format of those data.
›› run MNIST_KEY.m

43

Convolution
At present, most computer-vision networks use maps
with additional mechanisms which we will consider
briefly here but won’t cover in detail because they are
biologically implausible. Most importantly, these net-
works use convolution, meaning that all the neurons in
any one map have the same incoming synaptic weights.
For example, if those neurons have 2-by-2 input fields,
and one of the neurons has weights 2, –3, 1, 5 for its up-
per-left, upper-right, lower-left, and lower-right input
cells, respectively, then every other neuron in the map
will also have those same weights at those same loci
within its field (but the fields for different neurons will
still have different locations within the upstream map,
as in Map_network.m).
This arrangement, called weight-sharing, is not biologi-
cal. Real neurons in real sensory maps each have their
own synapses and no known mechanism to keep their
weights equal to those of other neurons, though of
course there may be some mechanism, still waiting to
be discovered, by which real neural maps could approx-
imate weight sharing.
The set of shared weights that a map applies to all its in-
put fields is called the map’s kernel or filter. So weight
sharing forces a map to apply a single filter to all its
fields. That constraint can be valuable because if a filter
is useful in one part of visual space, then it is likely also
useful elsewhere. If it is useful to detect vertical edges
in the upper-left part of visual space then it is likely also
useful to detect them in the lower-right and everywhere
else, because objects with vertical edges may appear an-
ywhere.

44

Weight-sharing drastically reduces the number of ad-
justable parameters. In a 20-by-20 map without weight-
sharing, where each of the 400 neurons has an 8-by-8
input field, there are 400 × 8 × 8 = 25,600 incoming
weights. But with weight-sharing there are only 8 × 8 =
64. That is too few to learn almost any visual task, and
therefore convolutional nets have many maps, usually 5
to 100 of them, in each layer. This way, they get reason-
able numbers of parameters, though to do it they need a
lot of maps, with a lot of neurons, i.e. convolutional
nets have a high ratio of neurons to adjustable parame-
ters.
Other resources
LeCun Y et al. (1998) Gradient-based learning applied
to document identification. Proceedings of the IEEE (a
description of convolutional nets)

45

4 Reinforcement learning

Sensorimotor agents can’t shape their policies by super-
vised learning alone, because supervised learning re-
quires teacher signals, τ(x), that say what the current
output should have been. There are no such signals for
most sensorimotor tasks, as sensory feedback usually
can’t tell an agent what its action should have been; it
can only report the consequences of the action, such as
the resulting state change ∆st and reward rt. The agent
must improve its policy based on that feedback — a
process called reinforcement learning.
As in Chapter 1, equation (6), the aim is to maximize
some return, the time-integral of a reward to some final
time T,
(61) 0...Σ .t T tG r
More generally, we may start at some time t' other than
0, and we may aim to optimize a discounted return,
(62) '' '...Σ γ ,
t t
t t t T tG r


where γ is the discount factor, between 0 and 1, say
0.99. So if γ < 1 then we care less and less about re-
wards further and further in the future. One motivation
is that we are less and less certain about those future re-
wards. Another is that discounting helps us avoid infi-
nite returns when T = .

46

Models
There are two main types of reinforcement learning:
model-based and model-free. In model-based methods,
the agent learns a model, or in other words an internal
simulation, of the state dynamics of its environment,
and uses that model to improve its policy.
But if the state dynamics are complex, the agent may
have trouble learning an accurate model. For this rea-
son, a lot of recent work in AI has instead used model-
free methods, where the agent learns not the full state
dynamics but something simpler, an action-value func-
tion, which represents the expected long-term outcome
that will follow if the agent takes a given action in a
given state.
Model-free methods have yielded impressive results in
machine learning, but the brain seems to use models for
at least some purposes: we can predict the sensory con-
sequences of our actions, and plan by running scenarios
in our minds. We will study both approaches, beginning
with model-free.
Action-value functions
Many model-free algorithms for reinforcement learning
are based on action-value- or Q-functions. A Q-function
takes as inputs a state s and an action a, and yields as
output the long-term result — the expected return —
achieved by taking action a when in state s.
Of course, that expected return depends not only on the
action a, but also on every action thereafter, till time T.
So Q makes sense only if we specify some policy μ for
all those subsequent actions.

47

Moreover, the expected return also depends on how
many time steps we have left till time T. Therefore Q-
functions are easiest to define if T = , because then
every t is equally far from T, and so we can define Q in
terms of s and a alone, regardless of what the present
time is. (Though the Q function is also useful in many
settings where T is not actually infinite but merely
somewhat large, as you will see in the sample code be-
low.)
Given a policy μ, we define Q
μ(s, a) to be the expected
return, from now on, if we are presently (at time t') in
state s, and we take action a, and from then on we
choose our actions using policy μ.
(63)
1...
( , ) ( , )
Σ γ ( , ( )).t t't t' t t
Q s a r s a
r s s  
 μ


Notice that the initial action a, at time t', need not fit the
policy μ. That is, Q
μ(s, a) still makes sense even when
a ≠ μ(s).
Policy gradients
One powerful method of model-free reinforcement
learning is to train a neural network Q to estimate Q
μ,
where μ is the current policy, implemented by another
network, and then improve the policy network by ad-
justing its parameters  μ up the gradient of Q. By the
chain rule, we have
(64) / / / .
μ μQ θ Q a a θ      
To compute the Q/a term, we backpropagate the
single-element vector [1] through the Q network —

48

just as we backpropagated the vector α in (45). Having
found Q/a in this way, we then compute Q/ μ
by backpropagating Q/a through the μ-network.
This method is called deep deterministic policy gradi-
ent, or DDPG. The next 6 sections describe its opera-
tions in detail.
Bellman equations
In DDPG, the neural network Q learns Q μ. And be-
cause the policy μ is also learning, and therefore chang-
ing, it follows that Q μ is changing as well, which means
Q must keep readjusting itself, so as to remain always
a good approximation to the current Q
μ.
How can Q learn Q
μ? Not by supervised learning, be-
cause there are no teacher signals that know the true
value of Q
μ(s, a). But useful error signals can be cre-
ated using the Bellman equation,
(65) 1 1( , ) ( , ) γE[ ( , ( ))].
μ
t t t t t tQ s a r s a Q s μ s  
μ
This equation follows from the definition of Q
μ in (63):
the left-hand side of (65) is, by definition, the expected
return from time t onward, given an initial state st and
action at, and the right-hand side is the same thing: the
reward r at time t given st and at, plus the discounted,
expected return from time t + 1 onward given policy μ.
This equation implies a Bellman error signal that can be
used to train estimates of Q
μ, namely
(66)
1 1
( , )
( , ) γ ( , ( )).
Q
t t
t t t t
e Q s a
r s a Q s μ s 

 

49

In other words, we don’t know Q
μ(st, at), so we don’t
know what Q(st, at) should be. But we do know that
Q
μ(st, at) is related to Q
μ(st+1, at+1) as laid out in equa-
tion (65), and therefore Q(st, at) should be related to
Q(st+1, at+1) in the same way, which means e
Q in (66)
should equal 0, at least on average.
In other words, eQ is a reasonable error signal because if
Q = Q
μ then eQ = 0 on average, and (it can be shown)
if eQ = 0 on average for all inputs (st, at) and  = 1 then
Q = Q
μ + c for some constant c, and that constant is
immaterial because the learning rule (64) uses the gra-
dient of Q, which doesn’t depend on c.
This technique of training Q to agree with its own fu-
ture values is called bootstrapping.
Target networks
How do we use a Bellman error such as (66) to train the
network Q? The obvious approach is to compute the
gradient of eQ-squared with respect to Q’s parameters,
 Q, and adjust  Q down that gradient. This approach
does work, but it is complicated because Q appears
twice in the formula (66) for eQ, and with different in-
puts in its two appearances. As a result, eQ/ Q is
more complicated than if Q had appeared just once,
and so the E[L]-landscape is more convoluted, and
learning is slower.
A simpler approach that often works better is to pretend
that Q appears just once in the Bellman error. We re-
place (66) by

50

(67)
1 1
( , )
( , ) γ ( , ( )),
Q
t t
t t t t
e Q s a
r s a Q' s μ s 

 

where Q' is a target network, separate from Q but with
the same structure: the same numbers of layers and neu-
rons, the same activation functions, and the same initial
weights and biases.
With (67), Q appears only once in eQ, and so it can
learn in a supervised way, by backpropagating eQ
through Q. And as learning proceeds, we also adjust
Q', nudging its parameters toward those of Q,
(68) ( ),
QQ' Q' Q'θ θ τ θ θ  
where τ is a positive number. Usually τ is small, say
0.001. Otherwise Q may fail to learn, because its error
signal eQ, in (67), is driving it toward a target function
Q' that is changing too quickly.
Replay buffers
To train Q, we need plenty of examples of eQ, for
many different states and actions. So it is useful to store
examples in memory, in a matrix called a replay buffer
or experience buffer. At each time step, we store a new
column vector in the buffer,
(69) 1[ ; ; ( , ); ].t t t t ts a r s a s 
It is common, especially when dealing with replay buff-
ers, to use s't as an alternative name for st+1, so we can
also write the vector in (69) as
(70) [ ; ; ( , ); ].t t t t ts a r s a s

51

Gradually, the buffer fills with many such examples of
the data that we need to compute eQ using (67). At each
time step, we randomly select a minibatch of examples
from the buffer, and use those to compute eQ and adjust
 Q, so Q learns from a large, varied mix of data.
When the buffer is full, the agent continues to update it,
making room for the newest column vector (69) by re-
moving the oldest one.
Off-policy learning
As time goes by, the agent’s policy, μ, changes though
learning. Therefore the columns of the replay buffer
will contain data generated by a variety of obsolete pol-
icies the agent is no longer using. If we select a column
(69) from the buffer for our current minibatch, then it
will contain an st together with an at that is probably not
the action the current policy would choose if it were in
state st:
(71) ( ).t ta μ s
But a nice feature of eQ, as defined in (66) and (67), is
that it is still useful even if at ≠ μ(st). As we saw in (63)
and the text right after it, Q
μ(s, a) is defined for any a,
not just μ(s), and so we can train Q using any a’s. This
type of learning, based on actions that don’t fit the cur-
rent μ, is called off-policy learning.
Off-policy learning is convenient because it lets us re-
use old data stored in a replay buffer. Some other rein-
forcement-learning algorithms (not covered here) need
on-policy data, and therefore can’t use replay buffers, at
least not without additional operations to try to correct
for the policy mismatch.

52

Rollouts
With off-policy learning, we can in principle learn Q
μ
with data from any policy at all, but it is better to train
on data close to those of the current policy, μ itself. The
main reason is that a Q network can almost never
match the true Q
μ(s, a) perfectly for all possible inputs
(s, a). At best, it approximates Q
μ(s, a), and the approx-
imation will be closer and more useful if its domain is
smaller and well chosen, i.e. if we don’t try to match Q
μ(s, a) over all (s, a) but instead focus on the crucial
subset, namely states that will actually be visited by an
agent using policy μ, and actions that will be chosen by
μ in those states.
Therefore we train Q based on data collected during
rollouts, which are episodes where the agent executes
its task using its current policy μ, e.g. if the agent is
learning to reach for target objects then each rollout is
one reach to one target. This method works well, and
also looks very much like biological learning, where an
animal learns a task using sense data it collects while
practicing that task.
Exploration
In DDPG the policy is stochastic. For instance, the ac-
tion at may be some deterministic function of st plus a
Gaussian random vector, usually small and with a mean
of zero. In this way, the agent can explore a range of
different actions in any one state. The amount of explo-
ration — the variance of the random vector — is chosen
by trial and error.

53

Pseudocode for DDPG
for rollout = 1 to nrollouts
choose a random initial state s0
for t = 0 to T
at = µ(st)
rt = r(st, at)
st + 1 = f (st, at)
add [st; at; rt; st + 1] to the buffer
choose a minibatch of columns [si; ai; ri; s'i]
from the buffer
for each column i in the minibatch
eQi = Q(si, ai) – ri – γ Q' (s'i, μ(s'i))
backprop eQi through Q to adjust  
Q
compute Q(si, μ(si))
backprop [1] through Q to get Q/ai
backprop Q/ai through μ to adjust 
μ
end
 Q' ←  Q'+ τ ( Q –  Q')
end
end
›› run DDPG.m
Costate equation
Here we consider a type of model-based reinforcement
learning which applies to episodic, or in other words fi-
nite-time or finite-horizon tasks, where the aim is to
maximize the return G of a movement from time 0 to a
finite end time T. For simplicity, we will assume the
discount factor γ = 1. So we have
(72) 0... 0... ( , ( )).Σ Σt T t t T t tG r r s μ s  

54

The policy μ is again a multilayer network with parame-
ters  μ, and our aim is to adjust  μ to maximize the av-
erage return E[G] over some repertoire of motions, e.g.
reaches from a variety of initial states.
To make those adjustments, we perform a lot of mo-
tions, and after each one we compute the gradient of its
return G with respect to each of its actions at, from t = 0
to T. To find those gradients, we note that at can influ-
ence G in two ways, through its effects on rt and on st+1.
Therefore by the chain rule,
(73)
1 1
1
/ / / /
/ / / .
t t t t t t
t t t t
G a r a G s s a
r a G s f a
 

         
       

This formula shows that we need G/st+1 to get G/at.
To find G/st, we again apply the chain rule,
(74)
1 1
1
/ / / /
/ /
/ / /
/ ( / / / ).
t t t t t t
t t t
t t t t t
t t t t
G s r s r a s
G s ds ds
r s r a s
G s f s f a s
 

          
 
        
        




Hence if we know (or can estimate) the functions f, r,
and μ, we can compute the final G/s,
(75) / / / / ,T T T T T TG s r s r a μ s         
and then sweep back in time, using (74) to compute all
the G/st in turn, down to G/s1. The G/st are
called costates, and (74) is the costate equation.
We plug these G/st into (73) to find the ∂G/∂at, and
use those derivatives to improve the policy, adjusting θ
μ
up the gradient

55

(76) 0.../ / / .
μ μ
t T t tG θ G a a θ      
Learning f and r
To apply the costate equation (74), the agent must esti-
mate the state dynamics f and the reward function r. It
can learn f by backprop based on the error signal
(77) 1( , )
f
t t te f s a s  
and r based on
(78) ( , ) .r t t te r s a r 
As in DDPG, we train our networks based on mini-
batches drawn from a replay buffer.
In some tasks, the agent already knows the reward func-
tion before training begins, and so doesn’t need to learn
it. In that case, we can omit the network r and equa-
tion (78). The posted code CPG.m shows a task of that
kind.
Training μ
We train the policy based on imaginary rollouts: start-
ing from a minibatch of initial states drawn from the
buffer, we run the rollouts forward in time, computing
rewards and new states at each time step using r and
the model f . (It would be cheating to use minibatches
for real rollouts, because an agent can’t make multiple
real movements at once, but for imaginary rollouts,
minibatches are allowed).
We use (74) and (73), with f  and r in place of f and
r, to compute ∂Gi/∂ati for each time point t and each

56

rollout i in the minibatch. With those derivatives we ap-
ply (76) to compute ∂G/∂θ
μ, where G is the sum of all
the returns Gi in the minibatch. And we use ∂G/∂θ
μ to
adjust μ.
Pseudocode for Costate Policy Gradient
for rollout = 1 to nrollouts

// Run a real rollout to gather data for the buffer
choose a random initial state s0
for t = 0 to T
at = µ(st)
rt = r(st, at)
st+1 = f (st, at)
add [st; at; rt; st+1] to the buffer
end

// Train  f  and r
choose a minibatch of columns [si; ai; ri; s'i]
from the buffer
for each column i in the minibatch
e
f
i =  f (si, ai) – s'i
backprop e
f
i through  f  to adjust  
f 
e
r
i = r(si, ai) – ri
backprop e
r
i through r to adjust  
r
end

// Train μ with an imaginary rollout
choose a minibatch of initial states s0 from the buffer
for t = 0 to T
at = µ(st)
st+1 =  f (st, at)
store [st; at]
end
…

57

compute r(sT, aT)
backprop through r to get rT
∂G/∂aT = ∂rT/∂aT 
compute ∂G/∂sT using (75)
set ∂G/∂θ
μ = 0
for t = T – 1 to 0 in steps of –1
compute r(st, at)
backprop through r to get rt
backprop ∂G/∂st+1 through  f  to get ∂G/∂st+1 f


compute ∂G/∂at using (73)
backprop ∂G/∂at through µ
add the result to ∂G/∂θ
μ
compute ∂G/∂st using (74)
end

adjust μ based on ∂G/∂θ
μ

end

›› run CPG.m
Other resources
Kirk DE (1998) Optimal Control Theory: an Introduc-
tion. Dover Publications (discusses the costate equa-
tion, though in continuous time)
Lillicrap TP et al. (2016) Continuous control with deep
reinforcement learning. arXiv:1509:02971 (presents
DDPG)
Silver D et al. (2014) Deterministic policy gradient al-
gorithms. Journal of Machine Learning Research
Workshop & Conference Proceedings 32.)
Sutton RS, Barto AG (2018) Reinforcement Learning:
An Introduction.

58

5 Regularization

When we learn a function relating the x’s and y*’s in a
training set of examples (a sample), we want that rela-
tion to generalize, i.e. we want it to hold true for other
x's and y*’s besides those in our sample. It is a serious
challenge: in many learning problems it is easy enough
to produce a network that predicts y*’s from x’s very
accurately when the net has already seen those specific
x’s and y*’s during its training; but faced with a new x
it has not seen before, the net’s accuracy often drops
markedly. This situation — good accuracy on the train-
ing set but poor on other data — is called overfitting.
Methods of reducing overfit are called regularization.
The central problem is that for any training set, there
will always be many different functions that fit the data
points in the set equally well but make incompatible
predictions about points not in the set. Suppose we have
a learning problem where our entire training set {(xt,
y*t)} consists of just 3 examples: (0.1, 0), (0.2, 0), (0.8,
0), i.e. our x’s take the values 0.1, 0.2, and 0.8, and our
y*’s are all 0s. Our target function may be the zero
function, which takes all x’s to 0. But then again the tar-
get may be some other function that happens to equal 0
when x is 0.1, 0.2, or 0.8 but is far from 0 elsewhere. If
we guess wrong then we will make large errors when-
ever x is not 0.1, 0.2, or 0.8.
Ensembles
One way of coping is to train several networks — an
ensemble — and have them vote on answers. We might
train 10 networks and then present each novel x to all

59

10 of them, taking the average of their predictions as
our guess. Mistakes made by the different nets often
cancel out to some extent, so their averaged guesses are
more accurate than their individual ones.
Capacity
Another coping method is to shrink our network, or oth-
erwise restrict it in some way, to reduce its capacity, or
in other words the size of its hypothesis space, which is
the set of all functions that network could ever possibly
compute, given any possible setting of its weights and
biases. The rationale is that overfitting occurs because
our hypothesis space contains functions which are not
the desired function but nonetheless fit the training data.
If the hypothesis space were smaller then there might be
fewer of these false fits. And in practice we can indeed
reduce overfitting by using networks with fewer adjust-
able parameters.
But if we make the hypothesis space too small then it
won’t contain any functions that are a decent match to
the desired one. Ideally, we want the smallest hypothe-
sis space that nevertheless contains all the functions we
will ever have to learn. For instance, if we know, some-
how, that the target function is linear then we might
choose a hypothesis space consisting only of linear
functions. But we rarely have that much prior
knowledge about our targets.
What we can do in practice is monitor our network’s
performance on both training data and a separate set of
non-training examples called the validation set, distinct
from both the training set and the test set. If our net-
work can’t learn to do a good job even on the training

60

data then its capacity is likely too low. If it does much
better on training than on validation data then it is over-
fitting, and might benefit from a smaller hypothesis
space, or from other regularization methods we will dis-
cuss in a moment.
Data Augmentation
If we can manage it, the best antidote to overfitting is to
get more training data. Consider again the case where
we had just 3 training examples, all compatible with the
zero function. If we had a million examples instead of
just 3, with a million x’s evenly packed over the range
from 0 to 1, and the y*’s all 0, then we could be more
certain that our target really was the zero function, at
least over the range [0, 1]. In other words, overfitting is
reduced when the training set is large.
Unfortunately, even sets that seem quite big can still be
too small to prevent overfitting, e.g. the MNIST data-
base of handwritten digits contains a training set of
60,000 examples, but when we use that set to train a
network to recognize handwritten digits, we get serious
overfitting. On any big problem we want truly vast
numbers of examples.
If we can’t get enough examples then it is often useful
to make fake ones, provided we take care that our fake
examples really do resemble real ones our net may face
after its training is done. This technique is called data
augmentation.
For example the MNIST training set includes about
6000 examples of handwritten “5”s. We can take those

61

6000 images and tweak each one in various ways, shift-
ing it slightly horizontally or vertically, stretching or
compressing it, rotating or shearing it, and so get vast
numbers of new examples, all of which are plausible
“5”s which a human might actually write. Applying
these transformations (for all the digits, not just the
“5”s) greatly improves the trained network’s generaliza-
tion. And we can do even better by augmenting even
more, e.g. the tweaking I just described included only
affine transformations (linear transformations plus
shifts), and we can do better by adding certain kinds of
nonlinear warping.
It is important that the transformations we use to aug-
ment our data produce new examples that are plausible.
It would be a bad idea, for instance, to create fake
MNIST examples by mirror-reflecting real ones upside-
down, because a handwritten “5” reflected upside-down
would not resemble any “5” that a real human is likely
to write; it would look more like a strange “2”. And an
upside-down “6” would look like a “9”. Used as train-
ing data, these fakes would confuse rather than aid the
network. Data augmentation should alter our training
examples so they resemble the huge set of real varia-
tions the net will encounter after its learning is done.
That said, it is often possible to reduce overfitting at
least a little by tampering with training data in ways that
are not so obviously guaranteed to produce plausible
fakes. An example is a method called dropout.

62

Dropout
Here we add noise to a network by randomly silencing
many of its hidden neurons, and sometimes input neu-
rons as well. We might stipulate that each hidden neu-
ron has a 50% chance of switching off for any one ex-
ample. As each example arrives it will evoke a re-
sponse, and drive learning, in some random set of hid-
den neurons, about 50% of the total, and the other 50%
will have zero activity. Then in the test phase, after
training is done, we let all the neurons respond to each
example, and we halve the weights in the network to
compensate for the doubled numbers of active cells.
This method improves generalization, though the rea-
sons are not entirely clear. It may be a form of ensemble
learning: in a sense each example trains a different net-
work, because a different set of neurons is switched on
at each moment, and then in the test phase we blend the
outputs of all those networks.
Or it may be a kind of augmentation, creating novel
training examples by contaminating them with noise.
Why might noise help? We can think of it as breaking
up spurious regularities. For example, it might happen
by sheer coincidence that all our training images of “5”s
have a pure black pixel at one consistent locus, and so
our network might learn to trust that pixel as an identi-
fier of “5”s. But tested on other “5”s, outside the train-
ing set, which don’t have a black pixel at that spot, the
network will fail. The likelihood of any one, specific
coincidence of this type may be tiny, but there are vast
numbers of such coincidences that might arise, and in
practice they do exist in all finite training sets. Noise
can disrupt these spurious features.

63

›› run DROPOUT.m
Simplicity
It is often useful to assume that our target functions are
simple, in the sense that their graphs are smooth rather
than wiggly. If multiple functions in our hypothesis
space fit our training data fairly well then we count the
smooth, simple functions as more probable than the jag-
ged, complicated ones. In other words, we assume the
world is simple, as the philosopher William of Occam
urged in the 14th century.
It is not clear why that is a useful assumption, but in
practice it does reduce overfitting. Part of the reason
may be that data are often distorted by noise, which
makes the relation between x and y* look more compli-
cated than it really is. In other words, complexity can be
a figment of noise, and should be viewed with skepti-
cism, so we do well to favor simple solutions. We will
see some examples shortly.
Simplicity can be defined in different ways. The most
common approach is called Tikhonov regularization,
where we replace our old, least-squares loss (34) with
(79)
2 2
* Σ { } ,lL y y λ W l  
where λ > 0 and ‖W{l}‖2 is the sum of the squared ele-
ments of W{l}. So now we are minimizing not just our
approximation errors but also the sum of the squared
weights, which is a measure of complexity in the sense
that larger weights yield wigglier functions while
smaller weights yield flatter ones.

64

To see the advantages of Tikhonov regularization, run
the m-file Occam.m with different settings of its varia-
bles pdt, ns, nsd, pda, and lam. Occam defines a target
function y* = τ(x) — different each time it is run —
which is a polynomial of degree pdt. It chooses n_ex
examples of x’s and y*’s, adds Gaussian noise with
standard deviation nsd to the y*’s, and presents this cor-
rupted sample to a learning algorithm. The algorithm
doesn't know the target function or pdt, but finds the
polynomial approximation of degree pda that minimizes
the expected Tikhonov-regularized loss (79) with lam
playing the role of λ. You will see that when the data
are few and noisy, and pda is large and lam is 0, then
your approximations will be poor between the data
points, but you can improve them with regularization,
either by shrinking pda (so as to permit only lower-de-
gree, i.e. simpler, polynomials) or by increasing lam.
›› run OCCAM.m
Occam_multitrial.m runs large numbers of trials like
those in Occam.m, to confirm that on average we get
better generalization with a simpler hypothesis space or
Tikhonov regularization.
›› run OCCAM_MULTITRIAL.m
Tikhonov regularization is easy to implement in a net-
work. It involves descending the gradient of (79) in-
stead of (34), but the only difference between the two is
the second addend in (79), namely λ Σl ‖W{l}‖2, whose
derivative with respect to any one weight, W{l}i,j, is
2λW{l}i,j, a scalar multiple of the weight itself. To de-
scend this new gradient, then, all we do is add –2λW{l}
to the gradient computed by backprop, i.e. learn by
backprop as usual, and also have each weight shrink

65

slightly toward 0 at each time step, at a rate propor-
tional to its own size. This mechanism is called weight
decay.
Autoencoders
One way to simplify and regularize networks is to re-
duce the dimensionalities of the signals they process.
Biological data are often very high-dimensional, but
they are also usually compressible. For instance, your
visual information is coded by the quarter-billion recep-
tor cells (rods and cones) in your retinas, but that infor-
mation is then processed for transmission to visual cor-
tex on the mere 2.5 million axons of your optic nerves,
i.e. the original signal, of a quarter-billion dimensions,
is condensed 100-fold. In general, reducing dimension-
ality lets an agent learn faster and with smaller net-
works.
One straightforward and perhaps brainlike approach is
to use an autoencoder, a neural network that creates
low-dimensional representations of data by learning to
make its output as similar as possible to its input. For
instance, we might build an autoencoder to create low-
dimensional representations of MNIST images of hand-
written digits. Its first layer would have 784 neurons, to
represent the greyscales values of the 784 pixels in an
MNIST image. The other 4 layers might have neuron
counts of 300, 30, 300, and 784 respectively — the cru-
cial thing here is the tight middle layer, or bottleneck. If
the network learns by backprop to make the activities in
its final layer resemble those in its first layer, then that
means it has squeezed a lot of information about the
original, 784-pixel image through a bottleneck of just
30 neurons.

66

Why use 5 layers in this example? Why not just 3, with
neuron counts of 784, 30, 784? In practice it works bet-
ter to compress the data in stages, maybe because it
breaks a difficult compression problem into easier parts.
›› run AUTOENCODER.m
Other resources
Goodfellow I, Bengio Y, Courville A (2016) Deep
Learning. MIT Press.
Srivastava N et al. (2014) Dropout: a simple way to pre-
vent neural networks from overfitting. Journal of Ma-
chine Learning Research 15: 1929–1958

67

Glossary

action. In reinforcement learning, the agent’s output, a.
action-value function. Given the current state s, an ac-
tion a, and a policy , the action-value Q(s, a) is the
expected return (i.e. the time-integral of the rewards we
will accumulate from now to some final time T) if we
start in state s, we choose action a (which need not
agree with policy ) at the first time point, and from
then on we apply policy .
adam. a formula for updating network weights and bi-
ases based on present and past gradients computed by
backprop.
adversarial training. a method where a learning net-
work improves by training on challenging examples
provided by a second, “adversary” net. The adversary
has full knowledge of the learner’s structure and param-
eters, and uses it to create examples that are likely to
fool the learner, i.e. it identifies the learner’s weak-
nesses and provides special training to overcome them.
agent. in reinforcement learning, the entity that per-
forms actions to achieve goals; called the controller in
control theory.
Bellman equation. Given any deterministic policy ,
the (one-step) Bellman equation is Q(st, at) = r(st, at) +
∆t E[Q(st+1, (st+1))], where Q is the action-value
function,  is the discounting factor, and st and at are the
state and action at time t. The multistep or n-step Bell-
man equation is Q(st, at) = r(st, at) + E[ i=1..n–1 
i r(st+i,
(st+i)) + 
n Q(st+n, (st+n)) ].

68

Bellman error. an error signal derived from a Bellman
equation, used for training estimates of Qμ. For instance,
the one-step Bellman Q-equation expresses a property
of the Q-function, Q(st, at) = r(st, at) +  E[Q(st+1,
(st+1))], which can be turned into an error signal to
train estimates of Q, namely eQ = Q(st, at) – r(st, at) –
Q(st+1, (st+1)).
Bellman optimality equation. the Bellman equation
for the optimal policy *, e.g. the one-step version is
Q*(st, at) = r(st, at) +  E[mina' Q*(st+1, a') ], where Q*
is the optimal action-value function.
bootstrap. train an estimate of the action-value, Qμ, or a
related function such as the return or value, using a
Bellman error.
conditioning. Any smooth function is bowl-shaped
near its minima, i.e. it approximates a quadratic func-
tion, of the form α + βT +  TB, where  is the func-
tion argument, α is a scalar, β is a vector, and B is a pos-
itive-semidefinite matrix. If the eigenvalues of B differ
greatly in size (i.e. if the bowl is much narrower and
steeper in some dimensions than in others) then gradi-
ent-descent methods take a long time to reach the mini-
mum, because no one learning rate constant η is well
suited for both the shallow and the steep dimensions. In
that case, the function and the learning problem are said
be poorly conditioned.
controller. an agent, in the terminology of control the-
ory.
control signal. in control-theory terminology, the out-
put, often called u, of a controller (i.e. an agent) —
analogous to the action, a, in reinforcement learning.

69

convex optimization. finding the optimum of a convex
function, i.e. one with only a single optimum, rather
than many local optima. More precisely, a convex func-
tion may have multiple optima, but they form a con-
nected set and all have the same value, i.e. none is bet-
ter than any other. Examples of convex optimization
methods are least-mean squares learning (LMS), nor-
malized least-mean squares learning (NLMS), recursive
least squares (RLS), and backprop in the final layer of a
network. An example of non-convex learning is back-
prop through a network of 3 or more layers. Convex op-
timization is usually faster, and easier to prove theo-
rems about, than non-convex optimization.
convolutional network. a network where neurons form
maps, i.e. each neuron in any one layer l receives input
from only a small contiguous patch of neurons in layer l
– 1, and neighbouring neurons in layer l receive their
input from neighbouring patches; and further, all neu-
rons in any one map share their weights, i.e. their in-
coming synaptic weights have the same numerical val-
ues, arranged in the same spatial pattern within their in-
put fields.
cross-entropy loss. a loss function L = –y*T log(y) that
is useful for networks with softmax outputs.
DDPG. The deep deterministic policy gradient algo-
rithm.
decision boundary. Many neural networks receive
vectors x, belonging to some high-dimensional space X,
as inputs (e.g. x might be a set of greyscale values of
pixels in an image, and X the space of all such images
of that size and shape), and they learn to classify the
vectors (e.g. identifying x as an image of a handwritten

70

“7”). A decision boundary is a hypersurface in X sepa-
rating one class from another, e.g. separating the x-vec-
tors representing “3”s from x’s representing “8”s.
deep deterministic policy gradient (DDPG). an algo-
rithm for learning policies in tasks where the state and
action spaces are continuous (i.e. the sets of all possible
states and actions are continua, not finite sets). The al-
gorithm uses layered networks for the policy and to es-
timate the action-value function.
deep learning. backprop-based learning in networks of
3 or more layers (or 4 or more according to some au-
thors).
dense network. a network where each neuron in each
layer projects to all neurons in the next layer. Also
called a fully-connected network.
discounting. In reinforcement learning, we try to max-
imize the expected return, where the return is the time-
integral of discounted future rewards up to some time T,
i.e. R = rt +  rt+1 + 2 rt+2 + …, and  is a scalar dis-
counting factor between 0 and 1, typically about 0.99.
Discounting implies that we care less and less about re-
wards further and further in the future. One motivation
is that we are less and less certain about those future re-
wards. Another is that discounting helps us avoid infi-
nite returns.
dropout. a method of reducing overfitting by randomly
switching off neurons in a network during learning.
early stopping. a method of reducing overfitting by
halting learning when performance begins to decline on
a validation set.

71

feedforward network. a neural network with no loop-
ing connections, i.e. where no neuron projects to itself
directly or via other neurons.
fully-connected network. a dense network.
generative model. a network that learns to output fake
but realistic data. Given a set of examples of, say, hand-
written numerals, the model learns to produce new im-
ages, different from those in the training examples, that
look like handwritten numerals, with variants appearing
with realistic probabilities, e.g. if 5% of the “7”s in the
training examples have crossbars then 5% of the
model’s “7”s should also have crossbars.
GPU. the graphical processing unit, a part of a com-
puter, separate from the CPU (central processing unit),
designed to handle computations for fast graphical out-
put, e.g. in computer games. GPUs are fast because they
perform thousands of operations in parallel, whereas
most current CPUs manage only about 8 operations in
parallel. Though originally designed for graphics, GPUs
are now also used for non-graphical machine-learning
operations, such as matrix multiplications, which bene-
fit from parallelization.
gradient noise. In most learning methods, we would
like to know the gradient of the expected loss or return,
but in fact we have only an estimate of that gradient
based on one or a few examples (a minibatch). These
estimates will vary around the true gradient depending
on the specific examples. This variance of the estimate
is called gradient noise.
hard-zero saturation. the fact that some types of artifi-
cial neurons have activation functions that are flat over

72

some range of inputs, e.g. relu neurons output 0 when-
ever their inputs are below zero. As a result, the deriva-
tives that drive learning in the neuron are also 0 when
its example inputs are in the flat range, and so the cell
cannot learn from those examples.
horizon. In some reinforcement learning tasks, we inte-
grate rewards over only a finite stretch of time, i.e. to a
finite final time T. These are called finite-horizon (or
episodic) tasks. Other tasks have an infinite horizon,
though these often use discounting, and then authors
may still write loosely of a “horizon”, e.g. saying that a
small discounting factor  results in a close horizon.
Still other tasks maximize an integral that runs always
some finite time into the future, e.g. always 100 ms for-
ward from the current time; these are receding-horizon
tasks.
KL divergence. Kullback-Leibler divergence, a meas-
ure of the similarity of two probability distributions,
sometimes used with softmax networks, whose outputs
can be regarded as probabilities.
L2 loss. the squared error loss function, defined as
L = (y – y*)T(y – y*).
LSTM. long short-term memory, a kind of neuron used
in some recurrent nets to control how certain signals
persist or fade away as time passes.
maxout unit. a neuron whose activation function con-
sists of multiple affine segments.
mel scale. a series of pitches which humans judge to be
equally spaced, used in studies of phoneme recognition.

73

MNIST. a data set used to train networks to read hand-
written numerals, with 60,000 training examples, each a
28-by-28-pixel image of one of the digits “0” through
“9”, and additional test examples.
objective function. the thing we want our learning al-
gorithm to optimize, e.g. expected loss or return.
observation. In this course, we assume for simplicity
that the agent knows the state, s, of the environment at
each time point. In reality, an agent may not know s in
its entirety, but may instead make a more limited obser-
vation o (i.e. gather some sensory or other data) at each
time point, where o may not contain enough infor-
mation to specify s.
off-policy training. In reinforcement learning, we often
learn the action-value function Q of a policy  by gra-
dient descent, using the Bellman error eQ = Q(st, at) –
r(st, at) + Q(st+1, (st+1)). This choice of error makes
sense because if the estimate Q equals the true Q then
E[eQ] = 0 on average. Further, that equality holds
whether or not at equals (st), which means we can, if
we wish, learn Q using at’s that do not fit the policy .
If we do use at  (st), we are training off-policy. Our
motivation might be to explore a wider range of actions
than those chosen by , or to reuse training data r(st, at)
that were sampled in the past, when the agent was using
a different policy.
one-hot. a kind of desired output vector for a network,
where there is always one element of the vector equal to
1 and the others are all 0s.

74

overfitting. a situation where a learner improves its ac-
curacy on its training examples, but gets worse on test
data, i.e. on data it has not seen during its training.
pooling. Convolutional nets often contain pooling lay-
ers, which combine data from multiple pixels, discard-
ing some information, e.g. a max-pooling layer might
act on a 24-by-24 pixel greyscale image to create a 12-
by-12 image where the grey level of each pixel in the
smaller image is the maximum over the 4 grey levels of
4 adjacent pixels in the larger image.
prior. prior probability, i.e. the probability of an event
judged before some new data arrive, e.g. I might judge
it unlikely that it rained last night, but then look out the
window and see that the roads are wet; the prior proba-
bility that it rained was low, but the posterior probabil-
ity, after I have looked at the roads, is high. More gen-
erally, priors can be any information or assumptions
which I build into a network before it learns, for regu-
larization e.g. if I have prior knowledge that some func-
tion I want to learn is quadratic, then I can design a net-
work whose hypothesis space consists only of quadratic
functions.
Q-function. the action-value function.
Q-learning. a method of reinforcement learning where
the agent learns the optimal Q-function directly, rather
than through a series of policy-specific Q’s, using e =
Q*(s, a) – r(s, a) – maxa'Q
*(s', a') or e = Q*(s, a) –
r(s, a) – maxa'Q–(s', a'), where Q– is a target function.
The agent’s policy is μ(s) = argmaxaQ(s, a). Hence Q-
learning works only in problems where it is possible to
compute that argmax efficiently. It is likely not useful
for sensorimotor systems in the brain, where the argmax

75

would be hard to compute because Q is a deep net-
work.
recurrent network. a neural network with looping con-
nections, i.e. where one or more neurons project to
themselves, either directly or via other neurons.
regularization. any method used to reduce overfitting.
reinforcement learning. learning based on rewards, as
opposed to supervised learning.
relu. rectified linear unit, i.e. a neuron whose activation
function is linear (and in fact equals the identity func-
tion) except for a cut-off at zero — y{l}i = max(0,
v{l}i).
replay buffer. In reinforcement learning there is often a
network that learns a function (such as the action-value)
based on training examples showing how various ac-
tions in various situations yielded specific rewards. In
some methods, the control system uses these training
data — states, actions, rewards — as it collects them,
e.g. it might carry out a reaching movement, collect the
current state, action, and reward at each moment as the
reach proceeds, immediately use those data to adjust its
network parameters, and then discard the data to make
room for the next state, action, and reward. A problem
with this approach is that, during a smooth movement
like a reach, the successive states are very close to-
gether, and so the learner gets a series of very similar
training examples. These examples may push the
learner’s parameters toward values that are optimal for
that small region of state space but not very good over-
all. We can avoid that problem by storing many training
examples in a replay buffer, and adjusting parameters at

76

each time step based not on the current example but on
a minibatch of training data from many different times
and reaches, drawn randomly from the buffer.
return. time-integral of rewards.
reward. Any reinforcement learning task begins with a
reward function r(s, a) which defines what the agent is
trying to achieve. Specifically, the agent’s aim is to
maximize the return, which is a time-integral of the re-
ward.
SARSA. a method of reinforcement learning, identical
to Q-learning except that the agent learns using e =
Q(s, a) – r(s, a) – Q(s', a') or e = Q(s, a) – r(s, a) –
Q–(s', a'), where the next-step s' and a' are both sam-
pled, and Q– is a target function. So SARSA, unlike Q-
learning, doesn’t compute argmaxa'Q for its learning,
but it does still use the policy μ(s) = argmaxaQ(s, a). It
is called SARSA because it computes its e based on a
state, action, reward, and the subsequent state and ac-
tion.
self play. training agents by having them compete
against other agents, usually in a simulated environ-
ment.
SIFT. the scale-invariant feature transform, a computer-
vision algorithm that detects certain image features that
are useful for object recognition. These features were
designed by computer scientists, whereas convolutional
nets instead learn useful features by backprop.
softmax. When training a one-hot network, it is often
helpful to let the final-layer neurons be affine (even if
the upstream neurons are, say, relu or tanh), and then

77

define the net’s output y to be a softmax function of the
final-layer signal y{nl}, i.e. define the i-th element of y
to be yi = exp(y{nl}
i)/j exp(y{nl}
j). This formula en-
sures that y’s elements are all between 0 and 1, and sum
to 1, like the elements of a one-hot y* (and also like
probabilities, so we can regard each element yi of the
softmax output as the network’s estimate of the proba-
bility that the input belongs to class i).
state. roughly, a vector which contains maximal infor-
mation about the future. 1) In a deterministic environ-
ment, the state is a vector s which contains enough in-
formation about the world that, given s at just one time
t, and the actions at time t and all subsequent time
points, it is possible in principle to predict the whole fu-
ture evolution of s. 2) In a stochastic environment, the
state contains enough information that, given s at time t,
and the actions at time t and subsequently, we can com-
pute the probabilities of all future s, and those probabili-
ties would not be altered if we had any additional infor-
mation from time t or earlier.
stochastic gradient descent. online gradient-descent
learning where parameters are adjusted after each single
example or minibatch, using a gradient estimate based
only on that example or minibatch.
supervised learning. a method where a learner receives
input vectors x, computes outputs y, and gets error feed-
back e that specifies the difference between its output
and some desired output y*, e.g. e = y – y*. The learner
adjusts its parameters to reduce some non-negative sca-
lar function of e, such as the squared error eTe.
tanh. the hyperbolic tangent function.

78

target function. 1) the function τ a network is trying to
learn. 2) a function such as Q– in DDPG, used to sim-
plify gradient descent.
u. in control theory, the control signal, analogous to the
action, a, in reinforcement learning.
unsupervised learning. any method where a learner
adjusts at least some of its parameters based on a crite-
rion that does not involve its output error or reward, e.g.
it might look for a way to compress its input data into a
lower-dimensional representation before worrying
about reducing its output errors or maximizing its re-
turns.
validation set. To combat overfitting, we may withhold
a subset of the training data called the validation set, so
the learner cannot use those examples to adjust its pa-
rameters. We use the validation set to monitor for over-
fitting, periodically testing the learner on those withheld
data to see whether its performance is getting worse. If
it does begin to worsen on the validation set then we
may halt its learning, even though it is still improving
on the non-withheld training data, to avoid overfitting
— a method called early stopping.
value function. The value V μ(s) of a state s, given a
policy μ and a reward function r, is the expected time-
integral of r(s, a), possibly discounted, from now to
some final time T, if we start in state s and use policy μ.
One way to state the aim of reinforcement learning is
that it tries to find policies that maximize the value
V μ(s) at every state s.

79

whitening. applying a linear transformation T to a ran-
dom vector v so that the covariance matrix of w = Tv is
the identity matrix, i.e. the elements are w are uncorre-
lated with each other, and each has variance 1. As a re-
sult, even if the vectors v form an elongated, rotated
cloud, the w’s will form a spherical ball. If v has dimen-
sionality > 1 then there are many different linear trans-
formations that will achieve whitening, because if T
works then so does OT, where O is any orthogonal ma-
trix (because rotating a ball of w vectors leaves it in the
form of a ball). Whitening input vectors, or vector sig-
nals between layers in a deep net, can speed up learning
by improving conditioning.
ZCA whitening. zero-phase component analysis whit-
ening, a form of whitening where the linear transfor-
mation T is the square root of the covariance of v. It is
the version of whitening that yields basis vectors as
close as possible to the basis vectors originally used to
represent v.

欢迎咨询51作业君