THEORETICAL PHYSIOLOGY PSL432 University of Toronto DOUGLAS TWEED i Course outline This course deals with the mathematical principles of control and learning in the sensorimotor systems of the brain. The course director is Douglas Tweed, reachable at
[email protected] or 416-912-7497. The teaching assistant is Ali Mojdeh, ali.mojdeh@ mail.utoronto.ca Lectures are Tuesdays 2–4 p.m., office hours by ap- pointment. For all these events, use the Zoom link an- nounced on Quercus. Format We will run an inverted classroom: before each class (except the first one), you should finish that week’s as- signed readings from these handout notes. If anything in the notes is unclear, you may ask questions about it in class, and answer other students’ questions. Materials There is no text book. All course material will be cov- ered in these handout notes and computer code posted on Quercus. ii Evaluation There will be no tests or exams. All course marks will be based on 3 take-home assignments, worth 25%, 40%, and 35%. The assignments will be posted on Feb 1, Mar 1, and Mar 22, and will be due 2 weeks later, not counting Reading Week, i.e. they will be due one mi- nute before midnight on Feb 22, Mar 15, and Apr 5. The penalty for late submission will be 5% (of the total marks for that assignment) per day late. Assignments may include calculations and proofs, but will mainly in- volve writing computer simulations of sensorimotor control and learning. Simulations You will write all your simulations in the Matlab pro- gramming language. You can get Matlab for free through https://onesearch.library.utoronto.ca/ic/mathworks-soft- ware-%E2%80%93-matlab-students You need only Matlab itself, not any of its “toolboxes”. Starting Jan 18, each class will end with a simulation workshop, where you can apply the day’s new concepts on your own computer, so make sure you have Matlab installed by then. iii Schedule Date Topic Jan 11 Control Systems Jan 18 Control Systems Jan 25 Control Systems Feb 1* Supervised Learning Feb 8 Supervised Learning Feb 15 Reading week T Feb 22 Supervised Learning Mar 1* Sensory Maps Mar 8 Reinforcement Learning Mar 15 Reinforcement Learning Mar 22* Reinforcement Learning Mar 29 Regularization Apr 5 Regularization * Days when problem sets will be assigned. You will have 3 weeks for the first, and 2 weeks for the second and third, i.e. they will be due one minute before mid- night on Feb 22, Mar 15, and Apr 5. iv 1 1 Control systems In this course we will study how sensorimotor systems in the brain learn to control the body. The mathematical principles of learning and control have become clearer in recent years thanks to research in artificial intelli- gence and robotics, so we will borrow concepts from these disciplines to help us understand the brain, and conversely, we will consider how ideas from neurosci- ence may benefit artificial systems. State dynamics Any sensorimotor task starts with an agent of some sort embedded in an environment. We write st for the state of the environment at time t, and at for the agent’s ac- tion at this time (where both st and at are column vec- tors). For example, the agent might be the saccadic system in the brain, which rotates the eyes quickly to aim them at an object of interest. In that case st might include the current positions of the eyes and the location of the vis- ual target. And at might be the motor command — the firing rates of the motoneurons to the muscles that move the eyes. In this course, we assume that time passes in discrete steps of some fixed size ∆t, because that approach is convenient for computer simulations. With each step, the state takes a new value, 2 (1) ( , ),t t t ts f s a where f is called the state-dynamics function. Simulating dynamics In our equations, time passes in discrete steps. But real time is continuous, and real processes obey differential equations, such as ds/dt = g(s, a). We can approximate such systems in our discrete-time framework using Eu- ler integration, where we say (2) d ( , ) ( , ). d t t t t t t t t t s s f s a s s s t s a t In other words, we compute ds/dt at time t and pretend that the velocity stays constant over the whole time in- terval [t, t + Δt). This approximation will usually be more accurate when ∆t is smaller, but small ∆t’s are computationally expensive, because it takes a lot of them to cover, say, 1 s of time. In this course, a good compromise will be to choose ∆t in the range 0.1 to 0.001 s. You can find an example in the code file Euler _integration.m, posted online (and see Getting Started with Matlab, near the end of this chapter). We will call ∆st/∆t the discrete-time velocity of s. ›› run EULER_INTEGRATION.M Policy The agent’s job is to choose actions, a, that achieve its aims, given the rules laid out by the state dynamics. In this course we will usually assume that the agent, based 3 on its sensory inputs, memories, and intentions, knows st in full at every time t. That is, we consider situations where state feedback is available, not ones where the state is only partially observable. The rule the agent uses to compute a’s from s’s is called its policy, , (3) ( ).t ta μ s A policy may be stochastic, e.g. it may choose its action at by computing some deterministic function of st and then adding Gaussian noise, i.e. a random variable with a Gaussian, or normal, distribution. Rewards An agent needs a policy that will achieve its aims, but what are these aims? We assume the agent receives or computes, at each time step, a signal called a reward, rt, usually determined by the current state and action, (4) ( , )t t tr r s a The reward function defines what the agent wants. A high reward indicates something the agent finds pleas- ant, which might be light, or dark, or heat, or cold, or some food, or having some visual target perfectly cen- tred on the fovea, whereas a lower reward is less pleas- ant. The reward at any moment is a scalar: it may be com- puted from the vectors st and at, but it expresses the overall pleasantness of the situation in a single number. The agent’s job is to choose actions that yield high re- wards. 4 For example, suppose the agent wants to hold s near a target state s = 0. We can express that wish mathemati- cally by defining a reward that is maximal when s = 0, and is otherwise much lower, such as (5) ( , ) tanh( )t t t t tr s a s s T This rt attains its maximum, namely 0, when s equals its target value of 0. But if s is even a little off target then rt is close to its minimum of –1. It may seem odd that the reward (5) is defined with a negative sign and has a maximal value of 0 — i.e. the best reward we can ever get is nothing. But rewards of this type are common in sensorimotor systems, many of which try to eliminate some bad thing, such as being off-target, and their highest reward is to zero that bad thing. Equation (5) also shows that rt doesn’t always depend on both st and at, but I have written r(st, at) nonetheless because in general, rewards do depend on both. Returns I have introduced the idea of rewards as a step toward expressing an agent’s aims mathematically, but many practical aims can’t be defined simply in terms of max- imizing a reward at each moment. Instead, most purpos- ive behaviour aims to maximize the time-integral of a reward, called the return, which (for some reason) is usually represented by a letter G in formulas. For example, suppose we want to drive a mechanical system such as an arm or leg to a target state as quickly 5 as possible and hold it there — a task called time-opti- mal control. If the target state is 0 then (5) is an apt re- ward function, but we should not aim to maximize this reward at each moment. If we do, we will drive s to 0 at high speed, so as to raise rt quickly, but then we may have so much momentum that we overshoot the target, and rt plummets. Essentially, if we maximize rt at each moment then we won’t hit the brakes in a timely fash- ion. Policies of this type, which maximize immediate reward at each instant with no regard for future conse- quences, are called myopic. We don’t want a myopic policy that maximizes imme- diate reward, we want a policy that cares about the fu- ture, and so tries to maximize the discrete-time integral of the reward from the present moment, t = 0, to some future time T, (6) 0...Δ Σt T tG t r where T may be . This integral formulation expresses the aim of time-optimal control, because clearly the way to maximize G is to get s to 0 as quickly as possi- ble without overshooting. Vestibulo-ocular reflex An example of a sensorimotor system is the vestibulo- ocular reflex, or VOR, which senses head motion and counterrotates the eyes to keep the retinal images stable. Without your VOR, your vision would blur whenever you moved, like a photo taken with a moving camera. For simplicity we will consider just one eye, rotating only horizontally, in which case the state st consists of horizontal eye position qt and current horizontal head 6 velocity ht, the latter measured by sensors called semi- circular canals located in the inner ears, in cavities in the skull bones called the vestibula. We write (7) [ ; ]t t ts q h where the semicolon means we stack qt and ht verti- cally, one on top of the other, to form a column vector st. The variables qt and ht have just one element each, and so could be regarded as scalars (i.e. numbers), but we will view them as one-element vectors because in most systems, the position and input variables are vectorial. Throughout the course, we will write q for the configu- ration of a system, i.e. its position, location, or posture, as opposed to its velocity, acceleration, etc. The ele- ments of q are the system’s degrees of freedom. What are the state dynamics of the VOR? Eye position q evolves (to a good approximation) this way: (8) ( ) /t t t t t t tq q q q t a κq ρ. Here κ and ρ are constants that reflect the mechanical properties — stiffness and viscosity — of the eye mus- cles. Realistic values are κ = 250 and ρ = 50, if t is measured in seconds, q in radians, and a in mean spike rate per motoneuron per second. As for head velocity ht, we will measure it in radians/s and treat it as something random, which the agent can- not predict, but must simply read in from its sensors in the inner ear. In reality you can predict your head mo- tion to some extent, as it is partly under your control, but it may also be affected by unpredictable factors, as 7 when you are riding a bucking bronco. For now, it is simpler to treat ht as completely unpredictable. What are the aims of the VOR? It tries to minimize the velocity with which the visual image is slipping across the retina, so we can define its reward as the negative of the squared slip velocity, (9) 2 ( , ) (Δ /Δ )t t t t tr r s a q t h (though other choices, such as the negative of the non- squared magnitude of the slip velocity, would also be reasonable). The VOR’s policy is a neural network in the brainstem and cerebellum. It receives information about st and uses it to compute an action at, which is another one-el- ement vector, the mean firing of the motoneurons. (Or more precisely, it is the difference between the mean firing rates of the motoneurons to the rightward and leftward-pulling eye muscles.) What is the best policy? Using (8), we see that rt can be maximized (that is, zeroed) if (10) ( ) = .t t t ta μ s κq ρh To apply this policy, the agent must know not just st but also the stiffness and viscosity coefficients of the mus- cles, κ and ρ. No sensory signals can report those values in any straightforward way, but by observing eye mo- tions evoked by a wide variety of actions, a, the agent can gradually deduce them. In other words, the agent can learn estimates of κ and ρ. To emphasize that it uses approximations rather than exact, true values, we write the policy this way, 8 (11) ( ) = ,t t t ta μ s κ q ρ h where corner brackets indicate estimates. In any system with estimates of environment properties, such as κ and ρ, we call the set of estimates a model of the envi- ronment. ›› run VOR.m Saccades We have deduced the ideal policy for the VOR, but for most sensorimotor systems it is harder to choose opti- mal actions. To see some of the issues, we will examine the system that drives saccades, the rapid eye move- ments that rotate the gaze lines to a visual target. Again we consider horizontal movements of one eye. For this task, the state is (12) [ ; * ],t t ts q q where qt is horizontal eye position as in (7), and q*t is desired horizontal eye position, the position that will point the fovea at the visual target. Eye position q evolves according to (8), as it did in the VOR, and with the same κ and ρ, because both systems use the same eyeballs and muscles. Target position q* we treat as random. In reality, the saccadic system must analyse the retinal images to de- tect objects and choose one as the visual target, but we will assume the choice has already been made in cir- cuits upstream, and the saccadic system receives a one- element vector q*t. 9 For this first look at saccades, we will take as our re- ward the negative squared difference between the actual and desired eye positions, (13) 2 * .t t tr q q (In fact, slightly more complicated rewards are more re- alistic for saccades, but (13) will do for now.) The policy is implemented by neurons mainly in the midbrain and brainstem. We assume these neurons al- ways know the current qt and q*t. But because time passes in discrete steps, they have to wait ∆t time units before they learn the next state and select the next ac- tion. In other words, our discrete ∆t mirrors the fact that real agents, even in continuous time, receive their feed- back with a delay. For the saccadic system we set ∆t = 0.01 s because that much delay is biologically reasona- ble (though values as small as 0.001 s may also be real- istic). Another issue, crucial for saccades, is that actions are bounded. Motoneurons can’t fire billions of spikes per second, to snap the eye to its target in a nanosecond. Real motoneurons can’t exceed about 500 spikes/s, and most saccades take 20–100 ms. So what is the best policy for saccades? We might try to find a policy by solving for the action at that zeroes rt, as we did with the VOR, but we won’t succeed, because nothing in (13) — the formula for rt — is a function of at. Specifically, (8) tells us that at (together with other variables) determines ∆qt, not qt. We might think of solving for the at that produces a ∆qt such that qt+∆t = q*t. But that won’t work either: because 10 at is bounded, it is almost never possible to make qt+∆t = q*t, i.e. to reach the target in a single 0.01-s time step. A better approach is to define some sensible desired dy- namics. To keep the formulas compact, we will write q(1)t for the discrete-time velocity ∆qt / ∆t, q(2)t for the ac- celeration ∆q(1)t / ∆t, q(3)t for the jerk ∆q(2)t / ∆t, and so on. And for qt we may also write q(0)t. Using this notation, we might define a desired eye ve- locity for saccade trajectories, (14) (1)* ( * ),t t tq α q q where α is a negative constant. That is, the eye should have a negative velocity when qt > q*t and a positive velocity when qt < q*t, so q is always moving toward its target. We can solve (8) for the at that makes q(1)t = q(1)*t, namely (15) ( *).t t t ta κ q ρ α q q This policy will drive the eye to its target, zeroing the reward (13), so long as κ and ρ are accurate and α is not so large that it causes overshoot, as you can check in the posted code file. ›› run SACCADIC_LINEAR.m But there are many possible choices of desired dynam- ics, and (14) is not particularly good. For one thing, sac- cades should be fast, but (14) is slow. It makes q(1)*t a linear function of the motor error, qt – qt*, which means that as the error shrinks, the velocity shrinks pro- portionally. If initial error is, say, 0.4 radians, then the 11 eye may start toward the target at high speed, but a little later, when the error has shrunk to 0.2, velocity will be halved. Better to choose desired dynamics that maintain speed until the target is reached, such as (16) (1)* sgn( *),t t tq α q q where the sgn function (pronounced signum) yields 1 when its argument is positive, –1 when it is negative, and 0 when it is 0. The resulting policy is (17) sgn( *).t t t ta κ q ρ α q q This policy is a simple case of sliding-mode control, which we will define in greater generality later. ›› run SACCADIC_SLIDING.m As you explore this code, you will find that (17) doesn’t actually work very well — it causes chatter, or rapid oscillations in q and a (why?) — but the following sof- tened version is better, (18) tanh( ( *)).t t t ta κ q ρ α β q q This policy is “softened” in the sense that tanh — the hyperbolic tangent function — is a smoothed approxi- mation to the abrupt step function sgn. In fact, (18) is close to what the real saccadic system does, except that the real system uses its κ and ρ not only as in (18), but also to update an internal estimate of qt, by running a kind of neural simulation of the state dynamics (8): (19) Δ ( ) / .t t t t tq q t a κ q ρ 12 It then uses this running estimate instead of the real qt in its policy, (20) tanh( ( *)).t t t ta κ q ρ α β q q This way, the policy neurons get (approximate) infor- mation about eye position faster than if they waited for visual or proprioceptive signals. This mechanism is called internal feedback because the agent learns qt from its own model rather than from the senses. Internal feedback is what allows the saccadic system to achieve a feedback delay ∆t as small as 0.01 s. Visual feedback takes longer — about 0.1 s. ›› run SACCADIC_INTERNAL.m Higher-order dynamics The VOR and saccades both have (to a good approxi- mation) first-order state dynamics, meaning that no time-derivatives higher than the first — velocity — ap- pear in the dynamics. But most sensorimotor systems are at least second-order. To see the implications, we will start not with any real sensorimotor dynamics but with the simplest second-or- der system, namely a frictionless bead sliding on a straight, horizontal wire, driven by a varying force which we will regard as an action a. The bead’s loca- tion q obeys Newton’s second law, d2q/dt2 = a (assum- ing the bead has mass 1 kg, and force a is measured in newtons). The state must now include both position and velocity, because both affect the future evolution of the system, (21) (1)[ ; ],t t ts q q 13 and the state dynamics are (22) (1) (1)Δ [Δ ; Δ ] Δ [ ; ].t t t t ts q q t q a Notice that at determines ∆q(1)t but has no effect on ∆qt, and therefore none on qt+∆t. It will affect qt+2∆t and later q’s, but qt+∆t is already determined by the current veloc- ity q(1)t. In other words, the system has inertia, and this fact can complicate control. Suppose we want to drive our bead-on-a-wire to state s = 0, assuming |a| ≤ 5, and taking (5) as our reward. As with saccades, no at will zero this reward immediately, but we can choose desired dynamics that take s to 0 eventually. For saccades, it was easy to choose reasonable linear dynamics, because the state dynamics were first-order. But for an nth-order system, how do we choose constant coefficients αi so the linear dynamics, (23) ( ) ( 1) (1) (0)1 1 0* ... , n n nq α q α q α q give the behaviour we want, e.g. driving q to 0? We define the Hurwitz polynomial (24) 1 11 1 0... n n nx α x α x α (replacing the time-derivatives q(i) from (23) with pow- ers xi of a new variable x), and choose our α’s so this polynomial’s roots all have real parts less than 0. Then if we use these α’s in (23), we get what are called Hur- witz dynamics, which (it can be shown) drive s to 0, so long as ∆t is not too large. 14 If our goal is not 0 then we make appropriate adjust- ments, e.g. if we want to take q to some nonzero sta- tionary target q* then we set (25) ( ) ( 1) (1) (0)1 1 0* ... ( *). n n nq α q α q α q q For the bead on the wire, we might choose the Hurwitz dynamics (26) (2) (1) (0) (1)1 0* 6 9 ,q α q α q q q which imply the policy (27) (1)6 9 .t t ta q q Defining a policy this way, based on Hurwitz dynamics, is called feedback linearization. But as with saccades, linear dynamics are slow. We can speed things up using sliding-mode control, meaning we choose Hurwitz dynamics for q(n–1), but faster, nonlinear dynamics for q(n). For example, given a stationary target q*, we set (28) ( 1) ( 2) (1) (0)2 1 0* ... ( *) n n nq α q α q α q q and (29) 1 1 1* sgn( *). n n n n q q qα Here α0 to αn–2 are chosen to make (28) Hurwitz, and αn–1 is a negative number. Given (29), q( n–1) moves rap- idly toward q(n–1)*, reaching it quickly, and thereafter the state dynamics obey (28), with the result that q → q* exponentially. For the bead on the wire, we might choose 15 (30) (1) 0* ( *) 6q α q q q and (31) 1 1 1 12 * sgn( *) 5sgn( 6 ),q q qα q q which imply the policy (32) 1 5sgn( 6 ).tt tqa q But even (32) is not the time-optimal policy — the one that gets s to 0 as quickly as possible, maximizing re- turn (6) given reward (5). In this simple case of a fric- tionless bead on a wire, we can deduce the time-optimal policy analytically (see posted code), but in general, time-optimal control is hard. ›› run INERTIAL_TIME_OPT.m Learning We have managed to find sensible policies for the VOR, saccades, and inertial control. But we did it by examining equations and reasoning about them. How can such policies be acquired by sensorimotor networks in the brain? They can’t be hardwired in by the genome, because they depend on mechanical coefficients like κ and ρ, which change as the body grows. They must be at least partly learned. Also, we found our policies by specific reasoning, tai- lored to the VOR, saccades, and a bead on a wire. And in fact our reasoning worked only because those tasks were very simple. The brain has more likely evolved one or a few general mechanisms that can learn policies for a wide range of difficult tasks. In this course we will 16 explore how such mechanisms can be embodied in the neural networks of the brain. Getting started with Matlab In Matlab, you write code (m-files) in the editor win- dow, and when you run your code, you may see printed outputs and error messages in the command window. You can also use the command window as a calculator (e.g. type 7*5 and get 35), and you can write and run one-line programs there. The following words or key-presses are useful in the command window. Ctrl-C will halt a running piece of Matlab code. help will provide information about Matlab functions, e.g. help rand will reveal how to use Matlab’s ran- dom-number function, rand. If you don’t know the name of the function you want, then write lookfor; e.g. lookfor random will bring up a lot of documentation where the word “random” ap- pears. If there is a bug in your code then error messages will appear in the command window. In the editor window, omitting the semicolon ; after a statement will cause Matlab to display its output in the command window, which may help you debug your code, e.g. if you write c = a + b without a semicolon, then the value of c will appear in the command win- dow. 17 Start a line with % to make it a comment (i.e. ignored by the computer but useful for humans reading your code). If a variable takes the value NaN (not a number) or inf (infinite) then there is a problem in your code. Write A(2, 7) for the element in the second row and seventh column of a matrix A. A(2:4, 7:12) is the 3-by-6 submatrix of A containing its rows 2 through 4 and columns 7 through 12. A(:, 7:12) is the submatrix of A containing all its rows and the columns 7 through 12. A’ is the transpose of the matrix A. A*B is the matrix product of A and B. A.*B is the elementwise product. Simulation workshop 1. Save Euler_integration.m under a different name and use the new file to experiment with various changes. (a) Change dur, Dt, and the initial s. (b) Right now the state dynamics are s(1)t (i.e. ∆st / ∆t) = st. Make the program more flexible by introducing a variable α and setting the equation to s(1)t = αst. (c) Plot the exact, continuous-time solution to the cor- responding differential equation ds/dt = αs with a 18 dashed blue line in the same panel as the discrete-time, Euler approximation. (d) Change other aspects of the graphing by altering the code following “% plot”. 2. To get practice with states and configurations, write an m-file that uses Euler integration to express and solve (approximately) the second-order differential equation d2q/dt2 = –αq, with initial conditions q = 0 and dq/dt = α1/2. Also plot the exact solution as a dashed line. Try different time steps. 3. In Saccadic_linear.m, try larger and smaller values for α, the desired-dynamics coefficient. Does anything odd happen when α is very large? Why? 4. Write an m-file that uses feedback linearization to control a system with state dynamics d2q/dt2 = a – αq, for a small constant α. Make the agent drive q to a tar- get q* that jumps every 500 ms, as in Saccadic_lin- ear.m. Other resources Hanneton S et al. (1997) Does the brain use sliding var- iables for the control of movements? Biological Cyber- netics 77: 381-393 Khalil HK (2002) Nonlinear Systems. Prentice Hall Slotine JJE, Li W (1991) Applied Nonlinear Control. Prentice Hall 19 2 Supervised learning Agents must learn their policies, based on information from their sensors. To explore the principles, we will first consider a simpler task, where a learner tries to de- duce a target function τ from examples of x’s and τ(x)’s. This task is called supervised learning; the x's are called inputs; the τ(x)'s are target outputs, desired outputs, la- bels, or supervisor or teacher signals; and each ordered pair (x, τ(x)) received by the learner is an example. The learner tries to mimic τ, i.e. to learn to predict τ(x) given x. Usually a learner can’t mimic τ precisely but creates an estimate τ that approximates τ. The other main type of learning, besides supervised, is reinforcement learning, where an agent learns a policy, based not on teacher signals like τ(x), but on rewards. We will get to reinforcement learning in Chapter 4, but first we need to understand supervised learning. Error and loss To formulate a supervised-learning task, we define the error vector (33) ( ) ( ) ( ).e x x x Then we choose a loss function, L — a scalar function of e, usually non-negative, with a single, global mini- mum at e = 0, and no other minima. An example is the squared error, (34) .L e e T 20 Then the aim of learning is to adjust τ to minimize the expected (i.e. average) loss, E[L], averaged across all possible inputs x, taking into account that some x’s may be more common than others. Neural networks The learners we study in this course are layered net- works of computational units called neurons — simpli- fied models of the real neural networks in the brain. We write nl for the number of layers in the network, and y{l} for the signal in layer l, i.e. the vector of the firing rates of all the neurons in that layer. Each signal (apart from the first layer’s) depends on the signal in the pre- vious, upstream layer, by the formula (35) { } ( { } { 1} { }).y l W l y l b l Here W{l} is the matrix of synaptic weights from layer l – 1 to layer l; b{l} is the bias vector of layer l (repre- senting baseline activities of the cells in the layer); and φ is the activation function. The term in parentheses is called the pre-activation potential, (36) { } { } { 1} { }.v l W l y l b l For our activation function, we will mainly use the rec- tified-linear-unit, or relu, function, (37) { } ( { }) max(0, { }),y l v l v l in layers 2 through nl – 1, and the identity function, y{nl} = v{nl}, in the final layer. No activation function is applied in the first layer of the network either, so we have also y{1} = v{1}. 21 Another popular activation function is the hyperbolic tangent, (38) { } tanh( { }),y l v l which has some advantages, e.g. it is bounded between –1 and 1, and so tanh-neuron signals can never fly off toward infinity, as may happen with relu. In any layered network, we regard the first-layer signal, y{1}, as the network's input, also called x. The net- work’s output is its final-layer signal, y{nl}, though sometimes it is convenient to make τ(x) equal not to y{nl} but to some function of y{nl}. Often we write simply y for τ(x), and y* for the desired value τ(x), in which case (33) becomes (39) *.e y y So long as they have at least 3 layers, networks with relu or tanh activation functions (or indeed with any of a wide variety of other nonlinear activation functions) are universal approximators, meaning we can approxi- mate any continuous function τ arbitrarily well with a network of this form if we have enough neurons and we find appropriate W’s and b’s. Gradient descent Learning is a matter of finding weights and biases that yield a small E[L]. Finding the very best W’s and b’s is computationally intractable — too slow and expensive — but it is feasible to guess W and b and then improve those guesses by gradient descent. 22 To explain this idea, it is convenient to arrange all the adjustable parameters — all the W{l} and b{l} — to- gether in a single column vector, . For any one net- work, the set of all its possible ’s is called its parame- ter space. E[L] is a function of , because when a net- work has good parameters then E[L] is small, and when it has bad parameters then E[L] is large. The gradient E[L]/ at any locus in parameter space is a row vector whose transpose points in the direction of steepest increase of E[L]. Therefore we can decrease E[L] by adjusting in the opposite direction, setting (40) E[ ] / ,t t t L T where θt is the parameter vector at time t, η is some small positive number, and θt+Δt is the new parameter vector, after the adjustment. Then we compute E[L]/ at this new spot in parameter space, subtract η times that gradient to get θt+2Δt, and repeat over and over, gradually improving . We can picture the graph of E[L] versus as a high-di- mensional surface, the E[L]-landscape. Then gradient descent is like walking down the slopes of that land- scape in hopes of finding a low basin. The quantity η in (40) scales the lengths of the strides we take, as we will discuss in the section Faster learning. And from now on, for simplicity, we will set the time step Δt equal to 1; for example, if θ is adjusted once every millisecond then we use milliseconds as our time units, so we can set Δt = 1 and write θt+1 instead of θt+Δt. ›› run GRADIENT_DESCENT_QUADRATIC.m 23 Usually we can’t compute E[L]/, but we can esti- mate it by computing Lt/, where Lt is the loss for the current example. At each time point t, the network re- ceives an example (xt, τ(xt)), and computes its estimate τ(xt) and its error et = τ(xt) – τ(xt). From et it can com- pute Lt/, which is an estimate of E[L]/. The esti- mate is unbiased, meaning it equals E[L]/ on aver- age, but of course it does vary around this mean, and that variation is called gradient noise. Backprop The error-backpropagation algorithm, or backprop, computes Lt/ for all the adjustable parameters in a layered network. As a step toward that goal, it computes the derivatives of Lt with respect to all the network’s pre-activation po- tentials, v{l}. Dropping the subscript t’s to avoid clutter, we see by the chain rule of calculus that (41) { 1} { } { } { 1} { } { } { } { 1} . { 1} { } L L v l dy l v l v l y l dv l L dy l W l v l dv l This equation tells us how to find L/v for layer l given L/v for layer l + 1. To get started, we need L/v for the final layer. Usu- ally that is easy to compute, though the details depend on the formula for L. If L is the squared-error loss, as in (34), and the final-layer activation function is the iden- tity, then 24 (42) 2 { } { } { } { } 2 2 . { } l l l l l L dL e e e v n de v n v n dy n e e dv n T T T So backprop starts at the output end with (42) and works backward layer by layer through the network with (41) to get all the L/v{l}. This same procedure also yields the derivatives of L with respect to the biases b{l}, because (43) { } . { } { } { } { } L L v l L b l v l b l v l To get the derivative of L with respect to the weights, we again apply the chain rule, (44) { } { 1} . { } { } { } { } L L v l L y l W l v l W l v l Given these derivatives, we can adjust the weights and biases by gradient descent, setting ∆W = – (L/W) T, ∆b = – (L/b)T. ›› run BACKPROPAGATION.m Finding other gradients by backprop We have seen that backprop can compute the gradient of L with respect to many network variables, including the input vector, y{1} — also called v{1} or x. But given any vector α of the same dimensionality as y{nl}, we can replace L by αTy{nl} in (41) and (42). It follows that, if we begin backprop with α instead of 2e, we get αT y{nl}/x at the input end. In other words, we can 25 compute the gradient of any network’s output with re- spect to its input, multiplied by an arbitrary vector α, by backpropagating α. In brief, (45) backpropagating { }/ .lα α y n x T This principle will be crucial to reinforcement learning in Chapter 4. Feedback alignment Very likely, backprop can’t operate in the brain, be- cause of (41). That equation contains the term W{l + 1}, which means that the feedback circuits that compute the parameter adjustments must know the weight matrices W. That is, neurons in these feedback pathways, whose signals represent L/v{l}, must know the weights of vast numbers of synapses on other neurons, in the for- ward path — the neurons whose signals are y{l}. De- spite decades of research on inter-neuronal communica- tion, we have no known mechanism that could convey this synaptic data with anywhere near the speed and ca- pacity required for backprop. But another mechanism, very close to backprop, works almost as well and is biologically feasible. If the W’s in (41) are replaced by random, fixed matrices then there is no need to communicate synaptic weights from cell to cell, and yet (it can be shown) the forward-path weight matrices W{l} evolve to resemble the fixed ma- trices in the feedback circuits, so that in the end it is as if those feedback matrices had been set equal to the W{l}’s, as in backprop. ›› run FEEDBACK_ALIGNMENT.m 26 Minibatches Backprop (and feedback alignment) networks learn much faster if they receive, at each time t, not a single example (xt, τ(xt)), but a set of nm examples called a minibatch, with nm usually in the range 100–256. One reason minibatches help is that with more examples we can compute better estimates of E[L]/, reducing gra- dient noise. Another reason is that we can arrange all the data vectors of the minibatch side by side in a single matrix, and computers can process that matrix of, say, 100 examples much faster than they can 100 individual examples. When we use a minibatch of nm examples, we are run- ning nm different inputs through the same network, so we get nm different sets of v’s and y’s, and nm different outputs, errors, and losses, but all nm runs share the same W’s and b’s. We will write Lm for the loss in ex- ample m of the minibatch, v{l}i,m for the pre-activation potential of neuron i of layer l in example m, and analo- gous things for other variables. In this setting, L means the average of all nm losses in the minibatch. Backprop equations (42), (41), (43), and (44) then become (46) , , , , , d { } 2 2 , { } d { } l i mm i m i m l i m l i m y nL e e v n v n (47) , , , , , d { } Σ { 1} , { } { 1} d { } i mm m j j i i m j m i m y lL L W l v l v l v l (48) , 1 Σ , { } { } m m i m i m LL b l n v l 27 (49) , , , 1 Σ { 1} . { } { } m m k m i k m i m LL y l W l n v l Brains likely can’t use minibatches, but in most of our simulations we will use them nonetheless, for speed. We get much the same result we would have obtained without minibatches, but faster. ›› run BACKPROPAGATION_MINIBATCH.m Initialization In many problems, backprop-based learning will suc- ceed or fail depending on how we set the initial values of the network’s weights and biases. One rule of thumb, which does not always work, is that in a relu network, the weights should start out small and the biases mainly positive, say with a mean near 0.1. That way, most neu- rons in the network will yield nonzero signals given most inputs x, whereas with negative b’s, or large W’s that overpowered the b’s, many neurons would be in the flat part of the relu function — a problem, as neurons can’t learn when they are in the flats because then the derivative y/v, which drives learning in (41), is 0. Activation functions like relu that are zero over some stretch of their domain are said to show hard-zero satu- ration. One might expect better learning with other acti- vation functions, such as the hyperbolic tangent, that are flat nowhere, but in practice, relu networks usually learn better, maybe because the relu function, being so simple, simplifies the E[L]-landscape. Also, most com- mon activation functions without hard-zero saturation, such as tanh, are almost flat over parts of their domains, 28 and so they don’t really avoid the problem. Real neu- rons in the brain show hard-zero saturation, as their fir- ing rates can’t be negative. Faster learning Speed of learning hinges on the rate constant in (40). If η is too small, then improves very slowly. If η is too large, it may overstep a minimum and increase E[L]. Annealing Often it helps to anneal η, shrinking it as time goes by so the network learns rapidly at first and then slows down to home in on an optimal . But choosing the an- nealing schedule (the rate at which η shrinks) is a matter of trial and error, as is choosing the initial η. Momentum Backprop usually learns faster with momentum, mean- ing that the parameter adjustment ∆ is not exactly –η times the gradient, but also depends on the ∆ in the previous time step: (50) / ,t t t L T where β is a number between 0 and 1, usually 0.9 or larger. The result is that if the negative-gradient –∂L/∂θ consistently points in some direction over several time steps, then will build up speed in that direction, whereas if the negative-gradient flicks back and forth between opposite directions from time step to time step then that back-and-forth motion will be averaged out and ignored. 29 This averaging is crucial because the E[L]-landscape is often poorly conditioned, meaning its ridges, saddles, valleys, and basins are much steeper in some dimen- sions than in others. Imagine a deep river valley that snakes its way downhill very slowly, i.e. with a shallow grade, but that has very steep sides. In other words, the land slopes steeply down to the river on either side, but slopes downhill much more gradually in the direction of the river’s flow. In a valley like this, it is usually a bad idea to step exactly down the gradient: if you are stand- ing on either side-slope, the negative-gradient will point straight down that slope, at almost a right angle to the riverbank, and will be huge, so unless in (40) is tiny, ∆ will also be huge, and so will tend to jump back and forth across the valley, ∆ t taking it from one side to the other, and ∆ t+1 taking it back again, rather than snaking steadily downhill along the valley floor. On the other hand, if is tiny, then will descend the steep side in an orderly way, but then its progress along the valley floor will be very slow. So no single is appropriate for the steep and the shal- low dimensions. But with (50), the back-and-forth jumps across the valley will tend to cancel each other out, and any motion along the valley will compound, so improves more efficiently. Nesterov momentum In (50), we applied the momentum after computing the gradient, i.e. we received the input xt, computed the gra- dient, and then adjusted . Another approach, often faster, is to adjust to an interim value, ', by applying the momentum, then compute the gradient at ', and move down that gradient: 30 (51) 1 1 1 , / ( ) , . t t t t t t θ' θ β θ θ θ' η L θ θ' θ θ θ T The rationale is, if we are already committed to taking a big momentum step, then we should check the lay of the land after that step, not before. RMSprop We have seen that E[L]-landscapes have steep and shal- low dimensions, and that no single can work well for all of them. Therefore it is useful to define a separate learning-rate factor for each dimension, i.e. for each pa- rameter i. One approach is to choose a single, scalar as usual, but then divide each ∆i by |Lt/ i|, replacing (40) by (52) 1, , / / / ,t i t i t i t iη L L ε where is some tiny number, added to prevent us ever dividing by 0. This scheme keeps the sizes of the ad- justments, |∆ i|, close to . No i can take huge steps down a steep dimension and fly out of control, or take minuscule steps along a nearly-flat dimension and stag- nate. But because of gradient noise, it is better to scale ∆ i not by the current |Lt/ i| alone, but by a running aver- age of recent |Lt/ i|. In the method called RMSprop, we introduce a vector rms, where the superscript means root-mean square. We initialize it to 0, we update it by (53) rms rms 2 , 1, (1 ) ( / ) ,t i t i t iβ β L 31 where rmst,i is the i-th element of rms at time t, and we use it to adjust , (54) rms1, , ,/ / .t i t i t i t iη L ε In (53), setting β = 0.9 usually strikes a nice balance be- tween too little and too much memory. If we set β = 0 (that is, we scale ∆ based only on the current mini- batch) then we overreact to noise, but if we set β much higher than 0.9 then we retain obsolete information about gradients in a region of -space we have already left behind. Adam The most popular way of adjusting parameters in AI right now is probably the method of adaptive moments, or adam, which maintains running estimates of the mean, or first moment of L/, (55) 1 1 , 1 1, 1(1 ) / ,t i t i t im β m β L and the second moment, (56) 2 2 2, 2 1, 2(1 )( / ) .t i t i t im β m β L To get started, we have to initialize m1 and m2, and for lack of any better guess, we set them both to 0, with the result that our running estimates in (55) and (56) are bi- ased toward 0. Those biases may not hurt us, because they fade with time, as we process more and more ex- amples and the initial values have less and less effect. But just to be safe, we can bias-correct. It can be shown that the following corrected estimates are unbiased: 32 (57) 1 1 1 2 2 2 ' / (1 ), ' / (1 ), t t t t t t m m m m where β1 t and β2 t are the β’s from (55) and (56) raised to the power t. Finally, we adjust the parameters, (58) 1 21, , , ,/ .t i t i t i t iηm ' m ' ε With adam, m1t,i in the numerator acts like momentum, cancelling out components of the gradient that reverse direction from one time step to the next, while m2t,i in the denominator acts like RMSprop, bounding the sizes of the adjustments, ∆ i. Usually we set = 10–8, β1 = 0.9 and β2 = 0.999, and choose by trial and error. ›› run COMPARE_DESCENT_QUADRATIC.m ›› run COMPARE_DESCENT_COMPLEX.m Batch normalization Even with adam, a backprop network may still learn slowly if its signals are uncentred, i.e. if any of the ele- ments y{l}i has a nonzero mean across examples, or if the elements have unequal variances. These problems can arise even if the input signals to the network as a whole, y{1}i, are nicely centred and have equal vari- ances, because the signals in downstream layers may not have those properties. ›› run UNCENTRED.m 33 We can often improve learning simply by ensuring that each element of each signal y{l} has a mean of 0 and a standard deviation of 1. In this method, called batch normalization or batchnorm, we normalize each layer’s signal y{l}: we calculate the mean μ{l}i and standard deviation σ{l}i of each y{l}i across the current mini- batch, then subtract off the means and divide by the standard deviations to create a normalized signal, (59) , , ( ) / , i m i m i i y' y μl l l σ l where y{l}i,m is the i-th element of example m in the minibatch. (There is another version of batchnorm which normalizes the pre-activation potential v{l} ra- ther than y{l}, but it is more complicated and usually doesn’t work any better.) To backprop through this batchnorm network, we must compute Lm/y from Lm/y', (60) , , , , , , / ( / Σ / Σ / ) / ( ε), i m m m i m i m n i n n i n n n i n m i L y n L y' y' y' L y' L y' n σ where nm is the number of examples in the minibatch, and ε is a small number introduced to prevent us divid- ing by 0. Batchnorm sometimes improves learning dramatically, though in other cases it makes it worse. It often works better with tanh than with relu activation functions, be- cause with relu, zeroing the mean pre-activation poten- tial pushes it into hard-zero saturation, i.e. the flat part of the relu curve. ›› run BATCHNORM.m 34 Brains likely can’t use batchnorm, because they likely can’t use minibatches, but variants of batchnorm may be feasible. For example, each neuron may monitor its own activity through time, and alter itself to keep its mean firing rate and variance near some target values. Comparing algorithms To assess learning algorithms, and judge their biologi- cal feasibility, we have to consider how much energy they consume, how long they take to learn, and how much memory they need. Energy Computer scientists measure the complexity of their al- gorithms by counting the number of floating-point oper- ations, or flops. For example, multiplying or adding two floating-point variables counts as one flop. Usually, we will assume that methods that involve a lot of flops in a computer also involve a lot of computational effort in a brain. On the other hand, there are operations that are hard in computers but easy in brains, or vice versa, so we try to identify these. Moreover, the energy cost of an algo- rithm in the brain depends not only on its flop count but also on how many neurons it needs, as we will discuss below. Time In a computer, the running time of an algorithm de- pends mainly on the number of flops, because comput- ers work mainly serially, doing just a few operations at 35 a time in the central processing unit (CPU) or several thousand at a time in the graphical processing unit (GPU). But in the brain, vast arrays of neurons do bil- lions of operations in parallel, so learning time depends more on the number of examples required than on the flops. Memory A crucial difference between brains and computers is that brains represent data in at least two different forms: in neuron parameters such as biases and synaptic weights, or in signals (firing rates) of neurons. Synapses are smaller than neurons and more numerous, so they are a metabolically cheap, high-capacity memory store. But they can’t send their information to other neurons or even other places in the same neuron — there is no known way for messages about the strengths of synapses to travel from one place to an- other. Information stored in neural firing, on the other hand, is transmissible — it can travel to other cells along axons. Therefore in our algorithms we have to consider which pieces of information are transmitted from place to place, so we can deduce both the synaptic and the trans- missible memory requirements, or in other words how many synapses and how many neurons we need. Scaling Sensorimotor tasks are often high-dimensional, mean- ing the state vectors have many elements. For example, visual inputs to the brain run on the optic nerves, with 36 more than a million axons each, and so visual input is a vector of more than 2 million elements. We are interested in how complexity, time, and memory scale with dimensionality. If the number of flops needed for a computation increases linearly with the di- mensionality n of its input, then we say the computation is O(n), or on the order of n. If it grows with the second power of n then it is O(n2). For example, computing the dot product of two n-element vectors takes n scalar multiplications, and about as many additions, and so the dot product is an O(n) operation. Multiplying an n-ele- ment vector by an n-by-n matrix is O(n2). Multiplying two n-by-n matrices is O(n3). O(n3) operations are slow and expensive when n is large. Other resources Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press. Kingma DP, Ba JL (2015) Adam: a method for stochas- tic optimization. arXiv:1412.6980 Ioffe S, Szegedy C (2015) Batch normalization: acceler- ating deep network training by reducing internal covari- ate shift. arXiv:1502:03167 Rumelhart DE, Hinton GE, Williams RJ (1986) Learn- ing representations by back-propagating errors. Nature 323 37 3 Sensory maps With backprop or feedback alignment, neural networks can learn visual tasks, such as recognizing objects in images. To feed an image into a network, we encode it as a vector, e.g. if the image is greyscale, with n pixels, we string those greyscale values into an n-element vec- tor and feed that into the network as its input, x. But this vector does not convey the geometry of the im- age: that each vector-element represents a pixel in a 2-D array, and each pixel is spatially related to every other pixel — right, left, above, or below it, and close by or far away. This geometry matters, because most things we want to recognize visually are contiguous objects, which are defined by their boundaries, which in images show up as edges, i.e. sharp differences in brightness or colour between neighbouring pixels. How can we convey the geometric relations of the im- age pixels to a neural network? In this chapter we ex- plore one approach that is used in the brain and in com- puter vision. Receptive fields In the human eye, light is detected by photoreceptors (rods and cones) in the retina. From those receptors, signals pass through a series of layers — bipolar cells and retinal ganglion cells in the eye, and then, in the brain, the lateral geniculate nuclei, primary visual cor- tex, and so on. Any one bipolar cell gets input from only a small fraction of the photoreceptors, and those input receptors are not scattered about the retina but are 38 all close together in a contiguous patch called the recep- tive field of the bipolar cell. The bipolar cell distinguishes different input patterns, responding more to some patterns of activity among its input receptors than to others. And because those recep- tors are all neighbours, the cell distinguishes local fea- tures of the image. For instance, many bipolar cells re- spond to small spots of light or dark against contrasting backgrounds. Owing to their input connections, then, the cells detect geometric features, namely spots, that are useful for identifying objects. Feature maps Moreover, neighbouring bipolar cells have neighbour- ing receptive fields, and therefore the sheet of bipolar cells is a map of visual space, just as the sheet of recep- tors is. The loci in this second map — the individual bi- polar cells — do not represent the brightness of points in space as the receptors do. Rather, they represent spots of contrast. We say the bipolars form a feature map, each cell signaling the presence or absence of an image feature, in this case a dark or light spot on a con- trasting background, in a region of visual space. Because the bipolar cells form a map, just as the recep- tors do, the brain can apply the same principle again at the next layer: any one retinal ganglion cell gets input from only a small contiguous patch of bipolar cells, and neighbouring ganglion cells get input from neighbour- ing patches. So the ganglion cells form a new feature map. And so on through the lateral geniculate nuclei, primary visual cortex, and beyond. 39 Layer by layer, the features become more and more ab- stract. So-called simple cells in primary visual cortex combine information from neighbouring spot-detectors in the lateral geniculate to detect straight, oriented edges. Farther downstream, other maps combine neigh- bouring edges into contours; and maybe neighbouring contours into outlines of shapes, and shapes into scenes, though the details are less clearly understood the deeper we go into the visual system. In neuroscience, these visual areas are called retinotopic maps, and analogous terms exist for other senses. Pri- mary somatosensory cortex, which gets information from touch receptors, is a somatotopic map, where neighbouring cells get information from neighbouring patches of skin. And some auditory areas are tonotopic maps, where neighbouring cells represent tones of simi- lar frequencies. Given any one retinal ganglion cell, its input field is the set of bipolar cells that project to it, whereas its recep- tive field is the set of photoreceptors that communicate with it, via the bipolars. And similarly for any layer of any sensory system: a cell’s receptive field is a patch of receptors, and its input field is a patch of cells in the previous, upstream map. Or more generally, a cell may have several input fields, if it receives projections from more than one map, e.g. direct projections from cells two or more layers back, or feedback from maps farther downstream. Connection matrices The next file of sample code constructs a network where all layers except the final one are maps. It defines 40 a map width for each map layer, i.e. how many pixels wide the map is (the code creates only square maps, so any map’s height is the same as its width). All maps share the same field-fraction, which sets the width of the input fields as a fraction of the width of the up- stream map, e.g. if the field-fraction is 0.25 and the layer-2 map is 32 pixels wide, then each cell in layer 3 has an input field in layer 2 that is 32 × 0.25 = 8 pixels wide (and tall). For each cell in each map layer after the first (the input) layer, the posted code file chooses an input field, lo- cated so that neighbouring cells have neighbouring fields. For each layer l, the locations and sizes of all its cells’ fields are stored in a connection matrix, C{l}, of ones and zeros. For example, cell 35 in layer 3 receives a projection (an axon branch) from cell 140 in layer 2, and so C{3}35, 140 = 1, but that same cell 35 in layer 3 receives no projection from cell 130 in layer 2, and so C{3}35, 130 = 0. As the layer-3 map has 16 by 16 pixels, and the layer-2 map has 32 by 32, C{3} has 162 = 256 rows (one for each neuron in layer 3) and 322 = 1024 columns (one for each neuron in layer 2). As each cell in layer 3 has an input field 8 pixels wide (and tall), it follows that each row in C{3} has exactly 82 = 64 en- tries that are equal to 1, and the other 1024 – 64 = 960 elements are zeros. In short, C{3} is structured so that each cell in layer 3 gets input from only 64 cells in layer 2, and those 64 cells form a contiguous 8-by-8 patch — an input field. The code plots the whole connection ma- trix C{3}, to show that most connections are absent (matrix entries are 0s) and only a few exist (entries are 1s). It also plots row 35 of the matrix reshaped into a 32-by-32 map, to show the geometry of the input field of cell 35. 41 The same code file ensures that the network’s weight matrices agree with the connection matrices: the synap- tic weight from any one neuron to any other can be non- zero only if there is a connection between those two cells, i.e. if C{l}i,j = 0 then W{l}i,j = 0. Of course the converse is not true: W{l}i,j may happen to be 0 even if C{l}i,j = 1. Because C and W matrices between map layers have few nonzero elements, they are said to be sparse. The code shows how they can be converted into Matlab’s “sparse” data type, which can make operations on them run faster. In contrast, networks like those in Chapter 2, where each cell in each layer projects to every cell in the next layer, are called dense or fully-connected net- works. In computer vision, most neural networks have several layers of maps, with sparse weight matrices, fol- lowed by a layer or two with dense connections. ›› run MAP_NETWORK.m Reading handwritten numerals One way to test computer-vision algorithms is to see how well they learn to recognize handwritten numerals in the MNIST (Modified National Institute of Stand- ards) data set, put together by Yann LeCun and col- leagues. MNIST contains a training set of 60,000 exam- ples, each a 28-by-28-pixel image of one of the digits “0” through “9”, and a test set of 10,000 additional ex- amples, which we present to the network only after it has finished learning, to assess how well it recognizes examples it did not see during its training. For each of the 70,000 example images, MNIST also provides a la- bel, or in other words a desired-output vector y*, which 42 reveals what numeral is in the image. Of course the learning network does not receive the labels as part of its inputs, but it does use them to compute its errors, as in equation (39), and we use them to assess whether the network’s answers are correct. When neural networks are trained on classification tasks, such as recognizing handwritten digits in the MNIST data set, the desired-output vectors (the labels) are usually one-hot, meaning there is always one ele- ment of the vector equal to 1 and the others are all 0s. For example, if the net is learning to recognize the 10 digits “0” through “9”, then the desired output y* is typ- ically a 10-element vector, set to [1; 0; 0; 0; 0; 0; 0; 0; 0; 0] when the digit is a “0”, to [0; 1; 0; 0; 0; 0; 0; 0; 0; 0] when the digit is a “1”, and so on. In this task, therefore, the actual output y of the network must also be a 10-element vector, where each element of y learns to signal the presence of one of the object classes, “0” through “9”. We say the network has given the correct answer when the largest element of y is the one with the correct index, e.g. if the digit is a “5” then y*6 = 1 and all other y*i = 0, and we say the network has answered correctly if y6 > yi for all i ≠ 6. When y* is one-hot then it is a good idea to restrict the elements of y to the range [0, 1] as well, e.g. by sending the final-layer signals through a rectified tanh function, y = max(0, tanh(y{nl})). On the course website you can find the MNIST data set, in the file MNIST.mat, and also MNIST_key.m, which illustrates the format of those data. ›› run MNIST_KEY.m 43 Convolution At present, most computer-vision networks use maps with additional mechanisms which we will consider briefly here but won’t cover in detail because they are biologically implausible. Most importantly, these net- works use convolution, meaning that all the neurons in any one map have the same incoming synaptic weights. For example, if those neurons have 2-by-2 input fields, and one of the neurons has weights 2, –3, 1, 5 for its up- per-left, upper-right, lower-left, and lower-right input cells, respectively, then every other neuron in the map will also have those same weights at those same loci within its field (but the fields for different neurons will still have different locations within the upstream map, as in Map_network.m). This arrangement, called weight-sharing, is not biologi- cal. Real neurons in real sensory maps each have their own synapses and no known mechanism to keep their weights equal to those of other neurons, though of course there may be some mechanism, still waiting to be discovered, by which real neural maps could approx- imate weight sharing. The set of shared weights that a map applies to all its in- put fields is called the map’s kernel or filter. So weight sharing forces a map to apply a single filter to all its fields. That constraint can be valuable because if a filter is useful in one part of visual space, then it is likely also useful elsewhere. If it is useful to detect vertical edges in the upper-left part of visual space then it is likely also useful to detect them in the lower-right and everywhere else, because objects with vertical edges may appear an- ywhere. 44 Weight-sharing drastically reduces the number of ad- justable parameters. In a 20-by-20 map without weight- sharing, where each of the 400 neurons has an 8-by-8 input field, there are 400 × 8 × 8 = 25,600 incoming weights. But with weight-sharing there are only 8 × 8 = 64. That is too few to learn almost any visual task, and therefore convolutional nets have many maps, usually 5 to 100 of them, in each layer. This way, they get reason- able numbers of parameters, though to do it they need a lot of maps, with a lot of neurons, i.e. convolutional nets have a high ratio of neurons to adjustable parame- ters. Other resources LeCun Y et al. (1998) Gradient-based learning applied to document identification. Proceedings of the IEEE (a description of convolutional nets) 45 4 Reinforcement learning Sensorimotor agents can’t shape their policies by super- vised learning alone, because supervised learning re- quires teacher signals, τ(x), that say what the current output should have been. There are no such signals for most sensorimotor tasks, as sensory feedback usually can’t tell an agent what its action should have been; it can only report the consequences of the action, such as the resulting state change ∆st and reward rt. The agent must improve its policy based on that feedback — a process called reinforcement learning. As in Chapter 1, equation (6), the aim is to maximize some return, the time-integral of a reward to some final time T, (61) 0...Σ .t T tG r More generally, we may start at some time t' other than 0, and we may aim to optimize a discounted return, (62) '' '...Σ γ , t t t t t T tG r where γ is the discount factor, between 0 and 1, say 0.99. So if γ < 1 then we care less and less about re- wards further and further in the future. One motivation is that we are less and less certain about those future re- wards. Another is that discounting helps us avoid infi- nite returns when T = . 46 Models There are two main types of reinforcement learning: model-based and model-free. In model-based methods, the agent learns a model, or in other words an internal simulation, of the state dynamics of its environment, and uses that model to improve its policy. But if the state dynamics are complex, the agent may have trouble learning an accurate model. For this rea- son, a lot of recent work in AI has instead used model- free methods, where the agent learns not the full state dynamics but something simpler, an action-value func- tion, which represents the expected long-term outcome that will follow if the agent takes a given action in a given state. Model-free methods have yielded impressive results in machine learning, but the brain seems to use models for at least some purposes: we can predict the sensory con- sequences of our actions, and plan by running scenarios in our minds. We will study both approaches, beginning with model-free. Action-value functions Many model-free algorithms for reinforcement learning are based on action-value- or Q-functions. A Q-function takes as inputs a state s and an action a, and yields as output the long-term result — the expected return — achieved by taking action a when in state s. Of course, that expected return depends not only on the action a, but also on every action thereafter, till time T. So Q makes sense only if we specify some policy μ for all those subsequent actions. 47 Moreover, the expected return also depends on how many time steps we have left till time T. Therefore Q- functions are easiest to define if T = , because then every t is equally far from T, and so we can define Q in terms of s and a alone, regardless of what the present time is. (Though the Q function is also useful in many settings where T is not actually infinite but merely somewhat large, as you will see in the sample code be- low.) Given a policy μ, we define Q μ(s, a) to be the expected return, from now on, if we are presently (at time t') in state s, and we take action a, and from then on we choose our actions using policy μ. (63) 1... ( , ) ( , ) Σ γ ( , ( )).t t't t' t t Q s a r s a r s s μ Notice that the initial action a, at time t', need not fit the policy μ. That is, Q μ(s, a) still makes sense even when a ≠ μ(s). Policy gradients One powerful method of model-free reinforcement learning is to train a neural network Q to estimate Q μ, where μ is the current policy, implemented by another network, and then improve the policy network by ad- justing its parameters μ up the gradient of Q. By the chain rule, we have (64) / / / . μ μQ θ Q a a θ To compute the Q/a term, we backpropagate the single-element vector [1] through the Q network — 48 just as we backpropagated the vector α in (45). Having found Q/a in this way, we then compute Q/ μ by backpropagating Q/a through the μ-network. This method is called deep deterministic policy gradi- ent, or DDPG. The next 6 sections describe its opera- tions in detail. Bellman equations In DDPG, the neural network Q learns Q μ. And be- cause the policy μ is also learning, and therefore chang- ing, it follows that Q μ is changing as well, which means Q must keep readjusting itself, so as to remain always a good approximation to the current Q μ. How can Q learn Q μ? Not by supervised learning, be- cause there are no teacher signals that know the true value of Q μ(s, a). But useful error signals can be cre- ated using the Bellman equation, (65) 1 1( , ) ( , ) γE[ ( , ( ))]. μ t t t t t tQ s a r s a Q s μ s μ This equation follows from the definition of Q μ in (63): the left-hand side of (65) is, by definition, the expected return from time t onward, given an initial state st and action at, and the right-hand side is the same thing: the reward r at time t given st and at, plus the discounted, expected return from time t + 1 onward given policy μ. This equation implies a Bellman error signal that can be used to train estimates of Q μ, namely (66) 1 1 ( , ) ( , ) γ ( , ( )). Q t t t t t t e Q s a r s a Q s μ s 49 In other words, we don’t know Q μ(st, at), so we don’t know what Q(st, at) should be. But we do know that Q μ(st, at) is related to Q μ(st+1, at+1) as laid out in equa- tion (65), and therefore Q(st, at) should be related to Q(st+1, at+1) in the same way, which means e Q in (66) should equal 0, at least on average. In other words, eQ is a reasonable error signal because if Q = Q μ then eQ = 0 on average, and (it can be shown) if eQ = 0 on average for all inputs (st, at) and = 1 then Q = Q μ + c for some constant c, and that constant is immaterial because the learning rule (64) uses the gra- dient of Q, which doesn’t depend on c. This technique of training Q to agree with its own fu- ture values is called bootstrapping. Target networks How do we use a Bellman error such as (66) to train the network Q? The obvious approach is to compute the gradient of eQ-squared with respect to Q’s parameters, Q, and adjust Q down that gradient. This approach does work, but it is complicated because Q appears twice in the formula (66) for eQ, and with different in- puts in its two appearances. As a result, eQ/ Q is more complicated than if Q had appeared just once, and so the E[L]-landscape is more convoluted, and learning is slower. A simpler approach that often works better is to pretend that Q appears just once in the Bellman error. We re- place (66) by 50 (67) 1 1 ( , ) ( , ) γ ( , ( )), Q t t t t t t e Q s a r s a Q' s μ s where Q' is a target network, separate from Q but with the same structure: the same numbers of layers and neu- rons, the same activation functions, and the same initial weights and biases. With (67), Q appears only once in eQ, and so it can learn in a supervised way, by backpropagating eQ through Q. And as learning proceeds, we also adjust Q', nudging its parameters toward those of Q, (68) ( ), QQ' Q' Q'θ θ τ θ θ where τ is a positive number. Usually τ is small, say 0.001. Otherwise Q may fail to learn, because its error signal eQ, in (67), is driving it toward a target function Q' that is changing too quickly. Replay buffers To train Q, we need plenty of examples of eQ, for many different states and actions. So it is useful to store examples in memory, in a matrix called a replay buffer or experience buffer. At each time step, we store a new column vector in the buffer, (69) 1[ ; ; ( , ); ].t t t t ts a r s a s It is common, especially when dealing with replay buff- ers, to use s't as an alternative name for st+1, so we can also write the vector in (69) as (70) [ ; ; ( , ); ].t t t t ts a r s a s 51 Gradually, the buffer fills with many such examples of the data that we need to compute eQ using (67). At each time step, we randomly select a minibatch of examples from the buffer, and use those to compute eQ and adjust Q, so Q learns from a large, varied mix of data. When the buffer is full, the agent continues to update it, making room for the newest column vector (69) by re- moving the oldest one. Off-policy learning As time goes by, the agent’s policy, μ, changes though learning. Therefore the columns of the replay buffer will contain data generated by a variety of obsolete pol- icies the agent is no longer using. If we select a column (69) from the buffer for our current minibatch, then it will contain an st together with an at that is probably not the action the current policy would choose if it were in state st: (71) ( ).t ta μ s But a nice feature of eQ, as defined in (66) and (67), is that it is still useful even if at ≠ μ(st). As we saw in (63) and the text right after it, Q μ(s, a) is defined for any a, not just μ(s), and so we can train Q using any a’s. This type of learning, based on actions that don’t fit the cur- rent μ, is called off-policy learning. Off-policy learning is convenient because it lets us re- use old data stored in a replay buffer. Some other rein- forcement-learning algorithms (not covered here) need on-policy data, and therefore can’t use replay buffers, at least not without additional operations to try to correct for the policy mismatch. 52 Rollouts With off-policy learning, we can in principle learn Q μ with data from any policy at all, but it is better to train on data close to those of the current policy, μ itself. The main reason is that a Q network can almost never match the true Q μ(s, a) perfectly for all possible inputs (s, a). At best, it approximates Q μ(s, a), and the approx- imation will be closer and more useful if its domain is smaller and well chosen, i.e. if we don’t try to match Q μ(s, a) over all (s, a) but instead focus on the crucial subset, namely states that will actually be visited by an agent using policy μ, and actions that will be chosen by μ in those states. Therefore we train Q based on data collected during rollouts, which are episodes where the agent executes its task using its current policy μ, e.g. if the agent is learning to reach for target objects then each rollout is one reach to one target. This method works well, and also looks very much like biological learning, where an animal learns a task using sense data it collects while practicing that task. Exploration In DDPG the policy is stochastic. For instance, the ac- tion at may be some deterministic function of st plus a Gaussian random vector, usually small and with a mean of zero. In this way, the agent can explore a range of different actions in any one state. The amount of explo- ration — the variance of the random vector — is chosen by trial and error. 53 Pseudocode for DDPG for rollout = 1 to nrollouts choose a random initial state s0 for t = 0 to T at = µ(st) rt = r(st, at) st + 1 = f (st, at) add [st; at; rt; st + 1] to the buffer choose a minibatch of columns [si; ai; ri; s'i] from the buffer for each column i in the minibatch eQi = Q(si, ai) – ri – γ Q' (s'i, μ(s'i)) backprop eQi through Q to adjust Q compute Q(si, μ(si)) backprop [1] through Q to get Q/ai backprop Q/ai through μ to adjust μ end Q' ← Q'+ τ ( Q – Q') end end ›› run DDPG.m Costate equation Here we consider a type of model-based reinforcement learning which applies to episodic, or in other words fi- nite-time or finite-horizon tasks, where the aim is to maximize the return G of a movement from time 0 to a finite end time T. For simplicity, we will assume the discount factor γ = 1. So we have (72) 0... 0... ( , ( )).Σ Σt T t t T t tG r r s μ s 54 The policy μ is again a multilayer network with parame- ters μ, and our aim is to adjust μ to maximize the av- erage return E[G] over some repertoire of motions, e.g. reaches from a variety of initial states. To make those adjustments, we perform a lot of mo- tions, and after each one we compute the gradient of its return G with respect to each of its actions at, from t = 0 to T. To find those gradients, we note that at can influ- ence G in two ways, through its effects on rt and on st+1. Therefore by the chain rule, (73) 1 1 1 / / / / / / / . t t t t t t t t t t G a r a G s s a r a G s f a This formula shows that we need G/st+1 to get G/at. To find G/st, we again apply the chain rule, (74) 1 1 1 / / / / / / / / / / ( / / / ). t t t t t t t t t t t t t t t t t t G s r s r a s G s ds ds r s r a s G s f s f a s Hence if we know (or can estimate) the functions f, r, and μ, we can compute the final G/s, (75) / / / / ,T T T T T TG s r s r a μ s and then sweep back in time, using (74) to compute all the G/st in turn, down to G/s1. The G/st are called costates, and (74) is the costate equation. We plug these G/st into (73) to find the ∂G/∂at, and use those derivatives to improve the policy, adjusting θ μ up the gradient 55 (76) 0.../ / / . μ μ t T t tG θ G a a θ Learning f and r To apply the costate equation (74), the agent must esti- mate the state dynamics f and the reward function r. It can learn f by backprop based on the error signal (77) 1( , ) f t t te f s a s and r based on (78) ( , ) .r t t te r s a r As in DDPG, we train our networks based on mini- batches drawn from a replay buffer. In some tasks, the agent already knows the reward func- tion before training begins, and so doesn’t need to learn it. In that case, we can omit the network r and equa- tion (78). The posted code CPG.m shows a task of that kind. Training μ We train the policy based on imaginary rollouts: start- ing from a minibatch of initial states drawn from the buffer, we run the rollouts forward in time, computing rewards and new states at each time step using r and the model f . (It would be cheating to use minibatches for real rollouts, because an agent can’t make multiple real movements at once, but for imaginary rollouts, minibatches are allowed). We use (74) and (73), with f and r in place of f and r, to compute ∂Gi/∂ati for each time point t and each 56 rollout i in the minibatch. With those derivatives we ap- ply (76) to compute ∂G/∂θ μ, where G is the sum of all the returns Gi in the minibatch. And we use ∂G/∂θ μ to adjust μ. Pseudocode for Costate Policy Gradient for rollout = 1 to nrollouts // Run a real rollout to gather data for the buffer choose a random initial state s0 for t = 0 to T at = µ(st) rt = r(st, at) st+1 = f (st, at) add [st; at; rt; st+1] to the buffer end // Train f and r choose a minibatch of columns [si; ai; ri; s'i] from the buffer for each column i in the minibatch e f i = f (si, ai) – s'i backprop e f i through f to adjust f e r i = r(si, ai) – ri backprop e r i through r to adjust r end // Train μ with an imaginary rollout choose a minibatch of initial states s0 from the buffer for t = 0 to T at = µ(st) st+1 = f (st, at) store [st; at] end … 57 compute r(sT, aT) backprop through r to get rT ∂G/∂aT = ∂rT/∂aT compute ∂G/∂sT using (75) set ∂G/∂θ μ = 0 for t = T – 1 to 0 in steps of –1 compute r(st, at) backprop through r to get rt backprop ∂G/∂st+1 through f to get ∂G/∂st+1 f compute ∂G/∂at using (73) backprop ∂G/∂at through µ add the result to ∂G/∂θ μ compute ∂G/∂st using (74) end adjust μ based on ∂G/∂θ μ end ›› run CPG.m Other resources Kirk DE (1998) Optimal Control Theory: an Introduc- tion. Dover Publications (discusses the costate equa- tion, though in continuous time) Lillicrap TP et al. (2016) Continuous control with deep reinforcement learning. arXiv:1509:02971 (presents DDPG) Silver D et al. (2014) Deterministic policy gradient al- gorithms. Journal of Machine Learning Research Workshop & Conference Proceedings 32.) Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction. 58 5 Regularization When we learn a function relating the x’s and y*’s in a training set of examples (a sample), we want that rela- tion to generalize, i.e. we want it to hold true for other x's and y*’s besides those in our sample. It is a serious challenge: in many learning problems it is easy enough to produce a network that predicts y*’s from x’s very accurately when the net has already seen those specific x’s and y*’s during its training; but faced with a new x it has not seen before, the net’s accuracy often drops markedly. This situation — good accuracy on the train- ing set but poor on other data — is called overfitting. Methods of reducing overfit are called regularization. The central problem is that for any training set, there will always be many different functions that fit the data points in the set equally well but make incompatible predictions about points not in the set. Suppose we have a learning problem where our entire training set {(xt, y*t)} consists of just 3 examples: (0.1, 0), (0.2, 0), (0.8, 0), i.e. our x’s take the values 0.1, 0.2, and 0.8, and our y*’s are all 0s. Our target function may be the zero function, which takes all x’s to 0. But then again the tar- get may be some other function that happens to equal 0 when x is 0.1, 0.2, or 0.8 but is far from 0 elsewhere. If we guess wrong then we will make large errors when- ever x is not 0.1, 0.2, or 0.8. Ensembles One way of coping is to train several networks — an ensemble — and have them vote on answers. We might train 10 networks and then present each novel x to all 59 10 of them, taking the average of their predictions as our guess. Mistakes made by the different nets often cancel out to some extent, so their averaged guesses are more accurate than their individual ones. Capacity Another coping method is to shrink our network, or oth- erwise restrict it in some way, to reduce its capacity, or in other words the size of its hypothesis space, which is the set of all functions that network could ever possibly compute, given any possible setting of its weights and biases. The rationale is that overfitting occurs because our hypothesis space contains functions which are not the desired function but nonetheless fit the training data. If the hypothesis space were smaller then there might be fewer of these false fits. And in practice we can indeed reduce overfitting by using networks with fewer adjust- able parameters. But if we make the hypothesis space too small then it won’t contain any functions that are a decent match to the desired one. Ideally, we want the smallest hypothe- sis space that nevertheless contains all the functions we will ever have to learn. For instance, if we know, some- how, that the target function is linear then we might choose a hypothesis space consisting only of linear functions. But we rarely have that much prior knowledge about our targets. What we can do in practice is monitor our network’s performance on both training data and a separate set of non-training examples called the validation set, distinct from both the training set and the test set. If our net- work can’t learn to do a good job even on the training 60 data then its capacity is likely too low. If it does much better on training than on validation data then it is over- fitting, and might benefit from a smaller hypothesis space, or from other regularization methods we will dis- cuss in a moment. Data Augmentation If we can manage it, the best antidote to overfitting is to get more training data. Consider again the case where we had just 3 training examples, all compatible with the zero function. If we had a million examples instead of just 3, with a million x’s evenly packed over the range from 0 to 1, and the y*’s all 0, then we could be more certain that our target really was the zero function, at least over the range [0, 1]. In other words, overfitting is reduced when the training set is large. Unfortunately, even sets that seem quite big can still be too small to prevent overfitting, e.g. the MNIST data- base of handwritten digits contains a training set of 60,000 examples, but when we use that set to train a network to recognize handwritten digits, we get serious overfitting. On any big problem we want truly vast numbers of examples. If we can’t get enough examples then it is often useful to make fake ones, provided we take care that our fake examples really do resemble real ones our net may face after its training is done. This technique is called data augmentation. For example the MNIST training set includes about 6000 examples of handwritten “5”s. We can take those 61 6000 images and tweak each one in various ways, shift- ing it slightly horizontally or vertically, stretching or compressing it, rotating or shearing it, and so get vast numbers of new examples, all of which are plausible “5”s which a human might actually write. Applying these transformations (for all the digits, not just the “5”s) greatly improves the trained network’s generaliza- tion. And we can do even better by augmenting even more, e.g. the tweaking I just described included only affine transformations (linear transformations plus shifts), and we can do better by adding certain kinds of nonlinear warping. It is important that the transformations we use to aug- ment our data produce new examples that are plausible. It would be a bad idea, for instance, to create fake MNIST examples by mirror-reflecting real ones upside- down, because a handwritten “5” reflected upside-down would not resemble any “5” that a real human is likely to write; it would look more like a strange “2”. And an upside-down “6” would look like a “9”. Used as train- ing data, these fakes would confuse rather than aid the network. Data augmentation should alter our training examples so they resemble the huge set of real varia- tions the net will encounter after its learning is done. That said, it is often possible to reduce overfitting at least a little by tampering with training data in ways that are not so obviously guaranteed to produce plausible fakes. An example is a method called dropout. 62 Dropout Here we add noise to a network by randomly silencing many of its hidden neurons, and sometimes input neu- rons as well. We might stipulate that each hidden neu- ron has a 50% chance of switching off for any one ex- ample. As each example arrives it will evoke a re- sponse, and drive learning, in some random set of hid- den neurons, about 50% of the total, and the other 50% will have zero activity. Then in the test phase, after training is done, we let all the neurons respond to each example, and we halve the weights in the network to compensate for the doubled numbers of active cells. This method improves generalization, though the rea- sons are not entirely clear. It may be a form of ensemble learning: in a sense each example trains a different net- work, because a different set of neurons is switched on at each moment, and then in the test phase we blend the outputs of all those networks. Or it may be a kind of augmentation, creating novel training examples by contaminating them with noise. Why might noise help? We can think of it as breaking up spurious regularities. For example, it might happen by sheer coincidence that all our training images of “5”s have a pure black pixel at one consistent locus, and so our network might learn to trust that pixel as an identi- fier of “5”s. But tested on other “5”s, outside the train- ing set, which don’t have a black pixel at that spot, the network will fail. The likelihood of any one, specific coincidence of this type may be tiny, but there are vast numbers of such coincidences that might arise, and in practice they do exist in all finite training sets. Noise can disrupt these spurious features. 63 ›› run DROPOUT.m Simplicity It is often useful to assume that our target functions are simple, in the sense that their graphs are smooth rather than wiggly. If multiple functions in our hypothesis space fit our training data fairly well then we count the smooth, simple functions as more probable than the jag- ged, complicated ones. In other words, we assume the world is simple, as the philosopher William of Occam urged in the 14th century. It is not clear why that is a useful assumption, but in practice it does reduce overfitting. Part of the reason may be that data are often distorted by noise, which makes the relation between x and y* look more compli- cated than it really is. In other words, complexity can be a figment of noise, and should be viewed with skepti- cism, so we do well to favor simple solutions. We will see some examples shortly. Simplicity can be defined in different ways. The most common approach is called Tikhonov regularization, where we replace our old, least-squares loss (34) with (79) 2 2 * Σ { } ,lL y y λ W l where λ > 0 and ‖W{l}‖2 is the sum of the squared ele- ments of W{l}. So now we are minimizing not just our approximation errors but also the sum of the squared weights, which is a measure of complexity in the sense that larger weights yield wigglier functions while smaller weights yield flatter ones. 64 To see the advantages of Tikhonov regularization, run the m-file Occam.m with different settings of its varia- bles pdt, ns, nsd, pda, and lam. Occam defines a target function y* = τ(x) — different each time it is run — which is a polynomial of degree pdt. It chooses n_ex examples of x’s and y*’s, adds Gaussian noise with standard deviation nsd to the y*’s, and presents this cor- rupted sample to a learning algorithm. The algorithm doesn't know the target function or pdt, but finds the polynomial approximation of degree pda that minimizes the expected Tikhonov-regularized loss (79) with lam playing the role of λ. You will see that when the data are few and noisy, and pda is large and lam is 0, then your approximations will be poor between the data points, but you can improve them with regularization, either by shrinking pda (so as to permit only lower-de- gree, i.e. simpler, polynomials) or by increasing lam. ›› run OCCAM.m Occam_multitrial.m runs large numbers of trials like those in Occam.m, to confirm that on average we get better generalization with a simpler hypothesis space or Tikhonov regularization. ›› run OCCAM_MULTITRIAL.m Tikhonov regularization is easy to implement in a net- work. It involves descending the gradient of (79) in- stead of (34), but the only difference between the two is the second addend in (79), namely λ Σl ‖W{l}‖2, whose derivative with respect to any one weight, W{l}i,j, is 2λW{l}i,j, a scalar multiple of the weight itself. To de- scend this new gradient, then, all we do is add –2λW{l} to the gradient computed by backprop, i.e. learn by backprop as usual, and also have each weight shrink 65 slightly toward 0 at each time step, at a rate propor- tional to its own size. This mechanism is called weight decay. Autoencoders One way to simplify and regularize networks is to re- duce the dimensionalities of the signals they process. Biological data are often very high-dimensional, but they are also usually compressible. For instance, your visual information is coded by the quarter-billion recep- tor cells (rods and cones) in your retinas, but that infor- mation is then processed for transmission to visual cor- tex on the mere 2.5 million axons of your optic nerves, i.e. the original signal, of a quarter-billion dimensions, is condensed 100-fold. In general, reducing dimension- ality lets an agent learn faster and with smaller net- works. One straightforward and perhaps brainlike approach is to use an autoencoder, a neural network that creates low-dimensional representations of data by learning to make its output as similar as possible to its input. For instance, we might build an autoencoder to create low- dimensional representations of MNIST images of hand- written digits. Its first layer would have 784 neurons, to represent the greyscales values of the 784 pixels in an MNIST image. The other 4 layers might have neuron counts of 300, 30, 300, and 784 respectively — the cru- cial thing here is the tight middle layer, or bottleneck. If the network learns by backprop to make the activities in its final layer resemble those in its first layer, then that means it has squeezed a lot of information about the original, 784-pixel image through a bottleneck of just 30 neurons. 66 Why use 5 layers in this example? Why not just 3, with neuron counts of 784, 30, 784? In practice it works bet- ter to compress the data in stages, maybe because it breaks a difficult compression problem into easier parts. ›› run AUTOENCODER.m Other resources Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press. Srivastava N et al. (2014) Dropout: a simple way to pre- vent neural networks from overfitting. Journal of Ma- chine Learning Research 15: 1929–1958 67 Glossary action. In reinforcement learning, the agent’s output, a. action-value function. Given the current state s, an ac- tion a, and a policy , the action-value Q(s, a) is the expected return (i.e. the time-integral of the rewards we will accumulate from now to some final time T) if we start in state s, we choose action a (which need not agree with policy ) at the first time point, and from then on we apply policy . adam. a formula for updating network weights and bi- ases based on present and past gradients computed by backprop. adversarial training. a method where a learning net- work improves by training on challenging examples provided by a second, “adversary” net. The adversary has full knowledge of the learner’s structure and param- eters, and uses it to create examples that are likely to fool the learner, i.e. it identifies the learner’s weak- nesses and provides special training to overcome them. agent. in reinforcement learning, the entity that per- forms actions to achieve goals; called the controller in control theory. Bellman equation. Given any deterministic policy , the (one-step) Bellman equation is Q(st, at) = r(st, at) + ∆t E[Q(st+1, (st+1))], where Q is the action-value function, is the discounting factor, and st and at are the state and action at time t. The multistep or n-step Bell- man equation is Q(st, at) = r(st, at) + E[ i=1..n–1 i r(st+i, (st+i)) + n Q(st+n, (st+n)) ]. 68 Bellman error. an error signal derived from a Bellman equation, used for training estimates of Qμ. For instance, the one-step Bellman Q-equation expresses a property of the Q-function, Q(st, at) = r(st, at) + E[Q(st+1, (st+1))], which can be turned into an error signal to train estimates of Q, namely eQ = Q(st, at) – r(st, at) – Q(st+1, (st+1)). Bellman optimality equation. the Bellman equation for the optimal policy *, e.g. the one-step version is Q*(st, at) = r(st, at) + E[mina' Q*(st+1, a') ], where Q* is the optimal action-value function. bootstrap. train an estimate of the action-value, Qμ, or a related function such as the return or value, using a Bellman error. conditioning. Any smooth function is bowl-shaped near its minima, i.e. it approximates a quadratic func- tion, of the form α + βT + TB, where is the func- tion argument, α is a scalar, β is a vector, and B is a pos- itive-semidefinite matrix. If the eigenvalues of B differ greatly in size (i.e. if the bowl is much narrower and steeper in some dimensions than in others) then gradi- ent-descent methods take a long time to reach the mini- mum, because no one learning rate constant η is well suited for both the shallow and the steep dimensions. In that case, the function and the learning problem are said be poorly conditioned. controller. an agent, in the terminology of control the- ory. control signal. in control-theory terminology, the out- put, often called u, of a controller (i.e. an agent) — analogous to the action, a, in reinforcement learning. 69 convex optimization. finding the optimum of a convex function, i.e. one with only a single optimum, rather than many local optima. More precisely, a convex func- tion may have multiple optima, but they form a con- nected set and all have the same value, i.e. none is bet- ter than any other. Examples of convex optimization methods are least-mean squares learning (LMS), nor- malized least-mean squares learning (NLMS), recursive least squares (RLS), and backprop in the final layer of a network. An example of non-convex learning is back- prop through a network of 3 or more layers. Convex op- timization is usually faster, and easier to prove theo- rems about, than non-convex optimization. convolutional network. a network where neurons form maps, i.e. each neuron in any one layer l receives input from only a small contiguous patch of neurons in layer l – 1, and neighbouring neurons in layer l receive their input from neighbouring patches; and further, all neu- rons in any one map share their weights, i.e. their in- coming synaptic weights have the same numerical val- ues, arranged in the same spatial pattern within their in- put fields. cross-entropy loss. a loss function L = –y*T log(y) that is useful for networks with softmax outputs. DDPG. The deep deterministic policy gradient algo- rithm. decision boundary. Many neural networks receive vectors x, belonging to some high-dimensional space X, as inputs (e.g. x might be a set of greyscale values of pixels in an image, and X the space of all such images of that size and shape), and they learn to classify the vectors (e.g. identifying x as an image of a handwritten 70 “7”). A decision boundary is a hypersurface in X sepa- rating one class from another, e.g. separating the x-vec- tors representing “3”s from x’s representing “8”s. deep deterministic policy gradient (DDPG). an algo- rithm for learning policies in tasks where the state and action spaces are continuous (i.e. the sets of all possible states and actions are continua, not finite sets). The al- gorithm uses layered networks for the policy and to es- timate the action-value function. deep learning. backprop-based learning in networks of 3 or more layers (or 4 or more according to some au- thors). dense network. a network where each neuron in each layer projects to all neurons in the next layer. Also called a fully-connected network. discounting. In reinforcement learning, we try to max- imize the expected return, where the return is the time- integral of discounted future rewards up to some time T, i.e. R = rt + rt+1 + 2 rt+2 + …, and is a scalar dis- counting factor between 0 and 1, typically about 0.99. Discounting implies that we care less and less about re- wards further and further in the future. One motivation is that we are less and less certain about those future re- wards. Another is that discounting helps us avoid infi- nite returns. dropout. a method of reducing overfitting by randomly switching off neurons in a network during learning. early stopping. a method of reducing overfitting by halting learning when performance begins to decline on a validation set. 71 feedforward network. a neural network with no loop- ing connections, i.e. where no neuron projects to itself directly or via other neurons. fully-connected network. a dense network. generative model. a network that learns to output fake but realistic data. Given a set of examples of, say, hand- written numerals, the model learns to produce new im- ages, different from those in the training examples, that look like handwritten numerals, with variants appearing with realistic probabilities, e.g. if 5% of the “7”s in the training examples have crossbars then 5% of the model’s “7”s should also have crossbars. GPU. the graphical processing unit, a part of a com- puter, separate from the CPU (central processing unit), designed to handle computations for fast graphical out- put, e.g. in computer games. GPUs are fast because they perform thousands of operations in parallel, whereas most current CPUs manage only about 8 operations in parallel. Though originally designed for graphics, GPUs are now also used for non-graphical machine-learning operations, such as matrix multiplications, which bene- fit from parallelization. gradient noise. In most learning methods, we would like to know the gradient of the expected loss or return, but in fact we have only an estimate of that gradient based on one or a few examples (a minibatch). These estimates will vary around the true gradient depending on the specific examples. This variance of the estimate is called gradient noise. hard-zero saturation. the fact that some types of artifi- cial neurons have activation functions that are flat over 72 some range of inputs, e.g. relu neurons output 0 when- ever their inputs are below zero. As a result, the deriva- tives that drive learning in the neuron are also 0 when its example inputs are in the flat range, and so the cell cannot learn from those examples. horizon. In some reinforcement learning tasks, we inte- grate rewards over only a finite stretch of time, i.e. to a finite final time T. These are called finite-horizon (or episodic) tasks. Other tasks have an infinite horizon, though these often use discounting, and then authors may still write loosely of a “horizon”, e.g. saying that a small discounting factor results in a close horizon. Still other tasks maximize an integral that runs always some finite time into the future, e.g. always 100 ms for- ward from the current time; these are receding-horizon tasks. KL divergence. Kullback-Leibler divergence, a meas- ure of the similarity of two probability distributions, sometimes used with softmax networks, whose outputs can be regarded as probabilities. L2 loss. the squared error loss function, defined as L = (y – y*)T(y – y*). LSTM. long short-term memory, a kind of neuron used in some recurrent nets to control how certain signals persist or fade away as time passes. maxout unit. a neuron whose activation function con- sists of multiple affine segments. mel scale. a series of pitches which humans judge to be equally spaced, used in studies of phoneme recognition. 73 MNIST. a data set used to train networks to read hand- written numerals, with 60,000 training examples, each a 28-by-28-pixel image of one of the digits “0” through “9”, and additional test examples. objective function. the thing we want our learning al- gorithm to optimize, e.g. expected loss or return. observation. In this course, we assume for simplicity that the agent knows the state, s, of the environment at each time point. In reality, an agent may not know s in its entirety, but may instead make a more limited obser- vation o (i.e. gather some sensory or other data) at each time point, where o may not contain enough infor- mation to specify s. off-policy training. In reinforcement learning, we often learn the action-value function Q of a policy by gra- dient descent, using the Bellman error eQ = Q(st, at) – r(st, at) + Q(st+1, (st+1)). This choice of error makes sense because if the estimate Q equals the true Q then E[eQ] = 0 on average. Further, that equality holds whether or not at equals (st), which means we can, if we wish, learn Q using at’s that do not fit the policy . If we do use at (st), we are training off-policy. Our motivation might be to explore a wider range of actions than those chosen by , or to reuse training data r(st, at) that were sampled in the past, when the agent was using a different policy. one-hot. a kind of desired output vector for a network, where there is always one element of the vector equal to 1 and the others are all 0s. 74 overfitting. a situation where a learner improves its ac- curacy on its training examples, but gets worse on test data, i.e. on data it has not seen during its training. pooling. Convolutional nets often contain pooling lay- ers, which combine data from multiple pixels, discard- ing some information, e.g. a max-pooling layer might act on a 24-by-24 pixel greyscale image to create a 12- by-12 image where the grey level of each pixel in the smaller image is the maximum over the 4 grey levels of 4 adjacent pixels in the larger image. prior. prior probability, i.e. the probability of an event judged before some new data arrive, e.g. I might judge it unlikely that it rained last night, but then look out the window and see that the roads are wet; the prior proba- bility that it rained was low, but the posterior probabil- ity, after I have looked at the roads, is high. More gen- erally, priors can be any information or assumptions which I build into a network before it learns, for regu- larization e.g. if I have prior knowledge that some func- tion I want to learn is quadratic, then I can design a net- work whose hypothesis space consists only of quadratic functions. Q-function. the action-value function. Q-learning. a method of reinforcement learning where the agent learns the optimal Q-function directly, rather than through a series of policy-specific Q’s, using e = Q*(s, a) – r(s, a) – maxa'Q *(s', a') or e = Q*(s, a) – r(s, a) – maxa'Q–(s', a'), where Q– is a target function. The agent’s policy is μ(s) = argmaxaQ(s, a). Hence Q- learning works only in problems where it is possible to compute that argmax efficiently. It is likely not useful for sensorimotor systems in the brain, where the argmax 75 would be hard to compute because Q is a deep net- work. recurrent network. a neural network with looping con- nections, i.e. where one or more neurons project to themselves, either directly or via other neurons. regularization. any method used to reduce overfitting. reinforcement learning. learning based on rewards, as opposed to supervised learning. relu. rectified linear unit, i.e. a neuron whose activation function is linear (and in fact equals the identity func- tion) except for a cut-off at zero — y{l}i = max(0, v{l}i). replay buffer. In reinforcement learning there is often a network that learns a function (such as the action-value) based on training examples showing how various ac- tions in various situations yielded specific rewards. In some methods, the control system uses these training data — states, actions, rewards — as it collects them, e.g. it might carry out a reaching movement, collect the current state, action, and reward at each moment as the reach proceeds, immediately use those data to adjust its network parameters, and then discard the data to make room for the next state, action, and reward. A problem with this approach is that, during a smooth movement like a reach, the successive states are very close to- gether, and so the learner gets a series of very similar training examples. These examples may push the learner’s parameters toward values that are optimal for that small region of state space but not very good over- all. We can avoid that problem by storing many training examples in a replay buffer, and adjusting parameters at 76 each time step based not on the current example but on a minibatch of training data from many different times and reaches, drawn randomly from the buffer. return. time-integral of rewards. reward. Any reinforcement learning task begins with a reward function r(s, a) which defines what the agent is trying to achieve. Specifically, the agent’s aim is to maximize the return, which is a time-integral of the re- ward. SARSA. a method of reinforcement learning, identical to Q-learning except that the agent learns using e = Q(s, a) – r(s, a) – Q(s', a') or e = Q(s, a) – r(s, a) – Q–(s', a'), where the next-step s' and a' are both sam- pled, and Q– is a target function. So SARSA, unlike Q- learning, doesn’t compute argmaxa'Q for its learning, but it does still use the policy μ(s) = argmaxaQ(s, a). It is called SARSA because it computes its e based on a state, action, reward, and the subsequent state and ac- tion. self play. training agents by having them compete against other agents, usually in a simulated environ- ment. SIFT. the scale-invariant feature transform, a computer- vision algorithm that detects certain image features that are useful for object recognition. These features were designed by computer scientists, whereas convolutional nets instead learn useful features by backprop. softmax. When training a one-hot network, it is often helpful to let the final-layer neurons be affine (even if the upstream neurons are, say, relu or tanh), and then 77 define the net’s output y to be a softmax function of the final-layer signal y{nl}, i.e. define the i-th element of y to be yi = exp(y{nl} i)/j exp(y{nl} j). This formula en- sures that y’s elements are all between 0 and 1, and sum to 1, like the elements of a one-hot y* (and also like probabilities, so we can regard each element yi of the softmax output as the network’s estimate of the proba- bility that the input belongs to class i). state. roughly, a vector which contains maximal infor- mation about the future. 1) In a deterministic environ- ment, the state is a vector s which contains enough in- formation about the world that, given s at just one time t, and the actions at time t and all subsequent time points, it is possible in principle to predict the whole fu- ture evolution of s. 2) In a stochastic environment, the state contains enough information that, given s at time t, and the actions at time t and subsequently, we can com- pute the probabilities of all future s, and those probabili- ties would not be altered if we had any additional infor- mation from time t or earlier. stochastic gradient descent. online gradient-descent learning where parameters are adjusted after each single example or minibatch, using a gradient estimate based only on that example or minibatch. supervised learning. a method where a learner receives input vectors x, computes outputs y, and gets error feed- back e that specifies the difference between its output and some desired output y*, e.g. e = y – y*. The learner adjusts its parameters to reduce some non-negative sca- lar function of e, such as the squared error eTe. tanh. the hyperbolic tangent function. 78 target function. 1) the function τ a network is trying to learn. 2) a function such as Q– in DDPG, used to sim- plify gradient descent. u. in control theory, the control signal, analogous to the action, a, in reinforcement learning. unsupervised learning. any method where a learner adjusts at least some of its parameters based on a crite- rion that does not involve its output error or reward, e.g. it might look for a way to compress its input data into a lower-dimensional representation before worrying about reducing its output errors or maximizing its re- turns. validation set. To combat overfitting, we may withhold a subset of the training data called the validation set, so the learner cannot use those examples to adjust its pa- rameters. We use the validation set to monitor for over- fitting, periodically testing the learner on those withheld data to see whether its performance is getting worse. If it does begin to worsen on the validation set then we may halt its learning, even though it is still improving on the non-withheld training data, to avoid overfitting — a method called early stopping. value function. The value V μ(s) of a state s, given a policy μ and a reward function r, is the expected time- integral of r(s, a), possibly discounted, from now to some final time T, if we start in state s and use policy μ. One way to state the aim of reinforcement learning is that it tries to find policies that maximize the value V μ(s) at every state s. 79 whitening. applying a linear transformation T to a ran- dom vector v so that the covariance matrix of w = Tv is the identity matrix, i.e. the elements are w are uncorre- lated with each other, and each has variance 1. As a re- sult, even if the vectors v form an elongated, rotated cloud, the w’s will form a spherical ball. If v has dimen- sionality > 1 then there are many different linear trans- formations that will achieve whitening, because if T works then so does OT, where O is any orthogonal ma- trix (because rotating a ball of w vectors leaves it in the form of a ball). Whitening input vectors, or vector sig- nals between layers in a deep net, can speed up learning by improving conditioning. ZCA whitening. zero-phase component analysis whit- ening, a form of whitening where the linear transfor- mation T is the square root of the covariance of v. It is the version of whitening that yields basis vectors as close as possible to the basis vectors originally used to represent v.
欢迎咨询51作业君