## Abstract

The view of artificial neural networks as adaptive systems has lead to the development of ad-hoc generic procedures known as learning rules. The first of these is the Perceptron Rule (Rosenblatt, 1962), useful for single layer feed-forward networks and linearly separable problems. Its simplicity and beauty, and the existence of a convergence theorem made it a basic departure point in neural learning algorithms. This algorithm is a particular case of the Widrow-Hoff or delta rule (Widrow & Hoff, 1960), applicable to continuous networks with no hidden layers with an error function that is quadratic in the parameters.

Top## Background

The first truly useful algorithm for feed-forward multilayer networks is the *backpropagation* algorithm (Rumelhart, Hinton & Williams, 1986), reportedly proposed first by Werbos (1974) and Parker (1982). Many efforts have been devoted to enhance it in a number of ways, especially concerning speed and reliability of convergence (Haykin, 1994; Hecht-Nielsen, 1990). The backpropagation algorithm serves in general to compute the gradient vector in all the first-order methods, reviewed below.

Neural networks are trained by setting values for the network parameters *w* to minimize an error function *E*(*w*). If this function is quadratic in *w*, then the solution can be found by solving a linear system of equations (e.g. with Singular Value Decomposition (Press, Teukolsky, Vetterling & Flannery, 1992)) or iteratively with the delta rule. The minimization is realized by a variant of a *gradient descent* procedure, whose ultimate outcome is a local minimum: a *w**** from which any infinitesimal change makes *E*(*w****) increase, that may not correspond to one of the global minima. Different solutions are found by starting at different initial states. The process is also perturbed by roundoff errors. Given *E*(*w*) to be minimized and an initial state *w*^{0}, these methods perform for each iteration the updating step:*w*^{i+}^{1}*=**w*^{i}+α_{i}*u*^{i}*(1)* where *u*^{i} is the *minimization direction* (the direction in which to move) and α_{i}∈*R* is the *step size* (how far to make a move in *u*^{i}), also known as the *learning rate* in earlier contexts. For convenience, define Δ*w*^{i}=*w*^{i+}^{1}*-**w*^{i}. Common stopping criteria are:

*1. *A maximum number of presentations of *D* (*epochs*) is reached.

*2. *A maximum amount of computing time has been exceeded.

*3. *The evaluation has been minimized below a certain tolerance.

*4. *The gradient norm has fallen below a certain tolerance.

## Key Terms in this Chapter

First-Order Method: A training algorithm using the objective function and its gradient vector.

Second-Order Method: A training algorithm using the objective function, its gradient vector and Hessian matrix.

Feed-Forward Artificial Neural Network: Artificial Neural Network whose graph has no cycles.

Back-Propagation: Algorithm for feed-forward multilayer networks that can be used to efficiently compute the gradient vector in all the first-order methods.

Learning Algorithm: Method or algorithm by virtue of which an Artificial Neural Network develops a representation of the information present in the learning examples, by modification of the weights.

Weight: A free parameter of an Artificial Neural Network, modified through the action of a Learning Algorithm to obtain desired responses to certain input stimuli.

Artificial Neural Network: Information processing structure without global or shared memory that takes the form of a directed graph where each of the computing elements (“neurons”) is a simple processor with internal and adjustable parameters, that operates only when all its incoming information is available.