The view of artificial neural networks as adaptive systems has lead to the development of ad-hoc generic procedures known as learning rules. The first of these is the Perceptron Rule (Rosenblatt, 1962), useful for single layer feed-forward networks and linearly separable problems. Its simplicity and beauty, and the existence of a convergence theorem made it a basic departure point in neural learning algorithms. This algorithm is a particular case of the Widrow-Hoff or delta rule (Widrow & Hoff, 1960), applicable to continuous networks with no hidden layers with an error function that is quadratic in the parameters.
The first truly useful algorithm for feed-forward multilayer networks is the backpropagation algorithm (Rumelhart, Hinton & Williams, 1986), reportedly proposed first by Werbos (1974) and Parker (1982). Many efforts have been devoted to enhance it in a number of ways, especially concerning speed and reliability of convergence (Haykin, 1994; Hecht-Nielsen, 1990). The backpropagation algorithm serves in general to compute the gradient vector in all the first-order methods, reviewed below.
Neural networks are trained by setting values for the network parameters w to minimize an error function E(w). If this function is quadratic in w, then the solution can be found by solving a linear system of equations (e.g. with Singular Value Decomposition (Press, Teukolsky, Vetterling & Flannery, 1992)) or iteratively with the delta rule. The minimization is realized by a variant of a gradient descent procedure, whose ultimate outcome is a local minimum: a w* from which any infinitesimal change makes E(w*) increase, that may not correspond to one of the global minima. Different solutions are found by starting at different initial states. The process is also perturbed by roundoff errors. Given E(w) to be minimized and an initial state w0, these methods perform for each iteration the updating step:wi+1=wi+αiui(1) where ui is the minimization direction (the direction in which to move) and αi∈R is the step size (how far to make a move in ui), also known as the learning rate in earlier contexts. For convenience, define Δwi=wi+1-wi. Common stopping criteria are:
A maximum number of presentations of D (epochs) is reached.
A maximum amount of computing time has been exceeded.
The evaluation has been minimized below a certain tolerance.
The gradient norm has fallen below a certain tolerance.
Key Terms in this Chapter
First-Order Method: A training algorithm using the objective function and its gradient vector.
Second-Order Method: A training algorithm using the objective function, its gradient vector and Hessian matrix.
Feed-Forward Artificial Neural Network: Artificial Neural Network whose graph has no cycles.
Back-Propagation: Algorithm for feed-forward multilayer networks that can be used to efficiently compute the gradient vector in all the first-order methods.
Learning Algorithm: Method or algorithm by virtue of which an Artificial Neural Network develops a representation of the information present in the learning examples, by modification of the weights.
Weight: A free parameter of an Artificial Neural Network, modified through the action of a Learning Algorithm to obtain desired responses to certain input stimuli.
Artificial Neural Network: Information processing structure without global or shared memory that takes the form of a directed graph where each of the computing elements (“neurons”) is a simple processor with internal and adjustable parameters, that operates only when all its incoming information is available.