IntroA first wave of interest in neural networks (also known as `connectionist models' or `parallel distributed processing') emerged after the introduction of simplified neurons by McCulloch and Pitts in 1943 (McCulloch & Pitts, 1943). These neurons were presented as models of biological neurons and as conceptual components for circuits that could perform computational tasks.
When Minsky and Papert published their book Perceptrons in 1969 (Minsky & Papert, 1969) in which they showed the deficiencies of perceptron models, most neural network funding was redirected and researchers left the field. Only a few researchers continued their eforts, most notably Teuvo Kohonen, Stephen Grossberg, James Anderson, and Kunihiko Fukushima.
The interest in neural networks re-emerged only after some important theoretical results were attained in the early eighties (most notably the discovery of error back-propagation), and new hardware developments increased the processing capacities. This renewed interest is reflected in the number of scientists, the amounts of funding, the number of large conferences, and the number of journals associated with neural networks. Nowadays most universities have a neural networks group, within their psychology, physics, computer science, or biology departments.
Artificial neural networks can be most adequately characterised as `computational models' with particular properties such as the ability to adapt or learn, to generalise, or to cluster or organise data, and which operation is based on parallel processing. However, many of the above-mentioned properties can be attributed to existing (non-neural) models the intriguing question is to which extent the neural approach proves to be better suited for certain applications than existing models. To date an equivocal answer to this question is not found.
Often parallels with biological systems are described. However, there is still so little known (even at the lowest cell level) about biological systems, that the models we are using for our artificial neural systems seem to introduce an oversimplification of the `biological' models.
In this course we give an introduction to artificial neural networks. The point of view we take is that of a computer scientist. We are not concerned with the psychological implication of the networks, and we will at most occasionally refer to biological neural models. We consider neural networks as an alternative computational scheme rather than anything else.
FundamentalsThe artificial neural networks which we describe in this course are all variations on the parallel distributed processing (PDP) idea. The architecture of each network is based on very similar building blocks which perform the processing. Here we first discuss these processing units and discuss diferent network topologies. Learning strategies (as a basis for an adaptive system) will be presented in the last section.
A framework for distributed representation
An artificial network consists of a pool of simple processing units which communicate by sending signals to each other over a large number of weighted connections.
A set of major aspects of a parallel distributed model can be distinguished (cf. Rumelhart and McClelland, 1986 (McClelland & Rumelhart, 1986 Rumelhart & McClelland, 1986)):
- a set of processing units (`neurons,' `cells')
- a state of activation yk for every unit, which equivalent to the output of the unit
- connections between the units. Generally each connection is defined by a weight wjk which determines the effect which the signal of unit j has on unit k
- a propagation rule, which determines the effective input sk of a unit from its external inputs
- an activation function Fk, which determines the new level of activation based on the effective input sk(t) and the current activation yk(t) (i.e., the update)
- an external input (aka bias, offset) θk for each unit
- a method for information gathering (the learning rule)
- an environment within which the system must operate, providing input signals and (if necessary) error signals.
Each unit performs a relatively simple job:
- receive input from neighbours or external sources and use this to compute an output signal which is propagated to other units.
- Apart from this processing, a second task is the adjustment of the weights.
Within neural systems it is useful to distinguish three types of units:
- input units (indicated by an index i) which receive data from outside the neural network,
- output units (indicated by an index o) which send data out of the neural network, and
- hidden units (indicated by an index h) whose input and output signals remain within the neural network.
- with synchronous updating, all units update their activation simultaneously
- with asynchronous updating, each unit has a (usually fixed) probability of updating its activation at a time t, and usually only one unit will be able to do this at a time. In some cases the latter model has some advantages.
In most cases we assume that each unit provides an additive contribution to the input of the unit with which it is connected. The total input to unit k is simply the weighted sum of the separate outputs from each of the connected units plus a bias (or offset) term θk :
sk(t) = Σj wjk(t) yj(t) + θk(t): (2.1)The contribution for positive wjk is considered as an excitation and for negative wjk as inhibition. In some cases more complex rules for combining inputs are used, in which a distinction is made between excitatory and inhibitory inputs. We call units with a propagation rule (2.1) sigma units.
A different propagation rule, introduced by Feldman and Ballard (Feldman & Ballard, 1982), is known as the propagation rule for the sigma-pi unit:
sk(t) = Σj wjk(t) Πm yjm(t) + θk(t): (2.2)Often, the yjm are weighted before multiplication. Although these units are not frequently used, they have their value for gating of input, as well as implementation of lookup tables (Mel, 1990).
Activation and output rules
We also need a rule which gives the effect of the total input on the activation of the unit. We need a function Fk which takes the total input sk (t) and the current activation yk (t) and produces a new value of the activation of the unit k:
yk (t + 1) = Fk (yk (t), sk (t)): (2.3)
Often, the activation function is a non-decreasing function of the total input of the unit:
yk (t + 1) = Fk (sk (t)) = Fk ( Σj wjk (t) yj (t) + θk (t) )although activation functions are not restricted to non-decreasing functions. Generally, some sort of threshold function is used: a hard limiting threshold function (a sgn function), or a linear or semi-linear function, or a smoothly limiting threshold (see figure 2.2). For this smoothly limiting function often a sigmoid (S-shaped) function like:
yk = F (sk ) = 1/( 1+ e^-sk ) ; (2.5)is used. In some applications a hyperbolic tangent is used, yielding output values in the range [-1 +1].
In some cases, the output of a unit can be a stochastic function of the total input of the unit. In that case the activation is not deterministically determined by the neuron input, but the neuron input determines the probability p that a neuron get a high activation value:
p(yk <--1) =1/ [ 1 + e^-(sk /T)] (2.6)in which T (cf. temperature) is a parameter which determines the slope of the probability function. This type of unit will be discussed more extensively later.
In all networks we describe we consider the output of a neuron to be identical to its activation level.
In the previous section we discussed the properties of the basic processing unit in an artificial neural network. This section focuses on the pattern of connections between the units and the propagation of data. As for this pattern of connections, the main distinction we can make is between:
- Feed-forward networks, where the data flow from input to output units is strictly feed-forward. The data processing can extend over multiple (layers of) units, but no feedback connections are present, that is, connections extending from outputs of units to inputs of units in the same layer or previous layers.
- Recurrent networks that do contain feedback connections. Contrary to feed-forward networks, the dynamical properties of the network are important. In some cases, the activation values of the units undergo a relaxation process such that the network will evolve to a stable state in which these activations do not change anymore. In other applications, the change of the activation values of the output neurons are significant, such that the dynamical behaviour constitutes the output of the network (Pearlmutter, 1990).
Training of artificial neural networks
A neural network has to be configured such that the application of a set of inputs produces (either `direct' or via a relaxation process) the desired set of outputs. Various methods to set the strengths of the connections exist.
- One way is to set the weights explicitly, using a priori knowledge.
- Another way is to `train' the neural network by feeding it teaching patterns and letting it change its weights according to some learning rule.
We can categorise the learning situations in two distinct sorts. These are:
- Supervised learning or Associative learning in which the network is trained by providing it with input and matching output patterns. These input-output pairs can be provided by an external teacher, or by the system which contains the network (self-supervised).
- Unsupervised learning or Self-organisation in which an (output) unit is trained to respond to clusters of pattern within the input. In this paradigm the system is supposed to discover statistically salient features of the input population. Unlike the supervised learning paradigm, there is no a priori set of categories into which the patterns are to be classified rather the system must develop its own representation of the input stimuli.
Both learning paradigms discussed above result in an adjustment of the weights of the connections between units, according to some modification rule. Virtually all learning rules for models of this type can be considered as a variant of the Hebbian learning rule suggested by Hebb in his classic book Organization of Behaviour (1949) (Hebb, 1949).
The basic idea is that if two units j and k are active simultaneously, their interconnection must be strengthened.If j receives input from k, the simplest version of Hebbian learning prescribes to modify the weight wjk with:
Δwjk = γyjyk (2.7)where γ is a positive constant of proportionality representing the learning rate. Another common rule uses not the actual activation of unit k but the difference between the actual and desired activation for adjusting the weights:
Δwjk = yj (dk - yk ) (2.8)in which dk is the desired activation provided by a teacher. This is often called the Widrow-Ho rule or the delta rule, and will be discussed in the next section. Many variants (often very exotic ones) have been published the last few years. In the next sections some of these update rules will be discussed.
Notation and terminology
Throughout the years researchers from different disciplines have come up with a vast number of terms applicable in the field of neural networks. Our computer scientist point-of-view enables us to adhere to a subset of the terminology which is less biologically inspired, yet still conflicts arise. Our conventions are discussed below.
We use the following notation in our formulae. Note that not all symbols are meaningful for all networks, and that in some cases subscripts or superscripts may be left out (e.g., p is often not necessary) or added (e.g., vectors can, contrariwise to the notation below, have indices) where necessary. Vectors are indicated with a bold non-slanted font:
- j , k, ... the unit j , k, ...;
- i an input unit
- h a hidden unit
- o an output unit
- xp the pth input pattern vector
- xpj the jth element of the pth input pattern vector
- sp the input to a set of neurons when input pattern vector p is clamped (i.e., presented to the network) often: the input of the network by clamping input pattern vector p
- dp the desired output of the network when input pattern vector p was input to the network
- dpj the jth element of the desired output of the network when input pattern vector p was input j to the network
- yp the activation values of the network when input pattern vector p was input to the network
- ypj the activation values of element j of the network when input pattern vector p was input to the network
- W the matrix of connection weights
- wj the weights of the connections which feed into unit j
- wkj the weight of the connection from unit j to unit k
- Fj the activation function associated with unit j
- γjk the learning rate associated with weight wkj
- θ the biases to the units
- θj the bias input to unit j
- Uj the threshold of unit j in Fj
- Ep the error in the output of the network when input pattern vector p is input
- E the energy of the network.
Output vs. activation of a unit. Since there is no need to do otherwise, we consider the output and the activation value of a unit to be one and the same thing. That is, the output of each neuron equals its activation value.
Bias, offset, threshold. These terms all refer to a constant (i.e., independent of the network input but adapted by the learning rule) term which is input to a unit. They may be used interchangeably, although the latter two terms are often envisaged as a property of the activation function. Furthermore, this external input is usually implemented (and can be written) as a weight from a unit with activation value 1.
Number of layers. In a feed-forward network, the inputs perform no computation and their layer is therefore not counted. Thus a network with one input layer, one hidden layer, and one output layer is referred to as a network with two layers. This convention is widely though not yet universally used.
Representation vs. learning. When using a neural network one has to distinguish two issues which influence the performance of the system. The first one is the representational power of the network, the second one is the learning algorithm.
The representational power of a neural network refers to the ability of a neural network to represent a desired function. Because a neural network is built from a set of standard functions, in most cases the network will only approximate the desired function, and even for an optimal set of weights the approximation error is not zero.
The second issue is the learning algorithm. Given that there exist a set of optimal weights in the network, is there a procedure to (iteratively) find this set of weights?