Introduction


Neural networks, by which I mean applications of the computational model inspired by the human brain, have been given much attention in the fifty some-odd years since their first conception. In 1943, following work by Nathan Rashevsky in the 1930's, Warren McCulloch and Walter Pitts produced a paper on how neurons might work, using a simple electrical circuit model as a basis for their idea. This was followed in 1957 by the work of Frank Rosenblatt at Cornell, who created a hardware network called the Perceptron, inspired by the operation of the mosquito eye. Following the disillusionment that resulted from the publication in 1969 of Marvin Minsky and Seymour Papert's book Perceptrons, which, among other things, showed that existing techniques could not train networks to solve the simple XOR problem, the research area saw a revival in the early 1980s sparked by the work of Caltech physicist John Hopfield (Dworman98; Latter99).
To date, a number of visualization tools have been developed for neural networks, the purpose of which is to ease the understanding both of the learning processes used to train a network and the computation performed by the network itself. Hinton and Bond diagrams, among the most basic and popular techniques, employ simple two dimensional figures that depict the network's edge weights and, in the case of Bond diagrams, topology. Hyperplane diagrams and response function plots attempt to display the way in which each node of a network partitions its multidimensional input space (Craven91). Additionally, a number of shareware and commercial programs have been developed, such as EasyNN (Wolstenholme00), QwikNet (Jensen98), and Trajan (Hunter99), most of which employ Bond-like visualizations.
The purpose of this report is to present a new visualization tool designed to overcome several limitations associated with existing packages. In particular, this paper presents the only visualization tool of which I know that attempts to display the hereditary and geneological relationships involved in the use of genetic algorithms and evolution strategies to train neural networks. The goal of this project was to create a tool that would allow the user to understand how neural networks evolve under a genetic algorithm, to more quickly and easily solve problems using neural networks, and to understand the "algorithm" that a specific neural network is using to solve a problem.

I. What is a Neural Network?

Sample Neural Network
Figure 1.1: A simple
neural network
All neural networks consist of a set of nodes, each of which takes a real-valued input and produces a real-valued output (also called an activation), and a set of weighted edges that connect them. The way in which the nodes are connected is a function of the network architecture, although feedforward networks are the most common. Figure 1.1 to the left of this text (generated using NVIS, the visualization tool presented in this report) illustrates a feedforward neural network with 4 "input nodes", 1 "output node", and two hidden layers of 2 nodes each.
The purpose of a neural network is to take a set of inputs, which, to distinguish them from the real-valued inputs to a single node, we shall call problem inputs, and produce a set of desired problem outputs, where each problem input and output is again a real value. Thus, if the problem inputs consisted of four real values, we would have four nodes in the input layer, defined as the set of nodes whose individual inputs are assigned to the problem inputs. These nodal inputs are processed and passed down through the hidden layers, and finally to the output layer, whose nodes' outputs give the final result. An individual node converts its input to output by means of a layer-specific activation function (also called a threshold function). The way in which the inputs to the nodes of one layer are determined from the activations of the nodes in the previous is described below.
In the figure, the activations are represented by the diameter of the white circles inside each node, while edge weights are proportional to the lengths of the colored line segments, with cyan and red indicating positive and negative sign, respectively. When a neural network is "run", the outputs from each node in a given layer will be multiplied by the weights of every edge to which it is attached before being sent over the edges into the next layer. Thus, for a node with activation 1.5, with edges of weights 2.0 and -2.0 emanating away from it, the signals sent across these edges will be 3.0 and -3.0, respectively.
In a feedforward network, the signals always flow from the input layer toward the output layer, i.e. from top to bottom in the diagram. Thus, while the inputs to the nodes in the input layer are the problem inputs themselves, the inputs into the first hidden layer will be a linear combination of the input layer's outputs. The values for each layer, therefore, can be determined from the previous layer until the final (output) layer is reached.
In principle, a feedforward network can compute any computable function (Sarle99). With suitable choices of network topology, activation functions, and edge weights, networks can be created to find the cosine of an angle, to square real numbers, to determine the parity of an integer, and to solve a huge variety of input-output mapping problems. Common choices for activation functions include linear, piecewise-linear, hyperbolic tangential, and sigmoid functions. The network depicted has been trained to produce an output of 1 if its first two inputs are 1 and 0, and 0 output otherwise. The activation functions for the four layers of this network are linear, sigmoid, sigmoid, and linear, respectively.

For more information on neural networks, see the
Neural Network FAQ (Sarle99).

II. Training Neural Networks to Solve Problems

Neural networks, however accurately they may model the human brain, and however powerful they become, are of little use without the aid of automated training methods. If the algorithm to solve a problem is known, it is certainly easier to write a C or C++ program to implement it than to fiddle with network weights and topology until an acceptable solution is found. If no such algorithm is known, however, then it is nearly impossible, through any heuristic or probabilistic technique, to come up with a C program that gives you the right answer. In the problem of handwriting recognition, for example, though the inputs and desired outputs are clearly known, it is difficult to define precisely the rules that differentiate one character from another. It is this type of problem that neural networks are designed to solve.
Training neural networks usually takes the form of finding the best edge weights for a given network topology, though it is conceivable that a training scheme could modify the topology as well. The most common training algorithm, called error back-propagation, attempts to improve the edge weights (i.e. to modify the weights in a way that reduces the difference between actual and expected output over a given set of problem samples) using a gradient-based search in weight space.
By "weight space" I mean the n-dimensional space defined by the real-valued edge weights of a network with n edges. For simplicity, we will consider a network with only 2 edges. Now if we label these edges e1 and e2, then for any two set values of e1 and e2 there is an associated error, E, usually calculated as the root-mean-squared of the differences between expected and actual output for each problem sample in the "training set". What is desired is to minimize E by choosing appropriate values of e1 and e2. Ideally, we would be able to reduce E to zero. One approach might be to solve the problem analytically, that is to determine the equation for E in terms of the edge weights and minimize using partial derivatives. Though this may work for suitably small numbers of edges (and simple activation functions), the approach does not generalize to practical networks. The equations soon become far too complex to solve analytically. What is needed is a heuristic approach, whereby an initial guess for e1 and e2 is made, and then iteratively refined to reduce the value of E. The back-propagation algorithm is one such approach.
If we imagine e1 and e2 as assigned to the x and y axes of a Cartesian plane, and E plotted on the z axis as a function of the two, we can envision a two-dimensional "error surface", with hills, valleys, and plateaus. What is desired is to find the lowest point that lies on this surface. Starting with a random initial guess (xi,yi), we take partial derivatives of the error surface to find the direction of steepest descent, and move in this direction a specific distance, proportional both to the magnitude of the gradient vector and a parameter called the learning rate. Then, from this new location, we take the partial derivatives again, and the process continues until we are at the bottom of a valley in the error surface. This algorithm, referred to as "hill climbing" when the sign of E is reversed, is guaranteed to find a local minimum of the error E, but will not necessarily find a global minimum. Once it gets to the bottom of a valley, the algorithm is stuck; it makes no attempt to see if there might not be deeper valleys elsewhere on the error surface. This is one of the primary drawbacks of the error back-propagation algorithm.
The error back-propagation algorithm gets its name from the way in which it finds the gradient on the error surface. The error is first computed for the output nodes, and then "propagated" back through the previous layers. The details of this algorithm will not be presented here; the interested reader is referred to Chapter 22 of the textbook Artificial Intelligence (Winston92) and The Backprop Algorithm (Tveter95) on the web.
Another common training method is the genetic algorithm, a variant of which will be discussed in the next section.

III. The Basics of Evolution Strategies

In the standard genetic algorithm, an initial population pool is created of networks with the same architecture and randomly initialized weights, which is to say that a set of random points are selected in weight space. The algorithm is run for a certain number of "generations", and for each generation of n networks, the best m are selected, and the remainder killed off. The "best" m networks are defined as those m networks with the lowest fitness scores, where a fitness score is calculated as the RMS error between expected and actual output for each problem sample in the training set. These m networks then "breed" to produce n-m children, which, together with their parents, form the next generation, and the process continues.
In the notation used in the evolution strategies algorithm, this scheme would be known as a (P+C) ES, where P is the number of parents (in this case n), and C=n-m is the number of children. The "+" symbol indicates that the parents of one generation survive into the next, so that the survivors of each generation are selected from a pool of size P+C. There is another type of ES written (P,C), in which only the children survive into the next generation, so that the new parents are always being selected from a pool of size C, where P<=C for this type of ES.
When breeding two networks to produce a child, a common approach is to simply average the corresponding weights, then modify them with a Gaussian mutation. In the usual genetic algorithm, the standard deviation of this mutation is set to some predefined constant value. In evolution strategies, however, each weight has an associated mutator value which, when the parents breed, is used to determine the degree of mutation in the child. These mutators, like the edge weights themselves, are passed down from parent to child, so that certain edge weights may become especially volatile over time, while others settle down to a pattern of little mutation.
An alternative breeding method is to select each weight at random from one of the parents, instead of averaging them. A randomly weighted average can also be used, as can a randomly weighted probability of choosing from one parent or the other. It is even possible to breed using more than two parents, or even a single parent. It is left to the reader to consider the ethical implications of such methods.

For a more detailed treatment of evolution strategies and genetic algorithms, the reader is referred to Evolutionary Algorithms in Theory and Practice by Thomas Back (Back96).