Machine Learning - Artificial Neural Network

Machine Learning


Artifical Neural Network(ANN) which is inspired by human brain neural, provide a general, practical method for learning real-valued, discrete-valued, and vector-valued functions from examples. Artifical Neural Network algorithm such as Backpropagation use gradient descent to tune network parameters to best fit a training set of input-outputpairs.

How to work


Figue 1: Artificial Neural Network (picture from 《机器学习-周志华》)

The basic component of an ANN network is node unit called neuron. These neurons connect with pre-neurons with bridges in different weights. Every neuron receive signal from bridges and evaluated by activation function to output. In ideal situation, activation function is like figue 2, input value from pre-neuron and return 0 or 1:

Figue 2: Output in ideal situation

Besides the logistic function, sigmoid functions include the ordinary arctangent, the hyperbolic tangent, the Gudermannian function, and the error function, but also the generalised logistic function and algebraic functions(Figue 3):

Figue 3: Some sigmoid functions: In the drawing all functions are normalized in such a way <br>that their slope at the origin is like figue 2.(picture from wikipedia)

2.Multilayer perceptron

An ANN network consists at least a input, output layer and hidden layer in which consists multilayer perceptron(MLP), which is a feedforward artificial neural network model that maps sets of input data onto a set of appropriate outputs.

Figue 4: Multilayer perceptron module

A single perceptron can be used to represent many boolean functions(such as AND/OR/NOT). For example:

AND (X1 ^ X2): Suppose w1 = w2 = 1, θ = 2, X0(bias) = 0, and y = f(1 · X1 + 1 · X2 - 2), then only when X1 = X2 = 1, y = 1
OR (X1 v X2): Suppose w1 = w2 = 1, θ = 0.5, X0(bias) = 0, and y = f(1 · X1 + 1 · X2 - 0.5), then only when X1 = 1 or X2 = 1, y = 1
NOT (!X1): Suppose w1 = 1, w2 = 0, θ = -0.5, X0(bias) = 0, and y = f(1 · X1 + 0 · X2 + 0.5), when X1 = 1, y = 0; X1 = 0, y = 1

Unfortunately, however, some boolean func- tions cannot be represented by a single perceptron, such as the XOR function whose value is 1 if and only if x1 ≠ x2. Note the set of linearly nonseparable training examples shown in figure 5 corresponds to this XOR function:

Figue 5: AND/OR/XOR/NOT boolearn function(picture from 《机器学习-周志华》)

The ability of perceptrons to represent AND, OR, AND, and NOR is important because every boolean function can be represented by some network of interconnected units based on these primitives. In fact, every boolean function can be represented by some network of perceptrons only two levels deep, in which the inputs are fed to multiple units, and the outputs of these units are then input to a second, final stage. One way is to represent the boolean function in disjunctive normal form (i.e., as the disjunction (OR) of a set of conjunctions (ANDs) of the inputs and their negations). Note that the input to an AND perceptron can be negated simply by changing the sign of the corresponding input weight.

3.Gradient descent

Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. (From:

Figue 6: global minimum and local minmum (picture from 《机器学习-周志华》)

To sum, we may use gradient descent to approach the minimum for a formula. And from gradient descent we may get local minimum and global minimum, but how could we check whether the data I caculate is global minimum but local minimum?(

  • Add noise. This reduces the precision of the parameters you’ve found, which can “blur” out local minima. The search can then jump out of local minima that are small compared to the noise, while still being trapped in deeper minima. A well-known approach for adding noise is simulated annealing.
  • Add momentum. Along with using the current gradient to define the step, also continue in the same direction as the previous step. If you take a fraction of the previous step as the momentum term, there is a tendency to keep going, which can take the search past the local minimum. By using a fraction, the steps decay exponentially, so poor steps aren’t a big problem. This was always a popular modification to gradient descent when used to train neural networks, where gradient descent is known as backpropagation.
  • Use a hybrid search. First use a global search (e.g., genetic algorithms, various Monte Carlo methods) to find some good starting points, then apply gradient descent to take advantage of the gradient information in the function.


The backward propagation of errors, or backpropagation, is a common method of training artificial neural networks and used in conjunction with an optimization method such as gradient descent. Backpropagatio wiki:

The BackPropagation follows two steps:

  • step 1: Propagation

    • Forward propagation of a training pattern’s input through the neural network in order to generate the propagation’s output activations.
    • Backward propagation of the propagation’s output activations through the neural network using the training pattern target in order to generate the deltas of all output and hidden neurons.
  • step 2: Weight update

    • Multiply its output delta and input activation to get the gradient of the weight.
    • Subtract a ratio (percentage) of the gradient from the weight.

Figue 7: BackPropagation alogrithm

Step one is caculate nodes output by giving weight, bias, squashing function, then get output node value; Step two needs to correcte predict value and target value, then return an error check value, which using gradient descent algorithm to feed-forward the weights.

Formula 1: Neuron node sum

Let`s return to the case of a single neuron N with weights W = (w0, … , wn) and an input X = (x1, x2 … xn), as is shown at Formula 1. And momentarily, let us add the activation function from Figue 3.

Supposing we predict output value as Yj, and f(x) as sigmoid function, the output may be this:

Formula 2: Neuron node predict value

where θ is the bias of node, and as return of Formula 1 for simple. Now after processing, we may get all output values, so it is easy to get prediction square error:

Formula 3: Neuron prediction square error

For convenience we add a factor of 1/2 to real E and drop the subscript N from f(N). Since minimizing E is the same as minimizing 0.5 * E, this changes nothing about the minima of the function.

Note that E is never negative, and so it will have a global minimum value at or near 0 (if it is possible for the neuron to represent the target function perfectly, it will be zero). That is, our update rule should be:

Formula 4: weight update rule

where η is some fixed parameter between 0 and 1 that represent the “learning rate.”, δE is prediction square error. We will not mention η too much except to say that as long as it is sufficiently small and we allow ourselves enough time to learn, we are guaranteed to get a good approximation of some local minimum (though it might not be a global one).

And further, consider BackPropagation algotithm is base on Gradient descent, we need seeking guidance for E:

Formula 5: seeking guidance for E

Due to the continuity of the neural network, we can split the Formula 5 for:

Formula 6: split the appeal formula

Here, we found sigmoid f(x) = 1 / (1 + np.exp(-x)), has a good attribute, which can be described:

Formula 7: sigmoid function's useful attribute

Finally, we may base on Formula 5/6/7, return a new formula:

Formula 8: combine formula 5/6/7

And now we can get the new weight formula:

Formula 9: new weight formula

5.Cross Validation’s slider “Overfitting, Cross-Validation”:

Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set, which can easily described by Figue 8:

Figue 8: Cross Validation

For Figue 8, we may choose different groups data and finally caculate their average, more information is avaiable in wikipedia:


Github code:

import numpy as np

class NeuralNetwork(object):
    :param layers: A list containing the number of units in each layer. Should be at least two values.
    :param activation: The activation function to be used. Can be "logistic" or "tanh".
    def __init__(self, layers, activation = 'tanh'):
        if activation == 'logistic':
            self.activation = self.logistic
            self.activation_deriv = self.logistic_derivative
        elif activation == 'tanh':
            self.activation = self.tanh
            self.activation_deriv = self.tanh_deriv
        generate weight matrix with random float
        self.weights = []
        for i in range(1, len(layers) - 1):
            self.weights.append((2 * np.random.random((layers[i - 1] + 1, layers[i] + 1)) - 1) * 0.25)
            self.weights.append((2 * np.random.random((layers[i] + 1, layers[i + 1])) - 1) * 0.25)

    def tanh(x):
        return np.tanh(x)

    def tanh_deriv(x):
        return 1.0 - np.tanh(x) * np.tanh(x)

    def logistic(x):
        return 1 / (1 + np.exp(-x))

    def logistic_derivative(x):
        return NeuralNetwork.logistic(x) * (1 - NeuralNetwork.logistic(x))

    :param X        numpy.array     train matrix
    :param y        numpy.array     result label
    :param learning_rate    float
    :param epochs   int             backprobagation times
    def fit(self, X, y, learning_rate = 0.2, epochs = 10000):
        X = np.atleast_2d(X)
        temp = np.ones([X.shape[0], X.shape[1] + 1])
        temp[:, 0:-1] = X
        X = temp
        y = np.array(y)

        loop operation for epochs times
        for k in range(epochs):
            # select a random line from X for training
            i = np.random.randint(X.shape[0])
            a = [X[i]]

            # going forward network, for each layer
            for l in range(len(self.weights)):
                # computer the node value for each layer (O_i) using activation function
                a.append(self.activation([l], self.weights[l])))
            # computer the error at the top layer
            error = y[i] - a[-1]
            deltas = [
                error * self.activation_deriv(a[-1])]  # For output layer, Err calculation (delta is updated error)

            # start backprobagation
            for l in range(len(a) - 2, 0, -1):  # we need to begin at the second to last layer
                # compute the updated error (i,e, deltas) for each node going from top layer to input layer
                deltas.append(deltas[-1].dot(self.weights[l].T) * self.activation_deriv(a[l]))
            for i in range(len(self.weights)):
                layer = np.atleast_2d(a[i])
                delta = np.atleast_2d(deltas[i])
                self.weights[i] += learning_rate *

    def predict(self, x):
        x = np.array(x)
        temp = np.ones(x.shape[0] + 1)
        temp[0:-1] = x
        a = temp
        for l in range(0, len(self.weights)):
            a = self.activation(, self.weights[l]))
        return a



本文出自 夏日小草,转载请注明出处:

-by grasses

2017-04-30 16:52:34

Fork me on GitHub