Shallow Neural Networks

7 mins read. Neural Network with single hidden layer and its implementation in python.

Neural Network Overview

The model:

Neural Network with 1 hidden layer

Mathematically:

For one example x(i)x^{(i)}:

Given the predictions on all the examples, you can also compute the cost JJ as follows:

Reminder: The general methodology to build a Neural Network is to:

  1. Define the neural network structure ( # of input units, # of hidden units, etc).

  2. Initialize the model's parameters

  3. Loop: - Implement forward propagation - Compute loss - Implement backward propagation to get the gradients - Update parameters (gradient descent)

Activation Functions

To put in simple terms, an artificial neuron calculates the ‘weighted sum’ of its inputs and adds a bias, as shown in the figure below by the net input.

33 (1) Types of Activation Functions – Several different types of activation functions are used in Deep Learning. Some of them are explained below:

  • Step Function: Step Function is one of the simplest kind of activation functions. In this, we consider a threshold value and if the value of net input say y is greater than the threshold then the neuron is activated.

  • Sigmoid Function: Sigmoid function is a smooth function and is continuously differentiable. The function ranges from 0-1 having an S shape.

  • ReLU: The ReLU function is the Rectified linear unit. It is the most widely used activation function. The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time. What does this mean ? If you look at the ReLU function if the input is negative it will convert it to zero and the neuron does not get activated.

  • Leaky ReLU: Leaky ReLU function is nothing but an improved version of the ReLU function.Instead of defining the Relu function as 0 for x less than 0, we define it as a small linear component of x. The slope coefficient is determined before training, i.e. it is not learnt during training. This type of activation function is popular in tasks where we may suffer from sparse gradients, for example training generative adversarial networks.

Gradient Descent for Neural Networks

Backward Propagation

Figure 1: Backpropagation. Use the six equations on the right.

When performing backpropagation in neural networks, especially in the context of the final output layer, you may encounter expressions that involve the difference between the network’s output and the target output (e.g., (ActivationValuey)(Activation Value−y)). This can be seen in some simplified derivations of the gradient for backpropagation, particularly when the cost function is related to the final activation function.

Detailed Breakdown:

  1. Activation Value Minus Desired Output (a - y):

    • Consider a network where the output is a single value (or a vector of values in the case of multiple outputs).

    • Let’s denote the final layer activation output as aa, and the desired output (target value) as yy.

    In backpropagation, the term (ay)(a−y) often appears. Here’s why:

  2. Simplifying Gradient Computation:

    • When using a cost function like Mean Squared Error (MSE), defined as 12(ay)2\frac{1}{2} (a - y)^2, the derivative with respect to the activation aa is (ay)(a - y).

    • For cross-entropy loss with a sigmoid activation, the derivative also simplifies to (ay)(a - y).

    This means that the gradient of the cost function with respect to the activation is proportional to the difference between the predicted output (activation) and the actual target yy.

  3. Chain Rule in Backpropagation:

    • Backpropagation involves using the chain rule to compute the gradient of the cost function with respect to the weights.

    • The term (ay)(a - y) represents the gradient of the cost function with respect to the activation in the output layer.

    • This gradient is then used to propagate errors backward through the network, adjusting the weights to minimize the cost function.

Update Parameters

General gradient descent rule: θ=θαJθ\theta = \theta - \alpha \frac{\partial J }{ \partial \theta } where α\alpha is the learning rate and θ\theta represents a parameter.

The gradient descent algorithm with a good learning rate (converging)
The gradient descent algorithm with a bad learning rate (diverging). Images courtesy of Adam Harley.

Random Initialization

Random initialization of weights is an essential step in training neural networks. It is the process of assigning initial weights to the neurons of the network before training. The initial weights are usually chosen randomly from a specified distribution, and their values can significantly impact the network's performance.

The goal of random initialization is to ensure that each neuron in the network starts with a different initial weight. If all the neurons start with the same initial weight, then they will learn the same features, and the network will not be able to learn complex patterns. Moreover, random initialization helps in preventing the neurons from getting stuck in a local minima during training.

Ways to initialize the weights in a neural network:

Reference

---------------------------------------------------------------------------------------------------------------

Lecture Note on June 18th 2024:

Neural Network Representation

  • Input layer

  • Hidden layer

  • Output layer

Activation Function

  • Sigmoid Function (used in output layer)

  • Tanh Function [better than sigmoid ... ?] (most commonly used in hidden layer)

  • RELU Function

  • leaky RELU

binary classification -> sigmoid

other -> RELU activated Function

Why need Non-Linear Activation Functions?

-> hidden layer

Derivatives of Activation Functions

  • sigmoid -> dz = g(z)(1 - g(z))

  • Tanh -> dz = 1 - (tanh(z)^2

  • ReLU g(z) = max(0, z) -> dz = 0 if z < 0 ; 1 if z >= 0;

  • Leaky ReLU g(z) = max(0.01z, z) -> dz = 0.01 if z < 0 ; 1 if z >= 0;

Gradient Descent for Neural Networks

parameters, cost functions ...

Forward propagation

Backward propagation

Backpropagation calculation

Random Initialization

-> mutiply with small constant number

Last updated