Logistic Regression in Python & NumPy

Logistic Regression Model Implementation in Python NumPy.

Packages

  • numpy is the fundamental package for scientific computing with Python.

  • h5py is a common package to interact with a dataset that is stored on an H5 file.

  • matplotlib is a famous library to plot graphs in Python.

  • PIL and scipy are used here to test your model with your own picture at the end.

import numpy as np
import copy
import matplotlib.pyplot as plt
import h5py
import scipy
from PIL import Image
from scipy import ndimage
from lr_utils import load_dataset
from public_tests import *

Problem Set

# Loading the data (cat/non-cat)
train_set_x_orig, train_set_y, test_set_x_orig, test_set_y, classes = load_dataset()

General Architecture of the learning algorithm

Mathematical expression of the algorithm:

For one example x(i)x^{(i)}:

The cost is then computed by summing over all training examples:

Building algorithm

The main steps for building a Neural Network are:

  1. Define the model structure (such as number of input features)

  2. Initialize the model's parameters

  3. Loop:

    • Calculate current loss (forward propagation)

    • Calculate current gradient (backward propagation)

    • Update parameters (gradient descent)

You often build 1-3 separately and integrate them into one function we call model().

Initializing parameters

def initialize_with_zeros(dim):
    """
    This function creates a vector of zeros of shape (dim, 1) for w and initializes b to 0.
    
    Argument:
    dim -- size of the w vector we want (or number of parameters in this case)
    
    Returns:
    w -- initialized vector of shape (dim, 1)
    b -- initialized scalar (corresponds to the bias) of type float
    """
    w = np.zeros((dim, 1))
    
    b = 0.0
  
    return w, b

Forward and Backward propagation

Implement a function propagate() that computes the cost function and its gradient.

Forward Propagation:

  • get X

  • compute A=σ(wTX+b)=(a(1),a(2),...,a(m1),a(m))A = \sigma(w^T X + b) = (a^{(1)}, a^{(2)}, ..., a^{(m-1)}, a^{(m)})

  • calculate the cost function: J=1mi=1m(y(i)log(a(i))+(1y(i))log(1a(i)))J = -\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)}))

Here are the two formulas you will be using:

import numpy as np
from public_tests import *

def propagate(w, b, X, Y):
    """
    Implement the cost function and its gradient for the propagation explained above

    Arguments:
    w -- weights, a numpy array of size (num_px * num_px * 3, 1)
    b -- bias, a scalar
    X -- data of size (num_px * num_px * 3, number of examples)
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat) of size (1, number of examples)

    Return:
    grads -- dictionary containing the gradients of the weights and bias
            (dw -- gradient of the loss with respect to w, thus same shape as w)
            (db -- gradient of the loss with respect to b, thus same shape as b)
    cost -- negative log-likelihood cost for logistic regression
    
    Tips:
    - Write your code step by step for the propagation. np.log(), np.dot()
    """
    
    m = X.shape[1]
    
    # FORWARD PROPAGATION (FROM X TO COST)
    Z = np.dot(w.T, X) + b
    A = 1/(1+np.exp(-1*(Z)))
    cost = (-1/m)*np.sum(Y * np.log(A) + (1-Y) * np.log(1-A))
    
    # BACKWARD PROPAGATION (TO FIND GRAD)
    dZ = A - Y
    dw = (1/m)*np.dot(X, dZ.T)
    db = (1/m)*np.sum(dZ)
    
    cost = np.squeeze(np.array(cost))
    
    grads = {"dw": dw,
             "db": db}
    
    return grads, cost

Optimization

  • You have initialized your parameters.

  • You are also able to compute a cost function and its gradient.

  • Now, you want to update the parameters using gradient descent.

The goal is to learn 𝑤𝑤𝑤 and 𝑏𝑏𝑏 by minimizing the cost function 𝐽𝐽𝐽. For a parameter 𝜃𝜃𝜃, the update rule is 𝜃=𝜃𝛼𝑑𝜃𝜃=𝜃−𝛼 𝑑𝜃𝜃=𝜃−𝛼 𝑑𝜃, where 𝛼𝛼𝛼 is the learning rate.

import copy
from public_tests import *

def optimize(w, b, X, Y, num_iterations=100, learning_rate=0.009, print_cost=False):
    """
    This function optimizes w and b by running a gradient descent algorithm
    
    Arguments:
    w -- weights, a numpy array of size (num_px * num_px * 3, 1)
    b -- bias, a scalar
    X -- data of shape (num_px * num_px * 3, number of examples)
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat), of shape (1, number of examples)
    num_iterations -- number of iterations of the optimization loop
    learning_rate -- learning rate of the gradient descent update rule
    print_cost -- True to print the loss every 100 steps
    
    Returns:
    params -- dictionary containing the weights w and bias b
    grads -- dictionary containing the gradients of the weights and bias with respect to the cost function
    costs -- list of all the costs computed during the optimization, this will be used to plot the learning curve.
    
    Tips:
    You basically need to write down two steps and iterate through them:
        1) Calculate the cost and the gradient for the current parameters. Use propagate().
        2) Update the parameters using gradient descent rule for w and b.
    """
    
    w = copy.deepcopy(w)
    b = copy.deepcopy(b)
    
    costs = []
    
    for i in range(num_iterations):
        # Cost and gradient calculation 
        grads, cost = propagate(w, b, X, Y)
        
        # Retrieve derivatives from grads
        dw = grads["dw"]
        db = grads["db"]
        
        # update rule 
        w -= learning_rate * dw
        b -= learning_rate * db
        
        # Record the costs
        if i % 100 == 0:
            costs.append(cost)
        
            # Print the cost every 100 training iterations
            if print_cost:
                print ("Cost after iteration %i: %f" %(i, cost))
    
    params = {"w": w,
              "b": b}
    
    grads = {"dw": dw,
             "db": db}
    
    return params, grads, costs

Predict

The previous function will output the learned w and b. We are able to use w and b to predict the labels for a dataset X. Implement the predict() function. There are two steps to computing predictions:

  1. Calculate 𝑌^=𝐴=𝜎(𝑤𝑇𝑋+𝑏)𝑌̂ =𝐴=𝜎(𝑤𝑇𝑋+𝑏)𝑌^=𝐴=𝜎(𝑤𝑇𝑋+𝑏)

  2. Convert the entries of a into 0 (if activation <= 0.5) or 1 (if activation > 0.5), stores the predictions in a vector Y_prediction. If you wish, you can use an if/else statement in a for loop (though there is also a way to vectorize this).

from public_tests import *
def predict(w, b, X):
    '''
    Predict whether the label is 0 or 1 using learned logistic regression parameters (w, b)
    
    Arguments:
    w -- weights, a numpy array of size (num_px * num_px * 3, 1)
    b -- bias, a scalar
    X -- data of size (num_px * num_px * 3, number of examples)
    
    Returns:
    Y_prediction -- a numpy array (vector) containing all predictions (0/1) for the examples in X
    '''
    
    m = X.shape[1]
    Y_prediction = np.zeros((1, m))
    w = w.reshape(X.shape[0], 1)
    
    # Compute vector "A" predicting the probabilities of a cat being present in the picture
    Z = np.dot(w.T, X)+b
    A = 1 / (1 + np.exp(-Z))
    
    for i in range(A.shape[1]):
        
        # Convert probabilities A[0,i] to actual predictions p[0,i]
        if A[0, i] > 0.5 :
            Y_prediction[0,i] = 1
        else:
            Y_prediction[0,i] = 0
    
    return Y_prediction

Until now:

  • Initialize (w,b)

  • Optimize the loss iteratively to learn parameters (w,b):

    • Computing the cost and its gradient

    • Updating the parameters using gradient descent

  • Use the learned (w,b) to predict the labels for a given set of examples

Merge all functions into a model

def model(X_train, Y_train, X_test, Y_test, num_iterations=2000, learning_rate=0.5, print_cost=False):
    """
    Builds the logistic regression model by calling the function you've implemented previously
    
    Arguments:
    X_train -- training set represented by a numpy array of shape (num_px * num_px * 3, m_train)
    Y_train -- training labels represented by a numpy array (vector) of shape (1, m_train)
    X_test -- test set represented by a numpy array of shape (num_px * num_px * 3, m_test)
    Y_test -- test labels represented by a numpy array (vector) of shape (1, m_test)
    num_iterations -- hyperparameter representing the number of iterations to optimize the parameters
    learning_rate -- hyperparameter representing the learning rate used in the update rule of optimize()
    print_cost -- Set to True to print the cost every 100 iterations
    
    Returns:
    d -- dictionary containing information about the model.
    """
    # initialize parameters with zeros
    # and use the "shape" function to get the first dimension of X_train
    w, b = initialize_with_zeros(X_train.shape[0])
    # Gradient descent 
    params, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost)
    # Retrieve parameters w and b from dictionary "params"
    w = params["w"]
    b = params["b"]
    # Predict test/train set examples (≈ 2 lines of code)
    Y_prediction_test = predict(w, b, X_test)
    Y_prediction_train = predict(w, b, X_train)

    # Print train/test Errors
    if print_cost:
        print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
        print("test accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100))

    
    d = {"costs": costs,
         "Y_prediction_test": Y_prediction_test, 
         "Y_prediction_train" : Y_prediction_train, 
         "w" : w, 
         "b" : b,
         "learning_rate" : learning_rate,
         "num_iterations": num_iterations}
    
    return d

Further Analysis

Choice of learning rate

Reminder: In order for Gradient Descent to work you must choose the learning rate wisely. The learning rate 𝛼𝛼𝛼 determines how rapidly we update the parameters. If the learning rate is too large we may "overshoot" the optimal value. Similarly, if it is too small we will need too many iterations to converge to the best values. That's why it is crucial to use a well-tuned learning rate.

Interpretation

  • Different learning rates give different costs and thus different predictions results.

  • If the learning rate is too large (0.01), the cost may oscillate up and down. It may even diverge (though in this example, using 0.01 still eventually ends up at a good value for the cost).

  • A lower cost doesn't mean a better model. You have to check if there is possibly overfitting. It happens when the training accuracy is a lot higher than the test accuracy.

  • In deep learning, we usually recommend that you:

    • Choose the learning rate that better minimizes the cost function.

    • If your model overfits, use other techniques to reduce overfitting.

Last updated