<< Go Back

$$\theta := \theta - \frac{dJ}{d\theta}$$ So every iteration or "epoch", we decrement \(\theta\) by the value of the gradient at that point with respect to \(\theta\) (\(:=\) is the assignment operator).

With this in practice, the gradient descent algorithm would look like this:

$$ \theta := \theta - \alpha(\frac{dJ}{d\theta})$$ where \(\alpha \) is a hyperparameter that we pick.

Mean Squared Error: $$\frac{1}{M} \sum_{i=0}^M(y^{(i)} - a^{(L)}_i)^2$$

Let's break this down: 'M' is the number of samples, \(a^{(L)}_i\) is the output of the neural network for the \(i\)-th sample, \(y^{(i)}\) is the \(i\)-th label. So we are squaring the difference between the prediction and values and taking the average over all samples. Why do we square it? Because if it wasnt, we would see errors cancelling out. So squaring it makes sure error is always positive for every sample.

This is a linear regression function, but it still does the job in demosntrating how error works. The red squares in this image represent the error from each data sample. The blue line is the prediction function, and the black dots represent each data sample. To find the area of each square, we square the difference between the label and the prediction. The areas of these squares are averaged to find the error of this model.

$$ \frac{dJ}{dW} = \frac{dJ}{da^{(L)}} * \frac{da^{(L)}}{dz} * \frac{dz}{d\theta}$$

This derivative calculation is for a single layer perceptron and to figure out the weight parameters.

Let's go back to the neural network we used in the other blog:

# layer 1

Z1 = W1 @ X + B1

A1 = sigmoid(Z1)

# layer 2

Z2 = W2 @ A1 + B2

A2 = sigmoid(Z2)

$$ \displaylines{J(\theta) = \frac{1}{M}\sum_{i=0}^M(A_2 - y)^2 \\ \\
Z1 = W1 @ X + B1

A1 = sigmoid(Z1)

# layer 2

Z2 = W2 @ A1 + B2

A2 = sigmoid(Z2)

\frac{dJ}{dW_2} = \frac{dJ}{dA_2} * \frac{dA_2}{dZ_2} * \frac{dZ_2}{dW_2}\\

\frac{dJ}{dB_2} = \frac{dJ}{dA_2} * \frac{dA_2}{dZ_2} * \frac{dZ_2}{dB_2}\\

\\ \frac{dJ}{dW_1} = \frac{dJ}{dA_2} * \frac{dA_2}{dZ_2} * \frac{dZ_2}{dA_1} * \frac{dA_1}{dZ_1} * \frac{dZ_1}{dW_1}\\

\frac{dJ}{dB_1} = \frac{dJ}{dA_2} * \frac{dA_2}{dZ_2} * \frac{dZ_2}{dA_1} * \frac{dA_1}{dZ_1} * \frac{dZ_1}{dB_1}} $$

Now, let's compute each derivative. I am going to leave out the summation notation for simplicity purposes.

\frac{dA_2}{dZ_2} = A_2(1 - A_2)\longrightarrow\textrm{simplified} \\ \frac{dZ_2}{dW_2} = A_1 \\

\frac{dZ_2}{dB_2} = 1

} $$

Some derivatives we need are already calculated above.

$$ \displaylines{\frac{dZ_2}{dA_1} = W2 \\\frac{dA_1}{dZ_1} = A_1(1 - A_1) \\

\frac{dZ_1}{dW_1} = X \\

\frac{dZ_1}{dB_1} = 1 }$$ Now, you can plug in these values to get \(\frac{dJ}{dW_2}\),\(\frac{dJ}{dB_2}\),\(\frac{dJ}{dW_1}\),\(\frac{dJ}{dB_1}\).

dA2 = 2(A2 - y)

dZ2 = W1.T.dot(dA2) * d_sigmoid(Z2)

dW2 = dZ2.dot(A1.T)

dB2 = np.sum(dZ2, axis=1, keepdims=True)

dZ1 = W2.T.dot(dZ2) * d_sigmoid(Z1)

dW1 = dZ1.dot(X.T)

dB1 = np.sum(dZ1, axis=1, keepdims=True)

W1 -= dW1 * lr

B1 -= dB1 * lr

W2 -= dW2 * lr

B2 -= dB2 * lr

dZ2 = W1.T.dot(dA2) * d_sigmoid(Z2)

dW2 = dZ2.dot(A1.T)

dB2 = np.sum(dZ2, axis=1, keepdims=True)

dZ1 = W2.T.dot(dZ2) * d_sigmoid(Z1)

dW1 = dZ1.dot(X.T)

dB1 = np.sum(dZ1, axis=1, keepdims=True)

W1 -= dW1 * lr

B1 -= dB1 * lr

W2 -= dW2 * lr

B2 -= dB2 * lr