Neural Networks are one of the biggest black boxes, yet they are present in every corner of our lives. In essence, they are functions that are able to learn whacky patterns. But what about it makes it a challenge to understand? Is it the heavy calculus involved? “There are so many derivatives to compute". Okay, but if you know this one thing called the Chain Rule, you have the weapon you need to defeat the mini-boss that is calculus.
Let’s say each sample in our training data has 3 features and we have 100 samples. Our hidden layer has 5 neurons and finally our output layer has 1 neuron. One thing to keep in mind is that each neuron produces one output and one output only.
There are 5 neurons in our hidden layer stacked vertically. The weights that we initialized are also 5 vertically stacked rows of weights. Therefore, each row represents one neuron. Consequently there is 1 bias per neuron/row. You might also be wondering why the weights are 3 x 100 and not 100 x 3. Well, in the visual above, the input features are stacked vertically. So often, you are going to want to transpose your data when you first obtain it raw because each sample is a row in most datasets. Luckily, transposing is super easy in numpy(read the docs).
Next, we will need an activation function for this layer to introduce non-linearity. We will be using the sigmoid activation function. This function will be element-wise, so it will not affect the output's dimensions.
Now we will initialize the second/output layer.
So in this case, we have one neuron for our output layer. Thus, we use one row of 5 weights for each of the 5 outputs from the previous layer. Now, since we have initialized our weights, we can start forward propagating.