Derivatives, Gradients and Backpropagation Interconnection
In this post, we will try to answer in simple terms the relation between derivatives, gradients and backpropagation in neural network circuits.
What are derivatives ?
Derivatives denotes the magnitude and direction of slopes, which indicates the rate of change. Derivatives of output values with respect to the input values basically can be thought as applying a force on each of the input values so as to make output strong(higher)
For Example, if \(f(x,y)=x∗y\) is the AND gate function, and \(x=−2, y=3\) are given values, and if we want to move output in the positive direction, then , intuitively to increase the output, we would need to think of how to apply rules on input parameters x and y , so as to make output strong as \(f(x,y)=−6\)
So , if x=−1 & y=2 , ⇒f(x,y)=−2 (increment x and decrement y) which is larger than previous output, and for this we apply derivatives. As we observe the change on the output value by tweaking the input value by a small amount, we actually are doing intuitively , which can be mathematically be computed by getting the derivatives.
Derivatives indicate the rate of change of a function w.r.t that variable around the infinitesimally small region near a particular point which can be computed by tweaking that input by a small amount and observing the change on the output value
Note:- By Derivatives
I mean to say Partial Derivatives
, which is called so as output's behaviour is being observed by differentiating the function w.r.t one variable independently, keeping the other variable as constant. Partial Derivatives
technically different from Derivatives
in a way that it is incomplete or partial .Here we calculate derivative of a function for a portion of total input space of the function, whereas Complete Derivatives
are when we compute cumulative derivative of the function with respect to all input variables.
Equation on the right hand side is called as partial derivative of function f(x,y) w.r.t variable x
$$\Rightarrow \frac{\partial f(x,y)}{\partial x} = \lim_{h \to 0} \frac{f(x+h,y) - f(x,y))}{h}$$
So, Derivative on each variable tells you the sensitivity(and its direction positive or negative) of the whole expression on the value of that variable
Some important Differentiation Rules
Addition Rule of Differentiation :- $$ f(x,y) = (x+ y) \Rightarrow \frac{\partial f}{\partial x} = 1 $$
Multiplicative Rule of Differentiation:- $$ f(x,y) = x*y => \frac{\partial f}{\partial x } = y ,\frac{\partial f}{\partial y } = x $$
Divisive Rule of Differentiation:- $$ f(x,y) = \frac{f(x)}{g(x)} => \frac{\partial \frac{f(x)}{g(x)}}{\partial g(x)) } = \frac{g(x)f'(x) - f(x)g'(x)}{(g(x))^2} $$
Chain Rule of Differentiation:- $$ \frac{\partial f}{\partial x} = \frac{\partial f}{\partial u} * \frac{\partial u}{\partial x} , u\to \text{some interim variable} $$
What are Gradients
Gradient is a vector which is just made up of the derivatives of all the inputs(feature vector) concatenated in a vector.
Gradient always gives the direction in which the function value(output) increases. So if applied on loss function , it would give the direction of increase of loss function. To find the direction of minimizing loss function , need to take negative of the computed gradient.
Denoted by ∇f(x),gradient is technically said as
the partial derivative on x for function f(x,y)
, but we can also call this term asthe gradient on x for f(x,y)
for simplicity.
$$ f(x,y,z) = (x + y)z \therefore \nabla f(x,y,z) = \frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} * \frac{\partial q}{\partial x} \to \text{where } q = (x+ y) $$
Backpropagation
Backpropagation can intuitively be thought as a procedure of communicating between gates of the neural network circuit , in order to respond how strongly(in magnitude and direction) does the network need to tweak the local gradients of the constituting circuits, in order to make the final output value higher. This communication is being done through a gradient signal , computed at output layer , and passed on to input layer by applying chain rule recursively backwards on top of the gradients computed by forward pass mechanism. To continue the recurrence and to chain the gradient , this calculated backpropagation gradient should be multiplied to all the local gradients of the interim circuits, till the backprop gradient finally reaches to all the input features , and those feature vector neurons can be appropriately adjusted or tweaked as per the intent from the output layer.
The learning in neural network is essentially the adjustment of weights and biases which happens during backpropagation. The network does not change (or learn ) during feedforward pass.
Also, gradient is calculated only while backpropagating. In forward pass, we only pass the input variables
It should be noted that to solve a given NN circuit equation to get its gradient, we should never try to follow the approach of unnecessarily performing the differentiation of the complete function w.r.t x or y else , we would end with very large and complex expressions.
So, to avoid this, backpropagation learns the weight and biases by iterating through every data point
1. Feed forward the neural network circuit equation till the output layer
2. Define a loss function , based on *gap* between the desired output and the feedforward output.
- Update the weights and biases of each layer by computing the local gradients (gradients for each layer) and then chaining it with the gradient coming to that layer from previous layer , so that the network circuit gets tweaked in towards the direction of desired output.
For an equation mentioned in
$$ \boxed{f(x,y)=\frac{x+σ(y)}{σ(x)+(x+y)^2}} \ \text{ where } σ→sigmoid fn \Rightarrow σ(x) =\frac{1}{(1+e^(−x)} $$
Backpropagation can be explained through code like this ,
1 # assuming some random data input
2 # dp=(3, -4)
3 x=3
4 y=-4
5
6 # forward pass
7 # as sigmoid(w, x) = 1/1 + e^-(w[0]*x[0] + w[1]*x[1] + w[2])
8 # sigmoid in numerator
9 sigmoid_y =1/(1+math.exp(-y)) #(1)
10 numerator=x +sigmoid_y #(2)
11 # sigmoid in denominator
12 sigmoid_x =1/(1+math.exp(-x)) #(3)
13 xplusy=x+y #(4)
14 xplusy_sqr =xplusy **2 #(5)
15 denominator=sigmoid_y +xplusy_sqr #denominator (6)
16 inv_denominator=1/denominator #(7)
17 f=numerator *inv_denominator #(8)
18
19
20 #now calculating backpropagation for every variable(eqn)
21 #along the way in the forward pass
22
23 #starting from end, last variable
24 #backprop f= numerator * inv_denominator
25 dinv_denominator =numerator #from eqn(8)
26 dnumerator =inv_denominator #from eqn(8)
27
28 #backprop inv_denominator= 1/denominator by solving local function first and then using chain rule. d(f)/d(denominator) = df/d(dinv_denominator) * d(dinv_denominator)/d(denominator)
29 ddenominator= (-1.0/denominator**2) *dinv_denominator #from eqn(7)
30
31 #backprop denominator= sigmoid_x + xplusy_sqr
32 dsigmoid_x =1*ddenominator #from eqn(6)
33 dxplusy_sqr =1*ddenominator #from eqn(6)
34
35 #backprop xplusy_sqr = xplusy **2
36 dxplusy = (2*xplusy) *dxplusy_sqr #from eqn (5)
37
38 #backprop xplusy= x+ y
39 dx =1*dxplusy #from eqn (4)
40 dy =1*dxplusy #from eqn (4)
41
42 #backprop sigmoid_x = 1/(1 + math.exp(-x))
43 # we need to accumulate the gradients, and not replace
44 # so, += and not =
45 dx+= ((1-sigmoid_d) *sigmoid_x) *dsigmoid_x #from eqn (3)
46
47 #backprop numerator= x + sigmoid_y
48 dx+=1*dnumerator
49 dsigmoid_y=1*dnumerator
50
51 #backprop sigmoid_y = 1/(1 + math.exp(-y))
52 dy+= ((1-sigmoid_y) *sigmoid_y) *dsigmoid_y
53
54 #done
Breaking forward pass into stages (i.e. storing forward pass values of each layer into intermediate variables ) that can be easily backpropagated through is called as staged backpropagation
Conclusion
I have tried to explain how backpropagation works and what the role of derivatives and gradients in it. But, the vanilla concept of backpropagation is computationally expensive as it involves O(nll) time complexity for n data points. And it also involves issue of exploding/vanishing gradients as the number of layer and data points increases.
So, to remove some of its limitations, stochastic gradient descent technique was introduced which basically performs backpropagation for a mini-batch of data points at one go, and do so it brings up the term epochs.
Stochastic gradient is a topic on its own , and I have purposefully decided to include only a short introduction to this topic in this article in order to keep the article precise.
References