backprogapation, gradient descent, neural network ,tutorial

In this post, we will try to answer in simple terms the relation between derivatives, gradients and backpropagation in neural network circuits.

What are derivatives ?

Derivatives denotes the magnitude and direction of slopes, which indicates the rate of change. Derivatives of output values with respect to the input values basically can be thought as applying a force on each of the input values so as to make output strong(higher)

For Example, if $f(x,y)=x∗y$ is the AND gate function, and $x=−2, y=3$ are given values, and if we want to move output in the positive direction, then , intuitively to increase the output, we would need to think of how to apply rules on input parameters x and y , so as to make output strong as $f(x,y)=−6$

So , if x=−1 & y=2 , ⇒f(x,y)=−2 (increment x and decrement y) which is larger than previous output, and for this we apply derivatives. As we observe the change on the output value by tweaking the input value by a small amount, we actually are doing intuitively , which can be mathematically be computed by getting the derivatives.

Derivatives indicate the rate of change of a function w.r.t that variable around the infinitesimally small region near a particular point which can be computed by tweaking that input by a small amount and observing the change on the output value

Note:- By Derivatives I mean to say Partial Derivatives , which is called so as output's behaviour is being observed by differentiating the function w.r.t one variable independently, keeping the other variable as constant. Partial Derivatives technically different from Derivatives in a way that it is incomplete or partial .Here we calculate derivative of a function for a portion of total input space of the function, whereas Complete Derivatives are when we compute cumulative derivative of the function with respect to all input variables.

Equation on the right hand side is called as partial derivative of function f(x,y) w.r.t variable x

$$\Rightarrow \frac{\partial f(x,y)}{\partial x} = \lim_{h \to 0} \frac{f(x+h,y) - f(x,y))}{h}$$

So, Derivative on each variable tells you the sensitivity(and its direction positive or negative) of the whole expression on the value of that variable

Some important Differentiation Rules

Addition Rule of Differentiation :- $$ f(x,y) = (x+ y) \Rightarrow \frac{\partial f}{\partial x} = 1 $$
Multiplicative Rule of Differentiation:- $$ f(x,y) = x*y => \frac{\partial f}{\partial x } = y ,\frac{\partial f}{\partial y } = x $$
Divisive Rule of Differentiation:- $$ f(x,y) = \frac{f(x)}{g(x)} => \frac{\partial \frac{f(x)}{g(x)}}{\partial g(x)) } = \frac{g(x)f'(x) - f(x)g'(x)}{(g(x))^2} $$
Chain Rule of Differentiation:- $$ \frac{\partial f}{\partial x} = \frac{\partial f}{\partial u} * \frac{\partial u}{\partial x} , u\to \text{some interim variable} $$

What are Gradients

Gradient is a vector which is just made up of the derivatives of all the inputs(feature vector) concatenated in a vector.

Gradient always gives the direction in which the function value(output) increases. So if applied on loss function , it would give the direction of increase of loss function. To find the direction of minimizing loss function , need to take negative of the computed gradient.

Denoted by ∇f(x),gradient is technically said as the partial derivative on x for function f(x,y), but we can also call this term as the gradient on x for f(x,y) for simplicity.

$$ f(x,y,z) = (x + y)z \therefore \nabla f(x,y,z) = \frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} * \frac{\partial q}{\partial x} \to \text{where } q = (x+ y) $$

Backpropagation

Backpropagation can intuitively be thought as a procedure of communicating between gates of the neural network circuit , in order to respond how strongly(in magnitude and direction) does the network need to tweak the local gradients of the constituting circuits, in order to make the final output value higher. This communication is being done through a gradient signal , computed at output layer , and passed on to input layer by applying chain rule recursively backwards on top of the gradients computed by forward pass mechanism. To continue the recurrence and to chain the gradient , this calculated backpropagation gradient should be multiplied to all the local gradients of the interim circuits, till the backprop gradient finally reaches to all the input features , and those feature vector neurons can be appropriately adjusted or tweaked as per the intent from the output layer.

The learning in neural network is essentially the adjustment of weights and biases which happens during backpropagation. The network does not change (or learn ) during feedforward pass.

Also, gradient is calculated only while backpropagating. In forward pass, we only pass the input variables

It should be noted that to solve a given NN circuit equation to get its gradient, we should never try to follow the approach of unnecessarily performing the differentiation of the complete function w.r.t x or y else , we would end with very large and complex expressions.

So, to avoid this, backpropagation learns the weight and biases by iterating through every data point

1.  Feed forward the neural network circuit equation till the output layer
2. Define a loss function , based on *gap* between the desired output and the feedforward output.

Update the weights and biases of each layer by computing the local gradients (gradients for each layer) and then chaining it with the gradient coming to that layer from previous layer , so that the network circuit gets tweaked in towards the direction of desired output.

For an equation mentioned in C231n Course Link

$$ \boxed{f(x,y)=\frac{x+σ(y)}{σ(x)+(x+y)^2}} \ \text{ where } σ→sigmoid fn \Rightarrow σ(x) =\frac{1}{(1+e^(−x)} $$

Backpropagation can be explained through code like this ,

 1 # assuming some random data input
    2 # dp=(3, -4)
    3 x=3
    4 y=-4
    5 
    6 # forward pass
    7 # as sigmoid(w, x) = 1/1 + e^-(w[0]*x[0] + w[1]*x[1] + w[2])
    8 # sigmoid in numerator
    9 sigmoid_y =1/(1+math.exp(-y))                                 #(1)
   10 numerator=x +sigmoid_y                                        #(2)
   11 # sigmoid in denominator
   12 sigmoid_x =1/(1+math.exp(-x))                                 #(3)
   13 xplusy=x+y                                                    #(4)
   14 xplusy_sqr =xplusy **2                                        #(5)
   15 denominator=sigmoid_y +xplusy_sqr                 #denominator (6)
   16 inv_denominator=1/denominator                                 #(7)
   17 f=numerator *inv_denominator                                  #(8)
   18 
   19 
   20 #now calculating backpropagation for every variable(eqn) 
   21 #along the way in the forward pass
   22 
   23 #starting from end, last variable 
   24 #backprop f= numerator * inv_denominator
   25 dinv_denominator =numerator                                #from eqn(8)
   26 dnumerator =inv_denominator                                #from eqn(8)
   27 
   28 #backprop inv_denominator= 1/denominator by solving local function first and then using chain rule. d(f)/d(denominator) = df/d(dinv_denominator) * d(dinv_denominator)/d(denominator)
   29 ddenominator= (-1.0/denominator**2) *dinv_denominator     #from eqn(7)
   30 
   31 #backprop denominator= sigmoid_x + xplusy_sqr
   32 dsigmoid_x =1*ddenominator                                #from eqn(6)
   33 dxplusy_sqr =1*ddenominator                               #from eqn(6)
   34 
   35 #backprop xplusy_sqr = xplusy **2
   36 dxplusy = (2*xplusy) *dxplusy_sqr                         #from eqn (5)
   37 
   38 #backprop xplusy= x+ y
   39 dx =1*dxplusy                                            #from eqn (4)
   40 dy =1*dxplusy                                            #from eqn (4)
   41 
   42 #backprop sigmoid_x = 1/(1 + math.exp(-x))
   43 # we need to accumulate the gradients, and not replace
   44 # so, += and not =
   45 dx+= ((1-sigmoid_d) *sigmoid_x) *dsigmoid_x             #from eqn (3)
   46 
   47 #backprop numerator= x + sigmoid_y
   48 dx+=1*dnumerator
   49 dsigmoid_y=1*dnumerator
   50 
   51 #backprop sigmoid_y = 1/(1 + math.exp(-y))
   52 dy+= ((1-sigmoid_y) *sigmoid_y) *dsigmoid_y
   53 
   54 #done

Breaking forward pass into stages (i.e. storing forward pass values of each layer into intermediate variables ) that can be easily backpropagated through is called as staged backpropagation

Conclusion

I have tried to explain how backpropagation works and what the role of derivatives and gradients in it. But, the vanilla concept of backpropagation is computationally expensive as it involves O(nll) time complexity for n data points. And it also involves issue of exploding/vanishing gradients as the number of layer and data points increases.

So, to remove some of its limitations, stochastic gradient descent technique was introduced which basically performs backpropagation for a mini-batch of data points at one go, and do so it brings up the term epochs.

Stochastic gradient is a topic on its own , and I have purposefully decided to include only a short introduction to this topic in this article in order to keep the article precise.

References C231n CNN Optimization Backpropagation Blue1Brown

Derivatives, Gradients and Backpropagation Interconnection

What are derivatives ?

Some important Differentiation Rules

What are Gradients

Backpropagation

Conclusion

Product

Explore

Company

Blogs

Support