跳转至

Backpropagation

约 1515 个字 19 行代码 12 张图片 预计阅读时间 5 分钟

Backprop with scalars

Computational Graphs

In order to make the process of computing gradients more efficient, we can use computational graphs,a way to represent the computation of a function.

Computational Graph

In a computational graph, to compute the gradient of the function by backpropagation, we can use the chain rule.

There is two passes:

  • Forward pass: Compute the value of the function.
  • Backward pass: Compute the gradient of the function.
Computing Gradients
zx=yxzy \frac{\partial z}{\partial x} =\frac{\partial y}{\partial x} \frac{\partial z}{\partial y}

where zy\frac{\partial z}{\partial y} is called the upstream gradient, and yx\frac{\partial y}{\partial x} is the local gradient,and zx\frac{\partial z}{\partial x} is the downstream gradient.

for the common functions, such as sigmoid, ReLU, etc, we can compute the local gradient easily.And pack them into the computational graph instead of computing them step by step.

Computational Graph

Patterns in Gradient flow

Add gate

An add gate is a gradient distribution gate.

Add Gate

For example,f(x,y)=x+yf(x,y)=x+y, the gradient of ff with respect to xx and yy is fx=1\frac{\partial f}{\partial x}=1 and fy=1\frac{\partial f}{\partial y}=1 . So whatever the upstream gradient is, it will be distributed to the two inputs.

Copy gate

A copy gate is a gradient addition gate,somehow a dual of the add gate.

Copy Gate

For example,if we have a function z=f(x,y)z=f(x,y),and x=y=tx=y=t,

then

zt=xtzx+ytzy\frac{\partial z}{\partial t} = \frac{\partial x}{\partial t} \frac{\partial z}{\partial x} + \frac{\partial y}{\partial t} \frac{\partial z}{\partial y}

where the local stream xt\frac{\partial x}{\partial t} and yt\frac{\partial y}{\partial t} is 11.

Multiply gate

A multiply gate is a gradient swap gate.

Multiply Gate

For example,if we have a function z=xyz=xy,

we know that zx=y\frac{\partial z}{\partial x}=y and zy=x\frac{\partial z}{\partial y}=x,

so if we have an upstream gradient Lz=2\frac{\partial L}{\partial z}=2,

then the downstream gradient of xx and yy are Lx=2y\frac{\partial L}{\partial x}=2y and Ly=2x\frac{\partial L}{\partial y}=2x.

Coding format

when we write the code, we follow the following steps:

  • store every intermediate value in the forward pass.
  • compute the gradient of the loss with respect to every variable in the backward pass.

For example,if we have a function as following:

Sample

the code would be written as:

def f(w0,x0,w1,x1,w2):
    s0=w0*x0
    s1=w1*x1
    s2=s0+s1
    s3=s2+w2
    L=sigmoid(s3)

    grad_L=1.0
    grad_s3=sigmoid_grad(L)*grad_L
    grad_w2=grad_s3# add gate distributes the gradient
    grad_s2=grad_s3
    grad_s0=grad_s2
    grad_s1=grad_s2
    grad_w1=grad_s1*x1# multiply gate swaps the gradient
    grad_x1=grad_s1*w1
    grad_w0=grad_s0*x0
    grad_x0=grad_s0*w0

    return L,grad_w0,grad_x0,grad_w1,grad_x1,grad_w2

Backprop with vectors and matrices

Backprop with vectors

xR,yRx \in \mathbb{R},y \in \mathbb{R}

yxR \dfrac{\partial y}{\partial x} \in \mathbb{R}

xRn,yRx \in \mathbb{R}^n,y \in \mathbb{R}

yxRn \dfrac{\partial y}{\partial x} \in \mathbb{R}^n

xRn,yRmx \in \mathbb{R}^n,y \in \mathbb{R}^m

yxRn×m \dfrac{\partial y}{\partial x} \in \mathbb{R}^{n \times m}

for example,x=(x1,x2,x3)\mathbf{x}=(x_1,x_2,x_3),y=(y1,y2)\mathbf{y}=(y_1,y_2)

the Jacobian matrix is 3×23 \times 2 matrix.

yx=[y1x1y2x1y1x2y2x2y1x3y2x3] \frac{\partial y}{\partial x} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1}\\ \frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2}\\ \frac{\partial y_1}{\partial x_3} & \frac{\partial y_2}{\partial x_3} \end{bmatrix}

Note

其实在数学上,如果yymm维,xxnn维,那么传统的Jacobian 矩阵的形状经常写为 yxRm×n\frac{\partial y}{\partial x} \in \mathbb{R}^{m \times n}的。但是在这里,由于反向传播的链式法则的写法是

Lx=yxLy \frac{\partial L}{\partial x} = \frac{\partial y}{\partial x} \frac{\partial L}{\partial y}

而不是传统的

Lx=Lyyx \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial x}

在一维的情况下,由于乘法可以交换,所以两种写法是等价的。但是在多维的情况下,要想反转这种顺序,就需要给矩阵加上转置。

所以我们把Jacobian 矩阵的形状写为 yxRn×m\frac{\partial y}{\partial x} \in \mathbb{R}^{n \times m}的。

记忆Jacobian形状也根据链式法则的写法来记忆,即如果是第一种Lx=LxLL\frac{\partial L}{\partial x} = \frac{\partial L}{\partial x}\frac{\partial L}{\partial L},所以是L的形状在后面,

如果是第二种Lx=Lxxx\frac{\partial L}{\partial x} = \frac{\partial L}{\partial x}\frac{\partial x}{\partial x},所以是x的形状在后面。

Sketch

Eg

ReLU

for the ReLU function, if we have a 4D input x=(1,2,3,1)x=(1,-2,3,-1),

the output is y=(1,0,3,0)y=(1,0,3,0).

the Jacobian matrix is

yx=[1000000000100000] \frac{\partial y}{\partial x} = \begin{bmatrix} 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 \end{bmatrix}

the upstream gradient is Ly=(4,1,5,9)\frac{\partial L}{\partial y}=(4,-1,5,9),

so the downstream gradient is

Jacobian×upstream gradient=[1000000000100000]×[4159]=[4050] Jacobian \times upstream \ gradient = \begin{bmatrix} 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 \end{bmatrix} \times \begin{bmatrix} 4\\-1\\5\\9 \end{bmatrix} = \begin{bmatrix}4\\0\\5\\0\end{bmatrix}

we can see that the jacobian matrix is a sparse matrix, which is a good property for computing by human but not for computer,if we want to compute high dimensional input, the Jacobian matrix will be too large to handle.

so it is important to come up with some tricks to make the computation more efficient.

For the ReLU function, we can use the following trick:

(Lx)i={(Ly)ixi>00xi0 \left(\frac{\partial L}{\partial x}\right)_{i} = \begin{cases} \left(\frac{\partial L}{\partial y}\right)_{i} & x_{i} > 0\\ 0 & x_{i} \leq 0 \end{cases}

and we can get the downstream gradient directly from the upstream gradient, without computing the Jacobian matrix.

Backprop with matrices

or tensors

Matrix

when we compute backprop with matrices,the jacobian matrix is a 4D tensor,which is very hard to handle.so we need to use some tricks to make the computation more efficient.

One simple way is to compute the gradient element-wise.

example
dLdx11=dydx11dLdy \frac{d L}{d x_{11}} = \frac{d y}{d x_{11}} \frac{d L}{d y}

dydx11\frac{d y}{d x_{11}} is a 2D matrix,

dydx11=[dy11dx11dy12dx11dy13dx11dy14dx11dy21dx11dy22dx11dy23dx11dy24dx11]=[W10] \frac{d y}{d x_{11}} = \begin{bmatrix} \frac{d y_{11}}{d x_{11}} & \frac{d y_{12}}{d x_{11}} &\frac{d y_{13}}{d x_{11}} & \frac{d y_{14}}{d x_{11}} \\ \frac{d y_{21}}{d x_{11}} & \frac{d y_{22}}{d x_{11}} & \frac{d y_{23}}{d x_{11}} & \frac{d y_{24}}{d x_{11}} \end{bmatrix} = \begin{bmatrix} W_1\\ \mathbf{0} \end{bmatrix}

So

dLdx11=dLdx11dLdy=3 \frac{d L}{d x_{11}}=\frac{d L}{d x_{11}} \cdot \frac{d L}{d y}=3

Note

这里进行的是inner product,而不是矩阵乘法。即两个矩阵对应元素相乘后的结果。可以类比2维矩阵相乘的时候,结果矩阵的第i行第j列的元素是两个矩阵的第i行和第j列的内积。

那么对于[(Dx×Dy)×(Dz×Mz)][(D_x \times D_y) \times (D_z \times M_z)]的张量与[Dz×Mz][D_z \times M_z]的矩阵进行乘法的时候,结果张量的形状是[Dx×Dy][D_x \times D_y]的,其第i行第j列的元素是第一个张量第ij位置的矩阵与第二个矩阵的乘积。

Similarly

dydxij\frac{d y}{d x_{ij}} is a 2D matrix,where the i-th row is the j-row of WW other rows are all zero.

And we can get the downstream gradient directly from the upstream gradient and the matrix WW with:

dLdx=dLdyWT \frac{d L}{d x} = \frac{d L}{d y} W^\mathtt{T}

it seems a little bit confusing,but

dLdxRN×D \frac{d L}{d x} \in \mathbb{R}^{N \times D}

dLdyRN×M \frac{d L}{d y} \in \mathbb{R}^{N \times M}

WRD×M W \in \mathbb{R}^{D \times M}

SO this format is the only way to make the matrix multiplication work.

Warning

这不是反向传播的一种形式,而是一种用于计算的推论

We can also use the backprop to compute the hige-order derivatives.

example