Backpropagation
约 1515 个字 19 行代码 12 张图片 预计阅读时间 5 分钟
Backprop with scalars
Computational Graphs
In order to make the process of computing gradients more efficient, we can use computational graphs,a way to represent the computation of a function.
Computational Graph
In a computational graph, to compute the gradient of the function by backpropagation, we can use the chain rule.
There is two passes:
Forward pass: Compute the value of the function.
Backward pass: Compute the gradient of the function.
Computing Gradients
∂ z ∂ x = ∂ y ∂ x ∂ z ∂ y
\frac{\partial z}{\partial x} =\frac{\partial y}{\partial x} \frac{\partial z}{\partial y}
∂ x ∂ z = ∂ x ∂ y ∂ y ∂ z
where ∂ z ∂ y \frac{\partial z}{\partial y} ∂ y ∂ z is called the upstream gradient, and ∂ y ∂ x \frac{\partial y}{\partial x} ∂ x ∂ y is the local gradient,and ∂ z ∂ x \frac{\partial z}{\partial x} ∂ x ∂ z is the downstream gradient.
for the common functions, such as sigmoid, ReLU, etc, we can compute the local gradient easily.And pack them into the computational graph instead of computing them step by step.
Computational Graph
Patterns in Gradient flow
Add gate
An add gate is a gradient distribution gate.
Add Gate
For example,f ( x , y ) = x + y f(x,y)=x+y f ( x , y ) = x + y , the gradient of f f f with respect to x x x and y y y is ∂ f ∂ x = 1 \frac{\partial f}{\partial x}=1 ∂ x ∂ f = 1 and ∂ f ∂ y = 1 \frac{\partial f}{\partial y}=1 ∂ y ∂ f = 1 . So whatever the upstream gradient is, it will be distributed to the two inputs.
Copy gate
A copy gate is a gradient addition gate,somehow a dual of the add gate.
Copy Gate
For example,if we have a function z = f ( x , y ) z=f(x,y) z = f ( x , y ) ,and x = y = t x=y=t x = y = t ,
then
∂ z ∂ t = ∂ x ∂ t ∂ z ∂ x + ∂ y ∂ t ∂ z ∂ y \frac{\partial z}{\partial t} = \frac{\partial x}{\partial t} \frac{\partial z}{\partial x} + \frac{\partial y}{\partial t} \frac{\partial z}{\partial y} ∂ t ∂ z = ∂ t ∂ x ∂ x ∂ z + ∂ t ∂ y ∂ y ∂ z
where the local stream ∂ x ∂ t \frac{\partial x}{\partial t} ∂ t ∂ x and ∂ y ∂ t \frac{\partial y}{\partial t} ∂ t ∂ y is 1 1 1 .
Multiply gate
A multiply gate is a gradient swap gate.
Multiply Gate
For example,if we have a function z = x y z=xy z = x y ,
we know that ∂ z ∂ x = y \frac{\partial z}{\partial x}=y ∂ x ∂ z = y and ∂ z ∂ y = x \frac{\partial z}{\partial y}=x ∂ y ∂ z = x ,
so if we have an upstream gradient ∂ L ∂ z = 2 \frac{\partial L}{\partial z}=2 ∂ z ∂ L = 2 ,
then the downstream gradient of x x x and y y y are ∂ L ∂ x = 2 y \frac{\partial L}{\partial x}=2y ∂ x ∂ L = 2 y and ∂ L ∂ y = 2 x \frac{\partial L}{\partial y}=2x ∂ y ∂ L = 2 x .
when we write the code, we follow the following steps:
store every intermediate value in the forward pass.
compute the gradient of the loss with respect to every variable in the backward pass.
For example,if we have a function as following:
Sample
the code would be written as:
def f ( w0 , x0 , w1 , x1 , w2 ):
s0 = w0 * x0
s1 = w1 * x1
s2 = s0 + s1
s3 = s2 + w2
L = sigmoid ( s3 )
grad_L = 1.0
grad_s3 = sigmoid_grad ( L ) * grad_L
grad_w2 = grad_s3 # add gate distributes the gradient
grad_s2 = grad_s3
grad_s0 = grad_s2
grad_s1 = grad_s2
grad_w1 = grad_s1 * x1 # multiply gate swaps the gradient
grad_x1 = grad_s1 * w1
grad_w0 = grad_s0 * x0
grad_x0 = grad_s0 * w0
return L , grad_w0 , grad_x0 , grad_w1 , grad_x1 , grad_w2
Backprop with vectors and matrices
Backprop with vectors
x ∈ R , y ∈ R x \in \mathbb{R},y \in \mathbb{R} x ∈ R , y ∈ R
∂ y ∂ x ∈ R
\dfrac{\partial y}{\partial x} \in \mathbb{R}
∂ x ∂ y ∈ R
x ∈ R n , y ∈ R x \in \mathbb{R}^n,y \in \mathbb{R} x ∈ R n , y ∈ R
∂ y ∂ x ∈ R n
\dfrac{\partial y}{\partial x} \in \mathbb{R}^n
∂ x ∂ y ∈ R n
x ∈ R n , y ∈ R m x \in \mathbb{R}^n,y \in \mathbb{R}^m x ∈ R n , y ∈ R m
∂ y ∂ x ∈ R n × m
\dfrac{\partial y}{\partial x} \in \mathbb{R}^{n \times m}
∂ x ∂ y ∈ R n × m
for example,x = ( x 1 , x 2 , x 3 ) \mathbf{x}=(x_1,x_2,x_3) x = ( x 1 , x 2 , x 3 ) ,y = ( y 1 , y 2 ) \mathbf{y}=(y_1,y_2) y = ( y 1 , y 2 )
the Jacobian matrix is 3 × 2 3 \times 2 3 × 2 matrix.
∂ y ∂ x = [ ∂ y 1 ∂ x 1 ∂ y 2 ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 2 ∂ x 2 ∂ y 1 ∂ x 3 ∂ y 2 ∂ x 3 ]
\frac{\partial y}{\partial x} = \begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1}\\
\frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2}\\
\frac{\partial y_1}{\partial x_3} & \frac{\partial y_2}{\partial x_3}
\end{bmatrix}
∂ x ∂ y = ⎣ ⎡ ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 1 ∂ x 3 ∂ y 1 ∂ x 1 ∂ y 2 ∂ x 2 ∂ y 2 ∂ x 3 ∂ y 2 ⎦ ⎤
Note
其实在数学上,如果y y y 是m m m 维,x x x 是n n n 维,那么传统的Jacobian 矩阵的形状经常写为 ∂ y ∂ x ∈ R m × n \frac{\partial y}{\partial x} \in \mathbb{R}^{m \times n} ∂ x ∂ y ∈ R m × n 的。但是在这里,由于反向传播的链式法则的写法是
∂ L ∂ x = ∂ y ∂ x ∂ L ∂ y
\frac{\partial L}{\partial x} = \frac{\partial y}{\partial x} \frac{\partial L}{\partial y}
∂ x ∂ L = ∂ x ∂ y ∂ y ∂ L
而不是传统的
∂ L ∂ x = ∂ L ∂ y ∂ y ∂ x
\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial x}
∂ x ∂ L = ∂ y ∂ L ∂ x ∂ y
在一维的情况下,由于乘法可以交换,所以两种写法是等价的。但是在多维的情况下,要想反转这种顺序,就需要给矩阵加上转置。
所以我们把Jacobian 矩阵的形状写为 ∂ y ∂ x ∈ R n × m \frac{\partial y}{\partial x} \in \mathbb{R}^{n \times m} ∂ x ∂ y ∈ R n × m 的。
记忆Jacobian形状也根据链式法则的写法来记忆,即如果是第一种∂ L ∂ x = ∂ L ∂ x ∂ L ∂ L \frac{\partial L}{\partial x} = \frac{\partial L}{\partial x}\frac{\partial L}{\partial L} ∂ x ∂ L = ∂ x ∂ L ∂ L ∂ L ,所以是L的形状在后面,
如果是第二种∂ L ∂ x = ∂ L ∂ x ∂ x ∂ x \frac{\partial L}{\partial x} = \frac{\partial L}{\partial x}\frac{\partial x}{\partial x} ∂ x ∂ L = ∂ x ∂ L ∂ x ∂ x ,所以是x的形状在后面。
Sketch
Eg
ReLU
for the ReLU function, if we have a 4D input x = ( 1 , − 2 , 3 , − 1 ) x=(1,-2,3,-1) x = ( 1 , − 2 , 3 , − 1 ) ,
the output is y = ( 1 , 0 , 3 , 0 ) y=(1,0,3,0) y = ( 1 , 0 , 3 , 0 ) .
the Jacobian matrix is
∂ y ∂ x = [ 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ]
\frac{\partial y}{\partial x} = \begin{bmatrix}
1 & 0 & 0 & 0\\
0 & 0 & 0 & 0\\
0 & 0 & 1 & 0\\
0 & 0 & 0 & 0
\end{bmatrix}
∂ x ∂ y = ⎣ ⎡ 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ⎦ ⎤
the upstream gradient is ∂ L ∂ y = ( 4 , − 1 , 5 , 9 ) \frac{\partial L}{\partial y}=(4,-1,5,9) ∂ y ∂ L = ( 4 , − 1 , 5 , 9 ) ,
so the downstream gradient is
J a c o b i a n × u p s t r e a m g r a d i e n t = [ 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ] × [ 4 − 1 5 9 ] = [ 4 0 5 0 ]
Jacobian \times upstream \ gradient = \begin{bmatrix}
1 & 0 & 0 & 0\\
0 & 0 & 0 & 0\\
0 & 0 & 1 & 0\\
0 & 0 & 0 & 0
\end{bmatrix} \times \begin{bmatrix}
4\\-1\\5\\9
\end{bmatrix} = \begin{bmatrix}4\\0\\5\\0\end{bmatrix}
J a co bian × u p s t re am g r a d i e n t = ⎣ ⎡ 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ⎦ ⎤ × ⎣ ⎡ 4 − 1 5 9 ⎦ ⎤ = ⎣ ⎡ 4 0 5 0 ⎦ ⎤
we can see that the jacobian matrix is a sparse matrix, which is a good property for computing by human but not for computer,if we want to compute high dimensional input, the Jacobian matrix will be too large to handle.
so it is important to come up with some tricks to make the computation more efficient.
For the ReLU function, we can use the following trick:
( ∂ L ∂ x ) i = { ( ∂ L ∂ y ) i x i > 0 0 x i ≤ 0
\left(\frac{\partial L}{\partial x}\right)_{i} = \begin{cases}
\left(\frac{\partial L}{\partial y}\right)_{i} & x_{i} > 0\\
0 & x_{i} \leq 0
\end{cases}
( ∂ x ∂ L ) i = { ( ∂ y ∂ L ) i 0 x i > 0 x i ≤ 0
and we can get the downstream gradient directly from the upstream gradient, without computing the Jacobian matrix.
Backprop with matrices
or tensors
Matrix
when we compute backprop with matrices,the jacobian matrix is a 4D tensor,which is very hard to handle.so we need to use some tricks to make the computation more efficient.
One simple way is to compute the gradient element-wise.
example
d L d x 11 = d y d x 11 d L d y
\frac{d L}{d x_{11}} = \frac{d y}{d x_{11}} \frac{d L}{d y}
d x 11 d L = d x 11 d y d y d L
d y d x 11 \frac{d y}{d x_{11}} d x 11 d y is a 2D matrix,
d y d x 11 = [ d y 11 d x 11 d y 12 d x 11 d y 13 d x 11 d y 14 d x 11 d y 21 d x 11 d y 22 d x 11 d y 23 d x 11 d y 24 d x 11 ] = [ W 1 0 ]
\frac{d y}{d x_{11}} = \begin{bmatrix}
\frac{d y_{11}}{d x_{11}} & \frac{d y_{12}}{d x_{11}} &\frac{d y_{13}}{d x_{11}} & \frac{d y_{14}}{d x_{11}} \\
\frac{d y_{21}}{d x_{11}} & \frac{d y_{22}}{d x_{11}} & \frac{d y_{23}}{d x_{11}} & \frac{d y_{24}}{d x_{11}}
\end{bmatrix} = \begin{bmatrix}
W_1\\
\mathbf{0}
\end{bmatrix}
d x 11 d y = [ d x 11 d y 11 d x 11 d y 21 d x 11 d y 12 d x 11 d y 22 d x 11 d y 13 d x 11 d y 23 d x 11 d y 14 d x 11 d y 24 ] = [ W 1 0 ]
So
d L d x 11 = d L d x 11 ⋅ d L d y = 3
\frac{d L}{d x_{11}}=\frac{d L}{d x_{11}} \cdot \frac{d L}{d y}=3
d x 11 d L = d x 11 d L ⋅ d y d L = 3
Note
这里进行的是inner product,而不是矩阵乘法。即两个矩阵对应元素相乘后的结果。可以类比2维矩阵相乘的时候,结果矩阵的第i行第j列的元素是两个矩阵的第i行和第j列的内积。
那么对于[ ( D x × D y ) × ( D z × M z ) ] [(D_x \times D_y) \times (D_z \times M_z)] [( D x × D y ) × ( D z × M z )] 的张量与[ D z × M z ] [D_z \times M_z] [ D z × M z ] 的矩阵进行乘法的时候,结果张量的形状是[ D x × D y ] [D_x \times D_y] [ D x × D y ] 的,其第i行第j列的元素是第一个张量第ij位置的矩阵与第二个矩阵的乘积。
Similarly
d y d x i j \frac{d y}{d x_{ij}} d x ij d y is a 2D matrix,where the i-th row is the j-row of W W W other rows are all zero.
And we can get the downstream gradient directly from the upstream gradient and the matrix W W W with:
d L d x = d L d y W T
\frac{d L}{d x} = \frac{d L}{d y} W^\mathtt{T}
d x d L = d y d L W T
it seems a little bit confusing,but
d L d x ∈ R N × D \frac{d L}{d x} \in \mathbb{R}^{N \times D} d x d L ∈ R N × D
d L d y ∈ R N × M \frac{d L}{d y} \in \mathbb{R}^{N \times M} d y d L ∈ R N × M
W ∈ R D × M W \in \mathbb{R}^{D \times M} W ∈ R D × M
SO this format is the only way to make the matrix multiplication work.
Warning
这不是反向传播的一种形式,而是一种用于计算的推论
We can also use the backprop to compute the hige-order derivatives.
example