为什么我们需要在 PyTorch 中调用 zero

小编典典

为什么我们需要在 PyTorch 中调用 zero_grad()？

all

为什么zero_grad()训练时需要调用？

|  zero_grad(self)
|      Sets gradients of all model parameters to zero.

阅读 69

2022-05-23

共1个答案

小编典典

在中，对于
训练PyTorch阶段的每个小批量，我们通常希望在开始执行反向传播（即更新
W 八和偏差）之前将梯度显式设置为零， _ 因为*_ _ PyTorch_ 会在随后 的反向传播中 累积梯度
。这种累积行为在训练 RNN 时或当我们想要计算多个 mini-batch
的总损失梯度时很方便。因此，默认操作已设置为在每次调用时累积（即求和）梯度。
__ _ _ _ ***_
loss.backward()

因此，当您开始训练循环时，理想情况下您应该zero out the gradients正确执行参数更新。否则，梯度将是您已经用于更新模型参数的旧梯度和新计算的梯度的组合。因此，它会指向除预期方向之外的某个其他方向，指向
最小值 （或 最大值 ，在最大化目标的情况下）。

这是一个简单的例子：

import torch
from torch.autograd import Variable
import torch.optim as optim

def linear_model(x, W, b):
    return torch.matmul(x, W) + b

data, targets = ...

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

optimizer = optim.Adam([W, b])

for sample, target in zip(data, targets):
    # clear out the gradients of all Variables 
    # in this optimizer (i.e. W, b)
    optimizer.zero_grad()
    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()
    optimizer.step()

或者，如果你正在做一个 普通的梯度下降 ，那么：

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

for sample, target in zip(data, targets):
    # clear out the gradients of Variables 
    # (i.e. W, b)
    W.grad.data.zero_()
    b.grad.data.zero_()

    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()

    W -= learning_rate * W.grad.data
    b -= learning_rate * b.grad.data

注意：

梯度的累积（即 sum ）发生在tensor .backward()上调用时loss。
从 v1.7.0 开始，Pytorch 提供了将渐变重置为的选项，None optimizer.zero_grad(set_to_none=True)而不是用零张量填充它们。文档声称此设置减少了内存需求并略微提高了性能，但如果处理不当可能容易出错。

2022-05-23