Learning Rate Scheduler

Learning Rate

Learning rate는 gradient를 업데이트할 때 사용되는 보폭이다.

Learning rate는 성능에 영향을 주기 때문에 잘못 설정하면 아예 학습이 안되기도 한다. 그렇기 때문에 learning rate를 어떻게 설정할지는 매우 중요하다.

적절한 학습률을 찾는 것은 어렵다. 그래서 이 문제를 해결하기 위해 learning rate schedule가 도입되었다.

처음부터 끝까지 같은 learning rate를 사용할 수도 있지만, 학습 과정에서 learning rate scheduler를 사용하면 더 좋은 성능이 나올 수도 있다.

처음에는 큰 learning rate으로 빠르게 최적화를 하고, 최적값에 가까워질수록 learning rate를 줄여 미세 조정하는 것이 학습이 잘 된다고 알려져 있다.

learning rate를 decay 하는 방법 이외에도 learning rate를 줄였다 늘렸다 하는 것이 더 성능 향상에 도움이 된다는 연구 결과도 있다.

Learning Rate Scheduler의 사용

optimizer와 scheduler를 먼저 정의한 후, 학습할 때 batch마다 optimizer.step() 하고 epoch마다 scheduler.step()을 해주면 된다.

PyTorch에서는 기본적으로 다양한 learning rate scheduler를 제공하고 있다.

그중에서도 사람들이 많이 사용하는 CosineAnnealingWarmRestarts를 알아보려고 한다.

CosineAnnealingWarmRestarts

https://paperswithcode.com/method/cosine-annealing

Papers with Code - Cosine Annealing Explained

Cosine Annealing is a type of learning rate schedule that has the effect of starting with a large learning rate that is relatively rapidly decreased to a minimum value before being increased rapidly again. The resetting of the learning rate acts like a sim

paperswithcode.com

Cosine Annealing는 learning rate scheduling의 일종이다. 큰 값을 learning rate로 사용하고 비교적 빠르게 극소값으로 감소된다는 특징이 있다.

Learning rate를 다시 세팅하는 것은 "warm restart"라고 부른다.

pytorch에서 제공하는 CosineAnnealingWarmRestarts

T_0 : 최초 주기 값
T_mult : 주기가 반복되면서 최초 주기 값에 비해 얼마큼 주기를 늘려나갈 것인지를 나타내는 스케일 값
eta_min : learning rate의 최솟값

참고 자료에 따르면 파이토치에서 제공하는 CosineAnnealingWarmRestarts는 warm up start가 구현되어 있지 않고 learning rate의 최댓값이 감소하는 방법이 구현되어 있지 않기 때문에 좀 아쉬운 코드라고 했다.

그래서 참고 자료는 Custom CosineAnnealingWarmRestarts을 사용하길 추천했다.

Pytorch의 CosineAnnealingWarmRestarts 코드에 warm up start와 max값의 감소 기능이 추가된 형태이다.

코드 출처 : https://github.com/gaussian37/pytorch_deep_learning_models/blob/master/cosine_annealing_with_warmup/cosine_annealing_with_warmup.py

import math
from torch.optim.lr_scheduler import _LRScheduler

class CosineAnnealingWarmUpRestarts(_LRScheduler):
    def __init__(self, optimizer, T_0, T_mult=1, eta_max=0.1, T_up=0, gamma=1., last_epoch=-1):
        if T_0 <= 0 or not isinstance(T_0, int):
            raise ValueError("Expected positive integer T_0, but got {}".format(T_0))
        if T_mult < 1 or not isinstance(T_mult, int):
            raise ValueError("Expected integer T_mult >= 1, but got {}".format(T_mult))
        if T_up < 0 or not isinstance(T_up, int):
            raise ValueError("Expected positive integer T_up, but got {}".format(T_up))
        self.T_0 = T_0
        self.T_mult = T_mult
        self.base_eta_max = eta_max
        self.eta_max = eta_max
        self.T_up = T_up
        self.T_i = T_0
        self.gamma = gamma
        self.cycle = 0
        self.T_cur = last_epoch
        super(CosineAnnealingWarmUpRestarts, self).__init__(optimizer, last_epoch)
    
    def get_lr(self):
        if self.T_cur == -1:
            return self.base_lrs
        elif self.T_cur < self.T_up:
            return [(self.eta_max - base_lr)*self.T_cur / self.T_up + base_lr for base_lr in self.base_lrs]
        else:
            return [base_lr + (self.eta_max - base_lr) * (1 + math.cos(math.pi * (self.T_cur-self.T_up) / (self.T_i - self.T_up))) / 2
                    for base_lr in self.base_lrs]

    def step(self, epoch=None):
        if epoch is None:
            epoch = self.last_epoch + 1
            self.T_cur = self.T_cur + 1
            if self.T_cur >= self.T_i:
                self.cycle += 1
                self.T_cur = self.T_cur - self.T_i
                self.T_i = (self.T_i - self.T_up) * self.T_mult + self.T_up
        else:
            if epoch >= self.T_0:
                if self.T_mult == 1:
                    self.T_cur = epoch % self.T_0
                    self.cycle = epoch // self.T_0
                else:
                    n = int(math.log((epoch / self.T_0 * (self.T_mult - 1) + 1), self.T_mult))
                    self.cycle = n
                    self.T_cur = epoch - self.T_0 * (self.T_mult ** n - 1) / (self.T_mult - 1)
                    self.T_i = self.T_0 * self.T_mult ** (n)
            else:
                self.T_i = self.T_0
                self.T_cur = epoch
                
        self.eta_max = self.base_eta_max * (self.gamma**self.cycle)
        self.last_epoch = math.floor(epoch)
        for param_group, lr in zip(self.optimizer.param_groups, self.get_lr()):
            param_group['lr'] = lr

optimizer = optim.Adam(model.parameters(), lr = 0)
scheduler = CosineAnnealingWarmUpRestarts(optimizer, T_0=150, T_mult=1, eta_max=0.1,  T_up=10, gamma=0.5)