NeuZephyr
Simple DL Framework
nz::opt::Adam Class Reference

Adam optimizer for deep learning models. More...

Inheritance diagram for nz::opt::Adam:
Collaboration diagram for nz::opt::Adam:

Public Member Functions

 Adam (Tensor::value_type learning_rate, Tensor::value_type beta1, Tensor::value_type beta2)
 Constructs an Adam optimizer with the specified hyperparameters.
 
void step (Node *input) override
 Performs a single optimization step using the Adam algorithm.
 
- Public Member Functions inherited from nz::opt::Optimizer
 Optimizer ()=default
 Default constructor for the Optimizer class.
 
virtual ~Optimizer ()=default
 Default destructor for the Optimizer class.
 

Detailed Description

Adam optimizer for deep learning models.

The Adam class implements the Adam optimization algorithm, which is an adaptive learning rate optimization method designed for training deep learning models. Adam combines the advantages of two popular optimization techniques: Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Propagation (RMSProp). It uses estimates of first and second moments of gradients to adaptively adjust the learning rate for each parameter.

This class extends the Optimizer base class and provides a concrete implementation of the step method, which updates the model's parameters (represented as Node objects) using the Adam algorithm.

  • The optimizer maintains two moment estimates for each parameter (Node):
    • ( m_t ): The first moment estimate, which is the exponentially decaying average of past gradients.
    • ( v_t ): The second moment estimate, which is the exponentially decaying average of past squared gradients.
  • The moment estimates are updated using the following formulas: [ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t ] [ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 ] where ( g_t ) is the current gradient, ( \beta_1 ) and ( \beta_2 ) are the decay rates for the first and second moments.
  • The model parameters are then updated using the bias-corrected moment estimates: [ \hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} ] [ \theta_t = \theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} ] where ( \eta ) is the learning rate, ( \epsilon ) is a small constant to prevent division by zero.
  • The optimizer uses GPU-accelerated computations through CUDA to efficiently update parameters, making it suitable for large-scale models.
Note
  • The optimizer assumes that the model parameters are represented by Node objects, and each node must have associated gradients.
  • The first and second moment estimates (m and v) are stored per Node object. If a Node does not have existing moments, they are initialized to zero tensors.
  • The optimizer utilizes GPU memory for moment storage and gradient computation, requiring CUDA support.
  • Ensure that the model parameters have been properly initialized, and gradients are computed before calling this method.

Usage Example:

Adam optimizer(0.001, 0.9, 0.999);
graph.update(&optimizer); // Suppose "graph" is a computation graph waiting for gradient updates.
Adam optimizer for deep learning models.
See also
Optimizer for the base class that defines the interface for all optimizers.
Nodes::Node for the class representing model parameters.
Author
Mgepahmge (https://github.com/Mgepahmge)
Date
2024/12/07

Definition at line 707 of file Optimizer.cuh.

Constructor & Destructor Documentation

◆ Adam()

nz::opt::Adam::Adam ( Tensor::value_type learning_rate,
Tensor::value_type beta1,
Tensor::value_type beta2 )
explicit

Constructs an Adam optimizer with the specified hyperparameters.

The Adam constructor initializes an instance of the Adam optimizer with the given learning rate, beta1, and beta2 values. These hyperparameters control the behavior of the Adam optimization algorithm:

  • The learning rate determines the step size for parameter updates.
  • Beta1 controls the decay rate for the first moment estimate (moving average of gradients).
  • Beta2 controls the decay rate for the second moment estimate (moving average of squared gradients).

The constructor also initializes the internal iteration counter (it) to zero, which is used for bias correction during the parameter updates.

Parameters
learning_rateThe learning rate (( \eta )) used for parameter updates. It controls the step size.
beta1The exponential decay rate for the first moment estimate (( \beta_1 )). Typical values are in the range [0.9, 0.99].
beta2The exponential decay rate for the second moment estimate (( \beta_2 )). Typical values are in the range [0.99, 0.999].
Note
  • The learning rate, beta1, and beta2 values should be chosen carefully based on the specific task and dataset.
  • The default values for beta1 (0.9) and beta2 (0.999) are commonly used in practice.
See also
Adam::step for the method that performs parameter updates using the Adam optimizer.
Author
Mgepahmge (https://github.com/Mgepahmge)
Date
2024/12/07

Definition at line 79 of file Optimizer.cu.

Member Function Documentation

◆ step()

void nz::opt::Adam::step ( Node * input)
overridevirtual

Performs a single optimization step using the Adam algorithm.

The step method updates the model parameters based on the gradients computed during the forward pass. It applies the Adam optimization algorithm, which uses moving averages of the gradients and their squared values to adaptively adjust the learning rate for each parameter. This helps achieve stable and efficient parameter updates.

The method performs the following steps:

  1. Increments the internal iteration counter (it), which is used for bias correction.
  2. Checks if the first moment estimate (m) and second moment estimate (v) for the given input node exist. If not, it initializes them to zero tensors with the same shape as the node's output.
  3. Launches a CUDA kernel to compute the Adam updates for the parameters, using the current gradient, the moving averages of the gradients (m), and their squared values (v), along with the specified hyperparameters (learning rate, beta1, beta2, epsilon).

This method is designed to be used with a model parameter represented as a Node object and assumes that the node has an associated output tensor and gradient.

Parameters
inputA pointer to the Node object representing the model parameter to be updated. The node should have an output tensor and its gradient already computed.
Note
  • This method operates on the GPU using CUDA to accelerate the parameter update process.
  • It assumes that the input node has a valid gradient stored in its output object.
  • The first moment estimate (m) and second moment estimate (v) are maintained for each node individually.
  • The epsilon value is used to prevent division by zero during the parameter update.
See also
Adam for the class definition and constructor.
Author
Mgepahmge (https://github.com/Mgepahmge)
Date
2024/12/07

Implements nz::opt::Optimizer.

Definition at line 86 of file Optimizer.cu.

Here is the call graph for this function:

The documentation for this class was generated from the following files: