NeuZephyr
Simple DL Framework
nz::opt::AdaGrad Class Reference

AdaGrad optimizer for deep learning models. More...

Inheritance diagram for nz::opt::AdaGrad:
Collaboration diagram for nz::opt::AdaGrad:

Public Member Functions

 AdaGrad (Tensor::value_type learning_rate)
 Constructs an AdaGrad optimizer with the specified learning rate.
 
void step (Node *input) override
 Performs a single optimization step using the AdaGrad algorithm.
 
- Public Member Functions inherited from nz::opt::Optimizer
 Optimizer ()=default
 Default constructor for the Optimizer class.
 
virtual ~Optimizer ()=default
 Default destructor for the Optimizer class.
 

Detailed Description

AdaGrad optimizer for deep learning models.

The AdaGrad class implements the Adaptive Gradient algorithm, which is a popular optimization method that adapts the learning rate for each parameter based on the historical gradients. AdaGrad is known for its ability to handle sparse gradients and adjust learning rates during training.

This class extends the Optimizer base class and provides a concrete implementation of the step method, which updates the model's parameters using the AdaGrad algorithm.

  • The main idea of AdaGrad is to maintain a separate learning rate for each parameter by scaling the gradient based on the sum of squares of past gradients. This helps reduce the learning rate for frequently updated parameters and increases it for rarely updated ones.
  • AdaGrad can significantly improve training performance for problems with sparse data or parameters that have widely varying scales.
  • This optimizer is effective for tasks such as natural language processing or training deep learning models with sparse gradients.
  • The optimizer uses parallel GPU processing with CUDA to speed up parameter updates, especially when dealing with large models.
Note
  • The optimizer assumes that the model parameters are represented by Node objects, and these nodes must have associated gradients for the optimizer to function correctly.
  • The gss map stores the sum of squared gradients for each parameter, which is used to adjust the learning rate.
  • The epsilon term ensures numerical stability when dividing by the sum of squared gradients.

Usage Example:

AdaGrad optimizer(0.01);
graph.update(&optimizer) // Suppose "graph" is a computation graph waiting for gradient updates;
AdaGrad optimizer for deep learning models.
See also
Optimizer for the base class that defines the interface for all optimizers.
Author
Mgepahmge (https://github.com/Mgepahmge)
Date
2024/12/07

Definition at line 458 of file Optimizer.cuh.

Constructor & Destructor Documentation

◆ AdaGrad()

nz::opt::AdaGrad::AdaGrad ( Tensor::value_type learning_rate)
explicit

Constructs an AdaGrad optimizer with the specified learning rate.

This constructor initializes the AdaGrad optimizer with the given learning rate, which is used to control the magnitude of the updates during training. The learning rate determines how much to adjust the model's parameters in response to the computed gradients.

Parameters
learning_rateThe learning rate to be used for parameter updates. It is a scalar value that controls the size of the steps taken during the optimization process. A smaller value makes the updates more conservative, while a larger value can speed up convergence but may cause instability.
Note
  • The epsilon value used in the AdaGrad algorithm is set to a default of 1e-6 for numerical stability during updates and is not modified by this constructor.
  • The optimizer assumes that the model parameters are represented by Node objects, and the gradients for these nodes will be updated during the step method.
See also
AdaGrad for the full class definition.
Author
Mgepahmge (https://github.com/Mgepahmge)
Date
2024/12/07

Definition at line 46 of file Optimizer.cu.

Member Function Documentation

◆ step()

void nz::opt::AdaGrad::step ( Node * input)
overridevirtual

Performs a single optimization step using the AdaGrad algorithm.

The step function updates the model parameters represented by the Node object using the AdaGrad optimization algorithm. AdaGrad adapts the learning rate for each parameter by considering the history of gradients, providing faster convergence for sparse gradients.

This method performs the following steps:

  • Initializes the sum of squared gradients (GSS) for the parameter (Node) if it has not been initialized.
  • Allocates memory on the GPU for storing intermediate results and computes the AdaGrad update for the model parameters.
  • Uses the sum of squared gradients to scale the gradient and update the model parameters.
  • Frees the temporary memory allocated for computations after the update.
Parameters
inputA pointer to the Node object representing the model parameters. This object should have gradients stored in its output attribute, which will be used to update the parameters.
Note
  • The Node object is assumed to have a valid output tensor with its gradients already computed.
  • The gss map stores the sum of squared gradients for each parameter, ensuring that the learning rate adapts to the frequency of gradient updates.
  • The epsilon term is used to avoid division by zero and ensure numerical stability when updating the parameters.
  • The method leverages CUDA for parallel computation, which speeds up the update process, especially for large models.
See also
AdaGrad for the class definition and constructor.
Author
Mgepahmge (https://github.com/Mgepahmge)
Date
2024/12/07

Implements nz::opt::Optimizer.

Definition at line 50 of file Optimizer.cu.

Here is the call graph for this function:

The documentation for this class was generated from the following files: