AdaDelta optimizer for deep learning models. More...

Inheritance diagram for nz::opt::AdaDelta:

Collaboration diagram for nz::opt::AdaDelta:

Public Member Functions
	AdaDelta (Tensor::value_type rho)
	Constructs an AdaDelta optimizer with a specified decay rate.

void	step (Node *input) override
	Performs a single optimization step using the AdaDelta algorithm.

Public Member Functions inherited from nz::opt::Optimizer
	Optimizer ()=default
	Default constructor for the Optimizer class.

virtual	~Optimizer ()=default
	Default destructor for the Optimizer class.

Detailed Description

AdaDelta optimizer for deep learning models.

The AdaDelta class implements the AdaDelta optimization algorithm, which is a variant of the Adagrad optimizer that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, AdaDelta restricts the window of accumulation to a fixed size, allowing for more robust updates and addressing the diminishing learning rate problem.

This class extends the Optimizer base class and provides a concrete implementation of the step method, which updates the model's parameters (represented as Node objects) using the AdaDelta algorithm.

The optimizer maintains two accumulators for each parameter (Node):
- ( E[g^2]_t ): The exponentially decaying average of past squared gradients.
- ( E[\Delta x^2]_t ): The exponentially decaying average of past squared parameter updates.
The accumulators are updated using the following formulas: [ E[g^2]_t = \rho E[g^2]_{t-1} + (1 - \rho) g_t^2 ] [ \Delta x_t = - \frac{\sqrt{E[\Delta x^2]_{t-1} + \epsilon}}{\sqrt{E[g^2]_t + \epsilon}} g_t ] [ E[\Delta x^2]_t = \rho E[\Delta x^2]_{t-1} + (1 - \rho) \Delta x_t^2 ] where ( g_t ) is the current gradient, ( \rho ) is the decay rate, and ( \epsilon ) is a small constant to prevent division by zero.
The model parameters are updated using ( \Delta x_t ), which is computed adaptively based on the ratio of the two accumulators.
The optimizer uses GPU-accelerated computations through CUDA to efficiently update parameters, making it suitable for large-scale models.

Note

The optimizer assumes that the model parameters are represented by Node objects, and each node must have associated gradients.
The accumulators (acc_grad and acc_delta) are stored per Node object. If a Node does not have existing accumulators, they are initialized to zero tensors.
The optimizer utilizes GPU memory for accumulator storage and gradient computation, requiring CUDA support.
Ensure that the model parameters have been properly initialized, and gradients are computed before calling this method.

Usage Example:

AdaDelta optimizer(0.95); // rho = 0.95

graph.update(&optimizer); // Suppose "graph" is a computation graph waiting for gradient updates.

nz::opt::AdaDelta

AdaDelta optimizer for deep learning models.

Definition Optimizer.cuh:987

See also: Optimizer for the base class that defines the interface for all optimizers.; Nodes::Node for the class representing model parameters.

Author: Mgepahmge (https://github.com/Mgepahmge)

Date: 2024/12/07

Definition at line 987 of file Optimizer.cuh.

Constructor & Destructor Documentation

◆ AdaDelta()

nz::opt::AdaDelta::AdaDelta ( Tensor::value_type rho )

explicit

Constructs an AdaDelta optimizer with a specified decay rate.

Initializes the AdaDelta optimization algorithm with a given decay rate (rho). AdaDelta is an adaptive learning rate method that automatically adjusts the learning rate for each parameter, addressing some limitations of traditional stochastic gradient descent methods.

Unlike other adaptive optimization algorithms, AdaDelta does not require an explicit learning rate. Instead, it uses a running average of squared gradients and squared parameter updates to scale the optimization step dynamically.

Parameters

rho	The decay rate that controls the moving window for accumulating gradient statistics. This parameter determines how quickly the algorithm forgets past gradient information. Typically set between 0.9 and 0.999.

Note

The rho parameter is analogous to the momentum decay rates in other adaptive optimization algorithms.
A value closer to 1 results in a longer memory of past gradients, while a value closer to 0 makes the algorithm more responsive to recent gradients.
Default recommended value is often around 0.95.

See also: RMSprop, Adam Alternative adaptive optimization algorithms; AdaDelta::step Method that applies the AdaDelta update rule

Author: Mgepahmge (https://github.com/Mgepahmge)

Date: 2024/12/07

Definition at line 136 of file Optimizer.cu.

Member Function Documentation

◆ step()

void nz::opt::AdaDelta::step ( Node * input )

overridevirtual

Performs a single optimization step using the AdaDelta algorithm.

This method updates the model parameters for a given input node using the AdaDelta optimization algorithm. It manages adaptive learning rates by maintaining running accumulators for both gradient and parameter update magnitudes.

The method performs several key operations:

Lazily initializes accumulators for parameter updates and gradients if they don't exist
Prepares CUDA grid and block configurations for parallel parameter updates
Invokes a CUDA kernel to apply the AdaDelta update rule

The lazy initialization of accumulators ensures that each parameter has its own adaptive learning rate, allowing for more flexible and efficient optimization across different model parameters.

Parameters

input A pointer to the Node object representing the model parameter to be updated. The node must have a valid output tensor and its gradient already computed.

Note

This method assumes the input node has a valid gradient stored in its output object.
Accumulators for parameter updates and gradients are created on-demand for each unique input node.
The method uses CUDA for parallel computation of parameter updates.
The algorithm adapts the learning rate based on the historical gradient information.

See also: AdaDelta::AdaDelta() Constructor for initializing optimizer parameters; krnl::AdaDelta CUDA kernel implementing the AdaDelta update rule

Author: Mgepahmge (https://github.com/Mgepahmge)

Date: 2024/12/07

Implements nz::opt::Optimizer.

Definition at line 140 of file Optimizer.cu.

Here is the call graph for this function:

The documentation for this class was generated from the following files:

D:/Users/Mgepahmge/Documents/C Program/NeuZephyr/include/NeuZephyr/Optimizer.cuh
D:/Users/Mgepahmge/Documents/C Program/NeuZephyr/src/Optimizer.cu

Public Member Functions

Detailed Description

Usage Example:

Constructor & Destructor Documentation

◆ AdaDelta()

Member Function Documentation

◆ step()