A class for representing and manipulating multidimensional arrays (tensors) in GPU memory. More...

Public Member Functions
Constructors and Destructors
	Tensor ()
	Default constructor for Tensor.

	Tensor (const shape_type &shape, bool requires_grad=false)
	Constructor that initializes a Tensor with the specified shape.

	Tensor (const shape_type &shape, value_type *data, bool requires_grad=false, bool host=true)
	Constructs a Tensor object with specified shape, data, gradient requirement, and data location.

	Tensor (const shape_type &shape, const std::initializer_list< value_type > &data, bool requires_grad=false)
	Constructs a Tensor object with a specified shape, initializer list data, and gradient requirement.

	Tensor (const Tensor &other)
	Copy constructor for Tensor.

	Tensor (Tensor &&other) noexcept(false)
	Move constructor for Tensor.

Tensor &	operator= (const Tensor &other)
	Assignment operator for Tensor.

Tensor &	operator= (Tensor &&other) noexcept(false)
	Move assignment operator for Tensor.

	~Tensor () noexcept(false)
	Destructor for Tensor.

Getters and Setters
bool	requiresGrad () const noexcept
	Checks whether the tensor requires gradient computation.

shape_type	shape () const noexcept
	Retrieves the shape of the tensor.

size_type	size () const noexcept
	Retrieves the total number of elements in the tensor.

void	setRequiresGrad (bool requires_grad)
	Sets whether the tensor requires gradient computation.

value_type *	data () const noexcept
	Retrieves a pointer to the tensor's data stored in GPU memory.

std::vector< value_type >	hostData () const noexcept
	Retrieves the tensor data from the device to the host and returns it as a std::vector.

value_type *	grad () const
	Retrieves a pointer to the gradient data stored in GPU memory.

std::vector< value_type >	hostGrad () const
	Retrieves the gradient data of the tensor from the device to the host and returns it as a std::vector.

void	dataInject (value_type *data, bool grad=false) const
	Injects data or gradient data into the tensor.

template<typename Iterator >
void	dataInject (Iterator begin, Iterator end, const bool grad=false) const
	Injects data or gradient data into the tensor using iterators.

void	dataInject (const std::initializer_list< value_type > &data, bool grad=false) const
	Injects data or gradient data into the tensor using a std::initializer_list.

Modifiers
void	zeroGrad () const
	Resets the gradient data to zero.

void	randomize (unsigned long long seed=0) const
	Randomizes the tensor's data with a uniform distribution.

void	clear () const
	Clears the tensor's data by setting all elements to zero.

void	fill (value_type value, bool isGrad=false) const
	Fills the tensor's data with a specified value.

void	fillMatrix (value_type value, size_type batch, size_type channels, bool isGrad=false)
	Fill a specific matrix slice within the Tensor with a given value.

void	reshape (const shape_type &shape)
	Reshapes the tensor to the specified shape.

void	transpose ()
	Transposes the tensor by swapping its dimensions and rearranging the data.

void	setData (const shape_type &position, value_type value, bool isGrad=false) const
	Sets the value of an element in the tensor or its gradient at a specified position.

Math
Tensor	operator+ (const Tensor &other) const
	Adds two tensors element-wise and returns the result.

Tensor	operator- (const Tensor &other) const
	Subtracts one tensor from another element-wise and returns the result.

Tensor	operator* (const Tensor &other) const
	Performs matrix multiplication of two tensors (matrices) and returns the result.

Tensor	operator/ (const Tensor &other) const
	Performs element-wise division between two Tensors.

Tensor	operator- () const
	Negates all elements of the tensor and returns the result.

bool	operator== (const Tensor &other) const
	Checks if two Tensor objects are equal.

bool	operator!= (const Tensor &other) const
	Checks if two Tensor objects are not equal.

void	recip () const
	Computes the reciprocal (1/x) of each element in the tensor and updates the tensor in-place.

value_type	sum () const
	Compute the sum of all elements in the Tensor.

value_type	sum (size_type batch, size_type channel) const
	Computes the sum of elements in a specific batch and channel of a Tensor.

value_type	max () const
	Finds the maximum value in the tensor.

value_type	max (size_type batch, size_type channel) const
	Finds the maximum value in a specific batch and channel of the tensor.

value_type	min () const
	Finds the minimum value in the entire tensor.

value_type	min (size_type batch, size_type channel) const
	Finds the minimum value in a specific batch and channel of the tensor.

shape_type	find (value_type value) const
	Finds the first occurrence of a given value in the entire tensor and returns its shape indices.

shape_type	find (value_type value, size_type batch, size_type channel) const
	Finds the first occurrence of a given value in a specific batch and channel of the tensor and returns its shape indices.

value_type	expSum () const
	Compute the sum of the exponential values of all elements in the Tensor.

value_type	expSum (size_t batch, size_t channel) const
	Computes the sum of exponential values of elements in a specific batch and channel of a Tensor.

void	syncData () const
	Synchronize the tensor data by waiting for all CUDA stream write operations to complete.

void	syncGrad () const
	Synchronize the gradient data of the tensor if gradient computation is required.

void	sync () const
	Synchronize both the tensor data and its gradient data.

Printer
std::ostream &	printGrad (std::ostream &os) const
	Prints the gradient values of the tensor to an output stream.

std::ostream &	print (std::ostream &os) const
	Prints the tensor data to an output stream.

Friends
DL_API std::ostream &	operator<< (std::ostream &os, const Tensor &tensor)
	Overloads the `<<` operator to print the tensor's data to an output stream.

DL_API std::istream &	operator>> (std::istream &is, const Tensor &tensor)
	Overloads the `>>` operator to read a tensor's data from an input stream.

Detailed Description

A class for representing and manipulating multidimensional arrays (tensors) in GPU memory.

The Tensor class is designed for high-performance numerical computations in GPU-based environments. It provides a wide range of functionalities, including tensor creation, mathematical operations, memory management, and gradient computation for deep learning tasks.

Type Definitions:

size_type: An alias for unsigned long long, used to represent the size of the tensor. Supports large tensors with up to 64-bit indices.
value_type: An alias for float, representing the data type of the tensor elements. Suitable for most machine learning computations.
shape_type: An alias for std::vector<int>, representing the shape of the tensor (e.g., {2, 3} for a 2x3 matrix).

Key Features:

Memory Management: Handles GPU memory allocation and deallocation using CUDA.
Flexible Initialization: Supports initialization via shapes, data pointers, initializer lists, and iterators.
Mathematical Operations: Includes overloaded operators (+, -, *, /) and activation functions (ReLU, Sigmoid, Tanh, etc.).
Gradient Support: Tracks gradients for tensors that require gradient computation (requires_grad) to facilitate backpropagation in neural networks.
Shape Transformation: Supports reshaping and transposing tensors.

Usage Example:

using namespace nz::data;
 
// Create a tensor that requires gradient with shape 2x3
Tensor tensor({2, 3}, true);
tensor.fill(1.0f);     // Fill the tensor with value 1.0
 
// Apply element-wise ReLU activation
Tensor result = ReLU(tensor);
std::cout << "ReLU activated tensor:" << std::endl;
std::cout << result << std::endl;        // Print the result of ReLU activation
 
// Perform matrix multiplication (2x3 * 3x2 = 2x2)
Tensor tensor3({3, 2}, true);
tensor3.dataInject({1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f}); // Fill tensor3 with values
Tensor multiplied_result = tensor * tensor3;  // Multiply tensor (2x3) by tensor3 (3x2)
std::cout << "Multiplication result (2x3 * 3x2 = 2x2):" << std::endl;
std::cout << multiplied_result << std::endl;  // Print the result of matrix multiplication

Note

Ensure proper cleanup by calling the destructor or relying on RAII to avoid memory leaks.
Tensor size and shape must match during operations to prevent runtime errors.
Requires CUDA-compatible hardware and a properly configured environment.
Most of the methods in this class involve CUDA operations and may throw the nz::CudaException in certain cases.

Author: Mgepahmge(https://github.com/Mgepahmge)

Date: 2024/11/29

Definition at line 134 of file Tensor.cuh.

Constructor & Destructor Documentation

◆ Tensor() [1/6]

nz::data::Tensor::Tensor ( )

Default constructor for Tensor.

Initializes an empty Tensor with no data or shape. This constructor is primarily used as a placeholder or for initializing variables before assigning a valid tensor.

Definition at line 88 of file Tensor.cu.

◆ Tensor() [2/6]

nz::data::Tensor::Tensor	(	const shape_type &	shape,
		bool	requires_grad = false )

explicit

Constructor that initializes a Tensor with the specified shape.

Parameters

shape	A vector representing the dimensions of the tensor.
requires_grad	A boolean indicating whether the tensor requires gradient computation.

This constructor allocates GPU memory for the tensor based on the specified shape. If requires_grad is set to true, additional memory is allocated for storing gradients.

Definition at line 92 of file Tensor.cu.

Here is the call graph for this function:

◆ Tensor() [3/6]

nz::data::Tensor::Tensor	(	const shape_type &	shape,
		value_type *	data,
		bool	requires_grad = false,
		bool	host = true )

explicit

Constructs a Tensor object with specified shape, data, gradient requirement, and data location.

Parameters

shape	A reference to the shape of the tensor (host-to-device). The shape determines the size of the tensor.
data	A pointer to the initial data of the tensor. The data can be either on the host or device depending on the `host` parameter.
requires_grad	A boolean indicating whether the tensor requires gradient computation.
host	A boolean indicating whether the data pointed to by `data` is on the host or device. If true, data is on the host; otherwise, it is on the device.

Returns: None. This is a constructor.

This constructor initializes a Tensor object. It first calculates the total size of the tensor based on the provided shape. Then, it allocates device memory for the tensor's data using cudaMalloc.

Depending on the value of the host parameter, it copies the data from either the host or another device memory location to the newly allocated device memory using cudaMemcpy.

If the requires_grad parameter is true, it also allocates device memory for the gradient data of the tensor. Otherwise, it sets the gradient pointer _grad to nullptr.

For memory management, the constructor allocates device memory for the tensor's data and gradient (if required). The responsibility of freeing this memory lies with the destructor of the Tensor class.

In terms of exception handling, this constructor does not explicitly catch any CUDA errors. If a CUDA operation fails (e.g., cudaMalloc or cudaMemcpy), it will likely lead to undefined behavior in subsequent operations. It is the caller's responsibility to check for CUDA errors using cudaGetLastError or other appropriate methods.

This constructor is a fundamental part of the Tensor class as it initializes the object's internal state.

Exceptions

None	explicitly, but CUDA operations may fail and return an error code.

Note

Ensure that the data pointer is valid and points to enough data to fill the tensor according to the specified shape.
The CUDA runtime environment should be properly initialized before calling this constructor.
This constructor has a time complexity of O(1) for memory allocation and O(n) for data copying, where n is the total number of elements in the tensor (_size).

```cpp
shape_type shape = {2, 3};
value_type data[] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f};
Tensor tensor(shape, data, true, true);
```

Definition at line 104 of file Tensor.cu.

Here is the call graph for this function:

◆ Tensor() [4/6]

nz::data::Tensor::Tensor	(	const shape_type &	shape,
		const std::initializer_list< value_type > &	data,
		bool	requires_grad = false )

explicit

Constructs a Tensor object with a specified shape, initializer list data, and gradient requirement.

Parameters

shape	A reference to the shape of the tensor (host-to-device). The shape determines the dimensions and total size of the tensor.
data	A std::initializer_list containing the initial data for the tensor (host-to-device).
requires_grad	A boolean indicating whether the tensor requires gradient computation.

Returns: None. This is a constructor.

This constructor initializes a Tensor object. First, it calculates the total size of the tensor based on the provided shape. It then checks if the size of the std::initializer_list is sufficient to fill the tensor. If not, it throws a std::invalid_argument exception.

For memory management, it allocates device memory for the tensor's data using cudaMalloc. If the tensor requires gradient computation, it also allocates device memory for the gradient data; otherwise, it sets the gradient pointer to nullptr.

A temporary host buffer is created to hold the data from the std::initializer_list. The data is copied from the initializer list to the host buffer and then transferred from the host buffer to the device memory using cudaMemcpy. After the transfer, the temporary host buffer is deleted to prevent memory leaks.

Regarding exception handling, it throws a std::invalid_argument if the initializer list size is insufficient. Any CUDA errors during memory allocation or data transfer are not explicitly caught here, and it's the caller's responsibility to check for CUDA errors.

This constructor is an important part of the Tensor class as it provides a convenient way to initialize a tensor with an initializer list.

Exceptions

std::invalid_argument If the size of the std::initializer_list is less than the size of the tensor.

Note

Ensure that the std::initializer_list contains enough elements to fill the tensor according to the specified shape.
The CUDA runtime environment should be properly initialized before calling this constructor.
The time complexity of this constructor is O(n), where n is the total number of elements in the tensor, due to the loop that copies data from the initializer list to the host buffer.

```cpp
#include <vector>
 
shape_type shape = {2, 3};
try {
    Tensor tensor(shape, {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f}, true);
} catch (const std::invalid_argument& e) {
    std::cerr << e.what() << std::endl;
}
```

Definition at line 123 of file Tensor.cu.

Here is the call graph for this function:

◆ Tensor() [5/6]

nz::data::Tensor::Tensor ( const Tensor & other )

Copy constructor for Tensor.

Parameters

other The Tensor object to copy from.

Performs a deep copy of the tensor, including its shape, data, and gradient (if applicable).

Definition at line 146 of file Tensor.cu.

Here is the call graph for this function:

◆ Tensor() [6/6]

nz::data::Tensor::Tensor ( Tensor && other )

Move constructor for Tensor.

Parameters

other The Tensor object to move from.

Moves the tensor data and ownership of the GPU memory to the new object.

Definition at line 161 of file Tensor.cu.

◆ ~Tensor()

nz::data::Tensor::~Tensor ( )

Destructor for Tensor.

Releases all GPU memory allocated for the tensor's data and gradient. Ensures that no memory leaks occur during the lifetime of the Tensor object.

Definition at line 210 of file Tensor.cu.

Here is the call graph for this function:

Member Function Documentation

◆ clear()

void nz::data::Tensor::clear ( ) const

Clears the tensor's data by setting all elements to zero.

This function resets the tensor's data to zero by filling the memory allocated for the tensor's data with zero values. It uses the cudaMemset function to set all the values in the tensor's GPU memory to zero. This is commonly used to clear or reset the tensor before using it for new computations.

Note

This function does not deallocate the memory; it only sets the values in the tensor's data to zero.
The tensor's data memory is assumed to be allocated before calling this function. This is automatically managed when the tensor is created, so no additional memory allocation is needed.

```cpp
Tensor tensor({2, 3});  // Create a tensor with shape 2x3
tensor.clear();         // Clear the tensor's data by setting all elements to zero
```

Definition at line 302 of file Tensor.cu.

Here is the call graph for this function:

◆ data()

Tensor::value_type * nz::data::Tensor::data ( ) const

nodiscardnoexcept

Retrieves a pointer to the tensor's data stored in GPU memory.

Returns: A value_type* (pointer to float) pointing to the tensor's data in GPU memory.

This function provides direct access to the raw data of the tensor stored in GPU memory. It is useful for low-level operations or when interfacing with other libraries that require access to the tensor's memory.

Note

The returned pointer points to GPU memory, so it cannot be directly dereferenced in CPU code.
Ensure that CUDA synchronization is handled properly before using this pointer in GPU operations.

```cpp
Tensor tensor({2, 3});
const float* gpu_data = tensor.data(); // Access raw data
// Use gpu_data in CUDA kernels or other GPU-based operations
```

Definition at line 432 of file Tensor.cu.

◆ dataInject() [1/3]

void nz::data::Tensor::dataInject	(	const std::initializer_list< value_type > &	data,
		bool	grad = false ) const

Injects data or gradient data into the tensor using a std::initializer_list.

Parameters

data	A std::initializer_list containing the data to be injected (host-to-device).
grad	A boolean indicating whether to inject gradient data.

Returns: void

This function serves as a wrapper that calls another dataInject function, passing the begin and end iterators of the provided std::initializer_list. In terms of memory management, it relies on the underlying dataInject function to handle memory operations for the actual data injection. Regarding exception handling, it simply propagates any exceptions thrown by the underlying dataInject function without additional handling. This function is closely related to the Tensor class and the other dataInject functions as it leverages the existing data injection logic.

Exceptions

std::runtime_error	If the length of the input array is less than the size of the tensor.
nz::CudaException	If the CUDA memory copy fails or if the tensor does not require gradients when trying to inject gradient data.

Note

The std::initializer_list should contain enough elements to fill the tensor.
This function has a time complexity of O(1) for the wrapper itself, but the overall complexity depends on the underlying dataInject function which is O(n) where n is the size of the tensor.

```cpp
Tensor tensor({1,3});
try {
    tensor.dataInject({1.0f, 2.0f, 3.0f}, false);
} catch (const std::runtime_error& e) {
    std::cerr << e.what() << std::endl;
}
```

Definition at line 241 of file Tensor.cu.

Here is the call graph for this function:

◆ dataInject() [2/3]

template<typename Iterator >

void nz::data::Tensor::dataInject	(	Iterator	begin,
		Iterator	end,
		const bool	grad = false ) const

inline

Injects data or gradient data into the tensor using iterators.

Parameters

begin	An iterator pointing to the beginning of the input data range (host-to-device).
end	An iterator pointing to the end of the input data range (host-to-device).
grad	A boolean indicating whether to inject gradient data. Defaults to false.

Returns: void

This function injects data or gradient data into the tensor using the provided iterator range. First, it checks if the length of the input range (determined by std::distance(begin, end)) is at least as large as the size of the tensor (_size). If not, it throws a std::runtime_error.

For memory management, it allocates a temporary host array host_data of size _size to store the data from the iterator range. The data is then copied from the iterator range to this temporary array. After that, it calls the dataInject function with the temporary array and the grad flag.

In case of an exception during the call to the dataInject function, the temporary array is deleted to prevent memory leaks. Finally, the temporary array is deleted after the call to dataInject returns successfully.

The exception handling mechanism catches any nz::CudaException or std::runtime_error thrown by the dataInject function and re - throws it after cleaning up the temporary memory.

This function is closely related to the Tensor class and the other dataInject function as it uses the other dataInject function to perform the actual data injection.

Exceptions

std::runtime_error	If the length of the input array is less than the size of the tensor.
nz::CudaException	If the CUDA memory copy fails or if the tensor does not require gradients when trying to inject gradient data.

Note

The iterators begin and end should be valid and form a proper range.
The input data should be convertible to the value_type of the tensor.
The time complexity of this function is O(n), where n is the size of the tensor (_size), due to the loop that copies data from the iterator range to the temporary array.

```cpp
#include <vector>
Tensor tensor({1,3});
std::vector<float> data = {1.0f, 2.0f, 3.0f};
try {
    tensor.dataInject(data.begin(), data.end(), false);
} catch (const std::runtime_error& e) {
    std::cerr << e.what() << std::endl;
}
```

Definition at line 567 of file Tensor.cuh.

◆ dataInject() [3/3]

void nz::data::Tensor::dataInject	(	value_type *	data,
		bool	grad = false ) const

Injects data or gradient data into the tensor.

Parameters

data	A pointer to the data to be injected (host-to-device).
grad	A boolean indicating whether to inject gradient data.

Returns: void

This function is responsible for injecting data or gradient data into the tensor. For memory management, it uses cudaMemcpy to copy data from the host to the device. If the grad parameter is true, it tries to copy data to the gradient buffer (_grad). If the tensor does not require gradients (_requires_grad is false), it throws an exception. If the grad parameter is false, it copies data to the main data buffer (_data).

The exception handling mechanism is in place to catch any CUDA memory copy errors. If the cudaMemcpy operation fails, it throws a nz::CudaException with an appropriate error message.

This function is closely related to the Tensor class as it modifies the internal data of the tensor.

Exceptions

nz::CudaException If the CUDA memory copy fails or if the tensor does not require gradients when trying to inject gradient data.

Note

The input data pointer data should point to a valid memory location with enough data to fill the tensor.
Ensure that the CUDA environment is properly initialized before calling this function.

Warning: This function is not safe. If the length of the input array pointed to by data is less than the size of the tensor, it will lead to undefined behavior and potentially cause unknown issues in the program.

```cpp
Tensor tensor({1, 3});
float data[] = {1.0, 2.0, 3.0};
try {
    tensor.dataInject(data, false);
} catch (const std::runtime_error& e) {
    std::cerr << e.what() << std::endl;
}
```

Definition at line 282 of file Tensor.cu.

Here is the call graph for this function:

◆ expSum() [1/2]

Tensor::value_type nz::data::Tensor::expSum ( ) const

nodiscard

Compute the sum of the exponential values of all elements in the Tensor.

Returns: The sum of the exponential values of all elements in the Tensor as a value of type Tensor::value_type.

This function calculates the sum of the exponential values of all elements in the Tensor. It first configures the CUDA block and grid dimensions. Then, it allocates device memory for intermediate results and host memory to hold the copied results from the device. The krnl::SummationExp CUDA kernel is launched to compute the partial sums of the exponential values on the device. After the kernel execution, the partial sums are transferred from the device to the host using cudaMemcpy. Finally, the partial sums on the host are added together to obtain the total sum, and the allocated host and device memory are freed.

Memory management:

Host memory is allocated for hData using new[] and freed using delete[].
Device memory is allocated for dData using cudaMalloc and freed using cudaFree.

Exception handling:

The CHECK macro is used to handle CUDA API errors. If a CUDA API call fails, the CHECK macro will throw an exception, and the function will terminate.

Relationship with other components:

This function depends on the krnl::SummationExp CUDA kernel to perform the partial sums of exponential values on the device.
It also depends on the CHECK macro to handle CUDA API errors.

Exceptions

[Exception type thrown by CHECK macro] If there are CUDA API errors during memory allocation, kernel execution, or memory copying.

Note

The time complexity of this function is approximately O(n), where n is the number of elements in the Tensor (_size). The CUDA kernel parallelizes the partial sum calculation of exponential values, and the final sum on the host is a linear operation over the number of grid blocks.
Ensure that the CUDA device is properly initialized before calling this function.

```cpp
nz::data::Tensor tensor({2, 3}, true);
// Assume tensor is filled with some values
nz::data::Tensor::value_type exp_sum_result = tensor.expSum();
```

Definition at line 694 of file Tensor.cu.

Here is the call graph for this function:

◆ expSum() [2/2]

Tensor::value_type nz::data::Tensor::expSum	(	size_t	batch,
		size_t	channel ) const

nodiscard

Computes the sum of exponential values of elements in a specific batch and channel of a Tensor.

Parameters

batch	The batch index. Memory flow: host - to - device (used for index calculation on the host side).
channel	The channel index. Memory flow: host - to - device (used for index calculation on the host side).

Returns: The sum of exponential values of elements in the specified batch and channel of the Tensor.

This function calculates the sum of the exponential values of elements within a particular batch and channel of a Tensor. First, it validates the provided batch and channel indices. If they are out of the valid range of the Tensor's shape, it throws a std::invalid_argument exception.

After validation, it computes the size of the region to be processed based on the Tensor's shape. It then allocates device memory for intermediate results (dData) and host memory (hData) to receive the computed values from the device. The offset in the Tensor's data is determined according to the batch and channel indices.

The krnl::SummationExp kernel is launched to compute the exponential of each element and perform partial summation on the device. The intermediate results are then copied from the device to the host. Finally, the function sums up all the intermediate results on the host, frees the allocated host and device memory, and returns the final sum.

Memory Management Strategy:

On the host side, an array hData of size grid.x is dynamically allocated using new[] and later freed using delete[].
On the device side, memory for dData is allocated using cuStrm::StreamManager<value_type>::Instance().malloc and freed using cuStrm::StreamManager<value_type>::Instance().free.

Exception Handling Mechanism:

Throws a std::invalid_argument exception if the provided batch or channel indices are out of the valid range of the Tensor's shape.
The CUDA memory allocation, copying, and kernel launch operations may return error codes indicating failures. It is assumed that the calling code or the CUDA runtime will handle these errors appropriately.

Relationship with Other Components:

Depends on the _shape member of the Tensor class to get the shape information and strides.
Uses the krnl::SummationExp kernel to perform the exponential calculation and partial summation on the device.
Relies on cuStrm::StreamManager<value_type>::Instance() for CUDA memory management (malloc, memcpy, free) operations.

Exceptions

std::invalid_argument If the provided batch or channel indices are out of the valid range of the Tensor's shape.

Note

Ensure that the provided batch and channel indices are within the valid range of the Tensor's shape to avoid exceptions.
Be aware of potential CUDA errors during memory allocation, copying, and kernel launch operations and handle them appropriately in the calling code.

```cpp
Tensor tensor; // Assume Tensor is properly initialized
size_t batch = 0;
size_t channel = 1;
Tensor::value_type expSumResult = tensor.expSum(batch, channel);
```

Definition at line 713 of file Tensor.cu.

Here is the call graph for this function:

◆ fill()

void nz::data::Tensor::fill	(	value_type	value,
		bool	isGrad = false ) const

Fills the tensor's data with a specified value.

This function sets all elements in the tensor's data to the specified value. It uses the cudaMemset function to fill the GPU memory allocated for the tensor with the provided value. This is commonly used to initialize a tensor with a constant value.

Parameters

value	The value to which all elements of the tensor will be set. This value is copied to every element in the tensor's data.
isGrad	A boolean flag indicating whether to fill the gradients or the data. If true, gradients are filled; otherwise, data is filled (host-to-device).

Note

This function does not deallocate the memory; it only sets the values in the tensor's data to the specified value.
The tensor's data memory is assumed to be allocated before calling this function. This is automatically managed when the tensor is created, so no additional memory allocation is needed.

```cpp
Tensor tensor({2, 3});  // Create a tensor with shape 2x3
tensor.fill(5.0f);      // Fill the tensor's data with the value 5.0f
```

Definition at line 306 of file Tensor.cu.

Here is the call graph for this function:

◆ fillMatrix()

void nz::data::Tensor::fillMatrix	(	value_type	value,
		size_type	batch,
		size_type	channels,
		bool	isGrad = false )

Fill a specific matrix slice within the Tensor with a given value.

This method allows users to populate a particular matrix slice of the Tensor (specified by batch and channels) with a provided value. It also supports filling the gradient matrix if the Tensor requires gradient computation and the isGrad flag is set.

Parameters

value	The value used to fill the matrix slice. Memory flow: host-to-function, passed from the calling code.
batch	The index of the batch. Memory flow: host-to-function, passed from the calling code.
channels	The index of the channels. Memory flow: host-to-function, passed from the calling code.
isGrad	A boolean indicating whether to fill the gradient matrix. Memory flow: host-to-function, passed from the calling code.

Returns: None

Memory Management Strategy:

This function does not allocate or free any additional memory. It operates on the existing _data or _grad buffer of the Tensor.

Exception Handling Mechanism:

Throws std::invalid_argument if the provided batch or channels indices are out of bounds.
Throws std::invalid_argument if isGrad is true but the Tensor does not require gradient computation.

Relationship with Other Components:

Depends on the _shape object to access tensor shape information and calculate offsets.
Relies on the krnl::Fill CUDA kernel to perform the actual filling operation.

Exceptions

std::invalid_argument When batch or channels are out of bounds or when trying to fill gradients of a non - gradient - requiring Tensor.

Note

The time complexity of this function is O(n), where n is the number of elements in the matrix slice (_shape[2] * _shape[3]).
Ensure that the krnl::Fill CUDA kernel is properly implemented and the CUDA environment is set up correctly.
Verify that the _shape object provides accurate shape and stride information.

Warning

Incorrect CUDA kernel usage may lead to runtime errors or undefined behavior.

```cpp
Tensor tensor;
Tensor::value_type fillValue = 2.0;
Tensor::size_type batchIndex = 0;
Tensor::size_type channelIndex = 1;
bool fillGrad = false;
try {
    tensor.fillMatrix(fillValue, batchIndex, channelIndex, fillGrad);
} catch (const std::invalid_argument& e) {
    std::cerr << e.what() << std::endl;
}
```

Definition at line 316 of file Tensor.cu.

Here is the call graph for this function:

◆ find() [1/2]

Tensor::shape_type nz::data::Tensor::find ( value_type value ) const

nodiscard

Finds the first occurrence of a given value in the entire tensor and returns its shape indices.

This function retrieves the tensor data from the device to the host, then iterates through the data to find the first element equal to the given value. Once found, it calculates the corresponding shape indices (batch, channel, height, width) and returns them.

Parameters

value The value to search for in the tensor. Memory location: host - to - device (used for comparison).

Returns: A Tensor::shape_type object representing the shape indices (batch, channel, height, width) of the first occurrence of the given value in the tensor. Memory flow: device - to - host.

Note

The time complexity of this function is O(n), where n is the number of elements in the tensor (_size), due to the linear traversal of the tensor data.
Ensure that the CUDA runtime environment is properly initialized and the device memory is valid before calling this function, as it depends on hostData().

```cpp
Tensor tensor;
Tensor::value_type targetValue = 5.0;
try {
    Tensor::shape_type indices = tensor.find(targetValue);
    std::cout << "The first occurrence of " << targetValue << " is at indices: ("
              << indices[0] << ", " << indices[1] << ", " << indices[2] << ", " << indices[3] << ")" << std::endl;
} catch (const std::exception& e) {
    std::cerr << e.what() << std::endl;
}
```

Definition at line 660 of file Tensor.cu.

Here is the call graph for this function:

◆ find() [2/2]

Tensor::shape_type nz::data::Tensor::find	(	value_type	value,
		size_type	batch,
		size_type	channel ) const

nodiscard

Finds the first occurrence of a given value in a specific batch and channel of the tensor and returns its shape indices.

This function first calculates the offset in the tensor data based on the provided batch and channel indices. It then retrieves the tensor data from the device to the host and iterates through the subset of data in the specified batch and channel to find the first element equal to the given value. Once found, it calculates the height and width indices and returns the complete shape indices (batch, channel, height, width).

Parameters

value	The value to search for in the tensor. Memory location: host - to - device (used for comparison).
batch	The batch index. Memory location: host - to - device (used for index calculation).
channel	The channel index. Memory location: host - to - device (used for index calculation).

Returns: A Tensor::shape_type object representing the shape indices (batch, channel, height, width) of the first occurrence of the given value in the specified batch and channel of the tensor. Memory flow: device - to - host.

Exceptions

std::invalid_argument When the batch or channel index is out of bounds.

Note

The time complexity of this function is O(m), where m is the number of elements in the specified batch and channel (_shape[2] * _shape[3]), due to the linear traversal of the subset of the tensor data.
Ensure that the CUDA runtime environment is properly initialized and the device memory is valid before calling this function, as it depends on hostData().
Ensure that the batch and channel indices are within the valid range of the tensor's shape to avoid unexpected behavior.

```cpp
Tensor tensor;
Tensor::value_type targetValue = 5.0;
Tensor::size_type batch = 0;
Tensor::size_type channel = 0;
try {
    Tensor::shape_type indices = tensor.find(targetValue, batch, channel);
    std::cout << "The first occurrence of " << targetValue << " in batch " << batch << " and channel " << channel
              << " is at indices: (" << indices[0] << ", " << indices[1] << ", " << indices[2] << ", " << indices[3] << ")" << std::endl;
} catch (const std::exception& e) {
    std::cerr << e.what() << std::endl;
}
```

Definition at line 676 of file Tensor.cu.

Here is the call graph for this function:

◆ grad()

Tensor::value_type * nz::data::Tensor::grad ( ) const

nodiscard

Retrieves a pointer to the gradient data stored in GPU memory.

Returns: A value_type* (pointer to float) pointing to the tensor's gradient data in GPU memory.

This function provides access to the gradient data of the tensor, stored in GPU memory. If the tensor does not require gradient computation (requires_grad is false), the function throws a std::runtime_error.

Exceptions

std::runtime_error If the tensor does not require gradient computation.

Note

The returned pointer points to GPU memory and cannot be directly dereferenced in CPU code.
Ensure that CUDA synchronization is handled properly before using this pointer in GPU operations.

```cpp
Tensor tensor({2, 3}, true); // Create a tensor that requires gradients
try {
    const float* grad_data = tensor.grad(); // Access raw gradient data
    // Use grad_data in CUDA kernels or other GPU-based operations
} catch (const std::runtime_error& e) {
    std::cerr << e.what() << std::endl; // Handle error if tensor does not require gradients
}
```

Definition at line 445 of file Tensor.cu.

◆ hostData()

std::vector< Tensor::value_type > nz::data::Tensor::hostData ( ) const

nodiscardnoexcept

Retrieves the tensor data from the device to the host and returns it as a std::vector.

This member function is used to transfer the tensor data from the device memory to the host memory. It returns a std::vector containing the tensor data.

Parameters

None

Returns: A std::vector of Tensor::value_type containing the tensor data. Memory flow: device - to - host.

Memory Management Strategy:

A temporary array temp of size _size is dynamically allocated on the host using new.
After the data is copied from the device to the host using cudaMemcpy, a std::vector is constructed from the temporary array.
The temporary array temp is then deleted using delete[] to avoid memory leaks.

Exception Handling Mechanism:

This function is marked as noexcept, meaning it does not throw any exceptions. However, if cudaMemcpy fails, it may lead to undefined behavior.

Relationship with Other Components:

Depends on the syncData() function to synchronize the data before the transfer.
Relies on cudaMemcpy to transfer data from the device to the host.
The internal member variables _data and _size are used to access the device data and its size.

Note

The time complexity of this function is O(n), where n is the number of elements in the tensor (_size).
Ensure that the CUDA runtime environment is properly initialized and the device memory is valid before calling this function.

Warning

If cudaMemcpy fails, the behavior of this function is undefined. Error checking for cudaMemcpy is not performed in this function.

```cpp
Tensor tensor;
std::vector<Tensor::value_type> hostData = tensor.hostData();
```

Definition at line 436 of file Tensor.cu.

Here is the call graph for this function:

◆ hostGrad()

std::vector< Tensor::value_type > nz::data::Tensor::hostGrad ( ) const

nodiscard

Retrieves the gradient data of the tensor from the device to the host and returns it as a std::vector.

This member function transfers the gradient data of the tensor from the device memory to the host memory. It returns a std::vector containing the gradient data of the tensor.

Parameters

None

Returns: A std::vector of Tensor::value_type containing the gradient data. Memory flow: device - to - host.

Memory Management Strategy:

A temporary array temp of size _size is dynamically allocated on the host using new.
After the gradient data is copied from the device to the host using cudaMemcpy, a std::vector is constructed from the temporary array.
The temporary array temp is then deleted using delete[] to avoid memory leaks.

Exception Handling Mechanism:

Throws std::runtime_error if the tensor does not require gradients (_requires_grad is false).
If cudaMemcpy fails, it may lead to undefined behavior as error - checking for cudaMemcpy is not performed in this function.

Relationship with Other Components:

Depends on the syncGrad() function to synchronize the gradient data before the transfer.
Relies on cudaMemcpy to transfer the gradient data from the device to the host.
The internal member variables _grad and _size are used to access the device gradient data and its size.

Exceptions

std::runtime_error When the tensor does not require gradients.

Note

The time complexity of this function is O(n), where n is the number of elements in the tensor (_size).
Ensure that the CUDA runtime environment is properly initialized and the device memory is valid before calling this function.
Ensure that the tensor requires gradients before calling this function to avoid exceptions.

Warning

If cudaMemcpy fails, the behavior of this function is undefined.

```cpp
Tensor tensor;
try {
    std::vector<Tensor::value_type> hostGrad = tensor.hostGrad();
} catch (const std::runtime_error& e) {
    std::cerr << e.what() << std::endl;
}
```

Definition at line 452 of file Tensor.cu.

Here is the call graph for this function:

◆ max() [1/2]

Tensor::value_type nz::data::Tensor::max ( ) const

nodiscard

Finds the maximum value in the tensor.

This member function retrieves the tensor data from the device to the host and then iterates through it to find the maximum value.

Parameters

None

Returns: The maximum value of type Tensor::value_type in the tensor. Memory flow: device - to - host.

Note

The time complexity of this function is O(n), where n is the number of elements in the tensor (_size), due to the linear traversal of the tensor data.
Ensure that the CUDA runtime environment is properly initialized and the device memory is valid before calling this function, as it depends on hostData().

```cpp
Tensor tensor;
try {
    Tensor::value_type maxVal = tensor.max();
    std::cout << "The maximum value in the tensor is: " << maxVal << std::endl;
} catch (const std::exception& e) {
    std::cerr << e.what() << std::endl;
}
```

Definition at line 608 of file Tensor.cu.

Here is the call graph for this function:

◆ max() [2/2]

Tensor::value_type nz::data::Tensor::max	(	size_type	batch,
		size_type	channel ) const

nodiscard

Finds the maximum value in a specific batch and channel of the tensor.

This member function first validates the provided batch and channel indices. If they are valid, it calculates the offset in the tensor data and then finds the maximum value within that subset of the tensor.

Parameters

batch	The batch index. Memory location: host - to - device (used for index calculation).
channel	The channel index. Memory location: host - to - device (used for index calculation).

Returns: The maximum value of type Tensor::value_type in the specified batch and channel of the tensor. Memory flow: device - to - host.

Exceptions

std::invalid_argument When the batch or channel index is out of bounds.

Note

The time complexity of this function is O(m), where m is the number of elements in the specified batch and channel (_shape[2] * _shape[3]), due to the linear traversal of the subset of the tensor data.
Ensure that the CUDA runtime environment is properly initialized and the device memory is valid before calling this function, as it depends on hostData().
Ensure that the batch and channel indices are within the valid range of the tensor's shape to avoid exceptions.

```cpp
Tensor tensor;
Tensor::size_type batch = 0;
Tensor::size_type channel = 0;
try {
    Tensor::value_type maxVal = tensor.max(batch, channel);
    std::cout << "The maximum value in batch " << batch << " and channel " << channel << " is: " << maxVal << std::endl;
} catch (const std::exception& e) {
    std::cerr << e.what() << std::endl;
}
```

Definition at line 619 of file Tensor.cu.

Here is the call graph for this function:

◆ min() [1/2]

Tensor::value_type nz::data::Tensor::min ( ) const

nodiscard

Finds the minimum value in the entire tensor.

This function retrieves the tensor data from the device to the host and iterates through it to determine the minimum value.

Parameters

None

Returns: The minimum value of type Tensor::value_type in the tensor. Memory flow: device - to - host.

Note

The time complexity of this function is O(n), where n is the number of elements in the tensor (_size), due to the linear traversal of the tensor data.
Ensure that the CUDA runtime environment is properly initialized and the device memory is valid before calling this function, as it depends on hostData().

```cpp
Tensor tensor;
try {
    Tensor::value_type minVal = tensor.min();
    std::cout << "The minimum value in the tensor is: " << minVal << std::endl;
} catch (const std::exception& e) {
    std::cerr << e.what() << std::endl;
}
```

Definition at line 634 of file Tensor.cu.

Here is the call graph for this function:

◆ min() [2/2]

Tensor::value_type nz::data::Tensor::min	(	size_type	batch,
		size_type	channel ) const

nodiscard

Finds the minimum value in a specific batch and channel of the tensor.

This function first validates the provided batch and channel indices. If they are valid, it calculates the offset in the tensor data and then finds the minimum value within that subset of the tensor.

Parameters

batch	The batch index. Memory location: host - to - device (used for index calculation).
channel	The channel index. Memory location: host - to - device (used for index calculation).

Returns: The minimum value of type Tensor::value_type in the specified batch and channel of the tensor. Memory flow: device - to - host.

Exceptions

std::invalid_argument When the batch or channel index is out of bounds.

Note

The time complexity of this function is O(m), where m is the number of elements in the specified batch and channel (_shape[2] * _shape[3]), due to the linear traversal of the subset of the tensor data.
Ensure that the CUDA runtime environment is properly initialized and the device memory is valid before calling this function, as it depends on hostData().
Ensure that the batch and channel indices are within the valid range of the tensor's shape to avoid exceptions.

```cpp
Tensor tensor;
Tensor::size_type batch = 0;
Tensor::size_type channel = 0;
try {
    Tensor::value_type minVal = tensor.min(batch, channel);
    std::cout << "The minimum value in batch " << batch << " and channel " << channel << " is: " << minVal << std::endl;
} catch (const std::exception& e) {
    std::cerr << e.what() << std::endl;
}
```

Definition at line 645 of file Tensor.cu.

Here is the call graph for this function:

◆ operator!=()

bool nz::data::Tensor::operator!= ( const Tensor & other ) const

Checks if two Tensor objects are not equal.

Parameters

other The other Tensor object to compare with. Memory flow: device - to - host (the comparison in the operator== function may involve data transfer from device to host).

Returns: Returns true if the two Tensor objects are not equal, false otherwise.

This function checks the inequality of two Tensor objects. It simply negates the result of the operator== function. So, it relies on the implementation of the operator== to determine the equality of the two Tensors.

Memory Management Strategy:

All memory management related to the comparison is handled by the operator== function. This function itself does not allocate or free any memory.

Exception Handling Mechanism:

Any exceptions that may occur during the comparison are handled by the operator== function. This function does not have its own exception handling mechanism.

Relationship with Other Components:

Depends entirely on the operator== function of the Tensor class.

Note

The time complexity of this function is the same as that of the operator== function, which is O(n) where n is the number of elements in the Tensor.

```cpp
Tensor tensor1; // Assume Tensor1 is properly initialized
Tensor tensor2; // Assume Tensor2 is properly initialized
bool isNotEqual = tensor1 != tensor2;
```

Definition at line 550 of file Tensor.cu.

◆ operator*()

Tensor nz::data::Tensor::operator* ( const Tensor & other ) const

Performs matrix multiplication of two tensors (matrices) and returns the result.

This operator performs matrix multiplication between two tensors (2D matrices) and returns a new tensor containing the result of the multiplication. The number of columns in the first tensor must match the number of rows in the second tensor for matrix multiplication to be valid.

Parameters

other The tensor (matrix) to multiply with the current tensor.

Returns: A new tensor containing the result of the matrix multiplication.

This function checks if the dimensions of the two tensors are compatible for matrix multiplication. If the number of columns in the current tensor does not match the number of rows in the other tensor, it throws an std::invalid_argument exception. It then creates a new tensor to hold the result of the multiplication and uses a CUDA kernel (GeneralMatrixMul) to perform the matrix multiplication in parallel on the GPU.

Exceptions

std::invalid_argument If the matrix dimensions are incompatible for multiplication.

Note

The number of columns in the current tensor (_shape[1]) must match the number of rows in the other tensor (other._shape[0]) for the multiplication to be valid.
This operator uses a CUDA kernel to perform matrix multiplication, and the result is stored in a new tensor, which is returned.
The original tensors are not modified.

```cpp
Tensor tensor1({2, 3});  // Create a 2x3 matrix
Tensor tensor2({3, 2});  // Create a 3x2 matrix
Tensor result = tensor1 * tensor2;  // Multiply the matrices (result will be 2x2)
```

Definition at line 343 of file Tensor.cu.

Here is the call graph for this function:

◆ operator+()

Tensor nz::data::Tensor::operator+ ( const Tensor & other ) const

Adds two tensors element-wise and returns the result.

This operator performs element-wise addition of two tensors and returns a new tensor containing the sum of the corresponding elements from the two input tensors.

Parameters

other The tensor to be added to the current tensor.

Returns: A new tensor containing the element-wise sum of the two tensors.

This function checks if the shapes of the two tensors match. If they do not, it throws an std::invalid_argument exception. The function then creates a new tensor to hold the result of the addition and uses a CUDA kernel (MatrixAddKernel) to compute the sum of the tensors' elements in parallel on the GPU.

Exceptions

std::invalid_argument If the shapes of the two tensors do not match.

Note

The tensors must have the same shape. If they do not, an exception is thrown.
This operator uses a CUDA kernel to perform the element-wise addition, and the result is stored in a new tensor, which is returned.
The original tensors are not modified.

```cpp
Tensor tensor1({2, 3});
Tensor tensor2({2, 3});
Tensor result = tensor1 + tensor2;  // Adds the two tensors element-wise
```

Definition at line 331 of file Tensor.cu.

Here is the call graph for this function:

◆ operator-() [1/2]

Tensor nz::data::Tensor::operator- ( ) const

Negates all elements of the tensor and returns the result.

This operator performs element-wise negation of the tensor, returning a new tensor that contains the negated values of the current tensor. Each element in the tensor is multiplied by -1 to compute its negation.

Returns: A new tensor containing the element-wise negation of the current tensor.

This function uses a CUDA kernel (Negation) to perform the negation of each element in the tensor in parallel on the GPU. The result is stored in a new tensor, which is returned.

Note

This operator does not modify the original tensor; it returns a new tensor with the negated values.
The operation is performed element-wise, meaning each individual element is negated.
The operation utilizes GPU parallelization for efficiency.

```cpp
Tensor tensor({2, 3});
Tensor result = -tensor;  // Negates all elements of the tensor
```

Definition at line 498 of file Tensor.cu.

Here is the call graph for this function:

◆ operator-() [2/2]

Tensor nz::data::Tensor::operator- ( const Tensor & other ) const

Subtracts one tensor from another element-wise and returns the result.

This operator performs element-wise subtraction of two tensors and returns a new tensor containing the result of subtracting the corresponding elements of the two input tensors.

Parameters

other The tensor to be subtracted from the current tensor.

Returns: A new tensor containing the element-wise difference of the two tensors.

This function checks if the shapes of the two tensors match. If they do not, it throws an std::invalid_argument exception. The function then creates a new tensor to hold the result of the subtraction and uses a CUDA kernel (MatrixSub) to compute the element-wise subtraction in parallel on the GPU.

Exceptions

std::invalid_argument If the shapes of the two tensors do not match.

Note

The tensors must have the same shape. If they do not, an exception is thrown.
This operator uses a CUDA kernel to perform the element-wise subtraction, and the result is stored in a new tensor, which is returned.
The original tensors are not modified.

```cpp
Tensor tensor1({2, 3});
Tensor tensor2({2, 3});
Tensor result = tensor1 - tensor2;  // Subtracts tensor2 from tensor1 element-wise
```

Definition at line 337 of file Tensor.cu.

Here is the call graph for this function:

◆ operator/()

Tensor nz::data::Tensor::operator/ ( const Tensor & other ) const

Performs element-wise division between two Tensors.

This function overloads the division operator to perform element-wise division between the current Tensor and another Tensor. It broadcasts the shapes of the two Tensors if necessary and creates a new Tensor to store the result.

Parameters

other The Tensor to divide the current Tensor by. Memory flow: host-to-function, as the object is passed from the calling code to the function.

Returns: A new Tensor containing the result of the element-wise division. Memory flow: function-to-host, as the result is returned from the function to the calling code.

Memory Management Strategy:

A new Tensor object result is created within the function to store the result of the division. The memory for this Tensor is managed automatically by its constructor and destructor.

Exception Handling Mechanism:

This function does not explicitly throw exceptions. However, the _shape.Broadcast method or the tensorElementwiseDivide function may throw exceptions if there are issues with shape broadcasting or the division operation.

Relationship with Other Components:

This function depends on the _shape.Broadcast method to handle shape broadcasting between the two Tensors.
It also relies on the tensorElementwiseDivide function to perform the actual element-wise division operation.

Note

The time complexity of this function is O(n), where n is the number of elements in the resulting Tensor after broadcasting. This is because the tensorElementwiseDivide function needs to process each element.
Ensure that the _shape.Broadcast method and the tensorElementwiseDivide function are correctly implemented.

Warning

Division by zero may occur if the other Tensor contains zero elements, which can lead to undefined behavior.

```cpp
Tensor tensor1;
Tensor tensor2;
Tensor result = tensor1 / tensor2;
```

Definition at line 352 of file Tensor.cu.

Here is the call graph for this function:

◆ operator=() [1/2]

Tensor & nz::data::Tensor::operator= ( const Tensor & other )

Assignment operator for Tensor.

Parameters

other The Tensor object to assign from.

Performs a deep copy of the tensor, including its shape, data, and gradient (if applicable).

Returns: A reference to the assigned Tensor object.

Definition at line 173 of file Tensor.cu.

Here is the call graph for this function:

◆ operator=() [2/2]

Tensor & nz::data::Tensor::operator= ( Tensor && other )

Move assignment operator for Tensor.

Parameters

other The Tensor object to move from.

Moves the tensor data and ownership of the GPU memory to the new object.

Returns: A reference to the assigned Tensor object.

Definition at line 192 of file Tensor.cu.

Here is the call graph for this function:

◆ operator==()

bool nz::data::Tensor::operator== ( const Tensor & other ) const

Checks if two Tensor objects are equal.

Parameters

other The other Tensor object to compare with. Memory flow: device - to - host (data is copied from device to host for comparison).

Returns: Returns true if the two Tensor objects are equal, false otherwise.

This function compares two Tensor objects for equality. First, it checks if the _requires_grad flags of the two Tensors are the same. If they differ, the function immediately returns false. Then, it compares the shapes of the two Tensors. If the shapes are not equal, the function also returns false.

After that, it allocates host memory for temporary storage of the data from the device memory of both Tensors. It copies the data from the device to the host and compares each element one by one. If any element in the data differs, it frees the allocated host memory and returns false.

If the _requires_grad flag is set to true, it repeats the same process for the gradients of the Tensors. If any element in the gradients differs, it frees the allocated host memory and returns false.

Finally, if all comparisons pass, it frees the allocated host memory and returns true.

Memory Management Strategy:

Two arrays temp and temp_other of size _size are dynamically allocated on the host using new[]. They are freed using delete[] either when a difference is found or at the end of the function.

Exception Handling Mechanism:

The CUDA memory copy operations (cudaMemcpy) may return error codes indicating failures. It is assumed that the calling code or the CUDA runtime will handle these errors appropriately.

Relationship with Other Components:

Depends on the _requires_grad, _shape, _size, _data, and _grad members of the Tensor class.
Uses CUDA memory copy operations (cudaMemcpy) to transfer data from device to host.

Note

Be aware of potential CUDA errors during memory copy operations and handle them appropriately in the calling code.
The function has a time complexity of O(n), where n is the number of elements in the Tensor, due to the element - by - element comparison.

```cpp
Tensor tensor1; // Assume Tensor1 is properly initialized
Tensor tensor2; // Assume Tensor2 is properly initialized
bool isEqual = tensor1 == tensor2;
```

Definition at line 506 of file Tensor.cu.

Here is the call graph for this function:

◆ print()

std::ostream & nz::data::Tensor::print ( std::ostream & os ) const

Prints the tensor data to an output stream.

Parameters

os	The output stream to which the tensor data will be written (host-to-host).

Returns: The output stream after the tensor data has been written.

This function copies the tensor data from device memory to host memory using cudaMemcpy. It then allocates memory on the host using malloc to hold the copied data. After printing the data to the output stream, it frees the allocated host memory using free. The function does not throw any exceptions under normal circumstances. If cudaMemcpy fails, the behavior depends on the CHECK macro, which is assumed to handle errors appropriately.

Note

The time complexity of this function is O(n), where n is the total number of elements in the tensor.
Ensure that the CUDA environment is properly initialized before calling this function.

```cpp
Tensor tensor;
std::ostringstream oss;
tensor.print(oss);
std::cout << oss.str();
```

Definition at line 252 of file Tensor.cu.

Here is the call graph for this function:

◆ printGrad()

std::ostream & nz::data::Tensor::printGrad ( std::ostream & os ) const

Prints the gradient values of the tensor to an output stream.

This function prints the gradient of the tensor (_grad) to the provided output stream (os). The gradient data is first copied from GPU memory to host memory, and then it is printed in a 2D matrix format where each row represents one dimension of the gradient. Each element in the gradient is printed, separated by a space.

Parameters

os	The output stream to which the gradient will be printed.

Returns: The same output stream (os), allowing for chaining of stream operations.

This function performs the following steps:

It allocates memory on the host and copies the gradient data from the device to the host.
It uses std::copy to print the gradient values in a matrix format (row by row).
The function prints each row of the gradient, with each value separated by a space.

Note

This function assumes that the gradient data has already been allocated and is valid.
The gradient is copied from device (GPU) memory to host (CPU) memory for printing, which can be inefficient for large tensors.
The function prints each row of the gradient tensor, enclosed in square brackets, with the elements separated by spaces.

```cpp
Tensor tensor({2, 3}, true);  // Create a tensor with gradient support
std::cout << "Gradient: " << std::endl;
tensor.printGrad(std::cout);  // Print the gradient of the tensor
```

Definition at line 464 of file Tensor.cu.

Here is the call graph for this function:

◆ randomize()

void nz::data::Tensor::randomize ( unsigned long long seed = 0 ) const

Randomizes the tensor's data with a uniform distribution.

This function fills the tensor's data with random values sampled from a uniform distribution in the range [0, 1). The random number generator is initialized using the specified seed to ensure reproducibility. The function uses the curand library to generate random numbers on the GPU.

Parameters

seed	A `unsigned long long` value used to initialize the random number generator. The same seed will produce the same sequence of random numbers, ensuring reproducibility.

This function performs the following steps:

It creates a random number generator using curandCreateGenerator.
It sets the seed for the random number generator using curandSetPseudoRandomGeneratorSeed.
It generates uniform random numbers in the range [0, 1) and fills the tensor's data with these values.

Note

The generated random numbers are uniformly distributed in the range [0, 1).

```cpp
Tensor tensor({2, 3});  // Create a tensor with shape 2x3
tensor.randomize(12345);  // Randomize tensor's data with a seed of 12345
```

Definition at line 298 of file Tensor.cu.

Here is the call graph for this function:

◆ recip()

void nz::data::Tensor::recip ( ) const

Computes the reciprocal (1/x) of each element in the tensor and updates the tensor in-place.

This function computes the reciprocal (1/x) of each element in the tensor and stores the results back into the original tensor. The operation is performed element-wise, where each element of the tensor is replaced by its reciprocal.

The function utilizes a temporary buffer allocated on the GPU to store the intermediate reciprocal values. After the computation, the updated data is copied back to the original tensor in GPU memory.

Note

This operation is performed element-wise on the tensor's data.
The original tensor is updated with the computed reciprocal values.
The function uses GPU memory for efficient parallel computation.

```cpp
Tensor tensor({2, 3});
tensor.recip();  // Computes the reciprocal of each element in the tensor
```

Definition at line 554 of file Tensor.cu.

Here is the call graph for this function:

◆ requiresGrad()

bool nz::data::Tensor::requiresGrad ( ) const

nodiscardnoexcept

Checks whether the tensor requires gradient computation.

Returns: true if the tensor requires gradient computation, false otherwise.

This function allows you to query whether the tensor is marked for gradient tracking, which is essential for backpropagation in neural networks. By default, tensors do not require gradients unless explicitly specified during construction or via setRequiresGrad.

Definition at line 224 of file Tensor.cu.

◆ reshape()

void nz::data::Tensor::reshape ( const shape_type & shape )

Reshapes the tensor to the specified shape.

This function changes the shape of the tensor, adjusting the layout of the data in memory. If the new shape has more elements than the current shape, the extra elements will be initialized to zero. If the new shape has fewer elements, the excess elements will be discarded.

Parameters

shape A shape_type (alias for std::vector<int>) representing the new dimensions of the tensor. The total number of elements in the new shape can be larger or smaller than the current shape.

This function performs the following steps:

It updates the tensor's shape to the new dimensions.
If the new shape requires more elements than the original shape, the new elements are initialized to zero.
If the new shape requires fewer elements, the excess data is discarded.

Note

This function does not reallocate memory. It simply adjusts how the existing data is interpreted based on the new shape.
If the new shape has more elements than the current tensor, the excess elements will be initialized to zero.
If the new shape has fewer elements, data beyond the new size will be discarded.

```cpp
Tensor tensor({2, 3});  // Create a tensor with shape 2x3
tensor.reshape(std::vector<int>({3, 2}));  // Reshape the tensor to shape 3x2, unused elements will be filled with zeros
```

Definition at line 358 of file Tensor.cu.

Here is the call graph for this function:

◆ setData()

void nz::data::Tensor::setData	(	const shape_type &	position,
		value_type	value,
		bool	isGrad = false ) const

Sets the value of an element in the tensor or its gradient at a specified position.

This member function allows you to set the value of a specific element in the tensor or its gradient. It first validates the position and the gradient setting based on the tensor's requirements.

Parameters

position	The position in the tensor where the value will be set. Memory location: host - to - device.
value	The value to be set at the specified position. Memory location: host - to - device.
isGrad	A boolean indicating whether to set the value in the gradient or the tensor data. Memory location: host - to - device.

Returns: None

Memory Management Strategy:

A temporary array data of size _size is allocated on the host using malloc.
The data from the device (either tensor data or gradient) is copied to the host using cuStrm::StreamManager<value_type>::Instance().memcpy.
After the value is set at the specified position in the host - side data, the updated data is copied back to the device.
The temporary array data is freed using free to avoid memory leaks.

Exception Handling Mechanism:

Throws std::invalid_argument if the position is out of bounds of the tensor's shape.
Throws std::invalid_argument if isGrad is true but the tensor does not require gradients.
If any of the cuStrm::StreamManager operations fail, it may lead to undefined behavior as error - checking is not explicitly done in this function.

Relationship with Other Components:

Depends on cuStrm::StreamManager<value_type>::Instance() for memory copying and data synchronization operations.
Relies on the _shape member variable to validate the position and calculate the index in the data array.
Uses the _data and _grad member variables to access the tensor data and its gradient.

Exceptions

std::invalid_argument When the position is out of bounds or when trying to set the gradient of a tensor that does not require gradients.

Note

The time complexity of this function is O(n) due to the memory copying operations, where n is the number of elements in the tensor (_size).
Ensure that the CUDA runtime environment is properly initialized and the device memory is valid before calling this function.
Ensure that the position is within the valid range of the tensor's shape to avoid exceptions.
If setting the gradient, ensure that the tensor requires gradients.

Warning

If any of the cuStrm::StreamManager operations fail, the behavior of this function is undefined.

```cpp
Tensor tensor;
Tensor::shape_type position = {0, 0, 0, 0};
Tensor::value_type value = 1.0;
bool isGrad = false;
try {
    tensor.setData(position, value, isGrad);
} catch (const std::invalid_argument& e) {
    std::cerr << e.what() << std::endl;
}
```

Definition at line 410 of file Tensor.cu.

Here is the call graph for this function:

◆ setRequiresGrad()

void nz::data::Tensor::setRequiresGrad ( bool requires_grad )

Sets whether the tensor requires gradient computation.

Parameters

requires_grad A boolean indicating whether gradient computation is required.

This function allows you to enable or disable gradient tracking for the tensor. If gradient computation is enabled, additional memory may be allocated for storing gradients.

Note: Modifying this setting does not affect any existing gradient data stored in the tensor.

Definition at line 229 of file Tensor.cu.

Here is the call graph for this function:

◆ shape()

Tensor::shape_type nz::data::Tensor::shape ( ) const

nodiscardnoexcept

Retrieves the shape of the tensor.

Returns: A shape_type (alias for std::vector<int>) representing the dimensions of the tensor.

The shape provides information about the size of each dimension in the tensor. For example, a tensor with shape {2, 3} represents a 2x3 matrix. The shape is defined during construction or reshaping of the tensor.

Definition at line 225 of file Tensor.cu.

◆ size()

Tensor::size_type nz::data::Tensor::size ( ) const

nodiscardnoexcept

Retrieves the total number of elements in the tensor.

Returns: A size_type (alias for unsigned long long) representing the total number of elements.

This function calculates the product of the dimensions in the tensor's shape. For example, a tensor with shape {2, 3} will have a size of 6. This value is useful for memory allocation and tensor operations.

Definition at line 226 of file Tensor.cu.

◆ sum() [1/2]

Tensor::value_type nz::data::Tensor::sum ( ) const

nodiscard

Compute the sum of all elements in the Tensor.

Returns: The sum of all elements in the Tensor as a value of type Tensor::value_type.

This function calculates the sum of all elements in the Tensor using CUDA parallel processing. It first determines the block and grid dimensions for the CUDA kernel. Then, it allocates device memory for intermediate results and host memory to store the results copied from the device. The krnl::Summation CUDA kernel is launched to perform partial sums on the device. After the kernel execution, the partial sums are copied from the device to the host using cudaMemcpy. Finally, the partial sums on the host are added together to obtain the total sum, and the allocated host and device memory are freed.

Memory management:

Host memory is allocated for hData using new[] and freed using delete[].
Device memory is allocated for dData using cudaMalloc and freed using cudaFree.

Exception handling:

The CHECK macro is used to handle CUDA API errors. If a CUDA API call fails, the CHECK macro will throw an exception, and the function will terminate.

Relationship with other components:

This function depends on the krnl::Summation CUDA kernel to perform partial sums on the device.
It also depends on the CHECK macro to handle CUDA API errors.

Exceptions

[Exception type thrown by CHECK macro] If there are CUDA API errors during memory allocation, kernel execution, or memory copying.

Note

The time complexity of this function is approximately O(n), where n is the number of elements in the Tensor (_size). The CUDA kernel parallelizes the partial sum calculation, and the final sum on the host is a linear operation over the number of grid blocks.
Ensure that the CUDA device is properly initialized before calling this function.

```cpp
nz::data::Tensor tensor({2, 3}, true);
// Assume tensor is filled with some values
nz::data::Tensor::value_type sum_result = tensor.sum();
```

Definition at line 565 of file Tensor.cu.

Here is the call graph for this function:

◆ sum() [2/2]

Tensor::value_type nz::data::Tensor::sum	(	size_type	batch,
		size_type	channel ) const

nodiscard

Computes the sum of elements in a specific batch and channel of a Tensor.

Parameters

batch	The batch index. This value should be within the valid range of the Tensor's batch dimension. Memory flow: host - to - device (used for index calculation on the host side).
channel	The channel index. This value should be within the valid range of the Tensor's channel dimension. Memory flow: host - to - device (used for index calculation on the host side).

Returns: The sum of elements in the specified batch and channel of the Tensor.

This function calculates the sum of elements in a particular batch and channel of a Tensor. First, it checks if the provided batch and channel indices are valid. If not, it throws a std::invalid_argument exception. Then, it calculates the size of the region to be summed based on the Tensor's shape. It allocates device memory for intermediate results and host memory to receive the intermediate results from the device. It determines the offset in the Tensor's data based on the batch and channel indices. The krnl::Summation kernel is then launched to perform the partial summation on the device. After that, the intermediate results are copied from the device to the host. Finally, the function sums up all the intermediate results on the host, frees the allocated host and device memory, and returns the final sum.

Memory Management Strategy:

On the host side, an array hData of size grid.x is dynamically allocated using new[] and later freed using delete[].
On the device side, memory for dData is allocated using cuStrm::StreamManager<value_type>::Instance().malloc and freed using cuStrm::StreamManager<value_type>::Instance().free.

Exception Handling Mechanism:

Throws a std::invalid_argument exception if the provided batch or channel indices are out of the valid range of the Tensor's shape.
The CUDA memory allocation, copying, and kernel launch operations may return error codes indicating failures. It is assumed that the calling code or the CUDA runtime will handle these errors appropriately.

Relationship with Other Components:

Depends on the _shape member of the Tensor class to get the shape information and strides.
Uses the krnl::Summation kernel to perform the partial summation on the device.
Relies on cuStrm::StreamManager<value_type>::Instance() for CUDA memory management (malloc, memcpy, free) operations.

Exceptions

std::invalid_argument If the provided batch or channel indices are out of the valid range of the Tensor's shape.

Note

Ensure that the provided batch and channel indices are within the valid range of the Tensor's shape to avoid exceptions.
The CUDA operations such as memory allocation, copying, and kernel launch have their own error handling mechanisms. The calling code should be prepared to handle potential CUDA errors.

```cpp
Tensor tensor; // Assume Tensor is properly initialized
Tensor::size_type batch = 0;
Tensor::size_type channel = 1;
Tensor::value_type sumResult = tensor.sum(batch, channel);
```

Definition at line 584 of file Tensor.cu.

Here is the call graph for this function:

◆ sync()

void nz::data::Tensor::sync ( ) const

Synchronize both the tensor data and its gradient data.

This function calls the syncData method to synchronize the tensor data and then calls the syncGrad method to synchronize the gradient data if gradient computation is required. It ensures that all CUDA stream write operations on the data and gradient (if applicable) are completed.

Parameters

None

Returns: None

Memory management for the data and gradient is assumed to be handled by the syncData and syncGrad methods respectively. There is no additional memory allocation or deallocation within this function. This function does not have an explicit exception - handling mechanism. It relies on the exception - handling of the syncData and syncGrad methods to manage any errors that may occur during the synchronization process.

Note

The time complexity of this function depends on the time complexity of the syncData and syncGrad methods. In the worst - case scenario, if both operations involve long - running CUDA stream write operations, it may take a significant amount of time.

```cpp
// Assume Tensor is defined and an instance is created
Tensor tensor;
tensor.sync();
```

Definition at line 747 of file Tensor.cu.

Here is the call graph for this function:

◆ syncData()

void nz::data::Tensor::syncData ( ) const

Synchronize the tensor data by waiting for all CUDA stream write operations to complete.

This function accesses the singleton instance of cuStrm::StreamManager specialized for the value_type of the Tensor class. It then calls the syncData method of this instance, passing the _data member of the Tensor object. This operation blocks the host until all CUDA stream write operations on the _data are finished.

Parameters

None

Returns: None

Memory management for the _data is assumed to be handled elsewhere in the codebase. There is no memory allocation or deallocation within this function. This function does not have an explicit exception - handling mechanism. It relies on the syncData method of the cuStrm::StreamManager instance to manage any errors that may occur during the synchronization process.

Note

The time complexity of this function depends on the time required for the CUDA stream write operations on _data to complete. In the worst - case scenario, if there are long - running write operations, it may take a significant amount of time.

```cpp
// Assume Tensor is defined and an instance is created
Tensor tensor;
tensor.syncData();
```

Definition at line 737 of file Tensor.cu.

Here is the call graph for this function:

◆ syncGrad()

void nz::data::Tensor::syncGrad ( ) const

Synchronize the gradient data of the tensor if gradient computation is required.

This function first checks the _requires_grad flag of the Tensor object. If the flag is set to true, it accesses the singleton instance of cuStrm::StreamManager specialized for the value_type of the Tensor class. Then it calls the syncData method of this instance, passing the _grad member of the Tensor object. This operation blocks the host until all CUDA stream write operations on the _grad are completed.

Parameters

None

Returns: None

Memory management for the _grad is assumed to be handled elsewhere in the codebase. There is no memory allocation or deallocation within this function. This function does not have an explicit exception - handling mechanism. It relies on the syncData method of the cuStrm::StreamManager instance to manage any errors that may occur during the synchronization process.

Note

The time complexity of this function depends on whether the _requires_grad flag is true and the time required for the CUDA stream write operations on _grad to complete. If _requires_grad is false, the function has a constant time complexity O(1). Otherwise, in the worst - case scenario with long - running write operations, it may take a significant amount of time.

```cpp
// Assume Tensor is defined and an instance is created
Tensor tensor;
tensor.syncGrad();
```

Definition at line 741 of file Tensor.cu.

Here is the call graph for this function:

◆ transpose()

void nz::data::Tensor::transpose ( )

Transposes the tensor by swapping its dimensions and rearranging the data.

This function performs a transpose on the tensor by swapping its rows and columns. For a 2D tensor (matrix), it swaps the first and second dimensions, effectively turning the rows into columns and vice versa. The tensor's data is rearranged using a temporary buffer, and the shape is updated accordingly. The data is first copied to a temporary memory space, then a CUDA kernel is used to perform the transposition.

Note

This function involves memory allocation and data copying. It creates a temporary tensor in GPU memory to hold the transposed data.
After the transposition, the tensor's shape is updated, and the temporary buffer is freed.
The function does not modify the original tensor's data but instead reinterprets the data with the new shape.

```cpp
Tensor tensor({2, 3});  // Create a tensor with shape 2x3
tensor.transpose();     // Transpose the tensor to shape 3x2
```

Definition at line 385 of file Tensor.cu.

Here is the call graph for this function:

◆ zeroGrad()

void nz::data::Tensor::zeroGrad ( ) const

Resets the gradient data to zero.

This function sets the gradient data of the tensor to zero. It is typically used during training in neural networks to clear the gradients before the next backpropagation pass. The gradient memory will remain allocated, but its contents will be zeroed out.

Note

This function does not deallocate the gradient memory; it only resets the stored gradient values.
The tensor must have been created with requires_grad set to true, otherwise the function does nothing.

```cpp
Tensor tensor({2, 3}, true);  // Create a tensor with gradient support
tensor.zeroGrad();  // Reset the gradients to zero
```

Definition at line 246 of file Tensor.cu.

Here is the call graph for this function:

Friends And Related Symbol Documentation

◆ operator<<

DL_API std::ostream & operator<<	(	std::ostream &	os,
		const Tensor &	tensor )

friend

Overloads the << operator to print the tensor's data to an output stream.

This function is a friend of the Tensor class and provides an overloaded version of the output stream operator (<<) to print the contents of a tensor to the specified output stream (e.g., std::cout or a file stream).

The tensor's data is first copied from GPU memory to host memory for printing, and then the data is printed in a 2D matrix format. Each row of the tensor is printed on a new line, and each element in a row is separated by a space. Each row is enclosed in square brackets.

Parameters

os	The output stream to which the tensor will be printed.
tensor	The tensor whose contents will be printed.

Returns: The output stream (os) after the tensor has been printed, allowing for chaining of operations.

Note

This operator works by accessing the tensor's private data members (e.g., _data) directly.
The tensor's data is assumed to be in a valid state (i.e., properly allocated in GPU memory) before printing.
The function copies the tensor's data from device (GPU) memory to host (CPU) memory using cudaMemcpy, which may introduce performance overhead for large tensors.

```cpp
Tensor tensor({2, 3});
tensor.fill(1.0f);  // Fill the tensor with 1.0f
std::cout << tensor << std::endl;  // Prints the tensor to standard output in matrix format
```

Definition at line 39 of file Tensor.cu.

◆ operator>>

DL_API std::istream & operator>>	(	std::istream &	is,
		const Tensor &	tensor )

friend

Overloads the >> operator to read a tensor's data from an input stream.

This function is a friend of the Tensor class and provides an overloaded version of the input stream operator (>>) to read the contents of a tensor from the specified input stream (e.g., std::cin or a file stream).

The function reads the tensor's data element by element from the input stream and stores the values in a temporary buffer. Once all the data has been read, it is copied from the host memory back into the tensor's GPU memory using cudaMemcpy.

Parameters

is	The input stream from which the tensor's data will be read.
tensor	The tensor to which the data will be read.

Returns: The input stream (is) after reading the tensor's data, allowing for chaining of operations.

Note

This operator works by reading data from the input stream and storing it in a temporary buffer on the host.
The function assumes that the input data matches the size of the tensor. If the data is malformed or does not match, the behavior may be undefined.
After reading, the data is copied from host memory back into the tensor's GPU memory.

```cpp
Tensor tensor({2, 3});
std::cin >> tensor;  // Reads the tensor's data from standard input
```

Definition at line 76 of file Tensor.cu.

The documentation for this class was generated from the following files:

D:/Users/Mgepahmge/Documents/C Program/NeuZephyr/include/NeuZephyr/Tensor.cuh
D:/Users/Mgepahmge/Documents/C Program/NeuZephyr/src/Tensor.cu

Public Member Functions

Friends

Detailed Description

Type Definitions:

Key Features:

Usage Example:

Constructor & Destructor Documentation

◆ Tensor() [1/6]

◆ Tensor() [2/6]

◆ Tensor() [3/6]

◆ Tensor() [4/6]

◆ Tensor() [5/6]

◆ Tensor() [6/6]

◆ ~Tensor()

Member Function Documentation

◆ clear()

◆ data()

◆ dataInject() [1/3]

◆ dataInject() [2/3]

◆ dataInject() [3/3]

◆ expSum() [1/2]

◆ expSum() [2/2]

◆ fill()

◆ fillMatrix()

◆ find() [1/2]

◆ find() [2/2]

◆ grad()

◆ hostData()

◆ hostGrad()

◆ max() [1/2]

◆ max() [2/2]

◆ min() [1/2]

◆ min() [2/2]

◆ operator!=()

◆ operator*()

◆ operator+()

◆ operator-() [1/2]

◆ operator-() [2/2]

◆ operator/()

◆ operator=() [1/2]

◆ operator=() [2/2]

◆ operator==()

◆ print()

◆ printGrad()

◆ randomize()

◆ recip()

◆ requiresGrad()

◆ reshape()

◆ setData()

◆ setRequiresGrad()

◆ shape()

◆ size()

◆ sum() [1/2]

◆ sum() [2/2]

◆ sync()

◆ syncData()

◆ syncGrad()

◆ transpose()

◆ zeroGrad()

Friends And Related Symbol Documentation

◆ operator<<

◆ operator>>