Horje
Different Variants of Gradient Descent

Gradient descent is a fundamental optimization algorithm in machine learning, used to minimize functions by iteratively moving towards the minimum. It’s crucial for training models by fine-tuning parameters to reduce prediction errors.

The article aims to explore the fundamentals of different variants of Gradient Descent along with their advantages and disadvantages.

Different variants of gradient descent, such as Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent, offer various advantages in terms of speed, efficiency, and handling of noisy data. Exploring these variants helps in selecting the best approach for specific optimization tasks.

Batch Gradient Descent

Batch Gradient Descent is a variant of the gradient descent algorithm where the entire dataset is used to compute the gradient of the loss function with respect to the parameters. In each iteration, the algorithm calculates the average gradient of the loss function for all the training examples and updates the model parameters accordingly.

The update rule for batch gradient descent is:

[Tex]\theta = \theta – \eta \nabla J(\theta)[/Tex]

where:

  • [Tex]\theta[/Tex] represents the parameters of the model,
  • [Tex]\eta[/Tex] is the learning rate,
  • ∇J(θ) is the gradient of the loss function [Tex]J(θ))[/Tex] with respect to [Tex]θ[/Tex].

Advantages of Batch Gradient Descent

  • Stable Convergence: Since the gradient is averaged over all training examples, the updates are less noisy and more stable.
  • Global View: It considers the entire dataset for each update, providing a global perspective of the loss landscape.

Disadvantages of Batch Gradient Descent

  • Computationally Expensive: Processing the entire dataset in each iteration can be slow and resource-intensive, especially for large datasets.
  • Memory Intensive: Requires storing and processing the entire dataset in memory, which can be impractical for very large datasets.

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a variant of the gradient descent algorithm where the model parameters are updated using the gradient of the loss function with respect to a single training example at each iteration. Unlike batch gradient descent, which uses the entire dataset, SGD updates the parameters more frequently, leading to faster convergence.

The update rule for SGD is:

[Tex]\theta = \theta – \eta \nabla J(\theta; x^{(i)}, y^{(i)})[/Tex]

where:

  • [Tex]θ[/Tex] represents the parameters of the model,
  • [Tex]η [/Tex]is the learning rate,
  • [Tex]\nabla J(\theta; x^{(i)}, y^{(i)})[/Tex] is the gradient of the loss function[Tex] J(θ)[/Tex] with respect to [Tex]θ [/Tex] for the i-th training example [Tex](x^{(i)}, y^{(i)})[/Tex].

Advantages of Stochastic Gradient Descent

  • Faster Convergence: Frequent updates can lead to faster convergence, especially in large datasets.
  • Less Memory Intensive: Since it processes one training example at a time, it requires less memory compared to batch gradient descent.
  • Better for Online Learning: Suitable for scenarios where data comes in a stream, allowing the model to be updated continuously.

Disadvantages of Stochastic Gradient Descent

  • Noisy Updates: Updates can be noisy, leading to a more erratic convergence path.
  • Potential for Overshooting: The frequent updates can cause the algorithm to overshoot the minimum, especially with a high learning rate.
  • Hyperparameter Sensitivity: Requires careful tuning of the learning rate to ensure stable and efficient convergence.

Mini-Batch Gradient Descent

Mini-Batch Gradient Descent is a compromise between Batch Gradient Descent and Stochastic Gradient Descent. Instead of using the entire dataset or a single training example, Mini-Batch Gradient Descent updates the model parameters using a small, random subset of the training data called a mini-batch. The update rule for Mini-Batch Gradient Descent is:

[Tex]\theta = \theta – \eta \nabla J(\theta; \{x^{(i)}, y^{(i)}\}_{i=1}^m)[/Tex]

where:

  • [Tex]θ[/Tex] represents the parameters of the model,
  • [Tex]η [/Tex]is the learning rate,
  • [Tex]\{x^{(i)}, y^{(i)}\}_{i=1}^{m}[/Tex]represents a mini-batch of mmm training examples,
  • [Tex]\nabla J(\theta; \{x^{(i)}, y^{(i)}\}_{i=1}^m)[/Tex] is the gradient of the loss function [Tex]J(\theta)[/Tex] with respect to θ for the mini-batch.

Advantages of Mini-Batch Gradient Descent

  • Faster Convergence: By using mini-batches, it achieves a balance between the noisy updates of SGD and the stable updates of Batch Gradient Descent, often leading to faster convergence.
  • Reduced Memory Usage: Requires less memory than Batch Gradient Descent as it only needs to store a mini-batch at a time.
  • Efficient Computation: Allows for efficient use of hardware optimizations and parallel processing, making it suitable for large datasets.

Disadvantages of Mini-Batch Gradient Descent

  • Complexity in Tuning: Requires careful tuning of the mini-batch size and learning rate to ensure optimal performance.
  • Less Stable than Batch GD: While more stable than SGD, it can still be less stable than Batch Gradient Descent, especially if the mini-batch size is too small.
  • Potential for Suboptimal Mini-Batch Sizes: Selecting an inappropriate mini-batch size can lead to suboptimal performance and convergence issues.

Momentum-Based Gradient Descent

Momentum-Based Gradient Descent is an enhancement of the standard gradient descent algorithm that aims to accelerate convergence, particularly in the presence of high curvature, small but consistent gradients, or noisy gradients. It introduces a velocity term that accumulates the gradient of the loss function over time, thereby smoothing the path taken by the parameters.

The update rule for Momentum-Based Gradient Descent is:

[Tex]v_t = \gamma v_{t-1} + \eta \nabla J(\theta_t)[/Tex]

[Tex]\theta_{t+1} = \theta_t – v_t[/Tex]

where:

  • [Tex]v_t[/Tex] is the velocity at iteration t,
  • [Tex]\gamma[/Tex] is the momentum term (typically between 0 and 1),
  • [Tex]\eta [/Tex]is the learning rate,
  • [Tex]\nabla J(\theta_t)[/Tex] is the gradient of the loss function [Tex]J(θ)[/Tex] with respect to [Tex]θ[/Tex] at iteration t.

Advantages of Momentum-Based Gradient Descent

  • Accelerated Convergence: Helps in faster convergence, especially in scenarios with small but consistent gradients.
  • Smoother Updates: Reduces the oscillations in the gradient updates, leading to a smoother and more stable convergence path.
  • Effective in Ravines: Particularly effective in dealing with ravines or regions of steep curvature, common in deep learning loss landscapes.

Disadvantages of Momentum-Based Gradient Descent

  • Additional Hyperparameter: Introduces an additional hyperparameter (momentum term) that needs to be tuned.
  • Complex Implementation: Slightly more complex to implement compared to standard gradient descent.
  • Potential Overcorrection: If not properly tuned, the momentum can lead to overcorrection and instability in the updates.

Adaptive Learning Rate Methods

1. AdaGrad (Adaptive Gradient Algorithm)

AdaGrad is an adaptive learning rate algorithm that adjusts the learning rate for each parameter individually, scaling it inversely proportional to the square root of the sum of all past squared gradients. This allows for larger updates for infrequent parameters and smaller updates for frequent ones.

The update rule for AdaGrad is:

[Tex]g_t = \nabla J(\theta_t)[/Tex]

[Tex]G_t = G_{t-1} + g_t^2[/Tex]

[Tex]\theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{G_t + \epsilon}}g_t[/Tex]

where:

  • [Tex]g_t[/Tex]​ is the gradient at iteration t,
  • [Tex]G_t[/Tex]​ is the sum of the squares of past gradients,
  • [Tex]\eta[/Tex] is the learning rate,
  • [Tex]\epsilon[/Tex] is a small smoothing term to avoid division by zero.

Advantages of AdaGrad

  • Per-Parameter Learning Rate: Adjusts the learning rate for each parameter based on its historical gradient, leading to more efficient convergence.
  • Effective for Sparse Data: Particularly useful for sparse data, where some parameters receive updates infrequently.

Disadvantages of AdaGrad

  • Learning Rate Decay: The cumulative sum of squared gradients can grow without bound, causing the learning rate to decay and potentially stopping learning prematurely.
  • Memory Intensive: Requires storing additional memory for each parameter’s accumulated gradient, which can be intensive for large models.

2. RMSProp (Root Mean Square Propagation)

RMSProp is an extension of AdaGrad that modifies the accumulation of squared gradients to use a moving average. This prevents the learning rate from decaying too quickly.

The update rule for RMSProp is:

[Tex]g_t = \nabla J(\theta_t)[/Tex]

[Tex]E[g^2]_t = \gamma E[g^2]_{t-1} + (1 – \gamma)g_{t}^{2}[/Tex]

[Tex]\theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t[/Tex]

where:

  • [Tex]g_t[/Tex] is the gradient at iteration t,
  • [Tex]E[g^2]_t[/Tex] is the exponentially weighted moving average of squared gradients,
  • [Tex]\gamma[/Tex] is the decay rate (typically between 0.9 and 0.99),
  • [Tex]\eta[/Tex] is the learning rate,
  • [Tex]\epsilon[/Tex] is a small smoothing term to avoid division by zero.

Advantages of RMSProp

  • Controlled Learning Rate Decay: Uses a moving average to prevent the learning rate from decaying too quickly.
  • Adaptable to Non-Stationary Objectives: More suitable for non-stationary objectives due to the moving average mechanism.

Disadvantages of RMSProp

  • Hyperparameter Sensitivity: Requires careful tuning of the decay rate [Tex]\gamma[/Tex] and learning rate [Tex]\eta[/Tex].
  • Potential for Oscillation: If the decay rate is not properly tuned, it can lead to oscillations in the learning process.

3. Adam (Adaptive Moment Estimation)

Adam is an adaptive learning rate optimization algorithm that combines the benefits of both AdaGrad and RMSProp. It computes adaptive learning rates for each parameter by estimating the first and second moments of the gradients.

The update rule for Adam is:

[Tex]m_t = \beta_1 m_{t-1} + (1 – \beta_1) g_t[/Tex]

[Tex]v_t = \beta_2 v_{t-1} + (1 – \beta_2) g_t^2[/Tex]

[Tex]\hat{m}_t = \frac{m_t}{1 – \beta_1^t}[/Tex]

[Tex]\hat{v}_t = \frac{v_t}{1 – \beta_2^t}[/Tex]

[Tex]\theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t[/Tex]

where:

  • [Tex]g_t[/Tex]​ is the gradient at iteration t,
  • [Tex]m_t[/Tex]​ is the exponentially weighted moving average of the gradient (first moment),
  • [Tex]v_t[/Tex]​ is the exponentially weighted moving average of the squared gradient (second moment),
  • [Tex]\hat{m}_t[/Tex]​ and [Tex]\hat{v}_t[/Tex]​ are bias-corrected estimates,
  • [Tex]\beta_1[/Tex]​ and [Tex]\beta_2[/Tex]​ are decay rates for the moment estimates (typically 0.9 and 0.999, respectively),
  • [Tex]\eta[/Tex] is the learning rate,
  • [Tex]\epsilon[/Tex] is a small smoothing term to avoid division by zero.

Advantages of Adam

  • Combines Benefits: Merges the benefits of AdaGrad and RMSProp, providing adaptive learning rates and preventing rapid decay.
  • Bias Correction: Uses bias-corrected estimates, which improve stability and convergence speed.
  • Widely Applicable: Effective in a wide range of deep learning applications, making it a popular choice.

Disadvantages of Adam

  • Hyperparameter Sensitivity: Requires careful tuning of multiple hyperparameters ([Tex]\beta_1[/Tex], [Tex]\beta_2[/Tex], and [Tex]\eta[/Tex]).
  • Complexity: More complex to implement compared to simpler gradient descent methods.

Hybrid and Advanced Gradient Descent Variants

1. AdamW (Adam with Weight Decay)

AdamW is a variant of the Adam optimization algorithm that decouples weight decay from the gradient-based updates. Unlike traditional Adam, which combines L2 regularization with the adaptive gradient updates, AdamW explicitly subtracts the weight decay term from the weights, leading to more effective regularization.

The update rule for AdamW is:

[Tex]m_t = \beta_1 m_{t-1} + (1 – \beta_1) g_t[/Tex]

[Tex]v_t = \beta_2 v_{t-1} + (1 – \beta_2) g_t^2[/Tex]

[Tex]\hat{m}_t = \frac{m_t}{1 – \beta_1^t}[/Tex]

[Tex]\hat{v}_t = \frac{v_t}{1 – \beta_2^t}[/Tex]

[Tex]\theta_{t+1} = \theta_t – \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right)[/Tex]

where:

  • [Tex]\lambda[/Tex] is the weight decay coefficient,
  • Other variables are as defined in the standard Adam algorithm.

Advantages:

  • Decoupled Regularization: Separates weight decay from the gradient updates, leading to more effective regularization.
  • Improved Generalization: Often results in better generalization performance compared to traditional Adam.

Disadvantages:

  • Hyperparameter Tuning: Introduces an additional hyperparameter (weight decay coefficient) that needs to be tuned.
  • Complexity: Slightly more complex to implement and understand compared to standard Adam.

2. Nadam (Nesterov-accelerated Adaptive Moment Estimation)

Nadam combines the benefits of Adam and Nesterov-accelerated gradient (NAG) to provide an optimization algorithm with adaptive learning rates and momentum. It incorporates Nesterov momentum into the Adam update rule.

The update rule for Nadam is:

[Tex]m_t = \beta_1 m_{t-1} + (1 – \beta_1) g_t[/Tex]

[Tex]v_t = \beta_2 v_{t-1} + (1 – \beta_2)g^2_t[/Tex]

[Tex]\hat{m}_t = \frac{m_t}{1 – \beta_1^t}[/Tex]

[Tex]\hat{v}_t = \frac{v_t}{1 – \beta_2^t}[/Tex]

[Tex]\theta_{t+1} = \theta_t – \eta \left( \frac{\beta_1 \hat{m}_t + (1 – \beta_1) g_t}{\sqrt{\hat{v}_t} + \epsilon} \right)[/Tex]

Advantages:

  • Faster Convergence: Combines the faster convergence properties of Nesterov momentum with the adaptive learning rates of Adam.
  • Stability: Often provides more stable convergence in practice.

Disadvantages:

  • Complexity: More complex to implement and understand compared to Adam and other simpler optimizers.
  • Hyperparameter Sensitivity: Requires careful tuning of multiple hyperparameters.

3. AMSGrad

AMSGrad is a variant of Adam that modifies the second moment estimate to prevent the exponential moving average of the squared gradients from increasing. This addresses the issue of convergence that can arise in Adam.

The update rule for AMSGrad is:

[Tex]m_t = \beta_1 m_{t-1} + (1 – \beta_1) g_t[/Tex]

[Tex]v_t = \beta_2 v_{t-1} + (1 – \beta_2) g_t^2[/Tex]

[Tex]\hat{v}_t = \max(\hat{v}_{t-1}, v_t)[/Tex]

[Tex]\hat{m}_t = \frac{m_t}{1 – \beta_1^t}[/Tex]

[Tex]\theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t[/Tex]

where:

  • [Tex]\hat{v}_t[/Tex] is the maximum of the previous [Tex]\hat{v}_{t-1}[/Tex]​ and the current [Tex]v_t[/Tex].

Advantages:

  • Improved Convergence: Addresses convergence issues seen in Adam, leading to more reliable and consistent performance.
  • Better Performance: Often results in better performance on some challenging optimization problems.

Disadvantages:

  • Hyperparameter Tuning: Requires careful tuning of hyperparameters similar to Adam.
  • Increased Computational Cost: Slightly more computationally expensive due to the additional max⁡\maxmax operation.

Conclusion

Gradient descent and its various advanced variants play a crucial role in optimizing machine learning models by fine-tuning parameters to minimize prediction errors. Each variant, from Batch Gradient Descent to adaptive methods like AdaGrad, RMSProp, Adam, and hybrid approaches such as AdamW, Nadam, and AMSGrad, offers unique advantages and addresses specific challenges in the optimization process. Understanding these variants, along with their advantages and disadvantages, enables practitioners to select the most appropriate method for their specific tasks, thereby enhancing the efficiency and effectiveness of their machine learning models.




Reffered: https://www.geeksforgeeks.org


AI ML DS

Related
Data Management in Generative AI Data Management in Generative AI
Zero-Shot vs One-Shot vs Few-Shot Learning Zero-Shot vs One-Shot vs Few-Shot Learning
Building a Simple Language Translation Tool Using a Pre-Trained Translation Model Building a Simple Language Translation Tool Using a Pre-Trained Translation Model
Privacy Risks in AI Systems Privacy Risks in AI Systems
Image Processing Techniques, Types, & Applications Image Processing Techniques, Types, & Applications

Type:
Geek
Category:
Coding
Sub Category:
Tutorial
Uploaded by:
Admin
Views:
23