![]() |
Gradient descent is a fundamental optimization algorithm in machine learning, used to minimize functions by iteratively moving towards the minimum. It’s crucial for training models by fine-tuning parameters to reduce prediction errors. The article aims to explore the fundamentals of different variants of Gradient Descent along with their advantages and disadvantages. Table of Content
Different variants of gradient descent, such as Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent, offer various advantages in terms of speed, efficiency, and handling of noisy data. Exploring these variants helps in selecting the best approach for specific optimization tasks. Batch Gradient DescentBatch Gradient Descent is a variant of the gradient descent algorithm where the entire dataset is used to compute the gradient of the loss function with respect to the parameters. In each iteration, the algorithm calculates the average gradient of the loss function for all the training examples and updates the model parameters accordingly. The update rule for batch gradient descent is: [Tex]\theta = \theta – \eta \nabla J(\theta)[/Tex] where:
Advantages of Batch Gradient Descent
Disadvantages of Batch Gradient Descent
Stochastic Gradient DescentStochastic Gradient Descent (SGD) is a variant of the gradient descent algorithm where the model parameters are updated using the gradient of the loss function with respect to a single training example at each iteration. Unlike batch gradient descent, which uses the entire dataset, SGD updates the parameters more frequently, leading to faster convergence. The update rule for SGD is: [Tex]\theta = \theta – \eta \nabla J(\theta; x^{(i)}, y^{(i)})[/Tex] where:
Advantages of Stochastic Gradient Descent
Disadvantages of Stochastic Gradient Descent
Mini-Batch Gradient DescentMini-Batch Gradient Descent is a compromise between Batch Gradient Descent and Stochastic Gradient Descent. Instead of using the entire dataset or a single training example, Mini-Batch Gradient Descent updates the model parameters using a small, random subset of the training data called a mini-batch. The update rule for Mini-Batch Gradient Descent is: [Tex]\theta = \theta – \eta \nabla J(\theta; \{x^{(i)}, y^{(i)}\}_{i=1}^m)[/Tex] where:
Advantages of Mini-Batch Gradient Descent
Disadvantages of Mini-Batch Gradient Descent
Momentum-Based Gradient DescentMomentum-Based Gradient Descent is an enhancement of the standard gradient descent algorithm that aims to accelerate convergence, particularly in the presence of high curvature, small but consistent gradients, or noisy gradients. It introduces a velocity term that accumulates the gradient of the loss function over time, thereby smoothing the path taken by the parameters. The update rule for Momentum-Based Gradient Descent is: [Tex]v_t = \gamma v_{t-1} + \eta \nabla J(\theta_t)[/Tex] [Tex]\theta_{t+1} = \theta_t – v_t[/Tex] where:
Advantages of Momentum-Based Gradient Descent
Disadvantages of Momentum-Based Gradient Descent
Adaptive Learning Rate Methods1. AdaGrad (Adaptive Gradient Algorithm)AdaGrad is an adaptive learning rate algorithm that adjusts the learning rate for each parameter individually, scaling it inversely proportional to the square root of the sum of all past squared gradients. This allows for larger updates for infrequent parameters and smaller updates for frequent ones. The update rule for AdaGrad is: [Tex]g_t = \nabla J(\theta_t)[/Tex] [Tex]G_t = G_{t-1} + g_t^2[/Tex] [Tex]\theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{G_t + \epsilon}}g_t[/Tex] where:
Advantages of AdaGrad
Disadvantages of AdaGrad
2. RMSProp (Root Mean Square Propagation)RMSProp is an extension of AdaGrad that modifies the accumulation of squared gradients to use a moving average. This prevents the learning rate from decaying too quickly. The update rule for RMSProp is: [Tex]g_t = \nabla J(\theta_t)[/Tex] [Tex]E[g^2]_t = \gamma E[g^2]_{t-1} + (1 – \gamma)g_{t}^{2}[/Tex] [Tex]\theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t[/Tex] where:
Advantages of RMSProp
Disadvantages of RMSProp
3. Adam (Adaptive Moment Estimation)Adam is an adaptive learning rate optimization algorithm that combines the benefits of both AdaGrad and RMSProp. It computes adaptive learning rates for each parameter by estimating the first and second moments of the gradients. The update rule for Adam is: [Tex]m_t = \beta_1 m_{t-1} + (1 – \beta_1) g_t[/Tex] [Tex]v_t = \beta_2 v_{t-1} + (1 – \beta_2) g_t^2[/Tex] [Tex]\hat{m}_t = \frac{m_t}{1 – \beta_1^t}[/Tex] [Tex]\hat{v}_t = \frac{v_t}{1 – \beta_2^t}[/Tex] [Tex]\theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t[/Tex] where:
Advantages of Adam
Disadvantages of Adam
Hybrid and Advanced Gradient Descent Variants1. AdamW (Adam with Weight Decay)AdamW is a variant of the Adam optimization algorithm that decouples weight decay from the gradient-based updates. Unlike traditional Adam, which combines L2 regularization with the adaptive gradient updates, AdamW explicitly subtracts the weight decay term from the weights, leading to more effective regularization. The update rule for AdamW is: [Tex]m_t = \beta_1 m_{t-1} + (1 – \beta_1) g_t[/Tex] [Tex]v_t = \beta_2 v_{t-1} + (1 – \beta_2) g_t^2[/Tex] [Tex]\hat{m}_t = \frac{m_t}{1 – \beta_1^t}[/Tex] [Tex]\hat{v}_t = \frac{v_t}{1 – \beta_2^t}[/Tex] [Tex]\theta_{t+1} = \theta_t – \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right)[/Tex] where:
Advantages:
Disadvantages:
2. Nadam (Nesterov-accelerated Adaptive Moment Estimation)Nadam combines the benefits of Adam and Nesterov-accelerated gradient (NAG) to provide an optimization algorithm with adaptive learning rates and momentum. It incorporates Nesterov momentum into the Adam update rule. The update rule for Nadam is: [Tex]m_t = \beta_1 m_{t-1} + (1 – \beta_1) g_t[/Tex] [Tex]v_t = \beta_2 v_{t-1} + (1 – \beta_2)g^2_t[/Tex] [Tex]\hat{m}_t = \frac{m_t}{1 – \beta_1^t}[/Tex] [Tex]\hat{v}_t = \frac{v_t}{1 – \beta_2^t}[/Tex] [Tex]\theta_{t+1} = \theta_t – \eta \left( \frac{\beta_1 \hat{m}_t + (1 – \beta_1) g_t}{\sqrt{\hat{v}_t} + \epsilon} \right)[/Tex] Advantages:
Disadvantages:
3. AMSGradAMSGrad is a variant of Adam that modifies the second moment estimate to prevent the exponential moving average of the squared gradients from increasing. This addresses the issue of convergence that can arise in Adam. The update rule for AMSGrad is: [Tex]m_t = \beta_1 m_{t-1} + (1 – \beta_1) g_t[/Tex] [Tex]v_t = \beta_2 v_{t-1} + (1 – \beta_2) g_t^2[/Tex] [Tex]\hat{v}_t = \max(\hat{v}_{t-1}, v_t)[/Tex] [Tex]\hat{m}_t = \frac{m_t}{1 – \beta_1^t}[/Tex] [Tex]\theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t[/Tex] where:
Advantages:
Disadvantages:
ConclusionGradient descent and its various advanced variants play a crucial role in optimizing machine learning models by fine-tuning parameters to minimize prediction errors. Each variant, from Batch Gradient Descent to adaptive methods like AdaGrad, RMSProp, Adam, and hybrid approaches such as AdamW, Nadam, and AMSGrad, offers unique advantages and addresses specific challenges in the optimization process. Understanding these variants, along with their advantages and disadvantages, enables practitioners to select the most appropriate method for their specific tasks, thereby enhancing the efficiency and effectiveness of their machine learning models. |
Reffered: https://www.geeksforgeeks.org
AI ML DS |
Type: | Geek |
Category: | Coding |
Sub Category: | Tutorial |
Uploaded by: | Admin |
Views: | 23 |