Guide to Gradient Descent in Machine Learning: How it Works, Types, and Challenges

TL;DR

What is gradient descent in machine learning?
Gradient descent is an optimization algorithm used in machine learning to train models by minimizing the error (cost function). It works by iteratively adjusting the model’s parameters in the direction that reduces the cost the most (the direction of the negative gradient).

What are the different types of gradient descent?
The three primary types are: batch gradient descent, stochastic gradient descent (SGD), and mini-batch gradient descent. Batch uses the entire training dataset to compute gradients for each update (stable but can be slow on big data). Stochastic (SGD) uses only one random training example per update. Mini-batch uses a small batch of examples (like 32 or 64) per update, balancing speed and stability – this is the most commonly used approach in practice. All three follow the same core idea of moving down the slope of the cost function, just with different amounts of data to estimate the slope each time.

What is stochastic gradient descent (SGD)?
Stochastic gradient descent is a variant of gradient descent where the parameter update is performed on each training example (or a very small batch of examples) one at a time, rather than on the whole dataset. In SGD, the term “stochastic” indicates the randomness in selecting individual examples for each step. An SGD update will use the gradient from one sample to tweak the parameters, then move to the next sample, and so on. This often leads to faster initial progress and the ability to jump out of local minima due to the noisy steps. However, because of the noise, the cost function value will fluctuate at each step rather than smoothly decrease. SGD gradient descent is essentially the same algorithm but on a per-sample basis. It’s widely used for large datasets and online learning scenarios. (In deep learning, when people mention using “SGD”, they often mean mini-batch SGD, which is a generalized form of this.)

How do I choose the learning rate for gradient descent?
Choosing the learning rate is crucial. If the learning rate is too high, the algorithm can overshoot minima and even diverge (you might see the cost going up or wildly oscillating). If it’s too low, training will be very slow – it might take forever to converge or appear to stall. A good strategy is to start with a moderate value like 0.01 and observe the training loss. You want to see the loss decreasing reasonably steadily. If it’s bouncing or increasing, decrease the learning rate. If it’s decreasing very slowly, you can try increasing it slightly. You can also use techniques like learning rate decay or more advanced methods like Adam optimizer (which adapts the effective learning rate automatically). In practice, finding the right learning rate often involves some experimentation or using known good defaults for your problem domain. There are also tools like learning rate finder graphs to help pinpoint a good range.

How is gradient descent used in deep learning?
A: In deep learning, gradient descent (usually in the form of mini-batch SGD or an optimizer like Adam) is the core algorithm for training neural networks. Neural networks have many layers of weights; to train them, we define a loss function (say, error in predictions), compute the gradient of the loss with respect to each weight via backpropagation, and then use gradient descent to update the weights in the direction that lowers the loss. This process repeats for many iterations (epochs) over the dataset. Essentially, gradient descent in deep learning is what allows a neural network to learn – it incrementally adjusts the millions of parameters to minimize the loss on training data. Without gradient descent (and backprop to efficiently compute gradients), training deep neural nets would not be feasible. Every major deep learning framework (TensorFlow, PyTorch) uses gradient descent under the hood to optimize networks.

Gradient descent is a cornerstone of machine learning and deep learning optimization. It’s the algorithmic equivalent of finding your way downhill in a foggy mountain landscape, taking one careful step at a time in the steepest downward direction.

Imagine standing on a hill, blindfolded, trying to reach the lowest valley. You would feel the slope under your feet and take a step in the direction that descends the most. Then you’d repeat this, taking smaller steps as the ground flattens out to avoid overshooting the valley floor. This hiking analogy is exactly how gradient descent operates on an error landscape – it iteratively steps down the slope of a cost function to find the minimum.

In simple terms, gradient descent is an optimization algorithm that gradually tweaks a model’s parameters to minimize error or “cost,” guiding the model toward better performance.

But what is gradient descent, precisely, and why is it so important in machine learning? In this guide, we will explain gradient descent in machine learning and the different types of gradient descent. We will also see a simple example of gradient descent in Python and discuss how gradient descent powers popular machine learning algorithms and is the backbone of training deep learning models.

What is Gradient Descent in Machine Learning?

Gradient descent is an optimization algorithm that iteratively adjusts the parameters of a model to minimize some cost function (error). In the context of machine learning, gradient descent is used to train models by refining their parameters (weights and biases) so that the model’s predictions become as close as possible to the true values.

It is called “gradient” descent because it utilizes the gradient (the vector of partial derivatives) of the cost function to decide which direction to move the parameters, analogous to measuring the steepness of a hill. And it’s called “descent” because the algorithm descends the slope of the cost function – it takes steps in the negative gradient direction (downhill) to reduce the error, aiming for the lowest possible value of the cost.

Gradient Descent in Machine Learning — **Figure**. Gradient Descent | Source

Gradient descent is one of the most widely used algorithms in machine learning and deep learning because of its simplicity and effectiveness. In fact, nearly every modern neural network training relies on some variant of it.

How Gradient Descent Works

Now that we know what gradient descent is, let’s break down how the gradient descent algorithm works step by step. The process can be summarized as follows:

Initialize Parameters: Start with an initial guess for the model parameters (for example, random values). This is like picking a starting point on the error landscape.
Compute the Cost: Calculate the cost function (also called loss function) for the current parameters. The cost function quantifies the error between the model’s predictions and the actual targets. For example, in linear regression, the cost is often the Mean Squared Error.
Compute the Gradient: Compute the gradient of the cost function with respect to each parameter. The gradient is a vector of partial derivatives, indicating the direction and rate of the steepest increase in cost. (The negative gradient gives the direction of steepest decrease.)
Update the Parameters: Adjust the parameters in the opposite direction of the gradient (i.e. downhill). In other words, we subtract a fraction of the gradient from the current parameters. This fraction is controlled by a small constant called the learning rate.
Repeat: Recompute the cost with the new parameters, and iterate the process. Continue iterating until the cost function is as low as possible (or changes very little), indicating convergence to a minimum.

This iterative procedure is the essence of the gradient descent algorithm. You start with some initial values, and gradient descent uses calculus (derivatives) to iteratively adjust the values to minimize the cost function. With each iteration (often called an epoch when it goes through a full training set), the algorithm moves the parameters a bit closer to the optimum.

Cost Function and Learning Rate

To understand this better, let’s clarify the key components involved in gradient descent:

Cost Function (Loss Function): We want to minimize this function. It measures how far off our model’s predictions are from the actual targets. The gradient descent algorithm uses the cost function as a feedback signal – it tells us how good or bad the current model is. By reducing the cost, we improve the model. The cost function acts as a continuous surface over the parameter space; our goal is to find the lowest valley on this surface. In each step, gradient descent looks at the slope of this surface (the gradient) to decide how to move. (If the cost function were flat, the gradient would be zero and the algorithm knows it has reached an optimum or a saddle point.)
Learning Rate (Step Size): The learning rate, often denoted by alpha, is a crucial hyperparameter that determines how big a step we take toward the negative gradient. It essentially scales the gradient before updating the parameters. If the learning rate is too high, gradient descent will take huge leaps down the slope and might overshoot the minimum – it could even diverge, bouncing around without ever settling down. If the learning rate is too low, the algorithm will make tiny baby steps. It will eventually reach the minimum, but it might take an unreasonably long time to get there. Ideally, we choose a learning rate that is “neither too low nor too high” so that the cost steadily decreases with each step towards the minimum.

Putting it together, each iteration of gradient descent updates the parameters theta as:

This famous gradient descent formula (the update rule) says we take the current parameters and subtract the gradient of the cost (at those parameters) scaled by the learning rate. For example, if w is a weight parameter and J(w) its cost, the update would be:

This simple formula is applied to each parameter simultaneously. Over many iterations, if all goes well, J_theta will decrease and ideally approach the minimum possible value.

Gradient Descent in Python

To make this concrete, let’s walk through a simple example of gradient descent in Python. We will implement a basic linear regression using gradient descent from scratch. Suppose we have a tiny dataset of points that roughly lie on a line y = 2x + 1. We want to find the line’s parameters (slope and intercept) using gradient descent, without using any machine learning library.

import numpy as np

# Sample data for a line y = 2x + 1 (with points along that line)
X = np.array([0, 1, 2, 3], dtype=float)
y = np.array([1, 3, 5, 7], dtype=float)  # 1, 3, 5, 7 correspond to 2*0+1, 2*1+1, etc.

# Initialize parameters
w = 0.0  # initial weight (slope)
b = 0.0  # initial bias (intercept)
learning_rate = 0.1

# Perform Gradient Descent for a fixed number of iterations
for epoch in range(1000):
    # 1. Compute predictions for current parameters
    y_pred = w * X + b
    # 2. Compute the error (difference between predictions and actual)
    error = y_pred - y
    # 3. Compute the cost (Mean Squared Error)
    cost = (error ** 2).mean()
    # 4. Compute gradients of cost w.rt w and b
    dw = (2/len(X)) * np.dot(error, X)      # derivative of MSE w.rt w
    db = (2/len(X)) * error.sum()           # derivative of MSE w.rt b
    # 5. Update parameters: move against the gradient
    w -= learning_rate * dw
    b -= learning_rate * db

# After training, print the learned parameters and final cost
print("Learned parameters: w =", w, " b =", b)
print("Final cost:", cost)

The update w -= learning_rate * dw and b -= learning_rate * db is the gradient descent step. If you run this code, you should find that w approaches 2.0 and b approaches 1.0 (and the cost goes down to nearly 0), which matches the line y=2x+1 that we expected. This little example illustrates how gradient descent in Python can find model parameters by iterative updates.

Note: In practice, you wouldn’t write your own loop for training large models – libraries like TensorFlow and PyTorch handle the gradient computations and updates for you (using highly optimized code, often on GPUs). However, it’s very useful to understand the above process, because under the hood those libraries are performing these same calculations! The example helps reinforce how gradients and learning rates work together in the parameter update step.

Types of Gradient Descent Algorithms

Gradient descent comes in different forms. The variants primarily differ in how much data they use to compute the gradient at each step and in certain tweaks to the update rules. Choosing the right type can impact convergence speed and stability.

Here, we will discuss the three main types of gradient descent (based on batch size) and then discuss some popular variants of the gradient descent algorithm that modify the update rule for faster or more robust convergence.

Batch Gradient Descent

Batch gradient descent (also called “vanilla” gradient descent) uses the entire training dataset to compute the gradient and perform one update per epoch. You sum up or average the gradient over all examples before updating the parameters.

Because batch gradient descent uses all data, each update is slow if the dataset is large, but the direction of the update is very reliable (low variance). It typically yields a stable, smooth convergence since the error gradient is averaged over all data.

Some key points about batch GD:

It is computationally efficient in terms of making full use of vectorized operations on all data at once, and each update moves the parameters in the overall best direction for the whole dataset.
It requires the entire dataset to be in memory and accessible, which can be an issue for very large datasets.
Convergence is stable but can be slow. It may take longer to start seeing improvements because it updates only after going through all examples. However, each step is a solid move downhill.
One drawback is that it can sometimes get stuck in a suboptimal point. The gradient computed on the whole dataset might consistently point toward a local minimum that is not the global minimum. Once it reaches a certain point, if that point is a local minimum (or a flat plateau), batch GD will stop there with no random variation to help it escape.
Usually, batch GD is deterministic – given the same data and initial conditions, you get the same path and outcome, since it doesn’t shuffle or randomize anything once the dataset is fixed.

Stochastic Gradient Descent (SGD)

Stochastic gradient descent (SGD) takes the opposite approach: it updates the parameters one training example at a time. “Stochastic” means randomly determined.

Typically, we shuffle the training data and then for each example, compute the gradient of the cost just for that single example, and update the parameters immediately. In one pass (epoch) through N examples, SGD will perform N tiny updates instead of one big update.

Key characteristics of SGD:

It’s usually much faster per update because each update only deals with one example (so computing the gradient is quick). You can often cover more ground in parameter space in the same wall-clock time compared to batch GD, at least initially.
Because it updates so frequently, the error can fluctuate a lot from step to step. The path it takes bounces around the true downhill direction. The cost function might jiggle up and down rather than smoothly decreasing each time. These are the noisy gradients inherent to SGD.
The noisy updates can actually be a feature: they help SGD avoid getting stuck in local minima. If there’s a small local minimum, the randomness from single-example updates can kick the parameters out. This can potentially allow the algorithm to eventually find a lower global minimum.
On the downside, the noise means convergence is less stable. It may oscillate around the minimum point and might never exactly settle. Often, one needs to gradually reduce the learning rate over time (learning rate decay) to get SGD to converge to a small region around the minimum.
Memory-wise, SGD is great – you only need to load one (or a few) training example(s) at a time, so it can handle very large datasets.

Mini-Batch Gradient Descent

Mini-batch gradient descent is the most commonly used form in practice. It’s a compromise between batch and stochastic methods: you split the training data into small batches (say 32 or 64 examples per batch). For each batch, you compute the gradient on just that subset and update the parameters. So each epoch consists of multiple mini-batch updates.

Why mini-batches? They offer the best of both worlds:

Using more than one example per update makes the gradient estimate more accurate than a single example. This approach is less noisy than SGD, but still faster to compute than using the whole dataset.
Mini-batches enable efficient use of vectorized operations and hardware like GPUs. We can parallelize the computations within a batch. Due to parallelism, using a batch of, say, 32 samples can be almost as fast as using 1 sample on a GPU, so we get a big win in terms of more information (32 samples’ worth) for almost the same cost as 1.
The noise in the updates is reduced, so convergence is faster and more stable than pure SGD in many cases. It still has some randomness to help escape local minima, but not as chaotically random as true SGD.
We can adjust the batch size to control the trade-off between stability and speed. Common mini-batch sizes range from 16 up to a few hundred (e.g. 32, 64, 128).

Mini-batch gradient descent is indeed the “go-to algorithm” for training neural networks. In modern deep learning frameworks, when you specify a batch size, you implicitly choose mini-batch gradient descent for optimization. For example, a batch size of 64 means each gradient descent step uses 64 samples.

To sum up the basic types:

Batch GD: Uses all data each step – stable but potentially slow and memory-heavy.
Stochastic GD: Uses 1 data point each step – fast per update and can escape local minima, but noisy convergence.
Mini-Batch GD: Uses a small batch each step – balances speed and stability, and is the most prevalent in practice.

Figure. Batch vs. Stochastic vs. Mini-Batch Gradient Descent | Source

Adaptive Optimizers: RMSprop and Adam

Some methods adapt the learning rate during training, usually per parameter. These are often called adaptive gradient methods. Two very popular ones are RMSprop and Adam.

RMSprop: Root Mean Square Propagation is an unpublished adaptive learning rate method. RMSprop tackles a problem that a predecessor algorithm, AdaGrad, faced: AdaGrad would monotonically decrease the learning rate for each parameter based on the sum of squared gradients, often decaying it too much. RMSprop fixes this by using a moving average of squared gradients. It keeps a running average of the squared gradient for each parameter and divides the gradient by the root of this average. In formula form, for each parameter i:

This means intuitively: if a parameter’s gradient has been large and volatile, s_t will be large, and thus the effective step (learning rate) for that parameter is reduced. Conversely, if a parameter’s gradient is small, the algorithm can afford a larger relative step for it. RMSprop adaptively tunes the learning rate for each parameter, which helps in situations like ill-conditioned surfaces where some parameters need smaller steps than others.

Adam: Adam stands for Adaptive Moment Estimation. Adam simultaneously keeps an exponentially decaying average of past gradients and of past squared gradients.

All these methods still fundamentally perform gradient descent – they simply modify the update rule or the step sizes. So they are sometimes called variants of gradient descent or gradient descent optimizers.

You might choose one of these optimizers when training a model, depending on the problem. For instance, one might say “we trained the neural network using gradient descent (Adam optimizer) with a learning rate of 0.001.”

Applications of Gradient Descent

Gradient descent is not just a theoretical concept; it’s the workhorse behind training many machine learning models. Let’s look at a few real-world applications and algorithms where gradient descent plays a key role:

Linear Regression: When fitting a linear regression model (finding the best-fit line or hyperplane through points), one approach is to minimize the Mean Squared Error between predictions and actual values. This minimization can be done analytically for simple linear regression, but we use gradient descent to find the weights for many features or large datasets. Gradient descent minimizes the MSE loss to find the optimal weights and bias of the linear model. Because the MSE cost surface for linear regression is convex (bowl-shaped), gradient descent is guaranteed to converge to the global minimum (given a reasonably chosen learning rate). This approach scales well to high-dimensional data (where closed-form solutions would be costly).
Logistic Regression: Logistic regression is used for binary classification (yes/no outcomes). It uses a logistic (sigmoid) function and optimizes a cost function called log loss (or cross-entropy loss). Gradient descent adjusts the model’s coefficients to minimize the log loss, thus improving classification accuracy. The process is similar to linear regression’s training, except the cost function and gradients are a bit different (involving the sigmoid function). Logistic regression often relies on gradient-based optimization because, like linear regression, it’s fast and convex (one global minimum). After enough iterations, gradient descent will find the parameter values that maximize the training labels’ likelihood (equivalent to minimizing log loss).
Support Vector Machines (SVMs): SVMs in their primal form can be trained by minimizing the hinge loss with regularization. While classic SVM training often uses quadratic programming, one can also use gradient or sub-gradient descent on the hinge loss. Many modern large-scale SVM solvers use stochastic gradient descent. For SVMs, gradient descent optimizes the hinge loss to find the maximum-margin hyperplane.
Neural Networks (Deep Learning): Perhaps the most famous use of gradient descent is in training neural networks. A neural network has potentially millions of parameters (weights and biases across many layers). We cannot solve for those analytically, so we use gradient descent combined with backpropagation. Backpropagation is the technique to compute the gradient of the network’s loss function with respect to each weight, essentially applying the chain rule of calculus through the layers. Once those gradients are computed, gradient descent (usually in the form of mini-batch updates with an optimizer like SGD/Adam) is used to update the weights. This repeats for many iterations (epochs) until the network’s loss is minimized. Without gradient descent, deep learning as we know it would not be possible – it’s literally how the network “learns” from data by gradually improving the weights to reduce error. For example, in a CNN recognizing images or an RNN for language, the training loop involves three main steps. First, compute outputs and loss on a batch. Then, backpropagate gradients and apply the gradient descent update. This entire process occurs thousands or millions of times.

Beyond these, many other algorithms rely on gradient descent:

Matrix factorization in recommender systems (to learn latent factors) uses gradient descent to minimize error in predicted ratings.
Word2Vec embeddings were trained with stochastic gradient descent (actually a variant called negative sampling).
Reinforcement learning policies are often optimized via gradient-based methods (policy gradients, etc.).
Any time you see the term “optimizer” in machine learning libraries, it’s typically some variant of gradient descent under the hood, adjusting parameters to minimize a loss.

Gradient descent’s ubiquity comes from its generality: as long as you can compute a gradient of your objective, you can plug it into gradient descent to improve the objective. This is why it’s the default tool for so many tasks.

From simple regressions to complex deep learning models, understanding gradient descent gives insight into how these models learn.

Challenges in Gradient Descent and How to Overcome Them

While gradient descent is powerful, it is not without challenges. When training models, you often encounter a few common issues:

Local Minima and Saddle Points

If the cost function is not convex (which is the case for most deep learning problems), there may be multiple local minima – points where the cost is lower than neighboring points but not the lowest overall.

Similarly, there can be saddle points, where the gradient is zero but the point is not a minimum. Gradient descent can get “stuck” at local minima or plateaus (saddle points) instead of finding the global minimum.

This is not an issue for convex problems (like linear regression) – only one minimum, and gradient descent will find it.

However, the loss surface is highly nonconvex for neural networks or other complex models with many pits and flats. Training may stall if gradient descent finds a spot where gradients are zero (or very tiny), but it’s not the best solution.

Local minima: Think of these as false valleys. The algorithm might settle in a shallow valley when a deeper one exists elsewhere.
Saddle points: The gradient can be zero here because the slope ascends in one direction and descends in another, flattening out at the saddle. It’s not a stable minimum, but gradient descent has no slope signal to move away.

How to overcome local minima and saddle points?
Fortunately, in very high-dimensional problems like neural networks, strict local minima are considered less of an issue. Saddle points frequently cause plateaus. Here are some tips:

Using stochasticity (SGD or mini-batches) can help. The noise in SGD can jostle the parameters out of a local minimum. If one batch yields zero gradient, the next batch (different data) might not, giving a push.
Random restarts: For some problems, you can try starting the training from different initial weights. If one run gets stuck badly, another might find a better path.
In practice, algorithms like Adam also have some implicit ability to navigate such issues better than plain SGD, due to adaptive step sizes.
It’s also common to simply run for more epochs or adjust learning rates if you suspect a saddle (e.g., a long plateau then eventually a drop in loss might indicate you were stuck on a saddle for a while).

Vanishing and Exploding Gradients

These problems are specific to deep neural networks, especially those with many layers or recurrent networks with long sequences:

Vanishing Gradients: This occurs when gradients become extremely small (tending to zero) in earlier layers of a deep network. During backpropagation, the gradient is propagated backwards from the output layer to the input layer. If the gradient diminishes at each layer (for example, due to certain activation functions like sigmoid, which have a derivative < 1), it may be nearly zero by the time it reaches the first layer. This means the early layers learn very slowly or not at all, since essentially no signal is getting through to update their weights. In the worst case, the weights stop changing (the model “freezes”).
Exploding Gradients: The opposite scenario – gradients that grow exponentially large as they propagate backward. This can happen if the weights in earlier layers are large or the chain of derivatives multiplies by a large number. Exploding gradients can cause numerical instability, e.g., weights becoming NaN (not-a-number) because the update is so large it overshoots to infinity. The model can diverge completely when gradients explode.

Both issues are tied to the stability of gradient descent in intensive computations. They are not problems with the concept of gradient descent per se, but with making it work in practice on deep networks.

Solutions for vanishing gradients:

Activation functions: In deep layers, use ReLU (Rectified Linear Unit) or variants instead of sigmoid/tanh. ReLUs have gradients of 0 or 1, which helps mitigate vanishing gradients.
Initialization: Proper weight initialization (e.g., Xavier/He initialization) can keep initial gradients in a reasonable range.
Normalization: Batch Normalization layers can help maintain stable gradients by normalizing inputs to each layer.
Architecture: Use residual connections (as in ResNets) that provide alternate gradient paths that bypass several layers. This prevents the gradient from vanishing when it reaches earlier layers. In recurrent networks, architectures like LSTM/GRU were designed to preserve gradients over long sequences.
Modern neural network design has largely addressed vanishing gradients through these techniques, allowing very deep networks to train.

Solutions for exploding gradients:

Gradient Clipping: This is a direct solution – if the gradient’s norm gets above a certain threshold, clip it (cut it off) to that threshold. By capping the gradient, you prevent any single update from blowing up the weights. This is commonly used in RNN training. For example, one might clip the gradient norm to (say) 5. If the gradients try to explode, we scale them down to a manageable size.
Lower learning rate: Sometimes, exploding gradients are a sign that the learning rate is too high during that training phase. Reducing the learning rate can help stability.
Rescaling inputs or using normalization can also help ensure gradients don’t suddenly spike due to distribution issues.

If you encounter NaNs or wildly oscillating loss, exploding gradients are a likely culprit, and gradient clipping is a quick fix.

Tuning the Learning Rate

Choosing the right learning rate is often the trickiest part of using gradient descent effectively. A poor choice can either make training very slow or unstable. Many challenges with gradient descent come down to the learning rate being too high. Conversely, slowness or getting seemingly stuck might mean the learning rate is too low.

Some tips for learning rate tuning:

Start with a moderate value: For many problems, a common starting point is alpha=0.01, which you can adjust if needed. For some algorithms or data, 0.1 or 0.001 might be more appropriate; it’s problem-dependent.
Observe the training loss curve: If the loss is not decreasing at all or even increasing, your learning rate might be too high (the algorithm is diverging or bouncing around). If the loss is decreasing but very slowly, you might increase the rate.
Use learning rate schedules: It’s common to reduce the learning rate as training progresses. For example, start at 0.1 and reduce to 0.01 after a certain number of epochs, etc. Many training setups use an initial higher learning rate for quick progress, then a lower rate for fine-tuning.
Adaptive methods (Adam, etc.): Using an optimizer like Adam can relieve some of the burden of picking a perfect learning rate, because these methods adjust the effective learning rate during training. That said, even Adam has a base learning rate that usually needs tuning.
Learning rate finder: There are techniques where you can run a few epochs while gradually increasing the learning rate to see at what value the loss starts exploding. This can help bracket a good learning rate.

Conclusion

In closing, mastering gradient descent means you have mastered the fundamental process by which machines learn from data. It provides a unifying understanding from the simplest linear regression to the deepest neural network. With this guide, you should be able to answer “what is gradient descent in machine learning?” and appreciate how to use it effectively and troubleshoot it. As a next step, you might apply these concepts: try implementing gradient descent for a model of your choice, or experiment with different variants on a dataset to see their impact. Gradient descent is a rich topic, and the best way to solidify your understanding is through practice and experimentation.