Generalization refers to the ability of the model to make accurate predictions on unseen data (also called test set
). Although the test set and training set are typically of the same data distribution the model could perform overly good at the training set and perform poorly on the test set which is the referred to as overfitting
problem. Alternatively the model could perform poorly on both training and test set which referred to as underfitting
problem.
In order to select a model with high generalizability, we need to find a balance between bias and variance in the the data. Underfitting or high bias
occurs if the model is not complex enough to capture the underlying structure of the data. On the other hand, there is an opposite case where the model is too complex and it learns the noise and fluctuations in data making it perform too well on the training set but poorly on the test set. This case is where the model has high variance meaning that it is too sensitive to the training data and a minimal change to the data would change the fitting curve which effectively make the model not generalizable. Note
that if the database is too large, a complex model is typically able to find the underlying pattern of data and ignore the outliers
. The goal is to tune the model complexity using bias variance tradeoff to find a balance where both variance and bias is as low as possible.
A complimentary principle to the bias variance tradeoff is the double descent phenomenon (schematic view in figure below), which is specific to model complexity that is measured only by the number of parameters. It demonstrates that increasing the number of parameters in a model can lead to a second descent in the test error curve, where the error decreases again after initially increasing. This phenomenon challenges the conventional belief that there is approximately a U shaped test error as a result of bias variance, when model complexity gradually increases. It is also demonstrated that the observed peaks of the test error in the double descent phenomenon can be mitigated by using a tuned regularization technique, meaning that it is possible to improve the model's test performance, even using the overparameterized model where the number of parameters exceeds the number of data points.
Learning theory is indeed a branch of machine learning that focuses on studying the assumptions, principles, and limitations of learning algorithms. It provides a framework for analyzing the performance of these algorithms and establishing mathematical foundations. Learning theory helps in comparing and evaluating different machine learning algorithms and guides the development of new ones. It aims to bridge the gap between empirical observations and rigorous mathematical principles to deepen the understanding of the learning process.
The performance of a machine learning (ML) algorithm can be measured by its training error
(how well it performs on the training data) and generalization error
(how well it performs on unseen data). Alongside this, there is a bias error
that represents the inherent limitations of an ML algorithm in capturing the underlying structure of the data. To better understand these aspects, several important questions arise: How can we quantify the tradeoff between bias and variance? How can we assess an algorithm's performance on unseen data? And what conditions ensure that an algorithm works well?
To address these questions, we can rely on mathematical tools such as the union bound
and Hoeffding inequality
. The union bound helps us examine the relationship between training and generalization error. It suggests that, on average, the training error closely corresponds to the generalization error when an algorithm is trained on different datasets. On the other hand, the Hoeffding inequality allows us to investigate how the size of the training data influences the connection between training and generalization error. In simpler terms, it provides a sample complexity bound
, indicating the minimum amount of training data required to ensure that the training error reflects the algorithm's performance on unseen data.
Regularization is a technique employed to reduce the model complexity and overcome overfitting. This is achieved by manipulating the loss function and incorporating a regularization parameter (also called regularizer
).
L2 regularization
, also known as norm regularization
, is the most commonly used regularization technique. It encourages the optimizer to find parameter values that result in a smaller L2 norm
. The L2 norm is calculated as the square root of the sum of the squared values of the weight parameters. By adding the L2 norm multiplied by the regularization parameter λ
to the loss function, L2 regularization penalizes larger parameter values and promotes smaller ones.
Note
that the regularization can also occur implicitly during the optimization process, without explicitly incorporating a regularizer into the loss function. The choice of optimizer can introduce a new structure or bias to the model, leading to improved performance on unseen data. This implicit regularization effect is particularly observed in deep neural networks, and it depends on the specific optimizer used during model training.
Cross-validation
is a widely used technique for model selection, ML algorithm selection, hyperparameter tuning, and assessing the generalization performance of a selected model. The classic form of Cross-validation involves dividing the training set into k folds (also called k-fold Cross-validation
) and iteratively training the model on k-1 folds while evaluating its performance on the remaining fold. After k iterations, the model with the highest average performance, based on the chosen metrics, is selected. It's important to note that after the model selection process is completed, the selected model should also be evaluated on a separate test set (also known as the holdout set) to assess its performance on unseen data. This ensures a fair evaluation of the model's generalization capability. Therefore, the dataset is divided into training and test sets, cross-validation is performed on the training set to select the best model, and finally, the performance of the selected model is assessed on the test set.
Let's introduce Bayesian statistics which offers an alternative approach for parameter estimation, that is particularly useful for mitigating overfitting and incorporating prior knowledge. In Bayesian statistics, variables are treated as random variables, and prior knowledge is incorporated through the use of probability distributions, such as the Gaussian distribution.
To estimate the parameters, we define a likelihood function that describes the probability of observing a given value given the statistical model. The likelihood function quantifies the fit between the model and the observed data.
The prior distribution and the likelihood function are combined using Bayes' theorem
, resulting in the posterior distribution. The posterior distribution represents the updated beliefs about the parameter values after incorporating information from the observed data.
To find the parameter values, the maximum a posteriori
(MAP
) point estimate technique is often used. This technique aims to find the parameter values that maximize the probability density function
(PDF
) of the posterior distribution, which corresponds to the maximum likelihood estimate given the prior information.
It's important to note that each time new data is introduced, the posterior distribution is updated, allowing for incremental learning and updating of parameter estimates based on new evidence.