ML model
What is a model ?¶
- A model is simply a system for mapping inputs to outputs.
- A model learns relationships between the inputs, called features, and outputs, called labels, from a training dataset.
- Models are useful because we can use them to predict the values of outputs for new data points given the inputs (test data).
What is training a model ?¶
- A model learns relationships between the inputs, called features, and outputs, called labels, from a training dataset.
- During training, the model is given both the features and the labels and learns how to map the former to the latter.
How to evaluate a model ?¶
- A trained model is evaluated on a testing dataset, where we only give it the features and it makes predictions.
- We compare the predictions with the known labels for the testing set to calculate accuracy.
What are the two prediction errors for a model ?¶
- Performance
- Accuracy
What is Model Validation ?¶
- We need some sort of pre-test to use for model optimization and evaluate. This pre-test is known as a validation dataset.
- A basic approach would be to use a validation set in addition to the training and testing set. This presents a few problems though: we could just end up overfitting to the validation set, and we would have less training data. A smarter implementation of the validation concept is k-fold cross-validation.
- What is k-fold cross-validation ?
- The idea is straightforward: rather than using a separate validation set, we split the training set into a number of subsets, called folds.
-
Example:
- Let’s use five folds
- We perform a series of train and evaluate cycles where each time we train on 4 of the folds and test on the 5th, called the hold-out dataset.
- We repeat this cycle 5 times, each time using a different fold for evaluation.
- At the end, we average the scores for each of the folds to determine the overall performance of a given model.
- This allows us to optimize the model before deployment without having to use additional data.
