Introduction
- Machine learning borrows heavily from fields such as statistics and computer science.
- In machine learning, models learn rules from data.
- In supervised learning, the target in our training data is labelled.
- A.I. has become a synonym for machine learning.
- A.G.I. is the loftier goal of achieving human-like intelligence.
Data preparation
- Data pre-processing is arguably the most important task in machine learning.
- SQL is the tool that we use to extract data from database systems.
- Data is typically partitioned into training and test sets.
- Setting random states helps to promote reproducibility.
Learning
- Loss functions allow us to define a good model.
- \(y\) is a known target. \(\hat{y}\) (\(y hat\)) is a prediction.
- Mean squared error is an example of a loss function.
- After defining a loss function, we search for the optimal solution in a process known as ‘training’.
- Optimisation is at the heart of machine learning.
Modellinguse a single feature (apache score)note: remove the reshape if fitting to >1 input variablefit the model to our dataget the y valuesplot
- Linear regression is a popular model for regression tasks.
- Logistic regression is a popular model for classification tasks.
- Probabilities that can be mapped to a prediction class.
Validation
- Validation sets are used during model development, allowing models to be tested prior to testing on a held-out set.
- Cross-validation is a resampling technique that creates multiple validation sets.
- Cross-validation can help to avoid overfitting.
Evaluation
- Confusion matrices are the basis for many popular performance metrics.
- AUROC is the area under the receiver operating characteristic. 0.5 is bad!
- TP is True Positive, meaning that our prediction hit its target.
Bootstrapping
- Bootstrapping is a resampling technique, sometimes confused with cross-validation.
- Bootstrapping allows us to generate a distribution of estimates, rather than a single point estimate.
- Bootstrapping allows us to estimate uncertainty, allowing computation of confidence intervals.
Data leakage
- Leakage occurs when training data is contaminated with information that is not available at prediction time.
- Leakage leads to over-optimistic expectations of performance.