Introduction


  • Machine learning borrows heavily from fields such as statistics and computer science.
  • In machine learning, models learn rules from data.
  • In supervised learning, the target in our training data is labelled.
  • A.I. has become a synonym for machine learning.
  • A.G.I. is the loftier goal of achieving human-like intelligence.

Data preparation


  • Data pre-processing is arguably the most important task in machine learning.
  • SQL is the tool that we use to extract data from database systems.
  • Data is typically partitioned into training and test sets.
  • Setting random states helps to promote reproducibility.

Learning


  • Loss functions allow us to define a good model.
  • \(y\) is a known target. \(\hat{y}\) (\(y hat\)) is a prediction.
  • Mean squared error is an example of a loss function.
  • After defining a loss function, we search for the optimal solution in a process known as ‘training’.
  • Optimisation is at the heart of machine learning.

Modellinguse a single feature (apache score)note: remove the reshape if fitting to >1 input variablefit the model to our dataget the y valuesplot


  • Linear regression is a popular model for regression tasks.
  • Logistic regression is a popular model for classification tasks.
  • Probabilities that can be mapped to a prediction class.

Validation


  • Validation sets are used during model development, allowing models to be tested prior to testing on a held-out set.
  • Cross-validation is a resampling technique that creates multiple validation sets.
  • Cross-validation can help to avoid overfitting.

Evaluation


  • Confusion matrices are the basis for many popular performance metrics.
  • AUROC is the area under the receiver operating characteristic. 0.5 is bad!
  • TP is True Positive, meaning that our prediction hit its target.

Bootstrapping


  • Bootstrapping is a resampling technique, sometimes confused with cross-validation.
  • Bootstrapping allows us to generate a distribution of estimates, rather than a single point estimate.
  • Bootstrapping allows us to estimate uncertainty, allowing computation of confidence intervals.

Data leakage


  • Leakage occurs when training data is contaminated with information that is not available at prediction time.
  • Leakage leads to over-optimistic expectations of performance.