PUBH5006: Assessment 2: Machine learning challenge
PUBH5006: Assessment 2: Machine learning challenge
Introduction:
In this assessment we are pretending that you have taken on an analyst position in a hospital and have been tasked with building a prediction model for AIE (adverse infective event). You have been asked to find the best performing model out of a regularised regression, gradient boosted trees (GBT) and neural network. After you select your best model, you will then need assess its calibration, apply a re-calibration method, and explain the strengths and limitations of your model.
The best model:
The best model will be the one which has the highest test set AUC, but which is also not over-fit (does not have a train set AUC >3 points higher than the test set AUC).
There is no single best model in this data, it is up to you to experiment with different hyperparameters to find the best model.
Assessment sections:
Section 1: build a regularised regression (using the glmnet package in caret), assess discrimination (i.e., AUC) and over-fit, make some effort to interpret the model (e.g., discuss ten strongest predictors).
Section 2: build a gradient boosted tree (using the XGBoost package in caret), assess discrimination (i.e., AUC) and over-fit, make some effort to interpret the model (e.g., discuss ten strongest predictors according to the gain).
Section 3: build an ANN (using the ANN2 package) which is improved upon the baseline, assess discrimination (i.e., AUC) and over-fit (do not try to interpret the model or comment on predictor importance, we have not learnt this yet for neural networks!).
Section 4: Select the best performing model from those above, assess its calibration, re-calibrate if necessary, then discuss some strengths/limitations of the model (these could be taken from class or the external literature).
Important details:
The assessment is worth 30 marks of your overall grade.
Marks are allocated as shown above the boxes in which you include your answers.
Additional marks will be allocated for originality, such as skilfully experimenting with hyperparameters not used in class and discussing strengths/limitations of models based on knowledge gained outside of class.
This assessment was made available at 12 pm Friday 22nd September and is due in two weeks, by 4 pm Friday 6th October.
Setting up the data (all these steps are demonstrated in the week 8 exercise notes!):
Recode the nominal categorical variables to dummy variables.
Remove variables with near zero variance.
Rescale the data (if needed by that algorithm)
Randomly partition the data, stratified by the outcome, using the createDataPartition function with a 70/30 train/test split [use the set.seed(748) before partition].
Prepare the train and test sets for caret or ANN2 (remember to remove patient ID - first column) and all outcomes from the train.x and test.x matrices).
Remember for caret, outcomes must be factor and for ANN must be a matrix.
When applying re-calibration, divide test set into a calibration train and calibration test set.
Section 1: Regularised regression (9 marks)
In the box below include all code and output from your final regularised regression (dont need to include code from all your experiments) plus your reasoning related to the regularised regression with marks attributed for:
2 marks: correct data pre-processing steps.
3 marks: construct a good performing model assess model discrimination (i.e., AUC) and comment on over-fit.
2 marks: originality in approach (if you tried something original but it didnt work and therefore did not feature in your final model, explain somewhere what you tried).
2 marks: list the 10 top influential predictors.
Section 2: Gradient boosted trees (7 marks)
In the box below include all code and output from your GBT (dont need to include code from all your experiments) plus your reasoning related to the XGBoost with marks attributed for:
0 marks: correct data pre-processing steps (same as for section 1).
3 marks: construct a good performing model assess model discrimination (i.e., AUC) and comment on over-fit.
2 marks: originality in approach (if you tried something original but it didnt work and therefore did not feature in your final model, explain somewhere what you tried).
2 marks: list the 10 top influential predictors
Section 3: Neural networks (7 marks)
In the box below include all code and output from your neural network (dont need to include code from all your experiments) plus your reasoning related to the neural network with marks attributed for:
2 marks: correct data pre-processing steps.
3 marks: construct a good performing model assess model discrimination (i.e., AUC) and comment on over-fit.
2 marks: originality in approach (if you tried something original but it didnt work and therefore did not feature in your final model, explain somewhere what you tried).
0 marks: list the 10 top influential predictors (do not do for neural net)
Section 4: Calibration and summary (7 marks)
In the box below include all code and output from calibration and re-calibration:
2 marks: assess moderate calibration and comment (e.g., over/under-predicted and in which region of the predicted probability).
2 marks: Apply a re-calibration method and comment on whether it was necessary and beneficial, and why.
3 marks: explain why you selected this model as the best and discuss its strengths and limitations (this can include knowledge gained from class but also outside of class)