PUBH5006 Assessment 1: Data analysis (assessing weeks 1-4)

Information about this assessment:

This assessment is worth 30% of your overall grade. There are a total of 30 marks available, so each 1 mark corresponds to 1% of your overall grade. The number of marks each question is worth is shown. The questions are broken up into 3 self-contained sections, each worth 10 marks. Each of these 3 sections involves building and assessing a prediction model, and covers material from the first 4 weeks of class. Part marks are available in the case that you are unable to produce the correct answer, but you can demonstrate that you understood what was being asked and did some things correctly. So it is worth attempting every question.

The assessment is structured like an analysis, in which some sections are done for you (and you simply cut and paste the code into your own R code file), while other sections are for you to complete. Please submit this completed document as a word file (a Turnitin link will be made available to you), and please refer to the unit outline for information regarding the assessment due date and late penalty policy.

IMPORTANT: For each question, a space is provided in which you must include ALL your R code used to answer the question, ALL output requested (including plots) and ALL interpretation requested which you must type yourself (IN THAT ORDER).

The use of generative AI (e.g., ChatGPT) in this assessment:

The use of these technologies is allowed, as they will play a central role in programming in your employment, and to some degree simply replace the method we used to use to find code (i.e., website internet forums such as Stack Exchange).

It is intended that ChatGPT may play an assistive role (i.e., helping you build the code needed to answer questions).

You should NOT use ChatGPT (or similar technologies) in a way which completes the assessment for you (e.g., cutting and pasting the question into ChatGPT and cutting and pasting the result into your assessment). You will lose marks for answers which appear to be verbatim ChatGPT.

Although I cant stop you using ChatGPT, If you are hoping to continue on in the data specialization, you should be focused on learning the R code and concepts for yourself. Thus, I would encourage you to not use ChatGPT or only use it to confirm your solution once you have solved it yourself.

Part one: Predicting Adverse medication event with logistic regression

Lets pretend youre an analyst at a major metropolitan hospital, and youve been asked to construct a model which predicts the risk (i.e., probability) that a patient will experience an adverse medication event using logistic regression. Start by setting your working directory, importing the data and loading the packages shown below:

data<-read.csv("Hospital_adverse_events_data.csv", na.strings = "")library(glm2)library(ggplot2)library(pROC)

Next, lets make a smaller data set called data2 which includes the outcome in the first column and a subset of 21 risk factors that a panel of clinicians at the hospital believe are likely to be predictors of the outcome:

data2<-data[,c(218,9:18,26,41,42,45,46,80,85,115:117,121)]

Question 1 (2 marks)

Now you need to divide this subset into a train and test set, of approximately 70% and 30% of the data respectively. Do this similarly to the way we did this in the week 4 exercise, by generating a random number for each row in data2 and saving this in a new variable called rand. Then youll need to create two new data sets called train and test, for which the train set should include all those with a random number below 0.7 and the test set includes those with a number of 0.7 or above. Make sure that just before generating the random number, you set the seed as - set.seed(785). Also, once you generate the new train and test sets, make sure you remove the rand variable from them (it will be column 23):

set.seed(785)

data2$rand <- runif(nrow(data2))

train <-data2[which(data2$rand <.7),]

test <- data2[which(data2$rand>=.7),]

data2 <- data2[,-23]

Question 2 (2 marks)

Now, run the full model (i.e., including all 21 predictors) in the train data, save the results in an object called full.mod, calculate the predicted probabilities in the train sample according to this model and save them in an object called pred.full, and use this to calculate the AUC.

set.seed(785)

x<- runif(100)

y<- 0.2+2*x+5*x^2+ rnorm (100,0,1)

full.mod <-lm (y~x, train)

pred.full <-predict(full.mod, type='response')

auc(data2$y,pred.full)

Question 3 (2 marks)

Now make an alternative model, saved in an object called alt.mod once again using the train set. In this alternative model you need to only include those predictors from the full.model which were significantly associated with the outcome (p<0.05 indicated by one or more *) in the full model.

In addition, calculate the predicted probabilities in the train sample according to this alt.model and save them in an object called pred.alt, and again use this to calculate the AUC in the training set. How does the AUC of the alternative model compare with the full model in the train set, are these results expected, why?

alt.mod

Question 4 (2 marks)

Now, use the full.mod and alt.mod to calculate the predicted probabilities among those in the test set, saving these in objects called test.pred.full and test.pred.alt. Then use these to calculate the AUC for both models in the test set and compare them. Which model would you select and why?

Question 5 (2 marks)

We now need to consider that our model is under-fitted (i.e., is not as complex as it could be) by exploring if we could add any interaction or quadratic terms. Using only one or two of the predictors included in the data2 object, experiment in the train set with interactions and quadratic terms.

Once you identify a significant interaction or quadratic effect, save this new model in an object called alt.mod2 and use this to calculate the AUC in the train and test sets. Is the complex term over-fit?

Part two: Variable selection with regularisationThis section follows on from the last, but this time we will start with fresh data. Run the code below to remove any existing objects, load the data again and load the necessary libraries.

rm(list=ls())data<-read.csv("Hospital_adverse_events_data.csv", na.strings = "")library(glmnet)

library(caret)

library(pROC)

In class so far the only machine learning method we have introduced is regularization, which, as we discussed, has the major benefit of finding the bias-variance trade-off automatically (by comparing different values of lambda with cross-validation).

However, regularisation offers another major advantage over normal regression, and that is in its ability to automate variable selection. In this section we are going to try to improve upon the model in the previous section by using regularisation. However, we are not going to apply shrinkage to the model we developed above (i.e., the alternative model from Q.3), because as weve already seen, this model is not over-fit. Thus, shrinking the coefficients of this model towards 0 would be introducing bias without the need to reduce variability.

Instead, we will apply regularisation to all of the predictors in the data (except the nominal categorial variables to avoid having to reformat the data).

Lets start by setting up the data objects we need to use with the caret package, this is done for you in the code below. The code creates the train and test data which separates the predictors and outcomes and formats them appropriately for caret (note the data is randomly split the same as above so we can compare this model with the one from the previous section):

data2<-data[,c(218,2,7:215)]set.seed(785)rand<-runif(nrow(data2))test<-data2[which(rand>=.7),]test.y<-factor(test$AME_outcome, labels=c("no","yes"))

test.x<- as.matrix(test[,-1])train<-data2[which(rand<.7),]train.y<-factor(train$AME_outcome, labels=c("no","yes"))

train.x<- as.matrix(train[,-1])

Question 1 (5 marks)

For this question, you need to fill in the requested values in the correct place in the code provided below and then run the code successfully. At the end of the question you will be required to include some output (explained near the end of the question):

Firstly, you need to set the lambda values in the grid search object grid as 0.0001, 0.001, 0.01, 0.1 and 1:

tune_grid <- expand.grid( alpha=1, lambda= c(0.0001, 0.001,0.01,0.1,1))

Next, in this section all you need to do is specify the methods in the train control object trcontrol as using 5-fold cross validation (dont change the other settings, and dont forget each line needs to end in a comma except the final line):

train_control <- trainControl( method= cv, number = 5, classProbs = TRUE, summaryFunction = twoClassSummary)

Lastly, in this section we run the model, and all you need to specify is the correct data objects (i.e., the training data objects), and the method we want to use (dont change the other settings, and dont forget each line needs to end in a comma):

set.seed(7353) reg_mod <- caret::train( x = train.x, y= train.y, trControl = train_control, preProcess = c("center", "scale"), tuneGrid = tune_grid, method="glmnet")

Once you have successfully run the above model, the results will be stored in the reg_mod object. Now for our purpose, we are not interested in the performance of the regularisation model, instead we simply want to know which coefficients among all the predictors were most strongly related to the outcome.

The code below extracts these coefficients and orders them from strongest to weakest (remember they are given in scientific notation and we are measuring the strength in the magnitude of the association, which can be positive or negative):

reg_mod_beta<-data.frame(variables=dimnames(coef(reg_mod$finalModel, reg_mod$finalModel$lambdaOpt))[[1]],

betas=as.numeric(coef(reg_mod$finalModel, reg_mod$finalModel$lambdaOpt)))

reg_mod_beta<-reg_mod_beta[order(-abs(reg_mod_beta$betas)),]

Print only the top 10 predictors and their coefficients form the reg_mod_beta object and put the results in the box below (there is a trick here, the intercept does not count as a predictor!):

head(reg_mod_beta, n=11)

variables betas

1 (Intercept) -3.80040803

2 age 1.549717834 Body.Weight 1.027273083 Body.Height -0.45441929

191 Documentation.of.current.medications 0.12470527

112 Acetaminophen.325.MG...oxyCODONE.Hydrochloride.5.MG.Oral.Tablet 0.10527738

188 Colonoscopy -0.09059045

68 Osteoporosis..disorder. 0.08967747

113 Acetaminophen.325.MG...oxyCODONE.Hydrochloride.5.MG..Percocet. 0.08789075

5 Calcium -0.07863440

48 Fracture.of.forearm -0.07729209

Question 2 (5 marks)

Now, create a final logistic regression model (i.e., normal logistic regression without regularisation) predicting AME_outcome and which includes only these top 10 predictors according to the model from the previous question. Run this model in the train data and save it as final.model, then calculate the AUC. Next, use this model to calculate the AUC in the test data. Is this model better than the alternative model from Q.3 (section 1), if so why? And what does this lead you to conclude about the use of regularisation for variable selection?

Part three: Exploring validity via simulation

preamble

For this section, we will start by using simulated data to generate an entire hypothetical population of 100,000 patients, so that we can compare the results in our sample with the actual or true results in the population (which of course in real-life we would never be able to access). Start by running the code below which generates a simulated data set called population with 20 continuous predictor variables labelled X1 to X20, and a binary outcome Y. Importantly, all code must be run in the order given with the set.seed() function, or you will get the wrong data! You should also clear all other data from your R envirnoment to not get confused.

rm(list=ls())library(pROC)library(glm2)set.seed(36)population<-data.frame(matrix(rnorm(2000000),nrow=100000,ncol=20))z <- -4+population$X1*0.5+population$X5*0.7+population$X10*-0.5+population$X17*1.5pr <- 1/(1+exp(-z)) population$Y <- rbinom(100000,1,pr)

What we have here is a hypothetical disease outcome Y with a prevalence of 5.8%, confirm this with prop.table(table(population$Y)), in the total population of 100,0000 patients. Further, of the 20 predictors, only 4 of them are actually associated with the outcome (i.e., the logits are set to X1=0.5, X5=0.7, X10=-0.5 and X17=1.5). We can confirm that the simulations above have resulted in the desired associations by running the logistic regression model below in this entire population:

summary(glm(Y~.,population,family=binomial(link=logit)))

Make sure you understand that this population represents the entire population, and we do not actually have access to it and cant ever know the true relationship revealed by it.

the exercise

You are a researcher with an interest in building a prediction model for this outcome Y, and have received funding which will allow you to obtain a sample of roughly 1000 patients from this population along with all 20 predictors X1 to X20. This is done for you in the code below, which randomly subdivides the population into sub-samples and draws one of these at random to be your sample which includes 999 patients.

set.seed(734)population$sample.no<-as.numeric(as.factor(cut(runif(100000), 100)))sample<-population[which(population$sample.no==6),-22]

Question 1 (2 marks)

Just as was done above in the entire population, run a logistic regression model including all the 20 predictors in your sample (include your model code and output in the box below). Do the results from this model, based on the sample, represent the true model in the entire population? If not, in which ways does it differ?

Question 2 (2 marks)

Now, re-run the model in the sample, but this time only include the variables which are significantly (p<0.05 - indicated by one or more *) associated with the outcome in your sample and save this model in an object called new.mod. Then use the predict function to calculate the predicted probability of the outcome given by the new.mod object for the observations in the sample, and save these into a new object called pred.new. Lastly, use this to calculate the AUC among the patients in your sample:

Your new model appears to be doing a pretty good job, and you are quite excited because you seem to have discovered two new predictors (X4 and X14). A previous prediction model constructed by a research team at another university had found that the only predictors which were associated with the outcome were X1, X5, X10 and X17. However, because the sample size used by this other university was much larger than yours (around 10,000 patients versus your 1,000) you decide you had better test this existing model in your data to ensure it is not actually better.

Question 3 (2 marks)

Run this previously discovered model in your sample, which includes only the four predictors found important by this previous study (i.e., remove X4 and X14). Save this old (i.e., previously discovered) model as old.mod and the new predicted probabilities as pred.old. Calculate the AUC and compare it with your new model. What do you conclude at this point?

At this point youre almost ready to submit your findings for publication, however just before you do, a senior researcher points out that you didnt use any form of cross-validation. Because your sample is small and you dont want to lose any information in a testing set s/he suggests you simply apply 5-fold cross validation in your sample, which you should use to validate and compare both your new model and the previous model. As you dont know how to implement this in R, s/he provides you with the following code:

Question 4 (2 marks)

Examine the code below which carries out 5-fold cross validation in your sample for both models, including your new model mod1 and the previous research groups model mod2. In 4 dot points, briefly explain how the code works and what the resulting outcome placed in the pred_compare object is. Even if you dont fully understand the code, do your best:

set.seed(263)cv_folds<-sample(c(1:5),size=nrow(sample),replace=TRUE)pred_compare<-data.frame(pred_new=rep(NA,nrow(sample)),pred_old=rep(NA,nrow(sample)))for (i in 1:5) { new.mod<-glm(Y~X1+X4+X5+X10+X14+X17,sample[cv_folds!=i,],family=binomial(link=logit)) old.mod<-glm(Y~X1+X5+X10+X17,sample[cv_folds!=i,],family=binomial(link=logit)) pred_compare[cv_folds==i,1]<-predict(new.mod,newdata = sample[cv_folds==i,],type="response") pred_compare[cv_folds==i,2]<-predict(old.mod,newdata = sample[cv_folds==i,],type="response")}

Question 5 (2 marks)

Lastly, run the cross-validation code given above and use the output from the above cross validation code to calculate the cross validated AUC for both models in your sample. What do you now conclude about your new model and how does it compare with the previous model?

Download Solution Now

Uploaded By : Pooja Dhaka
Posted on : November 22nd, 2024
Downloads : 0
Views : 195

PUBH5006 Assessment 1: Data analysis (assessing weeks 1-4)

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

PUBH5006 Assessment 1: Data analysis (assessing weeks 1-4)

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

Request a Call Back