PUBH5006 Data Analysis Assessment

Subject Code :
PUBH5006
Country :
Australia

Part one: Predicting Adverse medication event with logistic regression

Let’s pretend you’re an analyst at a major metropolitan hospital, and you’ve been asked to construct a model which predicts the risk (i.e., probability) that a patient will experience an adverse medication event using logistic regression. Start by setting your working directory, importing the data and loading the packages shown below:

Data<-read.csv("Hospital_adverse_events_data.csv", na.strings = "")
library(glm2)
library(ggplot2)
library(pROC)

Next, let’s make a smaller data set called data2 which includes the outcome in the first column and a subset of 21 risk factors that a panel of clinicians at the hospital believe are likely to be predictors of the outcome:

data2<-data[,c(218,9:18,26,41,42,45,46,80,85,115:117,121)]

Question 1 (2 marks)

Now you need to divide this subset into a train and test set, of approximately 70% and 30% of the data respectively. Do this similarly to the way we did this in the week 4 exercise, by generating a random number for each row in data2 and saving this in a new variable called rand. Then you’ll need to create two new data sets called train and test, for which the train set should include all those with a random number below 0.7 and the test set includes those with a number of 0.7 or above. Make sure that just before generating the random number, you set the seed as - set.seed(785). Also, once you generate the new train and test sets, make sure you remove the rand variable from them (it will be column 23):

# Set the seed for reproducibility
set.seed(785)
# Add a random number column to data2
data2$rand <- runif(nrow(data2))
# Split data into train and test based on the random number
train <- data2[data2$rand < 0> test <- data2[data2$rand >= 0.7, ]

# Remove the rand variable from train and test sets
train$rand <- NULL
test$rand <- NULL
# Print dimensions of train and test datasets
cat("Dimensions of train dataset:", dim(train), "\n")
cat("Dimensions of test dataset:", dim(test), "\n")

Question 2 (2 marks)

Now, run the full model (i.e., including all 21 predictors) in the train data, save the results in an object called full.mod, calculate the predicted probabilities in the train sample according to this model and save them in an object called pred.full, and use this to calculate the AUC.

set.seed(785)
x<- runif(100)
y<- 0.2+2*x+5*x^2+ rnorm (100,0,1)
full.mod <-lm (y~x, train)
pred.full <-predict(full.mod, type='response')
auc(data2$y,pred.full)
# Fit the logistic regression model on the train data
full.mod <- glm(AME_outcome ~ ., data = train, family = "binomial")
# Calculate predicted probabilities in the train sample
pred.full <- predict(full.mod, type = 'response')
# Calculate AUC
library(pROC)
auc_score <- roc(train$AME_outcome, pred.full)
cat("AUC:", auc(auc_score), "\n")
# Plot the ROC curve
plot(auc_score, main = "ROC Curve - Full Model", print.auc = TRUE)

Question 3 (2 marks)

Now make an alternative model, saved in an object called alt.mod once again using the train set. In this alternative model you need to only include those predictors from the full.model which were significantly associated with the outcome (p<0>

In addition, calculate the predicted probabilities in the train sample according to this alt.model and save them in an object called pred.alt, and again use this to calculate the AUC in the training set. How does the AUC of the alternative model compare with the full model in the train set, are these results expected, why?

# Fit the full logistic regression model on the train data
full.mod <- glm(AME_outcome ~ ., data = train, family = "binomial")
# Get the summary of the full model to identify significant predictors
full.mod_summary <- summary(full.mod)
# Extract significant predictors (p < 0> significant_predictors rownames(full.mod_summary$coefficients)[full.mod_summary$coefficients[, 4] < 0>

# Create an alternative model using significant predictors
alt.mod <- glm(AME_outcome ~ ., data = train[, c("AME_outcome",
significant_predictors)], family = "binomial")

# Calculate predicted probabilities in the train sample for the alternative model
pred.alt <- predict(alt.mod, type = 'response')

# Calculate AUC for the alternative model
auc_score_alt <- roc(train$AME_outcome, pred.alt)
cat("AUC (Alternative Model):", auc(auc_score_alt), "\n")

# Plot the ROC curve for the alternative model
plot(auc_score_alt, main = "ROC Curve - Alternative Model", print.auc = TRUE)

Question 4 (2 marks)

Now, use the full.mod and alt.mod to calculate the predicted probabilities among those in the test set, saving these in objects called test.pred.full and test.pred.alt. Then use these to calculate the AUC for both models in the test set and compare them. Which model would you select and why?

# Calculate predicted probabilities for the full model on the test set test.pred.full <- predict(full.mod, newdata = test, type = 'response')

# Calculate predicted probabilities for the alternative model on the test set test.pred.alt <- predict(alt.mod, newdata = test, type = 'response')

# Calculate AUC for the full model on the test set auc_full <- auc(test$AME_outcome, test.pred.full)

# Calculate AUC for the alternative model on the test set auc_alt <- auc(test$AME_outcome, test.pred.alt)

# Print AUC values
cat("AUC (Full Model - Test Set):", auc_full, "\n")
cat("AUC (Alternative Model - Test Set):", auc_alt, "\n")

Question 5 (2 marks)

We now need to consider that our model is under-fitted (i.e., is not as complex as it could be) by exploring if we could add any interaction or quadratic terms. Using only one or two of the predictors included in the data2 object, experiment in the train set with interactions and quadratic terms.

Once you identify a significant interaction or quadratic effect, save this new model in an object called alt.mod2 and use this to calculate the AUC in the train and test sets. Is the complex term over-fit?

# Add interaction and quadratic terms to predictors x1 and x2
train$interaction_x1x2 <- train$Urea.Nitrogen * train$Potassium
train$quadratic_x1 <- train$Potassium^2
train$quadratic_x2 <- train$Prediabetes^2

# Fit the model with the added terms
alt.mod2 <- glm(AME_outcome ~ ., data = train[, c("AME_outcome", "Urea.Nitrogen", "Potassium", "interaction_x1x2", "quadratic_x1", "quadratic_x2")], family = "binomial")

# Calculate predicted probabilities in the train sample according to alt.mod2 pred.alt2 <- predict(alt.mod2, type = 'response')

# Calculate AUC in the train set
auc_train_alt2 <- auc(train$AME_outcome, pred.alt2)

# Calculate predicted probabilities in the test sample according to alt.mod2
test$interaction_x1x2 <- test$Calcium * test$Diastolic.Blood.Pressure
test$quadratic_x1 <- test$Calcium^2
test$quadratic_x2 <- test$Glucose^2
test.pred.alt2 <- predict(alt.mod2, newdata = test, type = 'response')

# Calculate AUC in the test set
auc_test_alt2 <- auc(test$AME_outcome, test.pred.alt2)

# Print AUC values
cat("AUC (Alternative Model 2 - Train Set):", auc_train_alt2, "\n")
cat("AUC (Alternative Model 2 - Test Set):", auc_test_alt2, "\n")

Part two: Variable selection with regularization

This section follows on from the last, but this time we will start with fresh data. Run the code below to remove any existing objects, load the data again and load the necessary libraries.

rm(list=ls())
data<-read.csv("Hospital_adverse_events_data.csv", na.strings = "")
library(glmnet)
library(caret)
library(pROC)

In class so far the only machine learning method we have introduced is regularization, which, as we discussed, has the major benefit of finding the bias-variance trade-off automatically (by comparing different values of lambda with cross-validation).

However, regularisation offers another major advantage over normal regression, and that is in its ability to automate variable selection. In this section we are going to try to improve upon the model in the previous section by using regularisation. However, we are not going to apply shrinkage to the model we developed above (i.e., the alternative model from Q.3), because as we’ve already seen, this model is not over-fit. Thus, shrinking the coefficients of this model towards 0 would be introducing bias without the need to reduce variability.

Instead, we will apply regularisation to all of the predictors in the data (except the nominal categorial variables to avoid having to reformat the data).

Lets start by setting up the data objects we need to use with the caret package, this is done for you in the code below. The code creates the train and test data which separates the predictors and outcomes and formats them appropriately for caret (note the data is randomly split the same as above so we can compare this model with the one from the previous section):

data2<-data[,c(218,2,7:215)]
set.seed(785)
rand<-runif(nrow(data2))

test<-data2[which(rand>=.7),]
test.y<-factor(test$AME_outcome, labels=c("no","yes"))
test.x<- as.matrix(test[,-1])

train<-data2[which(rand<.7),]
train.y<-factor(train$AME_outcome, labels=c("no","yes"))
train.x<- as.matrix(train[,-1])

Question 1 (5 marks)

For this question, you need to fill in the requested values in the correct place in the code provided below and then run the code successfully. At the end of the question you will be required to include some output (explained near the end of the question):

Firstly, you need to set the lambda values in the grid search object grid as 0.0001, 0.001, 0.01, 0.1 and 1:

tune_grid <- expand.grid(
alpha=1,
lambda= c(0.0001, 0.001,0.01,0.1,1))

Next, in this section all you need to do is specify the methods in the train control object trcontrol as using 5-fold cross validation (don’t change the other settings, and don’t forget each line needs to end in a comma except the final line):

train_control <- trainControl(
method= “cv”,
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary)

Lastly, in this section we run the model, and all you need to specify is the correct data objects (i.e., the training data objects), and the method we want to use (don’t change the other settings, and don’t forget each line needs to end in a comma):

set.seed(7353)
reg_mod <- caret::train(
x = train.x,
y= train.y,
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = tune_grid,
method="glmnet")

Once you have successfully run the above model, the results will be stored in the reg_mod object. Now for our purpose, we are not interested in the performance of the regularisation model, instead we simply want to know which coefficients among all the predictors were most strongly related to the outcome.

The code below extracts these coefficients and orders them from strongest to weakest (remember they are given in scientific notation and we are measuring the strength in the magnitude of the association, which can be positive or negative):

reg_mod_beta<-data.frame(variables=dimnames(coef(reg_mod$finalModel, reg_mod$finalModel$lambdaOpt))[[1]],

betas=as.numeric(coef(reg_mod$finalModel, reg_mod$finalModel$lambdaOpt)))

reg_mod_beta<-reg_mod_beta[order(-abs(reg_mod_beta$betas)),]

Print only the top 10 predictors and their coefficients form the reg_mod_beta object and put the results in the box below (there is a trick here, the intercept does not count as a predictor!):

# Load necessary libraries
library("glmnet")
library("caret")
library("pROC")

# Load the data
data <- read.csv("Hospital_adverse_events_data.csv", na.strings = "")

# Prepare data objects
data2 <- data[, c(218, 2, 7:215)]
set.seed(785)
rand <- runif(nrow(data2))

test <- data2[which(rand >= 0.7), ]
test.y <- factor(test$AME_outcome, labels = c("no", "yes"))
test.x <- as.matrix(test[, -1])

train <- data2[which(rand < 0> train.y <- factor(train$AME_outcome, labels = c("no", "yes"))
train.x <- as.matrix(train[, -1])

# Set lambda values in the grid search object
tune_grid <- expand.grid(
alpha = 1,
lambda = c(0.0001, 0.001, 0.01, 0.1, 1)
)

# Define train control settings
train_control <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary
)

# Run the regularisation model
set.seed(7353)
reg_mod <- caret::train(
x = train.x,
y = train.y,
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = tune_grid,
method = "glmnet"
)

# Extract and order coefficients
reg_mod_beta <- data.frame(
variables = dimnames(coef(reg_mod$finalModel,
reg_mod$finalModel$lambdaOpt))[[1]],
betas = as.numeric(coef(reg_mod$finalModel, reg_mod$finalModel$lambdaOpt)) )
reg_mod_beta <- reg_mod_beta[order(-abs(reg_mod_beta$betas)),]

# Print only the top 10 predictors and their coefficients head(reg_mod_beta, n = 11)

Question 2 (5 marks)

Now, create a final logistic regression model (i.e., normal logistic regression without regularisation) predicting AME_outcome and which includes only these top 10 predictors according to the model from the previous question. Run this model in the train data and save it as final.model, then calculate the AUC. Next, use this model to calculate the AUC in the test data. Is this model better than the alternative model from Q.3 (section 1), if so why? And what does this lead you to conclude about the use of regularisation for variable selection?

# Select the top 10 predictors from the reg_mod_beta object
selected_predictors <- top_10_predictors$variables[1:10]

# Create the formula for the model
formula <- paste("AME_outcome ~", paste(selected_predictors, collapse = " + "))

# Fit the final logistic regression model
final_model <- glm(formula, data = train, family = binomial(link = "logit"))

# Summary of the final model
summary(final_model)

# Calculate predicted probabilities using the final model
train$pred_prob <- predict(final_model, newdata = train, type = "response")

# Calculate AUC for train data
train_auc <- pROC::roc(train$AME_outcome, train$pred_prob)$auc
print("Train AUC:")
print(train_auc)

# Calculate predicted probabilities using the final model on test data test$pred_prob <- predict(final_model, newdata = test, type = "response")

# Calculate AUC for test data
test_auc <- pROC::roc(test$AME_outcome, test$pred_prob)$auc
print("Test AUC:")
print(test_auc)

# Print AUC of the alternative model from Q.3 (section 1)
print("Alternative Model AUC (Q.3):")
print(auc_alt_model)

# Compare AUC values
if (train_auc > auc_alt_model) {
print("The final model in train data is better than the alternative model.") } else {
print("The final model in train data is not better than the alternative model.") }
if (test_auc > auc_alt_model) {
print("The final model in test data is better than the alternative model.")
} else {
print("The final model in test data is not better than the alternative model.")

Part three: Exploring validity via simulation

preamble

For this section, we will start by using simulated data to generate an entire hypothetical population of 100,000 patients, so that we can compare the results in our sample with the “actual” or “true” results in the population (which of course in real-life we would never be able to access). Start by running the code below which generates a simulated data set called population with 20 continuous predictor variables labelled X1 to X20, and a binary outcome Y. Importantly, all code must be run in the order given with the set.seed() function, or you will get the wrong data! You should also clear all other data from your R envirnoment to not get confused.

rm(list=ls())
library(pROC)
library(glm2)

set.seed(36)
population<-data.frame(matrix(rnorm(2000000),nrow=100000,ncol=20))
z <- -4+population$X1*0.5+population$X5*0.7+population$X10*-0.5+population$X17*1.5
pr <- 1/(1+exp(-z))
population$Y <- rbinom(100000,1,pr)

What we have here is a hypothetical disease outcome Y with a prevalence of 5.8%, confirm this with prop.table(table(population$Y)), in the total population of 100,0000 patients. Further, of the 20 predictors, only 4 of them are actually associated with the outcome (i.e., the logits are set to X1=0.5, X5=0.7, X10=-0.5 and X17=1.5). We can confirm that the simulations above have resulted in the desired associations by running the logistic regression model below in this entire population:

summary(glm(Y~.,population,family=binomial(link=logit)))

Make sure you understand that this population represents the entire population, and we do not actually have access to it and can’t ever know the true relationship revealed by it.

the exercise

You are a researcher with an interest in building a prediction model for this outcome Y, and have received funding which will allow you to obtain a sample of roughly 1000 patients from this population along with all 20 predictors X1 to X20. This is done for you in the code below, which randomly subdivides the population into sub-samples and draws one of these at random to be your sample which includes 999 patients.

set.seed(734)
population$sample.no<-as.numeric(as.factor(cut(runif(100000), 100))) sample<-population[which(population$sample.no==6),-22]

Question 1 (2 marks)

Just as was done above in the entire population, run a logistic regression model including all the 20 predictors in your sample (include your model code and output in the box below). Do the results from this model, based on the sample, represent the true model in the entire population? If not, in which ways does it differ?

sample_model <- glm(Y ~ ., data = sample, family = binomial(link = logit)) summary(sample_model)

Question 2 (2 marks)

Now, re-run the model in the sample, but this time only include the variables which are significantly (p<0>

# Fit a new model with only significantly associated predictors new_mod <- glm(Y ~ X1 + X5 + X10 + X17, data = sample, family = binomial(link = logit))

# Predict probabilities using the new model
pred_new <- predict(new_mod, newdata = sample, type = "response")

# Calculate the AUC
auc_new <- roc(sample$Y, pred_new)$auc
auc_new

Question 3 (2 marks)

Run this previously discovered model in your sample, which includes only the four predictors found important by this previous study (i.e., remove X4 and X14). Save this old (i.e., previously discovered) model as old.mod and the new predicted probabilities as pred.old. Calculate the AUC and compare it with your new model. What do you conclude at this point?

# Define control settings for cross-validation
cv_control <- trainControl(
method = "repeatedcv",
number = 5, # Number of folds
repeats = 3 # Number of repetitions
)

# Create a formula for the model
formula <- as.formula("Y ~ X1 + X5 + X10 + X15 + X16 + X17")

# Fit the model using cross-validation
new_mod_cv <- train(
formula,
data = sample,
method = "glm",
family = binomial(link = "logit"),
trControl = cv_control
)

# Print cross-validation results
print(new_mod_cv)

# Access RMSE for each fold
rmse_scores <- new_mod_cv$results$RMSE

# Calculate mean RMSE across folds
mean_rmse <- mean(rmse_scores)

Question 4 (2 marks)

Examine the code below which carries out 5-fold cross validation in your sample for both models, including your new model mod1 and the previous research groups model mod2. In 4 dot points, briefly explain how the code works and what the resulting outcome placed in the pred_compare object is. Even if you don’t fully understand the code, do your best:

set.seed(263)
cv_folds<-sample(c(1:5),size=nrow(sample),replace=TRUE)
pred_compare<-data.frame(pred_new=rep(NA,nrow(sample)),pred_old=rep(NA,nrow(sample)))

for (i in 1:5) {
new.mod<-glm(Y~X1+X4+X5+X10+X14+X17,sample[cv_folds!=i,],family=binomial(link=logit))
old.mod<-glm(Y~X1+X5+X10+X17,sample[cv_folds!=i,],family=binomial(link=logit))
pred_compare[cv_folds==i,1]<-predict(new.mod,newdata = sample[cv_folds==i,],type="response")
pred_compare[cv_folds==i,2]<-predict(old.mod,newdata = sample[cv_folds==i,],type="response")
}

# Set random seed for reproducibility
set.seed(263)

# Generate cross-validation fold assignments
cv_folds <- sample(c(1:5), size = nrow(sample), replace = TRUE)

# Initialize data frame to store predicted probabilities
pred_compare <- data.frame(pred_new = rep(NA, nrow(sample)), pred_old = rep(NA, nrow(sample)))

# Perform cross-validation
for (i in 1:5) {

# Fit new model using selected predictors
new.mod <- glm(Y ~ X1 + X4 + X5 + X10 + X14 + X17, sample[cv_folds != i,], family = binomial(link = logit))

# Fit old model using selected predictors
old.mod <- glm(Y ~ X1 + X5 + X10 + X17, sample[cv_folds != i,], family = binomial(link = logit))

# Predict probabilities for new model
pred_compare[cv_folds == i, 1] <- predict(new.mod, newdata = sample[cv_folds == i,], type = "response")

# Predict probabilities for old model
pred_compare[cv_folds == i, 2] <- predict(old.mod, newdata = sample[cv_folds == i,], type = "response")
}

before executing this code, make sure the sample data frame has the required columns (X1, X4, X5, X10, X14, X17, and Y). The predicted probabilities for the new and old models throughout the five cross-validation folds will be included in the resultant pred_compare data frame.

Question 5 (2 marks)

Lastly, run the cross-validation code given above and use the output from the above cross validation code to calculate the cross validated AUC for both models in your sample. What do you now conclude about your new model and how does it compare with the previous model?

# Load the pROC package
library(pROC)

# Calculate cross-validated AUC for the new model
cv_auc_new <- roc(response = sample$Y, predictor = pred_compare$pred_new)$auc

# Calculate cross-validated AUC for the old model
cv_auc_old <- roc(response = sample$Y, predictor = pred_compare$pred_old)$auc

# Print the cross-validated AUC values
cat("Cross-validated AUC for new model:", cv_auc_new, "\n")
cat("Cross-validated AUC for old model:", cv_auc_old, "\n")

Are you struggling to keep up with the demands of your academic journey? Don't worry, we've got your back! Exam Question Bank is your trusted partner in achieving academic excellence for all kind of technical and non-technical subjects.

Our comprehensive range of academic services is designed to cater to students at every level. Whether you're a high school student, a college undergraduate, or pursuing advanced studies, we have the expertise and resources to support you.

To connect with expert and ask your query click here Exam Question Bank

Download Solution Now

Uploaded By : Mohit
Posted on : September 02nd, 2023
Downloads : 0
Views : 89

PUBH5006 Data Analysis Assessment

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

PUBH5006 Data Analysis Assessment

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

Request a Call Back