Analyse the data in heart.train.ass3.2022.csv.
Question 2 (18 marks) In this question we will analyse the data in heart.train.ass3.2022.csv. In this dataset, each observation represents a patient at a hospital that reported showing signs of possible heart disease. The outcome is presence of heart disease (HD), or not, so this is a classification problem. The predictors are summarised in Table 2. We are interested in learning a model that can predict heart disease from these measurements. To answer this question you must: When answering this question, you must use the rpart package that we used in Studio 9. The wrapper function for learning a tree using cross-validation that we used in Studio 9 is contained in the file wrappers.R. Don't forget to source this file to get access to the function.
1. Using the techniques you learned in Studio 9, fit a decision tree to the data using the tree package. Use cross-validation with 10 folds and 5,000 repetitions to select an appropriate SIVe tree. What variables have been used in the best tree? How many leaves (terminal nodes) does the best tree have? [2 marks]
2. Plot the tree found by CV. Clearly describe in plain English what conditions are required for the tree to predict that someone has heart disease. (hint: use the text (cv$best.tree,pretty=12) function to add appropriate labels to the tree). 3 marks]
3. For classification problems, the rpart package only labels the leaves with the most likely class. However, if you examine the tree structure in its textural representation on the console, you can determine the probabilities of having heart disease (see Question 2.3 from Studio 9 as a guide) in each leaf (terminal node). Take a screen-capture of the plot of the tree (don't forget to use the "zoom" button to get a larger image) or save it as an image using the "Export" button in R Studio. Then, use the information from the textual representation of the tree available at the console and annotate the tree in your favourite image editing software; next to all the leaves in the tree, add text giving the probability of contracting heart disease. Include this annotated image in your report file. [1 mark]
4. According to your tree, which predictor combination results in the lowest probability of having heart-disease? 1 mark)
5. We will also fit a logistic regression model to the data. Use the glm() function to fit a logistic regression model to the heart data, and use stepwise selection with the KIC score (using direction="both") to prune the model. What variables does the final model include, and how do they compare with the variables used by the tree estimated by CV? Which predictor is the most important in the logistic regression? [3 marks] 6. Write down the regression equation for the logistic regression model you found using step-wise selection. 1 mark)
7. Please describe the effect the variable CA has on heart-disease according to this logistic regression model? 1 mark)
8. The file heart. test ass3.2022.csv contains the data on a further n' = 92 individuals. Using the my.pred.stats () function contained in the file my .prediction.stats.R, compute the prediction statistics for both the tree and the step-wise logistic regression model on this test data. Contrast and compare the two models in terms of the various prediction statistics? Does one seem better than the other? Justify your answer. 2 marks]