M348 Applied statistical modelling Assignment
Order Code: 483370
Question Task Id: 0
- Subject Code :
- Country :
You should create a Jupyter notebook for your solutions to the TMA questions. This will need to be submitted as a zip file via the University’s online TMA/EMA service. Before starting your work, please read the Specific guidance for M348 assignments and other guidance provided on the Assessment page of the module website.
This TMA is marked out of 50. Your overall score for this TMA will be the sum of your marks for each question.The marks allocated to each part of each question are indicated in brackets
in the margin.If you have a disability that makes it difficult for you to attempt any of these questions, then please contact your Student Support Team or your tutor for advice.
Question 1 – 20 marks
Note: Your solution should be contained in a Jupyter notebook. See the module website for guidance.In a clinical trial on the effectiveness of thrombolytic therapy for acute ischemic stroke, 312 patients who had suffered ischemic stroke were given intravenous recombinant tissue plasminogen activator (t-PA) and a further 312 stroke patients received a placebo. For each patient it was recorded if the t-PA or placebo began within 90 minutes of the onset of stroke, or if it took between 90 and 180 minutes before the patient was treated. A ‘favourable outcome’ is defined as one (or both) of the following happening within 24 hours of the onset of stroke:
• an improvement of four points over baseline values in the score of the National Institutes of Health Stroke Scale (NIHSS)
• the return to normal neurological function.The results of the trial are summarised in the table below.
(a) Create a data frame in R containing the data above and using the following variable names and levels:
• treat: treatment indicator, taking the values 0 (for placebo) and 1 (for t-PA)
• time: time of treatment after onset of stroke, taking the values 0 (for within 90 minutes) and 1 (for between 90 and 180 minutes)
• fav: favourable outcome, taking the values 0 (for no) and 1 (for yes). This is the response variable.
Use the table() command to check the data you have created correspond to those given in the table above.(Hint: The rep() function in R with option times may be useful here.
For example, the command rep(c(0,1,0),times=c(5,6,3)) will create a vector of five 0’s, six 1’s and three 0’s. Also remember that in Notebook activities 1.10 and 1.19 you saw how to create data frames by combining vectors.) 
(b) Find the best (in terms of the Akaike information criterion (AIC)) logistic model, with response variable fav. Explain your approach. 
(c) Use a statistical test to check whether the best model in part (b) can be improved by adding an extra term. 
(d) Write down the fitted linear predictor of the best model you found in part (b). For this model, interpret the estimated parameters; in particular:
• Will treating the patient within 90 minutes of the onset of stroke increase or decrease the odds of a favourable outcome compared with starting treatment between 90 and 180 minutes?
• Will treating the patient with t-PA increase or decrease the odds of a favourable outcome compared with a placebo? Then find the estimated probabilities of a favourable outcome for
patients in the four groups (defined by the four combinations of time and treat), rounded to two decimal places. Which combination gives the highest estimated probability of a favourable outcome? 
(e) Check if your model from part (b) satisfies the assumptions of a logistic regression model. 
Question 2 – 30 marks
Note: Your solution should be contained in the same Jupyter notebook you used for Question 1. See the module website for guidance, and to download the required data file.
The aim of a study carried out in 2004 at the Royal Berkshire Hospital,Reading, was to investigate the incidence of sore throats in patients who had undergone surgery. Of particular interest was whether the occurrence of a sore throat was affected by the method used to deliver anaesthetic gas, as one of three types of airway device was used on each patient. The response variable was binary and indicated whether or not a patient experienced a sore throat in the 24-hour period following the operation. Several other explanatory variables were also recorded.
The data are given in the data frame soreThroat and stored in the file soreThroat.RData. The variables are as follows:
• age: the patient’s age (in years)
• gender: the gender the patient identifies with, taking the values 0 (for male) and 1 (for female)
• duration: the duration of the operation (in minutes)
• method: the type of airway device used:
– LMA: laryngeal mask airway
– ETT: endo-tracheal tube
– FM: traditional face mask
• lubricated: if lubrication used when inserting airway device, taking the values 0 (for no), 1 (for yes) and 9 (for missing data)
• sore: occurrence of sore throat after surgery, taking the values 0 (for no)and 1 (for yes). This is the response variable.
(a) Preliminary analysis:
(i) In the data frame soreThroat, all explanatory variables have been saved as numeric vectors in R. Identify which of the variables are factors, and make sure that R is treating them as factors.
(Hint: If you need a reminder about how to do this look back at Notebook activities 1.7 and 3.1.) 
(ii) Provide appropriate visual summaries for the variables age,gender, duration, method and lubricated, and comment on these summaries. Where appropriate, provide suggestions for modelling the data, such as using transformations, and point out if there is anything you would have done differently if you had been planning the study. 
(iii) Investigate if having missing data in the variable lubricated may impact on your ability to compare the three types of airway devices. For subsequent analysis, replace the missing values in lubricated with NA.
(Hint: The ifelse() function in R may be useful here. For example, the command ifelse(x==1,2,x) replaces all entries in an object x that are equal to 1 with the value 2 and leaves the
other entries as they are.) 
(b) Fit a logistic model with all explanatory variables and comment on the output. Use your results from part (a)(iii) to explain any surprising features. 
(c) Speaking with a consultant reveals that for the traditional face mask (FM) method, no lubrication is necessary as no part of the face mask is inserted into the throat. So the missing data arise naturally from the nature of the study.The consultant also tells you that there are two research questions (RQ1 and RQ2) to be answered from this study:
• RQ1: Are there differences in occurrences of sore throats between the three types of airway device?
• RQ2: For the two types of airway device where lubrication has been studied, is the application of lubricant associated with lower odds of getting a sore throat? Is this the same for both types of device?
(i) Find the best model (in terms of AIC), which may contain two-factor interactions, that will allow you to answer RQ1.Explain your modelling approach, incorporating the suggestion(s) you made in part (a)(ii), and answer RQ1 using your results.Which type of airway device would you recommend, and why? 
(ii) Create a new data frame called soreThroat2, which is a subset of soreThroat consisting of the data when method is either ETT or LMA.Find the best model (in terms of AIC), which may contain two-factor interactions, that will allow you to answer
RQ2.Explain your modelling approach, and answer RQ2 using your results.(Hint: As the variables in the new and old data frames are likely to have the same names, it is recommended that you add the argument data=soreThroat2 to your glm() command in R to avoid confusion.) 
(d) Check if the model you selected in part (c)(i) satisfies the model assumptions of a logistic regression model.