401077 Introduction to Biostatistics, Spring 2023
401077 Introduction to Biostatistics, Spring 2023
Assignment 1 (Due Sunday August 20, 2023)
Please answer all questions. Record your answers in the MS Word template provided and submit via Turnitin before 11:59pm on the due date. The marks allocated to each question are shown in the assignment. A total of 30 marks are available and this assignment is worth 30% of your overall grade.
All of the questions require you to analyse the unique assignment data set which I have created for you. This is labelled dataforxxxxxxxx.RData where xxxxxxxx represents your Student ID number. The description of this data set is provided in the file Description of your data set.docx. You can find your data set and its description into the Assessment 1 folder in vUWS.
Note: Each student will get different answers as the data sets differ.
Question 1 (4 marks)
Identify all of the categorical variables in the assignment data file allocated to you. Explain why these variables are categorical.
Question 2 (4 marks)
Using the assignment data file allocated to you and R Commander, graph the relationship between the C-Reactive protein level (variable 'crp') and the language the respondent used when completing the survey (variable 'lang') in your sample of adult respondents to the NHANES survey. This figure should be prepared in R Commander with appropriate axis labels then copied and pasted into your assignment answers with appropriate title and axis labels. (1 marks)
Use appropriate statistics to compare the centres of the distributions of the respondents' estimated total grams of fats consumed on the day prior to interview (the variable 'fats') between categories of the language the respondent used when completing the survey (variable 'lang') in your sample of adult respondents to the NHANES survey. R Commander output alone is insufficient and should be avoided. (3 marks)
Question 3 (6 marks)
Using the assignment data file allocated to you and R Commander, create a table showing the relationship between respondents' highest education level (the variable 'educ') and the language they used when completing the survey (the variable 'lang') in your sample of adult respondents to the NHANES survey. Please include row or column percentages. Please use the results from R Commander to create a table in Word with appropriate title and row and column headings (the output from R Commander is poorly labelled). (3 marks)
Using the row or column percentages from part a) describe the relationship between highest education level and language used to complete the survey in your sample of adult respondents to the NHANES survey. (2 marks)
Consider the sub-set of respondents with <9th grade education in your sample of adult respondents to the NHANES survey. If you select one person at random from this sub-set, what is the probability that this person used English when responding to the NHANES survey? (1 mark)
Question 4 (5 marks)
Using R Commander and the sample of adult respondents to the NHANES survey assigned to you, what proportion of respondents in your data set had the highest education level of 'college graduate or above'? R Commander output alone is insufficient and should be avoided. (1 mark)
The Binomial probability distribution requires i) the number of observations n is fixed ii) each observation is independent iii) each observation represents one of two outcomes (such as 'success' or 'failure') and iv) the probability of 'success' p is the same for each outcome. Explain how each of these 4 conditions are met when estimating the probability that 2 or more of 6 people selected at random from your data set of adult respondents to NHANES have a highest education level of 'College graduate or above'. (2 marks)
If you selected 6 people at random from your data set of adult respondents to NHANES, what is the probability that two or more (2) of these six people had the highest education level of 'college graduate or above'? (2 marks)
Question 5 (11 marks)
Assume the sample of US adults who responded to the NHANES study assigned to you is representative of all US adults.
The distribution of C-Reactive protein has a strong positive skew, but the distribution of the logarithms of the C-Reactive proteins can be assumed to be Normally distributed. Using R Commander, calculate the mean and standard deviation of the logarithm of C-Reactive protein in the data set of adult respondents to NHANES assigned to you. (R Commander output alone is insufficient and should be avoided.) (1 mark)
Suppose Ms Taylor Swift (a well-known American) just had her C-Reactive protein measured as 0.25 mg/dL. This means the logarithm of her C-Reactive protein is -1.3863. Use the results of part a) to estimate the z-score for the logarithm of Ms Swift's C-Reactive protein reading. Show any working. (2 marks)
Using the Normal probability model and R Commander, estimate the proportion of Americans who would have a C-Reactive protein level higher than that recorded by Ms Swift. (2 marks)
Suppose Mr Donald Trump (a well-known American) claimed that his C-Reactive protein level was within the lowest 1% of all values recorded in America. Using the results in part a) and R Commander, (and accepting that Mr Trump's statement is accurate) estimate the maximum possible value for Mr Trump's logarithm of C-Reactive protein. (2 marks)
There are more than 4,000 institutions classified as universities in the US. Suppose a random sample of 180 of these institutions each took a random sample of 100 of their students and measured their logarithm C-Reactive protein levels. Each of these universities then reported their sample mean logarithm C-Reactive protein to a central processing point. Why would the Central Limit Theorem be applicable when predicting the distribution of these 180 sample means? (2 marks)
Using the Central Limit Theorem and your findings in part a) estimate the overall mean and standard error of the distribution of the sample means described in part e). (2 marks)