Bayesian analysis project: Vinho Verde
Bayesian analysis project: Vinho Verde
Perform a logistic regression on a true dataset - Wine dataset CSV file is attached.
This dataset is related to red variants of the Portuguese "Vinho Verde" wine. The dataset is described in the publication by Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties.
The input variables (based on physicochemical tests) are:
fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pHsulphates
alcohol
The output variable (based on sensory data) is quality (a score between 0 and 10).
Analysis:
[2 points] Submit R code works with no error, independently from the correctness of it. Additionally, you must provide snippets of code within your report itself when answering each of the below questions.
[2 points] Read the dataset into R. Check if there are missing values (NA) and, in case there are, remove them.
[2 points] We want to implement a logistic regression, therefore we want a response variable which assume values either 0 or 1 . Suppose we consider "good" a wine with quality above 6.5 (included).
[4 points]Run a frequentist analysis on the logistic model, using the glm() function. What are the significant coefficients?
[5 points]Estimate the probabilities of having a "success": fix each covariate at its mean level, and compute the probabilities for a wine to score "good" varying total. sulfur. dioxide, and plot the results.
[15 points] Perform a Bayesian analysis of the logistic model for the dataset, i.e. approximate the posterior distributions of the regression coefficients, following these steps:
Write an R function for the log posterior distribution.
Fix the number of simulation at 104.
Choose 4 different initialisations for the coefficients.
For each initialisation, run a Metropolis-Hastings algorithm.
Plot the chains for each coefficients (the 4 chains on the same plot) and comment.
(Question 6 HINT: Generate separate plots for all coefficients (not just significant ones) and each plot should have 4 separate chains plotted on it corresponding to each initialization. When initializing at different locations, think about the purpose of this to help you decide how to choose the 4 different initializations. There's no single correct way to do it and you are free to do it, so long as it serves the purpose of considering multiple chains.)
[5 points] Approximate the posterior predictive distribution of an unobserved variable characterized by
fixed acidity: 7.5
volatile acidity: 0.6
citric acid: 0.0
residual sugar: 1.70
chlorides: 0.085
free sulfur dioxide: 5
total sulfur dioxide: 45
density: 0.9965
pH: 3.40
sulphates: 0.63
alcohol: 12
Plot the approximate posterior predictive distribution.
(Question 7 HINT: when plotting the distribution of the response variable, is the output what you would expect given what you know about the properties of the response variable (discrete random variable, probability mass function rather than probability density function).
[5 points] Use the metrop ( ) function available in the mcmc package to perform the same analysis on the posterior distribution you have approximated for Question 6. Choose again 104 simulations and compare the results with the results obtained with your code. (Here a visual comparison of the chains is enough to get full mark).