Logistic multilevel model in Voting Assignment
Order Code: CLT274635
Question Task Id: 0
- Subject Code :
- Country :
QUESTION B: Estimating Constituency-Level Results from the EU Referendum [25 points]
In the 2016 UK referendum on leaving the EU, the results of the vote were not released for individual electoral constituencies. However, many scholars would like to know why people voted to leave the EU, and how support for leaving differed across constituencies. One previous study has already estimated constituency-level support for ‘leave’ in an authoritative way. Your tasks in this question are (i) to produce estimates of the percentage of voters that voted ‘leave’ in every constituency using multilevel modeling and post-stratification that are as close as possible to this existing set of estimates, as measured by the Mean Absolute Error (MAE), and (ii) to use your results to explain why people voted to leave.
You need to:
- Estimate an appropriate logistic multilevel model explaining voting for leave, using the predictors in the dataset.1
- Present the multilevel model results and interpret how the variables affect voting to leave the EU (Note: you do not need to discuss statistical significance).
- Produce post-stratified estimates of the percentage of people who voted ‘leave’ in all 631 constituencies in England, Scotland and Wales
- Compare your results to the existing estimates using the Mean Absolute Error
You should present and explain your approach and results in a brief report, explaining why your estimates do or not perform well compared to the existing estimates. Note: if you cannot get very close to the existing results, do not worry. Your grade depends on the quality of your analysis, presentation and interpretation, not how close your results are to the existing estimates.
The survey data is called “e” and is in the file “eusurvey.Rda”. It comes from the 2017 British
Election Study and it contains the following variables:
Variable name Variable description
cname constituency name
ccode constituency code
leave dependent variable: =1 if respondent voted to leave EU, 0 if respondent voted to remain in the EU
votecon =1 if respondent voted Conservative in the 2015 election, 0 otherwise
voteukip =1 if respondent voted UKIP in the 2015 election, 0 otherwise [note: UKIP is the United Kingdom Independence Party, which campaigns in favour of the UK leaving the EU]
female =1 if female, 0 otherwise
age in years
highed =1 if respondent is educated to degree level or higher, 0 otherwise lowed =1 if respondent has no educational qualifications, 0 otherwise c_con15 percent vote for Conservative party in the constituency, 2015 election c_ukip15 percent vote for UKIP in the constituency, 2015 election c_unemployed constituency unemployment rate, percent
c_whitebritish percent of constituency population who are white British
c_deprived percent of constituency population living in poverty
As in the practical exercise, use the option “nAGQ=0” to avoid estimation errors
Post-stratification data for the 631 constituencies is called “post” and is contained in the file “eupoststrat.Rda”. Each row contains one particular demographic group in one constituency. In addition to the variables in “e”, it also contains these variables:
Variable name Variable description
c_count Number of people in the demographic group
c_total Number of people in the constituency
percent percent of constituency represented by the demographic group
Finally, the comparison data containing the existing estimates by constituency is called “est” and is in the file “existing_estimates.Rda”. In addition to the constituency name and code, it contains the existing estimate of the leave vote share for each constituency (called estimate).
This part of the final essay contains one question. It is worth 40 points. Again, 5 points are reserved for clarity of presentation, especially tables and figures. See Q+A session 5 for guidelines on presentation.
The question requires you to write a brief report. It is up to you how you structure the report, but it is advisable to keep introductory material to a minimum, given the word limit. Your report should discuss your methods, your results and the conclusions that you draw from them.
QUESTION C: Describing and Classifying Tweets [40 points]
Many companies monitor social media posts in order to gauge how customers feel about their company and their competitors. For this question, imagine that you have been hired as a consultant by one of the major American airline companies to analyse tweets about airlines. They want to find out how people talk about airlines on Twitter, and then build a predictive tool that can classify tweets in future into ‘negative’ or ‘positive’ sentiment toward airlines, to help them respond better to their customers in real time. They have provided you with a dataset of 11,541 tweets about airlines that have been labelled as ‘negative’ or ‘positive’ by their staff. The dataset also identifies which airline each tweet is talking about.
Your task is to prepare a brief report that describes the tweets, and recommends a classification method for future tweets. You need to:
- Use appropriate tools to describe the tweets. In particular, what words are associated with negative or positive sentiment? How does word usage differ across the different airlines?
- Use your analysis from i) to build a short dictionary of negative and positive words describing airlines, then use it to classify tweets as ‘negative’ if they contain more negative than positive language, and ‘positive’ otherwise [code for creating your own dictionary is provided below]Use the lasso logit method to classify the tweets into ‘negative’ and ‘positive’.
- Compare the performance of your classifiers from ii) and iii), and use this analysis to decide which one would be the better classifier for the company to use for future tweets
The dataset for this question is called “tweets” and is contained in the file “tweets.Rda”. It contains the following variables:
Variable name Variable description
text The text of each tweet
sentiment Labeled sentiment of each tweet: 1=negative, 0=positive
airline The airline company featured in the tweet: United, JetBlue, American
Airlines, US Airways, Virgin America or Southwest
You should first create a corpus of tweets using the following code:
tweetCorpus <- corpus(tweets$text, docvars = tweets)
Here is some advice for part ii):
- Your dictionary should contain a minimum of 5 words and a maximum of 15 words in each category
- You are not expected to exhaustively compare the performance of different dictionaries. Instead, simply choose one dictionary based on your analysis from i), explaining how you chose the words.
Code for creating a dictionary:
You can create a dictionary called “mydict” in R that contains two categories (‘negative’ and ‘positive’) using the following code:
neg.words <- c() pos.words <- c()
mydict <- dictionary(list(negative = neg.words,
positive = pos.words))
You need to insert your chosen sets of negative and positive words in ‘neg.words’ and ‘pos.words’. This dictionary can then be used with quanteda in exactly the same way as any of the existing built- in dictionaries.