ECO520 Midterm Project
ECO520 Midterm Project
Topic: COVID19 Data in Wisconsin by Census Tract (25 points total)
Use SAS code to answer the following questions using MYCOVID19 data.
Submit the SAS code as a txt file and a pdf or doc file for the report. Only include necessary tables or graphs from the SAS output in the report.
DO NOT INCLUDE ALL SAS OUTPUT IN THE REPORT
Wisconsin COVID-19 data by census tract boundary
All data are laboratory-confirmed cases of COVID-19 that we freeze once a day to verify and ensure that we are reporting accurate information. The number of people with positive/negative test results includes only Wisconsin residents who had their results reported electronically to DHS. Here are descriptions of the variables in the data
Variable Name Variable Description
GEOID Geographic ID
State State
CENSUS_TRACT Census Tract Number
COUNTY County Name
DATE Last Date of Report
POSITIVE Number of Positive on COVID19 Test
NEGATIVE Number of Negative on COVID19 Test
DEATHS Number of Deaths by COVID19
HOSP_YES Number of Hospitalized by COVID19
HOSP_NO Number of Not Hospitalized by COVID19
AREA_LAND Land Area Size
AREA_WATER Water Area Size
POPULATION Total Population
POP_LT18 Percent of Population that is Less Than 18 Years
POP_65P Percent of Population that is 65 Years and Over
HOUS_NO_VEH Percent of households with no vehicle available
ADULT_LIMITED_ENGLISH Percent of adults 18 years and over who have limited English ability
ADULT_SPANISH_LENG Percent of adults 18 years and over who speak Spanish and have limited English ability
POP_BELOWPOV Percent of Population whose income in the past 12 months is below poverty level
POP_DISABILITY Percent of Population with a Disability
POP_MEDICAD Percent of Population with Medicaid/Means-Tested Public Coverage
POP_MEDICARE Percent of Population with Medicare Coverage
POP_HEALTHINS Percent of Population with No Health Insurance Coverage
HOUS_NOSMARTPHN Percent of Households that Have No Smartphone
HOUS_NOINTERNET Percent of Households with No Internet Access
Here is the SAS code to load the data into your SAS program
filename webdat url "https://bigblue.depaul.edu/jlee141/econdata/eco520/COVID19_WI22.csv" ;
proc import datafile=webdat out = COVID19 DBMS = csv replace ; run ;
run ;
/*Select 500 randomly selected census tracts in WI using YourDePaulID */
proc surveyselect data= COVID19 method=srs seed= YourDePaulID N=500 out= MYCOVID19 ;
run;
proc contents data=MYCOVID19 ;
run ;
Data Steps (2 points)
Create the following variables (the recommended variable names in parentheses).
The percent of POSITIVE by POPULATION:
Pct_POSITIVE = 100*POSTIVE/POPULATION
The percent of DEATH by POPULATION:
Pct_DEATH = 100*DEATHS/POPULATION
The percent of HOSP_YES by POSITIVE:
Pct_HOSP_POSITIVE= 100*HOSP_YES/POSITIVE
The percent of POSITIVE results by TOTAL TESTS:
Pct_POSTIVE_TEST = 100*POSITIVE / (NEGATIVE+POSITIVE)
A category variable for the size of census tract by POPULATION (SIZE_CLASS). Lets define the size of the census tract:
if population is less than 2800, then SIZE_CLASS =1 ,
if 2800 <= POPULATION < 5000 then SIZE_CLASS = 2,
and SIZE_CLASS = 3 if the POPULATION >= 5000.
Remove any observations that have missing or zero for GEOID or POPULATION.
Descriptive Analytics Questions. ( 6 points)
Use descriptive statistics and plots to answer the following questions
(Confirmed cases mean the test results show positive from the COVID19 test.)
Find the average percentage of confirmed cases, hospitalized, and deaths rates by County, and list the five counties that have the highest probability of positive/ death/ hospitalized.
Use descriptive statistics and graphs to show the facts on Pct_POSITIVE, pct_DEATH, Pct_HOSP_POSITIVE by the size of the census tract (SIZE_CLASS).
Find the correlation coefficients related to the confirmed case percentage (Pct_POSITIVE) with demographic, social status, and healthcare service-related variables. Then, select the most related variable to the rate of confirmed cases and show the relationship using scatter plots.
Suppose you are working as a consultant for the state of Wisconsin and make some recommendations in terms of infection, death, and hospitalized rates. According to the descriptive statistics, where are the most critical areas that the govt agent need to spend their resources? Explain clearly.
Hierarchical and Non-hierarchical Clustering Analysis on ( 7 points)
Hierarchical Clustering Analysis for the communication and transportation: HOUS_NO_VEH, HOUS_NOSMARTPHN, HOUS_NOINTERNET
Non-Hierarchical Clustering Analysis for the communication and transportation: HOUS_NO_VEH, HOUS_NOSMARTPHN, HOUS_NOINTERNET
Compare the results 1) and 2), and define or name the clusters using descriptive statistics and plots.
Test if the clusters are significant to the rate of confirmed cases.
Test if the clusters are significant to the rate of hospitalization.
Predictive Analytics using Regression Model (10 points)
Use only TRAIN data (70%) to estimate the models and use the TEST data (30%) to perform the out-of-sample prediction. Please note that the grade will be given by the SAS code and the performance of the models.
Estimate regression models to predict the rate of COVID19 confirmed cases. Please note none of the variables related to COVID19 can be used to predict the COVID19 related dependent variables, but only use variables start with POP_, HOUS_, ADULT, or AREA_, and SIZE_CLASS as independent variables.
Find the best regression models to explain the variation of the ratio of deaths by population (Pct_DEATH). You can use any variables such as nonlinear and cluster variables to make the best models.
Model 1: Simple Regression Model with your choice of one independent variable
Model 2: Your own choices of variables
Model 3: Including all possible independent variables
Model 4: Stepwise,
Model 5: adjusted R square
Perform the out-of-sample prediction using the observations not used in the estimation (test dataset). Then, find the following statistics and compare the results. Which model is the best performing model in terms of the following statistics?
MSE (mean square error)
RMSE (root mean square error)
MPE (mean percentage error)
MAE (mean absolute error)
Find the best regression models to explain the variation of the percent of positive results (Pct_POSTIVE_TEST). You can use any variables such as nonlinear variables and cluster variables to make the best models.
Model 1: Simple Regression Model with your choice of an independent variable
Model 2: Your own choices of variables 2
Model 3: Including all possible independent variables
Model 4: Stepwise
Model 5: adjusted R square
Perform the out-of-sample prediction using the observations not used in the estimation (test dataset). Then, find the following statistics and compare the results. Which model is the best performing model in terms of the following statistics?
MSE (mean square error)
RMSE (root mean square error)
MPE (mean percentage error)
MAE (mean absolute error)