ECO520 Midterm Project

Topic: COVID19 Data in Wisconsin by Census Tract (25 points total)

Use SAS code to answer the following questions using MYCOVID19 data.

Submit the SAS code as a txt file and a pdf or doc file for the report. Only include necessary tables or graphs from the SAS output in the report.

DO NOT INCLUDE ALL SAS OUTPUT IN THE REPORT

Wisconsin COVID-19 data by census tract boundary

All data are laboratory-confirmed cases of COVID-19 that we freeze once a day to verify and ensure that we are reporting accurate information. The number of people with positive/negative test results includes only Wisconsin residents who had their results reported electronically to DHS. Here are descriptions of the variables in the data

Variable Name Variable Description

GEOID Geographic ID

State State

CENSUS_TRACT Census Tract Number

COUNTY County Name

DATE Last Date of Report

POSITIVE Number of Positive on COVID19 Test

NEGATIVE Number of Negative on COVID19 Test

DEATHS Number of Deaths by COVID19

HOSP_YES Number of Hospitalized by COVID19

HOSP_NO Number of Not Hospitalized by COVID19

AREA_LAND Land Area Size

AREA_WATER Water Area Size

POPULATION Total Population

POP_LT18 Percent of Population that is Less Than 18 Years

POP_65P Percent of Population that is 65 Years and Over

HOUS_NO_VEH Percent of households with no vehicle available

ADULT_LIMITED_ENGLISH Percent of adults 18 years and over who have limited English ability

ADULT_SPANISH_LENG Percent of adults 18 years and over who speak Spanish and have limited English ability

POP_BELOWPOV Percent of Population whose income in the past 12 months is below poverty level

POP_DISABILITY Percent of Population with a Disability

POP_MEDICAD Percent of Population with Medicaid/Means-Tested Public Coverage

POP_MEDICARE Percent of Population with Medicare Coverage

POP_HEALTHINS Percent of Population with No Health Insurance Coverage

HOUS_NOSMARTPHN Percent of Households that Have No Smartphone

HOUS_NOINTERNET Percent of Households with No Internet Access

Here is the SAS code to load the data into your SAS program

filename webdat url "https://bigblue.depaul.edu/jlee141/econdata/eco520/COVID19_WI22.csv" ;

proc import datafile=webdat out = COVID19 DBMS = csv replace ; run ;

run ;

/*Select 500 randomly selected census tracts in WI using YourDePaulID */

proc surveyselect data= COVID19 method=srs seed= YourDePaulID N=500 out= MYCOVID19 ;

run;

proc contents data=MYCOVID19 ;

run ;

Data Steps (2 points)

Create the following variables (the recommended variable names in parentheses).

The percent of POSITIVE by POPULATION:

Pct_POSITIVE = 100*POSTIVE/POPULATION

The percent of DEATH by POPULATION:

Pct_DEATH = 100*DEATHS/POPULATION

The percent of HOSP_YES by POSITIVE:

Pct_HOSP_POSITIVE= 100*HOSP_YES/POSITIVE

The percent of POSITIVE results by TOTAL TESTS:

Pct_POSTIVE_TEST = 100*POSITIVE / (NEGATIVE+POSITIVE)

A category variable for the size of census tract by POPULATION (SIZE_CLASS). Lets define the size of the census tract:

if population is less than 2800, then SIZE_CLASS =1 ,

if 2800 <= POPULATION < 5000 then SIZE_CLASS = 2,

and SIZE_CLASS = 3 if the POPULATION >= 5000.

Remove any observations that have missing or zero for GEOID or POPULATION.

Descriptive Analytics Questions. ( 6 points)

Use descriptive statistics and plots to answer the following questions

(Confirmed cases mean the test results show positive from the COVID19 test.)

Find the average percentage of confirmed cases, hospitalized, and deaths rates by County, and list the five counties that have the highest probability of positive/ death/ hospitalized.

Use descriptive statistics and graphs to show the facts on Pct_POSITIVE, pct_DEATH, Pct_HOSP_POSITIVE by the size of the census tract (SIZE_CLASS).

Find the correlation coefficients related to the confirmed case percentage (Pct_POSITIVE) with demographic, social status, and healthcare service-related variables. Then, select the most related variable to the rate of confirmed cases and show the relationship using scatter plots.

Suppose you are working as a consultant for the state of Wisconsin and make some recommendations in terms of infection, death, and hospitalized rates. According to the descriptive statistics, where are the most critical areas that the govt agent need to spend their resources? Explain clearly.

Hierarchical and Non-hierarchical Clustering Analysis on ( 7 points)

Hierarchical Clustering Analysis for the communication and transportation: HOUS_NO_VEH, HOUS_NOSMARTPHN, HOUS_NOINTERNET

Non-Hierarchical Clustering Analysis for the communication and transportation: HOUS_NO_VEH, HOUS_NOSMARTPHN, HOUS_NOINTERNET

Compare the results 1) and 2), and define or name the clusters using descriptive statistics and plots.

Test if the clusters are significant to the rate of confirmed cases.

Test if the clusters are significant to the rate of hospitalization.

Predictive Analytics using Regression Model (10 points)

Use only TRAIN data (70%) to estimate the models and use the TEST data (30%) to perform the out-of-sample prediction. Please note that the grade will be given by the SAS code and the performance of the models.

Estimate regression models to predict the rate of COVID19 confirmed cases. Please note none of the variables related to COVID19 can be used to predict the COVID19 related dependent variables, but only use variables start with POP_, HOUS_, ADULT, or AREA_, and SIZE_CLASS as independent variables.

Find the best regression models to explain the variation of the ratio of deaths by population (Pct_DEATH). You can use any variables such as nonlinear and cluster variables to make the best models.

Model 1: Simple Regression Model with your choice of one independent variable

Model 2: Your own choices of variables

Model 3: Including all possible independent variables

Model 4: Stepwise,

Model 5: adjusted R square

Perform the out-of-sample prediction using the observations not used in the estimation (test dataset). Then, find the following statistics and compare the results. Which model is the best performing model in terms of the following statistics?

MSE (mean square error)

RMSE (root mean square error)

MPE (mean percentage error)

MAE (mean absolute error)

Find the best regression models to explain the variation of the percent of positive results (Pct_POSTIVE_TEST). You can use any variables such as nonlinear variables and cluster variables to make the best models.

Model 1: Simple Regression Model with your choice of an independent variable

Model 2: Your own choices of variables 2

Model 3: Including all possible independent variables

Model 4: Stepwise

Model 5: adjusted R square

MSE (mean square error)

RMSE (root mean square error)

MPE (mean percentage error)

MAE (mean absolute error)

Download Solution Now

Uploaded By : Pooja Dhaka
Posted on : December 23rd, 2024
Downloads : 0
Views : 192

ECO520 Midterm Project

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

ECO520 Midterm Project

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

Request a Call Back