Heart Disease Dataset Analysis: Clustering and Classification Techniques

Data Set- Heart disease

The dataset used in this assignment is the heart disease dataset available in heart-c.csv from the Blackboard. This dataset describes 13 risk factors for heart disease. The attribute num represents the (binary) class attribute: class <50>50_1 indicates increased level of heart disease. The following questions will allow you to demonstrate your knowledge of Unsupervised Learning and data exploration skills.

Data Dictionary:

age = age in years

sex = sex (1 = male; 0 = female)

cp = cp: chest pain type

Value 1: typical angina

Value 2: atypical angina

Value 3: non-anginal pain

Value 4: asymptomatic trestbps = resting blood pressure (in mm Hg on admission to the hospital)

chol = serum cholestoral in mg/dl

fbs = (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

restecg = resting electrocardiographic results :

Value 0: normal

Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)

Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

thalach = maximum heart rate achieved

exang = exercise induced angina (1 = yes; 0 = no)

oldpeak = ST depression induced by exercise relative to rest

slope = the slope of the peak exercise ST segment :

Value 1: upsloping

Value 2: flat

Value 3: downsloping

ca = number of major vessels (0-3) colored by flourosopy

thal = 3 = normal; 6 = fixed defect; 7 = reversable defect

num = diagnosis of heart disease (angiographic disease status) :

Value 0: < 50>

Value 1: > 50% increased level of heart disease

1. Run K-means clustering on the above heart disease dataset and answer the following questions

Why should the attribute "class" in heart-c.csv ("num") not be included for clustering?
Run the K-means algorithm and provide reasoning for the optimum value of K.
Which features would you expect to be less useful when using K-means and why?

2. Run the hierarchical clustering on above heart disease dataset, and answer the following questions

Show the clustering results in a tree structure and provide reasoning for the optimal number of clusters
Describe the link method you used.
What are the strengths and limitations of this link method in hierarchical clustering?

Data Set- Heart disease

Using the heart disease dataset (heart_modified.csv - found in the blackboard assignment page), show your understanding of classification techniques. The dataset has been modified to include additional noise in the data. Features have been added that have either been filled with random values or sampled from other features. Please choose two classifiers of your choice that have been covered in the module. The target feature for classification is called class, with a particular interest in classifying people that do develop heart disease (class = 1). Remember to use some method of training on a sub-sample of your data, while a sub sample is used for testing.

What will be assessed in this question is not what classifier achieves the best accuracy, but rather the justification of your choices and the interpretation of the results. As long as you have shown an understanding of what your classifiers are doing, achieving the highest accuracy is not the point of interest.

Data dictionary:

Patient ID : Unique patient identifier (anonymised)
sex = sex (1 = male; 0 = female)
age = age in years
pacemaker = presence of pacemaker
cp = cp: chest pain type
Value 1: typical angina
Value 2: atypical angina
Value 3: non-anginal pain
Value 4: asymptomatic
trestbps = resting blood pressure (in mm Hg on admission to the hospital)
chol = serum cholestoral in mg/dl
fbs = (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
restecg = resting electrocardiographic results :
Value 0: normal
Value 1: having ST-T wave abnormality (T wave inversions and/or ST
elevation or depression of > 0.05 mV)
Value 2: showing probable or definite left ventricular hypertrophy by
Estes' criteria
thalach = maximum heart rate achieved
exang = exercise induced angina (1 = yes; 0 = no)
oldpeak = ST depression induced by exercise relative to rest
perfusion: results of perfusion scan
slope = the slope of the peak exercise ST segment :
Value 1: upsloping
Value 2: flat
Value 3: downsloping
ca = number of major vessels (0-3) colored by flourosopy
thal = 3 = normal; 6 = fixed defect; 7 = reversable defect
DRUG: any prescribed cardiovascular drugs
troponin: level of protein released into blood during heart attack.
Fam hist: family history of heart disease
class = diagnosis of heart disease (angiographic disease status) :
Value 0: no disease
Value 1: heart disease

Q1. For each classifier, please answer the following:

1. Did you undertake any prepossessing? If so, why?
2. Run the classifier with default parameters.

a. How accurately can the classifier predict those that develop heart disease? What is in the output that signifies this?
b. How many people are misclassified as developing heart disease? Where is this answer found in the output?

3. Plot and submit ROC curves for the class that develops heart disease. What is another measure of accuracy commonly used?

Q2. Now choose one classifier to further optimize.

1. Why did you choose this classifier over the other?
2. Explain how this classifier works from a theoretical point of view try to include as much detail as possible.
3. Try to optimize the classifier to achieve a higher accuracy (no matter how small) than first found. Remember that we have a particular focus on predicting those that develop heart disease.

a. Were there any features that could be removed? Please print the output that helped you make this decision.
b. Did changing the way data is sampled during training/testing affect the accuracy?
c. What about some of the internal parameters specific to the classifier? Please explain how one of these parameters can affect accuracy.

4. In general, a classifier is only as good as the data it is trained on. Please comment on what is needed from training data to train a good classifier. How can utilizing classifiers help feed back into healthcare settings with regards to data collection?

Are you struggling to keep up with the demands of your academic journey? Don't worry, we've got your back!
Exam Question Bank is your trusted partner in achieving academic excellence for all kind of technical and non-technical subjects. Our comprehensive range of academic services is designed to cater to students at every level. Whether you're a high school student, a college undergraduate, or pursuing advanced studies, we have the expertise and resources to support you.

Download Solution Now

Uploaded By : Charles
Posted on : November 14th, 2024
Downloads : 0
Views : 340

Heart Disease Dataset Analysis: Clustering and Classification Techniques

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

Heart Disease Dataset Analysis: Clustering and Classification Techniques

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

Request a Call Back