diff_months: 20

G 11374/11517 Exploratory Data Analysis & Visualisation Assignment

Download Solution Now
Added on: 2023-04-20 12:37:18
Order Code: clt267986
Question Task Id: 0
  • Subject Code :

    G-11374-11517

  • Country :

    Australia

Important information

1. This project is due by Sunday 7 May 2023, 9pm.

2. This project is based on the content from the entire semester and is worth 50% of your final grade. This project is marked out of 100 marks. A rough marking guide will be available separately on Canvas.

3. This project is to be done individually.

4. Your report can be written in Word, R Markdown (Rmd), or Sweave.

- If you use Word, please submit both your R file and your Word report.

- If you use Rmd or Sweave, you get 5 bonus marks! Please submit both your code file and your generated output file (i.e. your report).

5. The page upper-limit for your report is 12 pages of text, excluding plots. There is no page limit for your R code. Note the markers will only look at your R code if they need to understand what you have done.

6. Your report and R/code file must be submitted to Canvas, by Sunday 7 May 2023, 9pm.

7. There are different requirements for undergraduate and graduate students.

Please read the project details carefully to make sure you address the criteria relevant to you.

Task description:

This project aims to bring all the skills you have learnt over the semester together, and give you an opportunity to apply them to a dataset. Such a dataset can be chosen by yourself, which can be relevant to financial, insurance, medical or health areas etc. We provide the Ames Housing dataset (an American housing dataset) if you would prefer it. A brief description of the variables in the dataset is at the end of this file.

In this project, you will need to write a full report on your analysis of your chosen dataset, from the beginning of the data science methodology, where you will need to establish your problems of interest/exploration to the end of the Further Preprocessing stage. You will then train a simple linear model on the training dataset and predict values for the test dataset. Finally, you will evaluate your model using the metric RMSE against the test dataset and plot the residuals (similar to that shown in Week 10), and draw your final conclusions.

Your report is to obtain insights and aim to make impacts, and should have a structure which follows the data science methodology. The main purpose of this project is to conduct your analysis using EDA and visualisation.

For example, on the housing dataset (if chosen by you), when writing the report, put yourself into the shoes of a real estate analyst wanting to obtain insights from this dataset to predict house prices. The dataset already has a lot of reports written on it find them here. Be inspired by them for EDA, but do not focus too much on their modelling.

Download the datasets labelled train and test from Canvas. As you are a real estate analyst, your target variable is SalePrice. Note for most of the report, you will only use the train dataset. This includes preprocessing, EDA, and everything else up to and including the creation of a linear model.

The linear model will then be trained on the train dataset. You will then predict a set of SalePrice values based on the variable information in the test dataset. You can then compare your predicted values to the real values in the test dataset.

Therefore the test dataset is only needed for the Evaluation section of the report.

Report structure

Your report needs to include the following sections. In each section you will need to give a very brief explanation as to what the section is about, what the purpose of the section is and/or describe the key pieces of information in your general approach. For example, in the Data preprocessing section, you would explain what exactly data preprocessing is, why you need to clean the data, and describe the key ideas in your approach e.g. fill in missing values with median based of external controls.

0. Title and abstract:

On the first page, you should have:

- A suitable title for your report

- Your student ID

- An abstract/executive summary outlining your problems, analysis and findings.

You should conduct some background research into the Ames Housing dataset and:

- Give some information on the dataset.

- Gather and list points of domain expertise to help you make better decisions and shape your report (e.g. you should identify creating a variable similar to Week 7/9s SeasonSold would require you to know which seasons correspond to which months as the dataset is American)

- Seek to understand the variables here.

Problem identification and understanding is crucial in any data science project.

You should:

- Think about (after gaining domain expertise) a few questions of interest, which you will then translate into data science problems to solve within your report (if you get stuck look at a few examples from the Melbourne dataset slides with problems of interest).

- Provide a list of these data science problems. You will need to address and interpret your corresponding findings later on in the body of your report.

Note that examples of problems for you to find and solve can be:

- Identify which suburb/location had the biggest growth in SalePrice by plotting and examining the sale prices cross different suburbs;

- Analyse a possible pattern of SalePrice vs YrSold/MoSold, LotArea and/or some other variables which can reasonably be included;

- Use predictions from your final model to compare suburbs which have shown varying growth. Or, to identify which suburbs have been growing the most over the last few years.

UG students (unit 11374):

Generate and address at least five problems.

G students (unit 11517):

Generate and address at least seven problems, including the last problem listed above which uses predictions from your final model, e.g. find a way to compare the predictions (maybe median?) between suburbs (could be the top 5 suburbs) which have shown varying growth from your time series plots of growth over time.

2. Data preprocessing: In this section you should:

- Preprocess your code, treat missing values etc.

- Note at least one key observation, e.g. identified possible missing values or outliers for a particular area/suburb or year e.g. 2016 is significantly higher. Or perhaps one column is missing more than 50% of its values.

3. EDA:

In this section you should:

- Include tasks such as determining which variables are significant, which observations may be outliers etc., and other EDA goals.

- Find as much insight as possible to support your modelling decisions later on.

- Use data visualisation techniques taught in the unit to answer your chosen problems of interest.

4. Further preprocessing: In this section you should:

- Select the final variables for your model based off your EDA (basically remove the non-significant variables).

- Create any new variables which you think may help based on your EDA in this section.

- Justify your decisions and provide EDA evidence as to how a variable is insignificant (e.g. no observable relationship to target variable in scatter plot).

5. Modelling:

In this section you should:

- Fit and evaluate a linear model to describe the relationship between your target variable and a number of selected significant predictors.

- Use your model to predict the prices of properties described by your test dataset. Alternatively, you may use another, more advanced model of your choice. If you do use a linear model, remember its likings such as a normalised distribution in the target variable.

6. Evaluation:

You should:

- Evaluate your model against the metric RMSE given the actual values in the test dataset

- Plot the residuals similar to that shown in the Week 10 slides. Pick a suitable cut off value for the red dots.

The data science methodology is an iterative process. Try to minimise your RMSE, so always go back and think about what improvements can be made, then fit another model, and find your second RMSE, and so on, noting what works and what does not. Compare at least two different models you considered, noting their differences.

Variables in the Ames Housing dataset:

Below, please find a brief description of the variables within the dataset. For more detail, look inside the data_description.txt file.

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.

MSSubClass: The building class

MSZoning: The general zoning classification

LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

Street: Type of road access

Alley: Type of alley access

LotShape: General shape of property

LandContour: Flatness of the property

Utilities: Type of utilities available

LotConfig: Lot configuration

LandSlope: Slope of property

Neighborhood: Physical locations within Ames city limits

Condition1: Proximity to main road or railroad

Condition2: Proximity to main road or railroad (if a second is present)

BldgType: Type of dwelling

HouseStyle: Style of dwelling

OverallQual: Overall material and finish quality

OverallCond: Overall condition rating

YearBuilt: Original construction date

YearRemodAdd: Remodel date

RoofStyle: Type of roof

RoofMatl: Roof material

Exterior1st: Exterior covering on house

Exterior2nd: Exterior covering on house (if more than one material)

MasVnrType: Masonry veneer type

MasVnrArea: Masonry veneer area in square feet

ExterQual: Exterior material quality

ExterCond: Present condition of the material on the exterior

Foundation: Type of foundation

BsmtQual: Height of the basement

BsmtCond: General condition of the basement

BsmtExposure: Walkout or garden level basement walls

BsmtFinType1: Quality of basement finished area

BsmtFinSF1: Type 1 finished square feet

BsmtFinType2: Quality of second finished area (if present)

BsmtFinSF2: Type 2 finished square feet

BsmtUnfSF: Unfinished square feet of basement area

TotalBsmtSF: Total square feet of basement area

Heating: Type of heating

HeatingQC: Heating quality and condition

CentralAir: Central air conditioning

Electrical: Electrical system

1stFlrSF: First Floor square feet

2ndFlrSF: Second floor square feet LowQualFinSF: Low quality finished square feet (all floors)

GrLivArea: Above grade (ground) living area square feet

BsmtFullBath: Basement full bathrooms

BsmtHalfBath: Basement half bathrooms

FullBath: Full bathrooms above grade

HalfBath: Half baths above grade

Bedroom: Number of bedrooms above basement level

Kitchen: Number of kitchens

KitchenQual: Kitchen quality

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

Functional: Home functionality rating

Fireplaces: Number of fireplaces

FireplaceQu: Fireplace quality

GarageType: Garage location

GarageYrBlt: Year garage was built

GarageFinish: Interior finish of the garage

GarageCars: Size of garage in car capacity

GarageArea: Size of garage in square feet

GarageQual: Garage quality

GarageCond: Garage condition

PavedDrive: Paved driveway

WoodDeckSF: Wood deck area in square feet

OpenPorchSF: Open porch area in square feet

EnclosedPorch: Enclosed porch area in square feet

3SsnPorch: Three season porch area in square feet

ScreenPorch: Screen porch area in square feet

PoolArea: Pool area in square feet

PoolQC: Pool quality

Fence: Fence quality

MiscFeature: Miscellaneous feature not covered in other categories

MiscVal: $Value of miscellaneous feature

MoSold: Month Sold

YrSold: Year Sold

SaleType: Type of sale

SaleCondition: Condition of sale

  • Uploaded By : Katthy Wills
  • Posted on : April 20th, 2023
  • Downloads : 0
  • Views : 210

Download Solution Now

Can't find what you're looking for?

Whatsapp Tap to ChatGet instant assistance

Choose a Plan

Premium

80 USD
  • All in Gold, plus:
  • 30-minute live one-to-one session with an expert
    • Understanding Marking Rubric
    • Understanding task requirements
    • Structuring & Formatting
    • Referencing & Citing
Most
Popular

Gold

30 50 USD
  • Get the Full Used Solution
    (Solution is already submitted and 100% plagiarised.
    Can only be used for reference purposes)
Save 33%

Silver

20 USD
  • Journals
  • Peer-Reviewed Articles
  • Books
  • Various other Data Sources – ProQuest, Informit, Scopus, Academic Search Complete, EBSCO, Exerpta Medica Database, and more