BUS5PA Predictive Analytics : Building and Evaluating Predictive Models Assessment
- Subject Code :
BUS5PA
A used car online selling company in the USA is in the process of updating its car price assessmentmethod where they want to apply a data-driven technique. The trial dataset consists of 25 variablesdescribing 39984 car sales from 1994 to 2021. The management is very keen to apply predictivemodelling for this task where the trial data set is to be used to build and evaluate predictive modelsto ascertain the feasibility of such an approach. The company has outsourced the task to you.
Part A (10%) Problem Formulation
The objective of this section (Part A) is to introduce students to the domain understanding andfamiliarisation phase data analysts go through prior to the actual analytics. Since you may have tocarry out analytics projects in different domains in the future, where you may not have sufficientdomain knowledge, it is important to develop this skill.
1. Carry out an exploratory study to identify the background and relevant features of used carsthat influence their value in the USA and methods used for price evaluation and assessment ofused cars?
2. Identify the data sources that would contain information useful for the value assessment of usedcars. What is the possible format of such information? Will you face any problems accessing thisdata?
3. What variables would be useful to build a predictive model to assess the used cars? How do youidentify such variables?
Part B (40%) Data Exploration and Cleaning
Use the provided dataset to answer this section. You are given access to 24 variables that are directlyrelated to used carsales from the above-mentioned dataset. Most of these variables are similar to thetype of information that an assessor will use to evaluate and assess the price of a used car (e.g. whenwas it made? What is the length and width of the car? What type of wheel system? Number of seats?).
You need to answer the following questions with evidence and justifications.
PART 1
a. Which variables are continuous/discrete? Which are ordinal? Which are nominal?
b. What are the methods for transforming categorical variables?
c. Carry out and demonstrate data transformation where necessary.
PART 2
a. Calculate the following summary statistics: mean, median, max, min and standard deviationfor each of the continuous variables, and count for each categorical variable.
b. Is there any evidence of extreme values using the boxplot? Briefly discuss.
PART 3
Plot histograms for each of the continuous variables and create summary statistics. Based onthe histogram and summary statistics answer the following and provide brief explanations:
a. Which variables have the largest variability?
b. Which variables seem skewed?
c. Are there any values that seem extreme?
PART 4
a. Which, if any, of the variables have missing values?
b. What are the methods of handling missing values?
c. Apply the 3 methods of missing value and demonstrate the output (summary statistics andtransformation plot) for each method in (4-a). (hint: the objective is to identify the impactof using each of the methods you mentioned in the 4-a on the summary statistics outputabove). Which method of handling missing values is most suitable for this data set? Discussbriefly referring to the data set