BDA602 Data Mining Techniques for Predictive Modelling: A Case Study Approach
- Subject Code :
BDA602
- University :
University of Lincoln Exam Question Bank is not sponsored or endorsed by this college or university.
- Country :
Australia
Assessment 2- (Case Study Report and Presentation)
Table of Contents
Introduction
Analysis of data and information has now spread around the world, and the majority of businesses only use it to boost their bottom line (Cuenca et al., 2021). Three main sets of data and information are included in this research to analyze the classification, regression, and clustering techniques. To examine the general idea of the data analysis, various data and pieces of information are employed for each of the procedures. Data and statistics about absenteeism are employed in the categorisation technique. Data and information on gender names are employed in the clustering process. Additionally, statistics and information about air quality are used for regression analysis.
Overview of a data set
The UCI database was used to collect the absenteeism at work data set, which has value for those who miss work (Archive, 2022). The data set offers a wide range of attribute combinations and exclusions for those who miss work. The available data set contains 21 data variables. The absenteeism data set in the provided data set also includes the ARFF file.
The statistics and information on air quality are in response to gas multisensory devices that are placed in the surroundings of Italian cities. The statistics and information provide information about the device's hourly replies. On an hourly basis, the average air quality value is presented.
Information on various names is provided, followed by information on their genders, in the gender by-name data and information. Data from the UK, Canada, and Australia are used.
Data preprocessing
Data cleaning
Any unnecessary data and information are removed from the data set during the data cleaning procedure (Xu et al., 2015). There was no need to clean the data set for the absenteeism file. The data and information of time and hour in the air quality file posed a risk when importing the data set to Weka. As a result, the irrelevant column for air quality for time and hour was completely deleted. The elimination made it possible to calculate information correctly. The name D'Arcy was incorrect in the gender names data set. As a result, the name was changed from D'Arcy to Darcy.
Data transformation
The process of transforming the data and information involves changing the numerical data to nominal data and information. The nominal data set is created using all of the numerical absenteeism data. This is done to ensure that the support vector machine's information is accurately calculated (Lee, 2020).
Data reduction
It is frequently important to lessen the number of dimensions present in the data set for a clustering approach to function in an efficient manner. The process of reducing the amount of data begins here. This is especially helpful when working with large category data sets, which often contain a great lot of features and can be difficult to manage from a computational standpoint. In feature-based data clustering, common approaches include feature selection and feature extraction, among other similar techniques. What we mean when we talk about "feature selection" is the process of extracting from the data set the characteristics that will be most useful for the clustering analysis. Methods such as the chi-squared test, correlation analysis, and mutual information can be utilized to focus on and isolate the most significant specifics. The process of reducing a huge number of features to a smaller, more manageable subset while still preserving the most important information is referred to as feature extraction (Jacques, 2014). These strategies enable the clustering algorithm to run more smoothly and produce conclusions that are easier to grasp by reducing the dimensionality of the data set. This was accomplished by minimizing the number of dimensions in the data set. As a result of the information included in the data being lost during the compression process, it is vital to find a good middle ground between the two.
Selection and justification of appropriate data mining tasks and methods
Classification data mining task
The goal of the data mining task known as classification is to use a model to determine, using a collection of features, which class or category a given data point belongs to. Accurately predicting the class or category of new, unseen data points using patterns or relationships learned from a training dataset is the purpose of classification. We could cite the following algorithms as instances of categorization systems:
- Decision trees: Decision trees, a common form of the classification algorithm, make use of a tree-like structure when it comes to describing a set of alternatives and the probable outcomes associated with each option.
- Random forests: Utilizing random forests, a form of a decision tree that differs from traditional decision trees in that it makes use of an ensemble of trees, is one way to further improve the predicted accuracy and consistency of decision trees.
- Support Vector Machines (SVMs): Input is separated into its respective categories by linear classifiers such as support vector machines (SVMs), which make use of a boundary in the form of a hyperplane.
- Nave Bayes: The Naive Bayes method uses Bayes' Theorem as a straightforward probabilistic classifier to determine how likely it is that a particular data point belongs to a particular category.
- Neural networks: Neural networks, which are one of the most powerful approaches to machine learning, are structured as layered collections of neurons or nodes that are interconnected with one another. One of their many uses is classification, which is just one of their many applications.
Classifiers can be rated on how well they function based on several different assessment criteria, including accuracy, precision, recall, and F1 score. We can evaluate how effectively the model can anticipate the type of fresh data by using these measures. Classification is a task in data mining that involves deciding, based on a set of input features, which category or group a certain data item belongs to. This determination is made through the process of classification (Kesavaraj, 2013, July). The techniques of categorization that can be used include, but are not limited to, decision trees, random forests, support vector machines, naive Bayes, neural networks, and knearest neighbors, to name just a few. It is essential to evaluate the performance of the model by utilizing reliable measures to be able to generate accurate forecasts.
Modeling
Selection of modeling techniques
Support vector machine
The classification technique applied in the Weka data analysis is the support vector machine. The support vector machine's data and information analysis is the analysis that offers different plane values from the information provided in the data analysis. Because it may offer helpful information on several groups of planes that would split the actual pattern of absenteeism discovered in people, support vector machine was chosen as the absenteeism classification technique (Noble, 2006)
K-means algorithm
When clustering data and information, the k-means algorithm is taken into account. The group of individuals with various names is offered below. Additionally, the cluster indicates if a person is a man or female based on the various names. According to the analysis, it demonstrates that cluster 0 is male-dominated while cluster 1 is female-dominated. And the final assessment reveals that Cluster 1 contains more information and value than Cluster 0.
Regression analysis
Regression analysis is a method for analysing predictive data that is also used to classify data, but it also provides a method for analysing descriptive data. Here, the regression analysis is carried out on the data and information's air quality (Sarstedt & Mooi, 2014).
Build models
Support vector machine
The model created for the support vector machine is seen in the accompanying picture. The plane used to segment the given data set is provided by the support vector machine. The support vector machine's classification method primarily evaluates the many values that have been obtained from the examination of each class data variable. Each variable has a value that is either positive or negative. The plane value that separates the data sets into several groups is provided by both the positive and negative value. Here, the data and information are all standardised into a single format to make it simple to study the variable when it is plotted on the 3D plane.
K-means algorithm
The k-means clustering methodology and its model are depicted in the above graphic (Kumar et al., 2013). The generic gender descriptors for men and women based on their names are shown in the image. The gender of an individual is determined by their name, and the k-means algorithm has produced two distinct clusters based on an examination of either a man or a female. It demonstrates that the majority of persons in the presented data sets are male. This is due to the fact that the two specified clusters are dominated by males.
Regression analysis
The data and information model created for the air quality is depicted in the above graphic. The examination of air quality based on many parameters is shown in the figure. It demonstrates the importance of atmospheric pollutants like carbon monoxide, C6H6, nitrogen oxide, and many more. All of these are a component of the study that determines the overall quality of the air in the surrounding area. All future predictions of air quality can be made correctly with the computation of these statistics and information.
Output parameter settings
The output parameter settings for the absenteeism data set used in the support vector machine processing are shown in the figure. The build calibration model is set to false, and the figure indicates that the total batch size is 100. The logistic regression analysis is utilised to calculate the support vector machine's value of C, which is 1. Because they would take longer to process if they were turned on, the checks that don't check capabilities are disabled. And whether it is switched on or off, the output is relatively comparable to that. Because all of the values for the support vector machine should be in the normalised version, the filter type utilised is normalised training data. The number of decimal places is assumed to be 2. Due to the large number of attributes in the data set, the ploy kernel is utilised as the support vector machine's kernel. The number of random seeds used is 1.
K-means algorithm
The output parameter settings for the k-means clustering analysis are depicted in the image above. It demonstrates that the memory can store a maximum of 100 canopies. It also demonstrates that the k-means clustering requires a minimum canopy density of two. The canopy is pruned at a rate that lasts for 10,000 hours. A default setting of the canopy is used here to identify the value for two cluster analyses in the canopy. For clusters 0 and 1, the values -1.25 and -1, respectively, are used. As there is a minimum requirement of standard deviation in kmeans clustering analysis, the debugging and presentation of standard deviation are regarded to be false. The Euclidean distance is used as the distance function. As no missing values are substituted, there won't be any erroneous predictions because there aren't any missing values. Due to the inaccurate data and information it supplies, the quick distance computation is not performed (Hosseini et al., 2014). For the k-means clustering, the starting technique is treated as a random data analysis. K-means has a maximum iteration limit of 500; if any data set exceeds this limit, the procedure will end immediately.
Regression analysis
The regression analysis that delivers the output parameter settings for the environment's air quality is depicted in the figure above. The graphic demonstrates how the M5 method is used to carry out the attribute selection procedure. The characteristic that it chooses in this case will contain numerical and discretized data and information. In the first phase, the numerical data and information must be transformed into nominal data before being discretized. So 100 is the batch size that it can store in memory. Additionally, the attributes are eliminated, and this is accomplished using the data set's collinearity attributes (Dormann et al., 2013). The linear regression analysis disregards the smallest value. The output additional status that is necessary in the linear regression is accepted as false, and the maximum number of decimal numbers taken is up to 4. All the information is calculated using these parameter settings.
Output model description
Support vector machine
The output model description for the support vector machine is depicted in the above graphic. It illustrates the importance of both correctly and wrongly classified situations. It also displays the kappa statistic's value, which is only 34%. This indicates that the value returned by the support vector machine was calculated using the absenteeism data set, and the kappa statistics only support 34% of the total value that the support vector machine has returned. Along with the total number of occurrences, the summary also displays the mean absolute error, root mean squared error, relative absolute error, and root relative squared error from the output model description.
K-means algorithm
The final cluster of centroids for the gender analysis based on the names of the people is depicted in the image above. The image demonstrates the existence of two clusters: cluster 0 and cluster 1. The centroid values for the two clusters differ. It demonstrates that the name Darcy, the female gender, more than 3800 counts, and the likelihood of all this data being present is zero. However, the centroid from cluster 0 has the name James, is male, has a count of more than 5300, and has a probability of 0. The value for Cluster 1 is comparable to the value for the entire data set. Therefore, it can be concluded that the offered data collection has more information and data that is focused on female names and data.
Regression analysis
The linear regression analysis that displays the various values and details for the air quality index is depicted in the figure above. The value provided by the air quality index offers several descriptions that demonstrate the examination of all available data and information. Some of the data contain negative values because they demonstrate that as a single data variable's value rises or falls, the overall quality of the air lowers. For the value that has a positive number, the same circumstance might be stated differently. The overall quality of the air also improves with an increase in the data variable.
Evaluate the models
Support vector machine
The accuracy by class is depicted in detail in the above figure, which also includes the values for the PRC area, ROC area, MCC value, f-measure value, recall value, precision value, true positive rate, and false positive rate. This illustrates how the class variable contains a wealth of data and information. In every row of the data and information, the overall weight of the class value is also displayed. The class value contains various numerical data and information, and among the numerical data are all the data and information related to absenteeism. The table shows that 120 days of absence is the maximum amount of absenteeism.
The confusion matrix, which is depicted in the image above, illustrates the importance of absenteeism in many contexts. The majority of those who aren't there have an average of 8. This demonstrates that the average number of missed work days is 8.
K-means algorithm
The gender analysis from name clustered instances are depicted in the above image. It demonstrates that the second cluster contains the majority of the data and information. It demonstrates how mostly women there are.
Regression analysis
The evaluation of the regression analysis for air quality is depicted in the figure above. It demonstrates that a general study of the linear regression has a correlation coefficient of 0.99. Since the value is so near to 1, the data from all of the air quality indexes are strongly correlated with the quality of the air. With a beta value of 1, the air quality will change whenever the air quality index changes.
Conclusion
Therefore, it can be determined that the examination of the three data sets was conducted correctly. The analyses of the classification, clustering, and regression techniques that were outlined in the question have been executed with accurate statistics. The absenteeism data set was employed in the classification technique using a support vector machine, the air quality data set was used in the regression technique through linear regression analysis, and the gender names data set was applied in the clustering technique through k-means.