Data Analysis and Visualization
Data Analysis and Visualization
(a)
Categorical feature Category N %
1 AlertCategory Alert 57 7.1%
2 AlertCategory Informational 53 6.6%
3 AlertCategory Warning 690 86.2%
4 AlertCategory Missing 0 0%
5 NetworkEventType NormalOperation 61 7.6%
6 NetworkEventType PolicyViolation 476 59.5%
7 NetworkEventType Policy_Violation 232 29.0%
8 NetworkEventType ThreatDetected 31 3.9%
9 NetworkEventType Missing 0 0%
10 NetworkInteractionType Anomalous 383 47.9%
11 NetworkInteractionType Critical 23 2.9%
12 NetworkInteractionType Elevated 27 3.4%
13 NetworkInteractionType Suspicious 366 45.8%
14 NetworkInteractionType Unknown 1 0.1%
16 NetworkInteractionType Missing 0 0%
Continuous Feature N (%)missing Mean Median Min Max Skewness
1 DataTransferVolume_IN 0(0%) 136262728.5 132288945.0 49797961.0 235929921.0 0.6
2 DataTransferVolume_OUT 0(0%) 120441648.1 114859654.0 46475929.0 237804068.0 0.6
3 TransactionsPerSession 0(0%) 29397.0 28487.0 13008.0 52001.0 0.7
4 NetworkAccessFrequency 0(0%) 31815.6 33151.5 -1.0 53930.0 -1.6
5 UserActivityLevel 0(0%) 5.4 5.0 1.0 9.0 0.0
6 SystemAccessRate 58(7.2%) 5.2 5.0 2.0 8.0 0.0
7 SecurityRiskLevel 0(0%) 152628275.1 152149408.0 57932792.0 275423200.0 0.0
8 ResponseTime 0(0%) 7902.2 30.1 7.1 99999.0 3.1
(b)There don't seem to be any erroneous categories or values in the categorical variables after inspection. No category is marked as "Missing" without accompanying values, and each categorized feature contains distinct categories with corresponding counts and percentages. As so, the information appears to be accurately categorized and devoid of erroneous entries.
DataTransferVolume_IN and DataTransferVolume_OUT the maximum values in the dataset are significantly higher than the other values. If these huge values are noticeably higher than the mean or median, they may be outliers. TransactionsPerSession's maximum value is noticeably greater than the other numbers, which can indicate the existence of anomalies. The value of NetworkAccessFrequency is negative, which is odd and could point to a data error or outlier. ResponseTime's maximum value is significantly greater than the dataset's other values, which may indicate the existence of outliers.
(c)
-Prior to doing Principal Component Analysis (PCA), data must be scaled (standardized). Since PCA is sensitive to the scale of variables, factors at greater scales may have a disproportionate impact on the principleprincipal components. By scaling the data, bias towards variables with greater scales is avoided because every variable contributes equally to the study. By guaranteeing that all variables are on the same scale and enabling direct comparison, scaling also improves interpretability. In general, scaling guarantees a fair and accurate PCA representation of the underlying structure of the data.
-
Component Individual Variance Cumulative Variance
PC1 0.321 0.321
PC2 0.236 0.557
PC3 0.236 0.728
- The first two principal components have already been shown to account for almost 8050% of the variability, all that remains to be determined is whether adding more components will cause the cumulative proportion of variation to rise above 50%. It is clear that the first two principal components alone are sufficient to account for more than 50% of the cumulative variance, while the third main component only contributes 20 % to the total variance. As a result, just two principal components (PCs) are required to account for a minimum of 50% of the data variability.
-
Feature PC1 PC2 PC3 Key Drivers for PC
DataTransferVolume_IN 0.342 0.432 0.179 PC2
DataTransferVolume_OUT 0.398 -0.362 0.228 PC1TransactionsPerSession 0.213 0.235 -0.344 NetworkAccessFrequency 0.288 -0.303 0.412 PC3
UserActivityLevel 0.367 -0.274 -0.378 PC3SystemAccessRate 0.336 0.416 -0.262 PC2
SecurityRiskLevel 0.282 -0.242 -0.433 PC32ResponseTime 0.393 -0.334 0.328 PC1(d)
The PCA biplots provide a graphic depiction of feature contributions and data distribution. The PCA plot provides a reduced-dimension overview of the distribution of the data points by grouping them according to the event classification (Malicious vs. Normal). In the meantime, each principal component's variance is driven by features that are highlighted in the loading plot. By integrating the two plots, we are able to pinpoint important characteristics that have a big impact on event classification, such as AlertCategory and NetworkEventType. These characteristics are crucial in differentiating between benign and malevolent occurrences according to how they affect the main constituents.
(e)In the PC1 plot Normal is more clustered towards the left side and in case of PC2 the Normal gets overlapped with Malicious. PC2 appears to be the dimension that is more useful in aiding in the identification of malicious occurrences, based on the findings from parts (iii) to (iv). There is some, but not much, difference between Normal and Malicious events when all points are projected onto the PCA plot's PC1 axis. On the other hand, compared to the projection onto PC1, the division between Normal and Malicious events is more obvious when projecting onto the PC2 axis. The more noticeable distinction on PC2 suggests that this dimension would be more useful for differentiating between the two sorts of events.
CriterionContribution to assignment markCorrect implementation of descriptive analysis, data cleaning andPCA in RWorking codeMasking of invalid/outliers is done correctlyExternal sources referenced in APA 7 referencing style (ifapplicable)Good documentation/commentaryNote: At least 80% of the code (excluding those provided to youabove) must aligned with unit content. Otherwise a mark of zerowill be awarded for this component.11/20%Correct identification of missing and/or invalid observations inthe data with justifications/explanations.4/10%Accurate specification and interpretation of the contribution ofprincipal components and its loading coefficients.Explain why you have or have not scaled the observations when running PCA.Outline the individual and cumulative proportion of variance explained, and comment on the number of components required to explain at least 50% of the variance.Outline the loadings (to specified decimal place) and comment as to their contribution to its respective PC.Tabulation of results no screenshot9/15%Accurate biplot, with appropriate interpretation presented2-d with clear labelsInterpretation of each biplotPCA plot Clustering? Separation?Loadings plot vectors (features) and its relationto each of the dimension, and as well as to eachother.PCA + Loadings plot Do any of the features appears to beable to assist with the classification of Normal andMalicious events and how?3/25%Appropriate selection of dimension for the identification Maliciousevents with justification.Choose a dimension, i.e. PC or PC2 and justify why it is thebest for classifying Malicious events3/10%Presentation and communication skills Tables (no screenshots)and figures are well presented and appropriately captioned andreferenced in text. Report, analysis and overall narrative is well-articulated and communicated.Minimum font size of 11.All figures and tables should be labelled/captioned appropriately and referenced in text. The labels in the plots should be clear.Solutions should be in the order that the questions wereposed in the assignment.Spelling and grammatical errors should be kept to aminimum.Overall narrative all interpretation should be in the context of the study.8/20%Total38/100%
Part 1: Data Preparation and Cleaning
The integrity and reliability of machine-learning models depend significantly on the quality of the input data. Therefore, thorough data preparation and cleaning are indispensable steps in the analysis of the ******* dataset, aimed at detecting ***************.
Data Cleaning Steps
Invalid and empty values were addressed to maintain data accuracy. Records with ************************ were removed because *****************. Entries with ************* were considered invalid and thus excluded. These corrections are essential for *******************************, especially when ********************************************************************************. The dataset was filtered to ************************************************************************. This binary classification is central to the supervised learning approach, which focuses on the analysis of the crucial task of incident detection.
Category Simplification
The dataset was streamlined by **********************************************************************************************. Specifically: *************************************************** were consolidated ***********************, whereas *******************************************************************************************************************************************************. This step aims to reduce **********************, aiding the learning process of the model *********************************************************. *****************************************************************************************. This simplification potentially improves the model efficiency by ***************************************************************************************************************************.
Part 2: Model Training and Hyperparameter Tuning
Hyperparameter Tuning/Search Strategy for Logistic Ridge Regression
Logistic Ridge Regression models, applied to both balanced and unbalanced datasets, underwent a rigorous process of hyperparameter tuning to ascertain the optimal configuration for detecting malicious incidents. This endeavor was crucial for **************************************************************************************. For the balanced dataset, the tuning focused on ******************************************************************************************************************************. The optimal ************* value is ******************. This precise calibration of ************ significantly bolstered the model's accuracy, achieving a notable accuracy rate of ************** and a kappa statistic of *********, indicating the robustness of the model in differentiating between **************************** events. In contrast, the unbalanced dataset underwent an exhaustive hyperparameter tuning process, revealing an optimal ******************. This fine-tuning resulted in an even higher accuracy of ************* and a kappa statistic of *************, showing the model's ***************************** to the presence of malicious activities.
Prediction Results from Balanced Training Model
The model demonstrated a high capability in identifying *************, correctly classifying ****** of such cases. However, it shows a vulnerability in detecting **********, mislabelling ***** of them as ********. The false positive rate was ********, which indicates that a relatively small number of *************** were incorrectly identified as ************. A false negative rate of ****** points to a small proportion of *************************. Notably, the precision of the model was **************, reflecting strong accuracy in predicting ****************. With a recall of ******, most malicious activities were successfully **********, and an F-score of ***** indicated a well balanced model. A balanced accuracy rate of ********* underscores the overall efficacy of the model.
Prediction Results from Unbalanced Training Model
For the unbalanced dataset, the model classified ******************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************.
Hyperparameter Tuning/Search Strategy for Random Forest Models
Tuning Methodology
A structured exploration of hyperparameters such as mtry, splitrule, and min.node.size was performed. For the balanced dataset, the optimal performance was obtained with mtry=3, using the Gini split rule, and setting the min.node.size to 5. In contrast, the unbalanced dataset showed optimal results with mtry=12, the same split rule, and node size, thus enhancing the model's detection capabilities for malicious incidents (Reference A; Reference B).
Performance Evaluation
The balanced dataset model achieved ******** accuracy, demonstrating ***********************. The unbalanced model surpassed this, achieving ******** with exceptional ************** and specificity ***************, underscoring its robustness in *************************.
Prediction Results from Balanced Training Model
The Random Forest model trained on the balanced model demonstrated ******************************************. It successfully identified *************************************************************************************************************************. The sensitivity of the model was ***************************************************************************************************. Precision is ************************************************************************. Coupled with *********************************************, the model reliably captured ********************************. The F-Score of *******************************************. The balanced accuracy rate of ***************** underscores the overall effectiveness of the model in correctly classifying both classes of events.
Prediction Results from Unbalanced Training Model
For the unbalanced dataset model, the Random Forest model shows **************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
Recommended Model and Conclusion
The chosen model for incident detection was the *************** model trained on the balanced dataset, primarily for its high ********, ************, and low ***********. Its F-score of ************ indicate a superior balance between recall and precision. Despite the comparative complexity of ***************, its performance and generalizability make it a pragmatic choice *******************, particularly in scenarios where missing a malicious event is highly detrimental. The trade-off in ***************** is deemed acceptable because of the significant gain in the predictive accuracy.
References
**********************************************************************************
**********************************************************************************