Assessment 2: Report on the analysis and modelling of a dataset (Instructions and report submission point)

Description: Test and evaluate two machine learning algorithms to determine which supervised learning model is best for the case study provided.

1) General data preparation and cleaning

a)Import the dataset from the case study into R Studio. This version is the same as Assessment 1.

(Please refer to other document which contains case study details)

b)Write the appropriate code in R Studio to prepare and clean the dataset as follows:

(i) Clean thewholedataset based on what you have suggested / feedback from Assessment 1.(ii) Filter the data to only include cases labelled with Class = 0 or 1.(iii) For the featureOperating.System, merge the three Windows categories together to form a new category, sayWindows_All. Furthermore, mergeiOS, Linux (Unknown), andOtherto form the new category namedOthers.Hint: use theforcats:: fct_collapse(.)function.(iv) Similarly, for the featureConnection.State, mergeINVALID,NEWandRELATEDto form the new category namedOthers.(v) Select only the complete cases using thena.omit(.)function, and name the datasetMLData2023_cleaned.

Briefly outline the preparation and cleaning process in your report and why you believe the above steps were necessary.

c)Use the code below to generated two training datasets (one unbalancedmydata.ub.trainand one balancedmydata.b.train) along with the testing set (mydata.test). Make sure you enter your student ID into the commandset.seed(.).

# Separate samples of non-malicious and malicious events

dat.class0 <- MLData2023_cleaned %>% filter(Class == 0) # non-malicious

dat.class1 <- MLData2023_cleaned %>% filter(Class == 1) # malicious

# Randomly select 19800 non-malicious and 200 malicious samples, then combine them to form the training samples

set.seed(Enter your student ID)

rows.train0 <- sample(1:nrow(dat.class0), size = 19800, replace = FALSE)

rows.train1 <- sample(1:nrow(dat.class1), size = 200, replace = FALSE)

# Your 20000 unbalanced training samples

train.class0 <- dat.class0[rows.train0,] # Non-malicious samples

train.class1 <- dat.class1[rows.train1,] # Malicious samples

mydata.ub.train <- rbind(train.class0, train.class1)

mydata.ub.train <- mydata.ub.train %>%

mutate(Class = factor(Class, labels = c("NonMal","Mal")))

# Your 39600 balanced training samples, i.e. 19800 non-malicious and malicious samples each.

set.seed(123)

train.class1_2 <- train.class1[sample(1:nrow(train.class1), size = 19800,

replace = TRUE),]

mydata.b.train <- rbind(train.class0, train.class1_2)

mydata.b.train <- mydata.b.train %>%

mutate(Class = factor(Class, labels = c("NonMal","Mal")))

# Your testing samples

test.class0 <- dat.class0[-rows.train0,]

test.class1 <- dat.class1[-rows.train1,]

mydata.test <- rbind(test.class0, test.class1)

mydata.test <- mydata.test %>%

mutate(Class = factor(Class, labels = c("NonMal","Mal")))

Note that in the master data set, the percentage of malicious events is less than 1%. This distribution is roughly represented by the unbalanced data. The balanced data is generated based on up-sampling of the minority class using bootstrapping. The idea here is to ensure the trained model is not biased towards the majority class, i.e. non-malicious event.

2) Compare the performances of different ML algorithms

a)Randomly selecttwosupervised learning modelling algorithms to test against one another by running the following code. Make sure you enter your student ID into the commandset.seed(.). Your 2 ML approaches are given bymyModels.

set.seed(Enter your student ID)

models.list1 <- c("Logistic Ridge Regression",

"Logistic LASSO Regression",

"Logistic Elastic-Net Regression")

models.list2 <- c("Classification Tree",

"Bagging Tree",

"Random Forest")

myModels <- c(sample(models.list1, size = 1),

sample(models.list2, size = 1))

myModels %>% data.frame

For each of your two ML modelling approaches, you will need to:

b)Run the ML algorithm in R on the twotraining setswithClassas the outcome variable.

c)Perform hyperparameter tuning to optimise the model:

Outline your hyperparameter tuning/searching strategy for each of the ML modelling approaches. Report on the search range(s) for hyperparameter tuning, which -fold CV was used, and the number of repeated CVs (if applicable), and the final optimal tuning parameter values and relevant CV statistics (i.e. CV results, tables and plots), where appropriate.If you are using repeated CVs, a minimum of 2 repeats are required.

If your selected tree model isBagging, you must tune thenbagg,cpandminsplithyperparameters, withat least 3 valuesfor each.

If your selected tree model isRandom Forest, you must tune thenum.treesandmtryhyperparameters, withat least 3 valuesfor each. Be sure to set the randomisation seed using yourstudent ID.

d)Evaluate the predictive performance of your two ML models, derived from the balanced and unbalanced training sets, on thetestingset. Provide the confusion matrices and report and describe the following measures in the context of the project:

False positive rate

False negative rate

Overall accuracy

Make sure you define each of the above metrics in the context of the case study.

e)Provide a brief statement on your final recommended model and why you chose it. Parsimony, and to a lesser extent, interpretability maybe taken into account if the decision is close.You may outline your penalised model if it helps with your argument.

Submit the following:

Your report (.pdf or .docx file): Submit on this page (click "Assignment" at page end).

Three csv files (two training sets plus a testing set) and your R code (.R file)

About the dataset

Variable Description

Assembled Payload Size (continuous) The total size of the inbound suspicious payload. Note: This would contain the data sent by the attacker in the TCP conversation up until the event was triggered.

DYNRiskA Score (continuous) An untested in-built risk score assigned by a new SIEM plug-in.

IPV6 Traffic (binary) A flag indicating whether the triggeringpacket was using IPV6 or IPV4 protocols (True = IPV6).

Response Size (continuous) The total size of the reply data in the TCP conversation prior to the triggering packet.

Source Ping Time (ms) (continuous) The ping time to the IP address which triggered the event record. This is affected by network structure, number of hops and even physical distances.E.g.:

< 1 ms is typically local to the device

1-5ms is usually located in the local network

5-50ms is often geographically local to a country

~100-250ms is trans-continental to servers

250+ may be trans-continental to a small network.

Note, these are estimates only and many factors can influence ping times.

Operating System (Categorical) A limited guess as to the operating system that generated the inbound suspicious connection. This is not accurate, but it should be somewhat consistent for each connection.

Connection State (Categorical) An indication of the TCP connection state at the time the packet was triggered.

Connection Rate (continuous) The number of connections per second by the inbound suspicious connection made prior to the event record creation.

Ingress Router (Binary) DCE has two main network connections tothe world. This field indicates whichconnection the events arrived through.

Server Response Packet Time (ms) (continuous) An estimation of the time from when the payload was sent to when the reply packetwas generated. This may indicate server processing time/load for the event.

Packet Size (continuous) The size of the triggering packet.

Packet TTL (continuous) The time-to-live (TTL) of the previous inbound packet. TTL can be a measure of how many hops (routers) a packet has traversed before arriving at our network.

Source IP Concurrent Connection (Continuous) How many concurrent connections were open from the source IP at the time theevent was triggered.

Class (Binary) Indicates if the event was confirmed malicious, i.e., 0 = Non-malicious, 1 =Malicious.

Data description

Each event record is a snapshot triggered by an individual network packet. The exact triggering conditions for the snapshot are unknown. But it is known that multiple packets are exchanged in a TCP conversation between the source and the target before an event is triggered and a record created. It is also known that each event record is unusual in some way (the SIEM logs many events that may be suspicious). A very small proportion of the data are known to be corrupted by their source systems and some data are incomplete or incorrectly tagged. The incident response team indicated this is likely to be less than a few hundred records. A list of the relevant features in the data is given below. The raw data for the variables below are contained in the MLData2023.csv file provided at the start of this case study.

Download Solution Now

Uploaded By : Pooja Dhaka
Posted on : November 23rd, 2024
Downloads : 0
Views : 158

Assessment 2: Report on the analysis and modelling of a dataset (Instructions and report submission point)

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

Assessment 2: Report on the analysis and modelling of a dataset (Instructions and report submission point)

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

Request a Call Back