INTRUSION DETECTION USING NAVE BAYES WITH KERNEL DENSITY ESTIMATION CSE590

Download Solution Now

Subject Code :
CSE590
University :
others Exam Question Bank is not sponsored or endorsed by this college or university.
Country :
India

INTRUSION DETECTION USING NAVE BAYES WITH KERNEL DENSITY ESTIMATION

A Thesis Submitted in Partial Fulfilment of the Requirements for the Degree of Master of Technology

YUMNAM SURAJKANTA

Registration no: 13-25-107

Under the Supervision of

Mr. UMAKANTA MAJHI & Mr. AMIT TRIVEDI

Department of Computer Science and Engineering

NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR 2015

DECLARATION

Thesis Title: Intrusion Detection using Nave Bayes with Kernel Density Estimation Degree for which the Thesis is submitted: Master of Technology.

I declare that the presented thesis represents largely my own ideas and work in my own words. Where others ideas or words have been included, I have adequately cited and listed in the reference materials. The thesis has been prepared without resorting to plagiarism. I have adhered to all principles of academic honesty and integrity. No falsified or fabricated data have been presented in the thesis. I understand that any violation the above will cause for disciplinary action by the Institute, including revoking the conferred degree, if conferred, and can also evoke penal action from the sources which have not been properly cited or from whom proper permission has not been taken.

(Signature)

Name of the student: Yumnam Surajkanta

Registration no: 13-25-107

Date: / /

CERTIFICATE

It is certified that the work contained in this thesis entitled Intrusion Detection using Nave Bayes with Kernel Density Estimation submitted by Yumnam Surajkanta, Registration no:13-25-107 for the award of M.Tech is absolutely based on his own work carried out under my supervision and that this work/thesis has not been submitted elsewhere for any degree/diploma.

(Umakanta Majhi) (Amit Trivedi)

Assistant Professor; Assistant Professor;

Dept. of Computer Science and Engg.; Dept. of Computer Science and Engg.; NIT Silchar, Assam.

NIT Silchar, Assam.

Date: ---/---/------ Date: ---/---/------

ACKNOWLEDGEMENT

I take this opportunity to express my sincere thanks to:

Umakanta Majhi, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology (NIT), Silchar, for his valuable supervision, guidance, encouragement and constant advice throughout the course of this investigation as it would not have been possible for me to carry out research without his co-operation and support.

Assistant Professor Amit Trivedi, Computer Science and Engineering, NIT Silchar for his valuable advice and guidance during the course of the investigation.

Prof. Nidul Sinha, Head of Department of Computer Science and Engineering for encouragement and providing the necessary facilities in carrying out this research work.

I would like to acknowledge all the staff for the help and co-operation during the course of my research work. I am thankful to entire research scholars, friends and family members, who help me in need.

Finally but not the least a special thanks to Almighty God for my wellbeing throughout the time.

Yumnam Surajkanta

ABSTRACT

As technology advanced Information System begins to play an important role in our life. Importance of Information System and danger pose by Intrusion inspired many researchers to research in network security intrusion and how to prevent such attacks.

In this thesis we are proposing a method which can provide better accuracy over the base Intrusion Detection System which only use machine learning algorithm. This improvement in accuracy is brought by transforming the dataset and selecting the most contributing components. We use Principal Component Analysis to transform the dataset and in identifying the vectors or components along which the variance of the data are highest. Principal Component Analysis redefine the data in terms of these orthogonal principal components. KDD 99 dataset contains wide variety of intrusion types and each intrusion target different aspect of the system. Attribute which play an important role in identifying particular attack type may not be important in detecting other attack types. To obtain optimum accuracy we group records in dataset into groups with similar properties and try to identify the most contributing attribute for each group. K-mean clustering is applied to group the records into similar groups. After clustering the data into similar groups, Genetic Algorithm is applied to select optimum set of attributes for each cluster. We used relatively simple Naive Bayes to train and classify the records in the dataset. Traditionally Nave Bayes use Gaussian distribution to approximate the probability distribution of the attribute values. However many attributes derive from real life problem seldom follow normal distribution and cannot be modeled accurately by simple Gaussian distribution. We improved the accuracy of the Nave Bayes by modelling the probability distribution of attributes more accurately using Kernel estimate.

Certificate Acknowledgement

Abstract i

List of Contents iii

List of figure vi

List of Tables vii

Acronym viii

Chapter 1 Introduction 1

Intrusion Detection Systems: Background 1
- Growth of the Internet 1
- Growth of Internet attacks 2
- Financial risks in corporate networks 3
- Need for Intrusion Detection Systems 3

Motivation 5

Problem Statement 7

1.6 Outline of the Thesis 8

Chapter 2 Literature Survey 9

Data acquisition 9
- Host Based IDS 9
- Network Based IDS 10

Detection capability 10

Misuse Based IDS 11
- Pattern Recognition Based 11
- Rule Based 12
- Specification based 12
- Data mining Based 13
- Genetic Algorithm Based 13

Anomaly Based IDS 14
- Statistical Based 14
- Signal Processing Based 14
- Immune System Based 15
- Heuristic Based 15

Hybrid Approach 15

Chapter 3 Proposed Model 19

Method Description 19

Chapter 4 DATASET 23

ATTACK TYPES 26

KDD 99 DATASET 28

KDD 99 FEATURES 29

Inconsistencies in KDD 99 dataset 31

Chapter 5 Preprocessing of data 33

Chapter 6 Principal Component Analysis 35

Principal Component Geometrical Interpretation 36

Why Eigen vectors are chosen 37

PCA Algorithm 39
- Principal Component Analysis Algorithm: 40

Chapter 7 Clustering 41

Distance Measure 42
- Murkowski 42

Similarity Functions 42
- Cosine Measure 42

7.2.2 Pearson correlation measure	43
7.3 Basic Clustering Techniques	43
7.4 Partitioning Methods	44
7.5 K-mean clustering	45
7.6 K-mean Clustering Algorithm	45
Chapter 8 Genetic Algorithm	47
8.1 Solution Representation	48
8.2 Initialization of population	49
8.3 Selection and Fitness Function	49
8.4 Roulette Selection Algorithm	50
8.5 Crossover	50
8.6 Mutation	51
8.7 Steps for GA	52
Chapter 9 Nave Bayes	53
9.1 Bayes Theorem:	53
9.2 Learning in Bayes Classifier	53
9.3 Learning in Nave Bayes	54
9.4 Nave Bayes With kernel Density Estimation	55
Chapter 10 Results and Discussion	59
Chapter 11 Conclusion and future works	65
References	61
Publication	64

List of Figures

Figure 1.1 Growth of internet 2

Figure 1.2 Intrusion Detection System in a network 4

Figure 3.1 Flow Chart of the proposed model 20

Figure 4.1 Lincoln lab 1998 experimentation setup. 24

Figure 4.2 Lincoln Lab 1999 experimentation Setup showing differences from 1998 setup. 25

Figure 6.1 principal components example. 36

Figure 6.2 Angle made by Principal Components 37

Figure 8.1 Binary string 48

Figure 8.2 Crossover between a pair of chromosomes 51

Figure 8.3 mutation 51

Figure 9.1 Bayesian Network and Nave Bayes 54

Figure 9.2 Gaussian estimation of probability distribution 56

Figure 10.1 Principal Component 1 59

Figure 10.2 Principal Component 25 59

Figure 10.3 Principal Component 49 60

Figure 10.4 Probability distribution of First component estimated using histogram, normal distribution and Kernel 61

Figure 10.5 Probability distribution of second component estimated using histogram, normal distribution and Kernel 62

Figure 10.6 Probability distribution of third component estimated using histogram, normal distribution and Kernel 62

List of tables

Table 2.1 Intrusion Detection System and their methods 17

Table 4.1 Basic features of individual TCP 29

Table 4.2 Traffic features computed using a two-second time window. 30

Table 4.3 Content features within a connection suggested by domain knowledge. 30

Table 4.4 Host-based traffic Features. 31

Table 5.1 Discrete attributes and number attribute it generates in preprocessing. 33

Table 10.1 K-mean clustering results 60

Table 10.2 Selected Attributes for each cluster 61

Table 10.3 Accuracy of different methods 63

Acronym

IDS: Intrusion Detection System HMM: Hidden Markov Model DoS: Denial of Service

BMSL: Behavioural Monitoring Specification Language

RIPPER: Repeated Incremental Pruning to Produce Error Reduction

SVM: Support Vector Machine

GA: Genetic Algorithm

DARPA: Defence Advanced Research Projects Agency

R2L: Remote to Local

U2R: User to Root

KDD: Knowledge Discovery and Data Mining

PCA: Principle Component Analysis

Chapter1 Introduction

Security attacks through Internet have increased tremendously in recent years. Hence, information security is an issue of very serious global concern of the present time. This chapter discusses the growth of the Internet and the security threat posed by the increased complexity, accessibility and openness of the Internet. The need for network security and in particular the need for Intrusion Detection Systems (IDS) have been brought out. This chapter also includes the motivation for the work reported in this thesis.

1.1 Intrusion Detection Systems: Background

1.1.1 Growth of the Internet

Internet has changed the world fundamentally just as the invention of telegraph or incandescent bulb revolutionized the world. It can be considered as one of the greatest invention of mankind. It has bought mankind closer to the Utopian concept known as Global Village. In a matter of very few years, the Internet has consolidated itself as a very powerful platform that has changed the way we do business, and the way we communicate.

The Internet came into existence in the late seventies as an outgrowth of the ARPANET, a DOD project. The exponential growth of the Internet can be understood from Figure 1.1 by taking the statistics over this period of time. At the end of 1995, there were about 16 million Internet users, which was about 0.4% of the total population. By the end of 2005, in a decade it has increased to 1018 million Internet users, which was about 15.7% of the total population. As of June 10, 2008, 1.596 billion people use the Internet according to Internet World Stats. Information Technology revolution has touched every aspect of our life. Once the standalone home appliances such as washing machine, oven, TV are increasingly becoming IT enabled and can exchange information through internet, ushering us into another revolution known as the Internet of Things.

1.1.2 Growth of Internet attacks

Information Technology has become the technology enabler for everyone, revolutionize the way we communicate, making data exchange much easier, becoming largest knowledge repository of mankind, however very idea that makes the internet such as the open access, interconnectivity makes it vulnerable to security attacks. The factors which drive the internet growth such as the falling price of digital computer, larger bandwidth also make it easier for hackers to attack the internet.

There are multiple penetration points for intrusions to take place in a network system. For example, at the network level carefully crafted malicious IP packets can crash a victim host; at the host level, vulnerabilities in system software can be exploited to yield an illegal root shell. The security threats have exploited all kinds of networks ranging from traditional computers to point-to-point and distributed networks. These security threats have also exploited the vulnerable protocols and operating systems extending attacks to operating system on various kinds of applications, such as database and web servers. The most popular operating systems regularly publish updates, but the combination of poorly administered machines, uninformed users, a vast number of targets, and ever-present software bugs has allowed exploits to remain ahead of patches.

1.1.3 Financial risks in corporate networks

The threats on the Internet can translate to substantial losses resulting from business disruption, loss of time and money, and damage to reputation. The financial impact of application downtime and lost productivity caused by the increasing number of application level vulnerabilities and frequency of attacks is substantial.

According to the census conducted in US in 2007, volume of business-to-business commerce has increased from US $38 billion in 1997 to US $990 billion in 2006 (an increase of 6.3%). The total E-commerce revenue of US is $20 billion in 1999 has increased to US $990 billion in 2006. It was predicted that by 2008, online retail sales will account for 10 percent of total US retail sales. There may have been boom in the usage of Internet and online businesses; but one main issue is security of online environment that is affecting both the users and businesses alike.

According to US report, cybercrime robs US business about 67.2 billion a year. Over the past two years, US consumers have lost US $8 billion to online fraud schemes. The online fraudsters are not only cheating online business, but they are also increasing the perception of fear among consumers. The 2005 annual computer crime and security survey [46], jointly conducted by the Computer Security Institute and the FBI, indicated that the financial losses incurred by the respondent companies due to network attacks were US $130 million.

1.1.4 Need for Intrusion Detection Systems

Intrusions refer to the network attacks against vulnerable services, data-driven attacks on applications, host-based attacks like privilege escalation, unauthorized logins and access to sensitive files, or malware like viruses, worms and Trojan horses. These actions attempt to compromise the integrity, confidentiality or availability of a resource. Intrusions result in services being denied, system failing to respond or data stolen or being lost. Intrusion detection means detecting unauthorized use of a system or attacks on a system or network. Intrusion Detection Systems are implemented in software or hardware in order to detect these activities.

The existing network security solutions, including firewalls, were not designed to handle network and application layer attacks such as Denial of Service and Distributed Denial of Service attacks, worms, viruses, and Trojans. Along with the drastic growth of the Internet, the high prevalence of the threats over the Internet has been the reason for the security personnel to think of IDSs.

The unauthorized activities on the Internet are not only by the external attackers but also by internal sources, such as fraudulent employees or people abusing their privileges for personal gain or revenge. These internal activities cannot be prevented by a firewall which usually stops the external traffic from entering the internal network. Firewalls are made to stop unnecessary network traffic into or out of any network. Packet filtering firewalls typically will scan a packet for layer 3 and layer 4 protocol information. There are not very much dynamic defensive abilities compared to most firewalls. The traffic approaching the firewall either matches up to applied rule and is allowed through or the traffic is stopped and the firewall logs the blocked traffic. The IDSs along with the firewall form the fundamental technologies for network security.

IDSs can be categorized into two classes, anomaly based IDSs and misuse based IDSs. Anomaly based IDSs look for deviations from normal usage behavior to identify abnormal behavior. Misuse based, on the other hand, recognize patterns of attack. Anomaly detection techniques rely on models of the normal behavior of a computer system. These models may focus on the users, the applications, or the network. Behavior profiles are built by performing statistical analysis on historical data or by using rule based approaches to specify behavior patterns. A basic assumption of anomaly detection is that attacks differ from normal behavior in type and amount. By defining whats normal, any violation can be identified, whether it is part of threat model or not. However, the advantage of detecting previously unknown attacks is paid for in terms of high false-positive rates in anomaly detection systems. It is also difficult to train an anomaly detection system in highly dynamic environments. The anomaly detection systems are intrinsically complex and also there is some difficulty in determining which specific event triggered the alarms.

On the other hand, misuse detection systems essentially contain attack descriptions or signatures and match them against the audit data stream, looking for evidence of known attacks. The main advantage of misuse detection systems is that they focus analysis on the audit data and typically produce few false positives. The main disadvantage of misuse detection systems is that they can detect only known attacks for which they have a defined signature. As new attacks are discovered, developers must model and add them to the signature database. In addition, signature-based IDSs are more vulnerable to attacks aimed at triggering a high volume of detection alerts by injecting traffic that has been specifically crafted to match the signatures used in the analysis process. This type of attack can be used to exhaust the resources on the IDS computing platform and to hide attacks within the large number of alerts produced.

In contrast to firewalls, a misuse based IDS will scan all packets at layers 3 and 4 as well as the application level protocols looking for back door Trojans, Denial of Service attacks, worms, buffer overflow attacks, detect scans against the network etc. An IDS provides much greater visibility to detect signs of attacks and compromised hosts. There is still the need for a firewall to block traffic before it enters the network; but, an IDS is also needed to make sure that the traffic that gets past the firewall will be monitored.

1.2 Motivation

The motivation behind our work was how important Information System play a role in our life and how vulnerable the system is to attack. Vulnerabilities in information Technology was highlighted by the Y2K scare which was caused by a small bug in the way date was represented in computer. Everything in our life ranging from microwave oven to mission critical flight control system in airliner is controlled using Information System in some way or other. We can only imagine the catastrophe that can be caused by a failure in control system of a system where failure is not an option such as nuclear reactor, manned space mission due to software bugs or intrusion attack on its Information System.

Though there has been tremendous development in Intrusion Detection System, most advanced Intrusion Detection System still cannot say it can detect all the attacks without any errors. It seems that Intrusion Detection System is playing catch up with intrusions. Hackers always seem to find a loophole in the security of our system no matter how well designed the security system is. Current policy of dealing with security loophole by releasing security patches is not helping the situation as most users dont keep their system up-to-date.

Current trend of the development in network design is towards speed and larger bandwidth. Gigabyte network are becoming common even in home network. However most of the present Intrusion Detection System both in the research works or the ones which are available, are not designed with high speed network in mind. They show high resource consumption and exhibit high packet droppage in high speed network.

Most of the available Intrusion Detection System use string matching or pattern recognition methods in their detection engine. Though they have advantages such as low false alarm rate, simpler to implement, they have some major disadvantages like the requirement of attack pattern before they can be detected, inability to detect novel attack. Researcher had proposed Intrusions Detection System which can detect novel attacks and do not require frequent update of its attack database. In spite of the advantages provided by new methods, it is not yet adopted in the available Intrusion Detection System. There seem to be a mismatch between the methods adopted by the available Intrusion Detection System and the one favored by the researchers.

Most work on Intrusion Detection System use common set of features for identifying all types of attack. Since different intrusion focused on different aspect of the system, so it quite possible that different intrusions have different characteristics and not all the attributes play an equal role. Some of the attributes may act as noise and affect the accuracy of the Intrusion Detection System. Also the most work on Intrusion Detection System employ advanced classifier with the idea that it can give better performance. However such complex classifier are resource extensive and cannot be scaled easily and spent huge amount of time in training.

1.3 Problem Statement

It is very challenging to design a fool proof Intrusion detection System which can detect all the intrusions. Most of the proposed Intrusion Detection System which have relatively high detection rate use complex and complicated methods like ensemble of classifiers, hybrid of different methods which make them impractical for use in real world.

In this paper we are proposing an Intrusion Detection System with relatively simple classifier which can approach the performance of advanced Intrusion detection System with complicated detection methods. By learning and exploiting the properties associated with each intrusions it is possible to improve the accuracy of the Intrusion Detection System instead of using a complicated methods and defining a broad rule for detecting all the attacks type. Things we want to accomplish in our work are summarized in the steps below.

To identify and select the most contributing attributes from the dataset

To group the dataset into similar groups and identify a common set of rule for each

Improve the accuracy of the classifier by building better model of the data

1.6 Outline of the Thesis

CHAPTER 1: This chapter gives a brief outline of the growth of the internet and danger pose by intrusion to the global business. Motivation towards the work and the problem statement of the proposed work. Finally, the outline of the individual chapters in the thesis is given.

CHAPTER 2: This chapter presents the brief literature survey.

CHAPTER 3: This chapter explain the DARPA dataset and its derivative KDD 99 dataset. Brief overview of the problems in KDD 99 dataset.

CHAPTER 4: This chapter explains the steps done during preprocessing of the dataset.

CHAPTER 5: This chapter explains the needs of transformation of original dataset and why we Principal Component Analysis is used for transforming the dataset. Outline of the Steps involved in Principal Component Analysis are also given.

CHAPTER 6: This chapter outline the clustering technique we used for grouping the records in dataset into groups with similar properties.

CHAPTER 7: This chapter explains the need of selecting correct set of attributes for each cluster and how it can be done. It also explain the general steps involved in Genetic Algorithm and how it can be applied to our problem.

CHAPTER 8: This chapter explain the Nave Bayes classifier and how it can be trained to learned the behavior of the intrusion. It also highlight the importance of correctly modelling the probability distribution of the data.

CHAPTER 9: This chapter shows various results of the methods employed in our work and compared the results of different methods.

CHAPTER 10: This chapter gives the conclusion of our work and the future works

Chapter2 Literature Survey

Intrusion Detection System can be classifieds in number of ways. We have organized Intrusion Detection System in this chapter according to their detection capability, data source which are further subdivided depending on the methods used.

2.1. Data acquisition

Data acquisition: Depending on data source, IDS can be classified as Host Based IDS or Network Based IDS.

2.1.1. Host Based IDS

Host Based IDS acquires its data from a single host on which it is installed. Since it is concerned with just one system, it has access to high level data like application log, system log, security log as well as network data. It can identify attacks which are difficult to detect from just network data such as attack from inside the network [21]. However the host based IDS consumes more computational resources since it has to process more data.

It generates alert after the intrusions have already affected the system. Article [21] proposed a host based IDS which used Hidden Markov Model (HMM) to detect intrusion. They built normal behavioral model of the programs sendmail, inetd and lpr based on the system call generated by them. They used this model successfully to detect syslogd intrusion in sendmail, iprcp intrusion in lpr and DoS attack against inetd.

2.1.2 Network Based IDS

Network based IDS inspects all network packets which are addressed to the particular network which it is monitoring. It usually runs on a strategic location like gateway or switch where it can have access to maximum network packets addressed to hosts inside the network. It only uses features generated from network packets and does not use data from audit files in identifying network intrusion.

In practice most of the IDSs are not purely Host based IDS or Network based IDS. Most of them are hybrid of both techniques. The paper [13] proposed a centralized network based IDS for detecting intrusion attacks such as Denial of Service (DOS), Probe and network worm. It consists of sniffers placed in different locations inside the network. Sniffers collect network data; preprocess it by aggregating network packets within a two seconds time slot and extracting certain features from it such as numbers of different tcp flags, number of distinct ports, and number of packets etc. which define the state of the network, and send it to central classifier. Central classifier uses information from multiple sniffers to determine whether there is intrusion in the network.

Traditional network based IDS have scalability issues in high speed networks where bandwidth can be in terms of Gigabits per second. Traditionally network based IDS inspects every packet in the network traffic. Packet inspection is not suited for use in networks which have large bandwidth. They cannot cope up with the huge volume of packets in such network. Popular IDS like snort [4] exhibits high resource consumption when confronted with high speed network. Researchers are looking into flow based IDS as a way to deal with high speed network. Instead of examining individual packet it examines communication patterns. Network traffics are aggregated into flow based on source IP, destination IP, source port and destination port. Flow contains aggregated information such as amount of data exchanged, start time, end time, number of packets. Flow module can be easily added to network devices like router to generate flow information. By examining the flow patterns, network attacks can be identified [39].

2.2 Detection capability

Intrusion Detection Systems are historically classifieds into misuse based IDS and anomaly based IDS on their detection capability. Misuse based IDS can identify not only the intrusion traffic but also the attack category which it belong to. It can identify attacks which are there in the training data but cannot identify novel attacks or attacks which are not in the training data. Anomaly based IDS can identify whether the packet is anomaly or not. It cannot identify the attack type which the anomaly packet belongs to. Anomaly based IDS train on normal data which are free from any attack.

2.3 Misuse Based IDS

Misuse based IDSs are trained on labeled data which contains both normal data and attack data. It learned the behavior of both the normal data patterns and attack data patterns. Then it used this trained model to classify network traffic into normal traffic and attack traffic. Misuse based IDS has lower error rate in identifying network intrusion compare to anomaly based IDS. Based on their detection method misuse based IDS can be divided into different category.

2.3.1 Pattern Recognition Based

Many of the popular IDS use pattern matching in their detection engine. Once the signature of an intrusion attack is known, pattern describing the attack can be hand coded by network administrator or generated by some tools. Pattern can be anything describing the intrusion like number of failed connection, particular sequence of system calls, IP address or port, size of the payload or particular pattern in the payload. Detection engine searches for this attack pattern in network traffic and alerts network administrator if a match is found. In snort [4], attack patterns are described in the form of snorts rule. Snort can search for the attack pattern both in packet header and payload. Internally rules are stored in the form of two dimensional linked lists which are called as Chain Headers and Chain Options. All the common attributes are stored in the Chain Headers and detection modifiers are stored in the Chain Options. These chain rules are searched recursively for every packet. Researchers have proposed many methods to improve the performance of pattern matching technique used in snorts detection engine. Paper [1] described a technique to improve the snort pattern matching methods. Attack patterns are described in the same way as snort using snort rule but the pattern matching are done in Graphics Processor instead of CPU. They claimed that using Graphics Processor improved performance in terms of speed, scalability, flexibility and price.

2.3.2 Rule Based

Rule-based IDS detect intrusion by observing events in the system and applying a set of rules that lead to a decision regarding whether a given pattern of activity is suspicious. These rules can be about anything from describing the normal behavior of programs, users to describing the intrusion patterns. They are usually described in the form of production rules. Rule Based IDS processed the events describing the system state leading to assertion of new derived facts which can be used for further derivation. These derivation can continue as long as facts matches the production rules leading to certain conclusion about the state of the system or stopping the derivation process if it didnt match any of the production rules. Intrusion knowledge is usually encoded in the form of IF-ELSE condition in program and alerts are fired if network traffic meets any of the condition. The paper [5] described forward chain rule based system for detecting intrusion. It used Production-Based Expert System Toolset (P-BEST) as rule translator. Rules are written in P-BEST production rule specification language. Then it is translated into C language expert system program.

2.3.3 Specification based

Specification Based IDS monitors system events such as arrival of network packets or system calls and checks if they follow certain specifications. These specifications can be based on protocol or developed by experts based on system behavior or program behavior. Specification based IDS monitor network traffic and sounds an alarm if system behaviors are not following these specifications. The article [9] described a system which detects intrusion by looking for events which do not follow the specification stored in their detection engine. They developed a language called behavioral monitoring specification language (BMSL) which allowed them to write concise specification of event based security relevant properties. In paper [7], they stressed that strict adherence to protocol are not desirable. Strict adherence would require too much effort than required for protocol abstraction and the protocol implementation slightly differed across different systems which may lead to the classification of normal traffic from other system as invalid. They modeled tcp connection using extended finite state automata and claimed that it can detect all the probing and denial of service attacks in 1999 Lincoln Lab intrusion evaluation data with low false rate.

2.3.4 Data mining Based

Some researchers proposed Intrusion Detection System based on data mining technique. They see Intrusion detection as data analysis problem. Instead of using experts knowledge to describe intrusion attack patterns, data mining technique can be used to discover the patterns of the intrusion attack. Data mining technique can find relationship among the data which are not apparent on casual observation. It can discover both spatial and temporal relation among the data. Paper [14] proposed a framework for building Intrusion Detection based on Data mining technique. It used data mining algorithm to find association rules and frequent episodes. These relationships are expressed in the form of support and confidence. Support tells how frequently the features occur together and confidence tells how frequently the pattern occurs when the features are together. Then they found out patterns which describe intrusion attack and used it to create a model using RIPPER rules. They suggest using multiple detection models instead of a single monolithic classifier. Classifier which combines features from too many sensors usually becomes performance bottleneck and can become the target of network attack. They use Meta learning to combine the evidence from multiple classifiers to detect intrusion attacks.

2.3.5 Genetic Algorithm Based

There is a class of Intrusion Detection Systems which are inspired by biological process. Such type of Intrusion Detection Systems can be broadly classified into Immune System inspired IDS and Genetic Algorithm based IDS. Genetic Algorithms are also applied to the Intrusion Detection System. From a set of basic rules genetic algorithm can generate optimal rule through an evolutionary computation. GA is iterative process which starts with basic rules or randomly generated rules and through each iteration try to find the fittest population (rules). It mimics biological evolution by generating new rules created by mixing properties from parent rules and allowing mutation to occur in the process. Through a fitness function some percentage of the fittest individuals are selected and allow it to propagate to the next iteration. Paper [15] proposed using Genetic Algorithms for selecting optimal features set and parameters for classifier SVM. Features and parameters are encoded in chromosomes where they can be processed by Genetic Algorithm. GA built new chromosomes and searched the optimal detection model based on the fitness values obtained from SVM classifier. At the end of learning stage, the optimal detection model consists of a set of features and parameters.

2.4 Anomaly Based IDS

Anomaly Based Intrusion Detection System works on the principle that network intrusion behaviors are quite different from the normal behavior. Unlike the Misuse Based IDS it doesnt needs known attack signature or labeled attack data to train the model. It builds a model of normal network behavior and classifies any which differ significantly from the normal model of network as intrusion. The advantage of this approach over misuse based IDS is that it can detect both known and unknown attack. However it tends to have higher error rate compared to Misuse Based IDS.

2.4.1 Statistical Based

Anomaly Based IDS are based on the idea that network intrusion behaviors are quite different from the normal traffic and significant portion of the network traffic are normal. Various statistics methods are used to build normal traffic model of the network and any traffic deviating from model are classified as anomaly. They can be further categorized as supervised and unsupervised methods base on whether the training data need to be labeled or unlabeled. Network traffics are highly dynamic in nature. Its behavior can change from day to day and from hour to hour. Network traffic models which can adapt to its changing network behavior are called non-stationary model. Paper [18] proposed an intrusion detection system based on chi-square statistics. It builds a model of normal usage profile. Then using chi square it measure the difference between the normal profile and observed data.

2.4.2 Signal Processing Based

Applications of signal processing methods in the field of Intrusion Detection system are relatively recent compared to other methods. Signal processing involves converting time domain signal into frequency domain. Some features of the network traffic such as packet counts can be taken as signal. Normal network traffics have certain frequency distribution and intrusion traffics deviate from normal profile and can be identified. Paper [30] described an Intrusion Detection System for detecting DOS attack using spectral analysis. Tcp data flow involves sending data packet followed by acknowledgement packet. Therefore they theorized that tcp data exchanges are periodic in nature. DOS packets are not periodic and can be identified. Paper [31] describes an anomaly detection system based on wavelet analysis. They extracted fifteen features from network traffic and used it as input signal. Prediction model of the traffic are constructed in which wavelet coefficient play an important role. This prediction model outputs the difference between normal and anomaly traffic.

2.4.3 Immune System Based

There are many similarities between human immune system and computer security system. They both used the concept of self. Entities which do not belong to the system are identified as intruder and take action against them. Paper [28] describes an Intrusion Detection System which consists of immune system inspired cooperating mobile agents which monitor computers activities. These agents are capable of learning and capable of adapting to changing environment. They can detect both known and unknown types of intrusion attack.

2.4.4 Heuristic Based

Heuristic techniques are used both in Anomaly Based IDS and Misuse Based IDS. Misuse Based IDS learns the patterns of the intrusion attack whereas Anomaly Based IDS learns the behavior of the normal usage profile. Normal usage profile can be about users network usage, pattern of system calls made by process, user website browsing patterns etc. IDS compare the current usage pattern to model pattern and classify as anomaly if deviation is significant. Paper [35] proposed a method for detecting anomaly by tracing system calls. They build models of behavior profile of several programs by capturing its system calls. They compared performance of using direct matching and neural network in detecting anomaly.

2.5 Hybrid Approach

Network security researchers are looking into the hybrid approaches as way to deal with the disadvantages of various approaches. For instance misuse based IDS can detect known attack types with low false rate however it cant detect novel attack. On the other hand anomaly based IDS can detect both known and unknown attack types but have high error rate. By combining misuse based technique and anomaly based technique, we can design an IDS which can detects both known and unknown attacks and still have low error rate. Similarly we can design IDS which can detect intrusion by analyzing the audit file as well as network information by combining the features of both network and host based IDS. Paper [42] described an IDS which is a hybrid of anomaly based IDS and misuse based IDS. They used popular open source IDS snort as the misuse based IDS and main detection engine. Then they extended the snort architecture by adding Packet Header Anomaly Detector [43] and Network traffic anomaly detector [44] as preprocessor. They evaluated the IDS with DARPA [45] dataset and reported that the hybrid system detected intrusions which were previously undetectable by snort alone.

Table 2.1 Intrusion Detection System and their methods

Methods	Authors	Data Used	Detection Capability	Detection Methods
Pattern Matching	Payam[1], Hyun Jin Kim[2], Roesch[4], LIU[3]	Network Packet	Probe,DOS,R2L	Users have to supply intrusion signatures.
Rule Based	Ulf Lindqvist[5]	Network Packet, Audit data	DOS, Buffer Overflow	Users have to supply rules.
Specification Based	R.Sekar[7]	Network Packet	Probe, DOS	Network Protocol Based.
	Prem Uppuluri[9]	Audit Data	R2L,U2R	Specification developed from program behavior.
Heuristic Based	Shi-Jinn Horng[10], Levent Koc[11]	Network Packet, Audit Data	Probe, DOS, R2l,U2R	Learns from labeled data.
	Omar Al-Jarrah[12]	Network Packet	Probe	Learns from labeled data.
Data Mining	Wenke Lee[14]	Network Packet, Audit Data	Probe, DOS, R2l,U2R	Rules developed using data mining methods
Genetic Algorithm Based	Dong Seong Kim[15]	Network Packet, Audit Data	Probe, DOS, R2l,U2R	Learns from labeled data.
	Wei Li[16]	Network Data	Probe, DOS	Rules generated by GA.
Statistics Based	Nong Ye[29]	Audit Data	Probe, DOS, R2l,U2R	Use Hotelling test to detect intrusion.
	QIANG CHEN[18]	Audit Data	Probe, DOS, R2l,U2R	Use Chi square test to detect intrusion
Artificial Immune System Based	Azzedine Boukerche[24]	Audit Data	Probe, DOS, R2l,U2R	Used Logcheck to detect anomalous event and produced immune inspired response.
	Farhoud[27]	Network Data	Probe, DOS	Used multiple distributed detectors to detect anomaly.
	Dasgupta[28]	Audit Data	R2l,U2R	Used multi-level monitoring to detect attack.
Signal processing Based	Chen-Mou[30], Alberto Dainotti[32], Wei Lu[31]	Network Data	DOS	Used spectral analysis, Wavelet Transform to detect anomaly.

Chapter3 Proposed Model

Number of works have been done on the detection of intrusions in network using classifiers. It involves training the machine learning algorithms with labelled data and using the trained modeled to predict the traffic. Most of the researchers focused on the learning ability of the classifier and often neglect the structures embedded in the input data. With the quest of improving the accuracy of the Intrusion Detection System, they employed complicated methods such as ensemble of classifiers or ever more complex and resource consuming classifiers. Doing so they have made the Intrusion Detection more complex, taking further away from practical implementation. In our work we focused on the inherent structure accompanying the dataset, namely the KDD 99 dataset. We believe that by properly identifying the structure inherent in the dataset and forming appropriate rules according to the structure, even the relatively simple classifier like Nave Bayes can give performance which can challenge the most sophisticated classifier.

3.1 Method Description

We use modified KDD 99 dataset known as NSL KDD 99, which rectified some of the problem identified in KDD 99 dataset, to evaluate our proposed method. Each of the record in KDD99 dataset has 41 attributes, having both continuous and discrete value attribute. All of the discrete attributes need to be converted to numeric one to make the data uniform and make the data more flexible to application of intrusion detection methods. In the preprocessing step we remove all the non-contributing attributes which have constant value for all the records in the dataset. Also we remove the disparity in the range of the different attributes by normalizing all the attributes so that their values lay between 0 and 1.Once the data has been preprocessed we began the task of identifying the most important attributes. Here we adopt the unsupervised attribute selection methods in assuming that attributes with the most variance in it is the most important attributes.

Using Principal Component Analysis we identified the principal vectors or components in the attribute space where the data are most varied and data are projected on this vectors to get the new transformed dataset. Different Intrusion attacks target different aspect of the system, and their characteristics can be very different from one another. Intrusion Detection rules which work well in detection of one type of attack cannot be taken for granted that it will work equally well in detection other kind of attacks. Similarly set of attributes which play an important role in identifying one type of attack may not play similar role in the identification of other attacks. Therefore we feel it is important to group the records in dataset into similar groups and select a common set of attributes and intrusion detection rules for each group to obtain optimum result. We employ k-mean clustering to achieve the means of grouping the data into similar groups. Since the intrusion attacks are generally classified into four types, we take the number of cluster as five, four type of attack class and one normal class.

Once the records in the dataset have been clustered into homogeneous groups, we began the task of identifying the most contributing attributes for each group. For selecting and identifying important attributes, we employed Genetic Algorithm. It began with randomly selected sets of attributes and through biological evolutionary paradigm it try to find the optimum selection of attributes. Each iteration or Generation contains set of attribute selection called the population in Genetic Algorithm terminology. Each selection is called the individual or chromosome. Using a fitness function which calculate the suitability of the individual for the task, GA identify a group of fittest individual in each generation. Then this group of fittest individual are propagated to the next generation by exchanging genetic material among themselves mimicking the reproduction in nature. This process continue until the iteration reach some preset maximum number of iteration or the solutions converge.

We use relatively simple Nave Bayes compare to other machine learning algorithm as the classifier in our intrusion detection engine. Generally Nave Bayes use normal distribution to model the distribution of the numeric attribute values. However we feel this is over simplification and most of distribution in real world problem seldom follow normal distribution, leading to inaccuracy in classifier. We feel accuracy of the Nave Bayes can be improved a lot if we can accurately modelled the distribution of the attributes values. Since most of the real world distribution are too complex to be modelled by parametric equation, we use a non-parametric technique called kernel density estimate which can map any arbitrary probability distribution, to model the distribution of attribute value with Gaussian kernel. Each of the steps of the flow chart are explained in details in the subsequent chapters.

Chapter4 DATASET

The Defense Advanced Research Projects Agency (DARPA) is an organization under the U.S. Department of Defense which is responsible for developing advanced technology for the military. Many of the project developed by DARPA for military has found usage in non- military domain and has major effect on the world. Many of the technology which we identify with modern age have beginning as one of the DARPA project. Internet, which revolutionize how information are exchanged, results from DARPA effort to design a robust distributed computer network for military which can still function even when parts of it have been damaged. Likewise many innovation like computer screen cursor, hypertext document can trace its origin to DARPA Laboratory.

In order to compare the wide variety of methods proposed by researchers, we need to standardize the results by evaluating on standard dataset. It is particularly difficult in network security field to obtain standard dataset because of the fear of putting unintended private information in public domain. This has result in many early works on intrusion detection system to use their own proprietary dataset to evaluate their proposed methods making comparison of different methods difficult.

In order to address the problem of difficulty in comparing different methods proposed by researchers, the Cyber Security and Information Sciences Group (formerly the Information Systems Technology Group) of MIT Lincoln Laboratory, under Defense Advanced Research Projects Agency (DARPA) and Air Force Research Laboratory sponsorship, has collected and distributed the first standard corpora for evaluation of computer network intrusion detection systems. They also conducted the first formal, repeatable, and statistically significant evaluations evaluation of different Intrusion Detection Systems. These evaluations measure probability of detection and probability of false alarm for each system under test.

The experiment simulated normal operation of an Air Force Base with hundreds of system consisting of both real and simulated systems. Normal office network traffic consisting of mail services, web browsing etc. were generated to simulate day to day operation of base. Intrusion exploiting weakness of various protocol were simulated against the system which is inside the simulated Air Force base.

The physical network used for the 1998 simulation included an inside and outside component separated by a router. The outside includes two workstations which simulate gateways to a virtual outside internet. One workstation simulates many workstations using custom software modifications of the Linux kernel provided by the Air Force ESC group. One gateway leads to roughly 100 workstations and the other leads to 1000s of web sites with actual content that is updated daily. The inside includes victim machines of many types (e.g. Linux, Solaris, Sun OS) and a gateway to many other inside workstations. Data is collected from the inside victim running Solaris and from an outside sniffer.

New Features that were added in 1999 simulation:

NT Workstation

Insider Attacks

NT Audit Data

File System Probe

The 1999 Intrusion simulation was similar to 1998 setup with few addition. They expanded the scope of the attacks by including intrusion from inside the network. NT workstations were added to represent more diverse environment and NT audit data were collected. Changes in file system by intrusions were taken into account by taking snapshot of file system.

In the DARPA IDS evaluation dataset, all the network traffic including the entire payload of each packet was recorded in tcpdump format and provided for evaluation. The data was in the form of sniffed network traffic, Solaris BSM audit data, Windows NT audit data (in the case of DARPA 1999) and file system snapshots and tried to identify the intrusions that had been carried out against a test network during the data-collection period. The test network consisted of a mix of real and simulated machines; background trace was artificially generated by the real and simulated machines while the attacks were carried out against the real machines. The dataset consists of weeks one, two and three of training data and weeks four and five of test data. In training data, the weeks one and three consist of normal traffic and week two consists of labeled attacks.

For each day of simulation following files were generated:

Outside sniffing data

Inside sniffing data

BSM audit data

NT audit data

Long listings of directory trees

Dumps of selected directories)

A Report of file system inode information

4.2 ATTACK TYPES

There were total of 52 attacks listed in the DARPA documentation including the new attacks used in 1999 simulation against the NT systems. Those attacks are shown below grouped into four main classes as mentioned in Kendalls Thesis.

Denial of Service Attacks

Apache2

arppoison (New in 1999 test)

Back

Crashiis

dosnuke (New in 1999 test)

Land

Mailbomb

SYN Flood (Neptune)

Ping of Death (POD)

Process Table

selfping (New in 1999 test)

Smurf

sshprocesstable (New in 1999 test)

Syslogd

tcpreset (New in 1999 test)

Teardrop

Udpstorm

User to Root Attacks

anypw (New in 1999 test)

casesen (New in 1999 test)

Eject

Ffbconfig

Fdformat

Loadmodule

ntfsdos (New in 1999 test)

Perl

sechole (New in 1999 test)

Xterm

yaga (New in 1999 test)

Remote to Local Attacks

Dictionary

Ftpwrite

Guest

Httptunnel

Imap

Named

ncftp (New in 1999 test)

netbus (New in 1999 test)

netcat (New in 1999 test)

ppmacro (New in 1999 test)

Sendmail

sshtrojan (New in 1999 test)

Xlock

Xsnoop

Probes

insidesniffer (New in 1999 test)

Ipsweep

ls_domain (New in 1999 test)

Mscan

NTinfoscan

Nmap

queso (New in 1999 test)

resetscan (New in 1999 test)

Saint

Satan

4.3 KDD 99 DATASET

DARPA dataset provide the first standard dataset with which the Intrusion Detection Systems can be evaluated and compared. However it cant be used readily as the DARPA dataset is a collection of large number of files having different formats such as tcp dump file, file system snapshot, audit data etc. and need to be processed extensively, thus the possibility of introducing error while processing. Different researchers may choose to use only parts of the DARPA dataset while evaluating their proposed methods making the task of comparing different Intrusion Detection System difficult.

KDD 99 Dataset is derived from DARPA Data and make the task of comparing Intrusion Detection System easier. It was originally developed for used in the Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99The Fifth International Conference on Knowledge Discovery and Data Mining.

The raw training data from DARPA Dataset was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records.

A connection is a sequence of TCP packets starting and ending at some well-defined times, between which data flows to and from a source IP address to a target IP address under some well-defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes.

It is important to note that the test data is not from the same probability distribution as the training data, and it includes specific attack types not in the training data. This makes the task more realistic. Some intrusion experts believe that most novel attacks are variants of known attacks and the "signature" of known attacks can be sufficient to catch novel variants. The datasets contain a total of 24 training attack types, with an additional 14 types in the test data only.

4.4 KDD 99 FEATURES

Each of the record contains 41 features derived from various audit files in DARPA dataset. These attributes can be classified as continuous or discrete. A complete listing of the set of features defined for the connection records is given in the three tables below.

Table 4.1 Basic features of individual TCP

Feature name	Description	Type
duration	length (number of seconds) of the connection	continuous
protocol_type	type of the protocol, e.g. tcp, udp, etc.	discrete
service	network service on the destination, e.g., http, telnet, etc.	discrete
src_bytes	number of data bytes from source to destination	continuous
dst_bytes	number of data bytes from destination to source	continuous
flag	normal or error status of the connection	discrete
land	1 if connection is from/to the same host/port; 0 otherwise	discrete
wrong_fragment	number of ``wrong'' fragments	continuous
urgent	number of urgent packets	continuous

Table 4.2 Traffic features computed using a two-second time window.

Feature name	Description	Type
count	number of connections to the same host as the current connection in the past two seconds	continuous
serror_rate	% of connections that have ``SYN'' errors	continuous
rerror_rate	% of connections that have ``REJ'' errors	continuous
same_srv_rate	% of connections to the same service	continuous
diff_srv_rate	% of connections to different services	continuous
srv_count	number of connections to the same service as the current connection in the past two seconds	continuous
srv_serror_rate	% of connections that have ``SYN'' errors
srv_rerror_rate	% of connections that have ``REJ'' errors	continuous
srv_diff_host_rate	% of connections to different hosts	continuous

Table 4.3 Content features within a connection suggested by domain knowledge.

Feature name	Description	Type
hot	number of ``hot'' indicators	continuous
num_failed_logins	number of failed login attempts	continuous
logged_in	1 if successfully logged in; 0 otherwise	discrete
num_compromised	number of ``compromised'' conditions	continuous
root_shell	1 if root shell is obtained; 0 otherwise	discrete
su_attempted	1 if `su root'' command attempted; 0 otherwise	discrete
num_root	number of ``root'' accesses	continuous
num_file_creations	number of file creation operations	continuous
num_shells	number of shell prompts	continuous
num_access_files	number of operations on access control files	continuous
num_outbound_cmds	number of outbound commands in an ftp session	continuous
is_hot_login	1 if the login belongs to the ``hot'' list; 0 otherwise	discrete
is_guest_login	1 if the login is a ``guest''login; 0 otherwise	discrete

Table 4.4 Host-based traffic Features.

Feature name	Description	Type
dst _host_count	count of connections having the same destination host	continuous
dst _host srv_count	count of connections having the same destination host and using the same service	continuous
Dst_host_same_srv_rate	% of connections having the same destination host and using the same service	continuous
Dst_host_diff_srv_rate	% of different services on the current host	continuous
Dst_host_same_sr_port_rate	% of connections to the current host having the same src port	continuous
Dst_host_srv_diff_host_rate	% of connections to the same service coming from different hosts	continuous
Dst_host_serror_rate	% of connections to the current host that have an S0 error	continuous
Dst_host_srv_serror_rate	% of connections to the current host and specified service that have an S0 error	continuous
dst_host_rerror_rate	% of connections to the current host that have an RST error	continuous
Dst_host_srv_rerror_rate	% of connections to the current host and specified service that have an RST error	continuous

4.5 Inconsistencies in KDD 99 dataset

Most of the available Intrusion Detection System are based on signature detection inspite of the theoretical advantages provided by the Anomaly Based system. Most of research papers are in favor of the Anomaly Based IDS because of its potential in detecting novel attacks and non-requirement of frequent updates of its database as new attacks are discovered. Traditional argument against the Anomaly Based IDS are its relatively high false alarm rate and low accuracy. But there are papers on Anomaly Based IDS with reported accuracy as high as 98% and false alarm rate of less than 1 percent which shows that Anomaly Based IDS can compete with the signature based IDS in terms of accuracy and error rate. The article[47] conducted a study to understand the mismatch between the methods favored in research community and the industrial practice. They focused their study on the KDD 99 dataset which is used as the de facto standard in evaluating IDS.

They identified number of problems inherent with KDD 99 dataset and practices used by researchers which can distort the results of IDS evaluation. They proposed a number of remedies to fix the problem inherent in KDD 99 dataset and published a dataset which addressed some of the problem.

Some of the problems identified in KDD 99 dataset are:

For the sake of privacy, the experiments chose to synthesize both the background and the attack data, and the data is claimed to be similar to that observed during several month of sampling data from a number of Air Force bases. However, neither analytical nor experimental validation of the datas false alarm characteristics were undertaken. Furthermore, the workload of the synthesized data does not seem to be similar to the traffic in real networks.

Traffic collectors such as TCPDump, which is used in DARPA98, are very likely to become overloaded and drop packets in heavy traffic load. However, there was no examination to check the possibility of the dropped packets.

There is no exact definition of the attacks. For example, probing is not necessarily an attack type unless the number of iterations exceeds an specific threshold. Similarly, a packet that causes a buffer overflow is not always representative of an Under such conditions, there should be an agreement on the definitions between the evaluator and evaluated. In DARPA98, however, there is no specific definitions of the network attacks.

One of the most important deficiencies in the KDD data set is the huge number of redundant records, which causes the learning algorithms to be biased towards the frequent records, and thus prevent them from learning infrequent records which are usually more harmful to networks such as U2R and R2L attacks. In addition, the existence of these repeated records in the test set will cause the evaluation results to be biased by the methods which have better detection rates on the frequent records.

They have a published a dataset based on KD99 dataset named NSL-KDD addressing some of the problems identified in their reports and make it freely available on their website[45].

Chapter5 Preprocessing of data

There are total of 41 features in each record of KDD 99 dataset. Not all of them are numeric in nature, some of them are symbolic or discrete. In order to make calculation easier we need to convert all the symbolic attributes to numeric one. For instance Principle Component Analysis cannot be applied on records having symbolic attributes and also it is easier to apply clustering if they are all of the same data type. We convert the symbolic attributes to numeric attributes by a process called binaritizing which add one attribute for each of the possible value of the symbolic attribute. For example protocol attribute is a symbolic attribute and has three possible value icmp, tcp and udp. Three attributes representing the three possible value are added in place of protocol and set the attribute corresponding to the value of the protocol attribute in that particular record to one and set the other two to zero.

Table 5.1 Discrete attributes and number attribute it generates in preprocessing.

Discrete Attribute	Number of attributes it spawned after binaritization
Protocol_type	3 attributes
Service	66 attributes
flag	11 attributes
Land	2 attributes
Logged_in	2 attributes
Is_host_login	2 attributes
Is_guest_login	2 attributes

Binaritizing process added a lot of attributes which value are constant for all records in the dataset. Such attribute does not contribute anything in terms of information while classifying the records. It can make the process of finding the covariance matrix much more complex. Also large number of attributes make the process of classifying more computationally intensive. Therefore all the attributes whose standard deviation is zero, are removed. After binaritizing and removing the constant attributes, there are total of 117 attributes in each record. Range of values that can be taken by KDD 99 attributes greatly differ from one attribute to another and are highly non uniform. While some attribute such as serror_rate range between 0 and 1, other attribute like count can take any arbitrary integer value. This can cause a problem while calculating Principal Components. Since covariance is calculated as product of difference between their mean and the value taken by the variables, it is highly sensitive to the range of the variable.

coVar = n (x - xi )(y - yi)

Chapter6 Principal Component Analysis

Large number of attributes adversely affect the speed and efficiency of machine learning classifiers. Non-essential attributes not only slow down the speed of the algorithm, it can affects accuracy of the Intrusion Detection System by acting as noise to the data. To have optimum accuracy and efficiency, it is important that we keep the number of attributes to minimum.

Researchers have proposed a number of methods to find out the most important attributes and reduce the dimension of the source dataset. Methods ranging from entropy to information gain have been applied to the KDD 99 dataset for removing the redundant attributes and selecting the most discriminating ones. For our proposed methods, dimensionality reduction has been achieved by applying Principal Component Analysis which transform the attributes in source dataset to orthogonal components.

Principal component analysis is a feature reduction procedure. It is useful when we have obtained data on a number of variables (possibly a large number of variables), and suspect that there is some redundancy in those variables. In this case, redundancy means that some of the variables are correlated with one another, possibly because they are measuring the same construct. Because of this redundancy, it should be possible to reduce the observed variables into a smaller number of principal components (transformed attributes) that will account for most of the variance in the observed variables. Principal Components Analysis not just transform the dataset, it sorts the resulting components according to the amount of variance in data along that components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.

Principal Components Analysis or PCA is a linear transformation which takes a set of possibly correlated variables and transform into a set of uncorrelated variables which are orthogonal to each other. PCA find the directions (vectors) which maximize the variance of the data and the source data are projected on these vectors. To find PCA, sample co-variance matrix of the dataset are calculated first, then we find Eigen vectors along with their corresponding Eigen values. Eigen vectors are sorted according to their Eigen values. Variance of the data along an Eigen vector are directly proportional to its Eigen value. Source dataset are projected along the Eigen vectors to get the transform dataset.

6.2 Principal Component Geometrical Interpretation

For a geometric interpretation of principal components, suppose we have two variables, X1 and X2, that are centered at their respective means (i.e., the means of the scores onX1 and X2 are zero). In the diagram above, the ellipse represents the scatter diagram of the sample points. The first principal component is a line through the widest part; the second component is the line at right angles to the first principal component. In other words, the first principal component goes through the widest part of the eclipse and the second principal component through the next widest part of the eclipse ,and orthogonal to the rest; and so on. Or, we take our original frame of reference and do a rigid transformation around the origin to get a new set of axes; the origin is given by the sample means (of zero) on the two X1 And X2 variables. (To make these same geometric points, we could have used a constant density contour for a bivariate normal pair of random variables, X1 and X2, with zero mean vector.)

As an example of how to find the placement of the components in the picture given above, suppose we have the two variables, X1 and X2, with variance-covariance matrix.

6.3 Why Eigen vectors are chosen

6.4 PCA Algorithm

In order to apply Principal Component Analysis on our dataset we need to preprocess the data. Any attributes which have constant value for all records need to be removed as they can cause divide by zero during covariance matrix calculation and do not contribute anything in terms of finding the class of the records. Since Principal Component Analysis are based on covariance matrix, it is sensitive to the variance of the data. Attributes with large range generally have large variance and can distort the result of the PCA. To prevent such distortion we need to normalize the range of the attributes of the data i.e. scale the attribute range to between 0 and 1.

Chapter7 Clustering

After transforming the dataset with Principal Component Analysis and selecting the components with most variance, dataset has 50 transformed attributes. Although these components have relatively large variance, they may not contribute much information in classification of the records. Also it is possible that attribute which play an important role in identifying a particular attack may act as noise while identifying other attacks. Therefore the need of different sets of attributes for each type of attacks. It is not practical to select different set of attributes for each of the attack types however attacks with similar properties can be grouped together and select common attributes for each group. This task of grouping similar records can be accomplished using clustering algorithm.

Clustering is a process which organize records into groups so that record in each group are more similar to each other than the records in other group. It is frequently used in data mining and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. The word similarity in data analysis is quite ambiguous and can be interpreted in number of ways. There are number of clustering algorithm which work on different notion of word cluster and how to find similarity efficiently. . Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions.

Since clustering is the grouping of similar instances/objects, some sort of measure that can determine whether two objects are similar or dissimilar is required. There are two main type of measures used to estimate this relation: distance measures and similarity measures. Many clustering methods use distance measures to determine the similarity or dissimilarity between any pair of objects.

7.1 Distance Measure

7.2 Similarity Functions

7.2.1 Cosine Measure

7.2.2 Pearson correlation measure

7.1 Basic Clustering Techniques

Since what constitute a cluster is not clearly defined, there are many clustering algorithms, each of which use a different principle. It can be divided in two main groups: hierarchical and partitioning methods.

These methods construct the clusters by recursively partitioning the instances in either a top down or bottom-up fashion. These methods can be subdivided as following:

Agglomerative hierarchical clustering Each object initially represents a cluster of its own. Then clusters are successively merged until the desired cluster structure is

Divisive hierarchical clustering All objects initially belong to one cluster. Then the cluster is divided into sub-clusters, which are successively divided into their own sub-clusters. This process continues until the desired cluster structure is obtained.

The merging or division of clusters is performed according to some similarity measure, chosen so as to optimize some criterion (such as a sum of squares). The hierarchical clustering methods could be further divided according to the manner that the similarity measure is calculated

Single-link clustering (also called the connectedness, the minimum method or the nearest neighbor method) methods that consider the distance between two clusters to be equal to the shortest distance from any member of one cluster to any member of the other cluster.

Complete-link clustering (also called the diameter, the maximum method or the furthest neighbor method) - methods that consider the distance between two clusters to be equal to the longest distance from any member of one cluster to any member of the other cluster.

Average-link clustering (also called minimum variance method) methods that consider the distance between two clusters to be equal to the average distance from any member of one cluster to any member of the other cluster.

7.2 Partitioning Methods

Partitioning methods relocate instances by moving them from one cluster to another, starting from an initial partitioning. Such methods typically require that the number of clusters will be pre-set by the user. To achieve global optimality in partitioned-based clustering, an exhaustive enumeration process of all possible partitions is required. Because this is not feasible, certain greedy heuristics are used in the form of iterative optimization. Namely, a relocation method iteratively relocates points between the k clusters.

Error Minimization Algorithms. These algorithms, which tend to work well with isolated and compact clusters, are the most intuitive and frequently used methods. The basic idea is to find a clustering structure that minimizes a certain error criterion which measures the distance of each instance to its representative The most well-known criterion is the Sum of Squared Error (SSE), which measures the total squared Euclidian distance of instances to their representative values.

Graph-Theoretic Clustering. Graph theoretic methods are methods that produce clusters via graphs. The edges of the graph connect the instances represented as nodes. Inconsistent edges are edges whose weight (in the case of clustering-length) is significantly larger than the average of nearby edge lengths.

Density-based Methods: Density-based methods assume that the points that belong to each cluster are drawn from a specific probability distribution. The overall distribution of the data is assumed to be a mixture of several distributions. The aim of these methods is to identify the clusters and their distribution parameters. These methods are designed for discovering clusters of arbitrary shape which are not necessarily convex.

7.5 K-mean clustering

After transforming the data with PCA, we have a dataset which is composed of 50 independent (orthogonal) components. Although Principal Component Analysis give us the components with most variance in it, it doesnt tell anything about the importance of the each component role in identifying each intrusion type. It is also possible that all components (or attributes) may not be equally important in identifying all the intrusion types. Therefore we proposed to group the dataset into similar groups before we train the classifier. Grouping of data into similar groups can be achieved by applying any of the clustering algorithm. Each of the clustering algorithm yield different result group according to the notion of cluster used by different clustering algorithm.

Some of the clustering algorithm can cluster the dataset very efficiently in short time if the number of cluster is known apriori such as k-mean clustering. K-mean clustering is one of the simplest yet efficient clustering algorithm. It falls under the cluster algorithm family which used Partitioning. It used Euclidian distance to measure similarity between each records. It starts with k instance initialize as cluster mean. Cluster head can be chosen in many ways. Instances can be chosen randomly to be cluster heads or can choose instances so that each cluster head lies farthest to each with the idea that it will capture the underlying structure better. Then the k-mean cluster algorithm iterate through the dataset, assigning each record to nearest cluster and updating the cluster position after new record has been assigned to it. Algorithm iterates through the dataset repeatedly until the clusters stop moving and records are not reassigned from one cluster to another on each iteration.

7.6 K-mean Clustering Algorithm

Initialize the first k records as the centroid of the k

Calculate Euclidian distance of each record x from centroid of k clusters ????^???? and assign the record to the nearest cluster.

Set the position of each cluster to the mean of all the records assign to that

Repeat step 2 and step 3 until centroid of each cluster stop moving.

Chapter8 Genetic Algorithm

Principal Component Analysis help us find the direction of the maximum variance in the data but it doesnt tell us which principal components will be most useful in determining the class of the records. Though a component may have maximum variance, it may not provide much information about the class of the records. Importance of each component may vary according to the attack types. Redundant attributes increased the computational time and may affect the accuracy of the classifier. It is important to select the minimum features which can give the maximum accuracy to keep the CPU load to minimum. There are 2^n ways for selecting n attributes. For the 50 principal components we selected, there are 2^50 combination of components. Trying out all the combination is not feasible as it will take too much time.

Different researchers have used different approaches for identifying and selecting the most important attributes which can give optimum accuracy without burdening the CPU. In Intrusion Detection System, Information Gain is often employed to identify attributes which can give the most information about the class of a record. In order to calculate the Information Gain of an attribute, we first measure the randomness or entropy of the dataset. Then we divide the dataset into groups according to the value of the attribute and measure the entropy of the dataset. The difference in entropy between the two measures is the Information Gain of the attribute. Genetic Algorithms are also applied to select attributes. It is modeled on biological evolutionary process and mimics biological process such as natural selection, crossover and mutation. It first generates initial random population and subjects the population to evolutionary pressure to select the best individuals in each generation and allows it to propagate to next generation. It continue the process until the solution converges or number of generation reach maximum allowable iteration. Particle Swarm Optimization is also based on iterative evolutionary process and is related to Genetic Algorithm. It is modeled after the natural swarm behavior of organism like flock of birds or swarm of insects. For our proposed methods we use Genetic Algorithm to search the solution space to find out the most contributing features for each cluster.

Genetic Algorithms (GAs) are adaptive heuristic search algorithm based on the evolutionary ideas of natural selection and genetics. As such they represent an intelligent exploitation of a random search used to solve optimization problems. The basic techniques of the GAs are designed to simulate processes in natural systems necessary for evolution, especially those follow the principles first laid down by Charles Darwin of "survival of the fittest".

8.1 Solution Representation

In order to able to use a solution in Genetic Algorithm, we need to represent the solution in such a manner that it mimics the information carrying mechanism of the chromosomes so that evolutionary process of selection, mutation and crossover can be applied easily to them. Generally solution are represented as vector of components such as alphabet, usually binary (0, 1). Each letter represents gene and the vector represents individual or Chromosome in GA terms.

In our work ,individuals in the Genetic Algorithm represent different selection of attributes. Each solution is represented by a binary string length of 50, each binary alphabet denoting an attribute. Binary alphabet of 1 in particular position of the string represent inclusion of the corresponding attribute in the solution and similarly 0 represent exclusion of the corresponding principal component from that particular selection of the attributes represented by the solution.

Figure. 1 shows an example of binary string of length 6 which can be used to encode selection of attributes. This particular string show a selection of attribute 1, 2 and 6.

Figure 8.1 Binary string

8.2 Initialization of population:

Genetic Algorithms need initial set of individuals to start with. Initial population can be randomly generated or seeded in particular area where solution are likely to be found. In our implementation we use random generator to generate the initial population of 35 individuals.

8.3 Selection and Fitness Function:

During each successive generation, a proportion of the existing population is selected to breed new generation. Individual solutions are selected through a fitness-based process, where fitter solutions (as measured by a fitness function) are typically more likely to be selected. Certain selection methods rate the fitness of each solution and preferentially select the best solutions. Other methods rate only a random sample of the population, as the former process may be very time-consuming.

The fitness function is defined over the genetic representation and measures the quality of the represented solution. The fitness function is always problem dependent. It can be a function which measure the accuracy of the solution for classification problems. In our implementation we use Nave Bayes with kernel density estimation as fitness function and accuracy of the solution on Nave Bayes as fitness value.

We use a fitness proportionate based algorithm called roulette selection to select the individuals to generate the next generation where the chance of selecting an individual is directly proportional to its fitness value. We cannot just select only the best solution from each generation because it will reduce the genetic diversity of the population and can led to premature convergent on local minima. Selection algorithm must ensure that there is enough diversity in the population by giving chance of selection to all individuals including non- optimum solution.

8.4 Roulette Selection Algorithm:

for all members of population
1. sum += fitness of this individual

end for

for all members of population
1. probability = fitness / sum

end for

loop until new population is full
1. number = Random between 0 and 1
1. for all members of population
  1. cumulative probability += probability
  1. if cumulative probability > number then you have been selected
1. end for

8.5 Crossover

Crossover is akin to the sexual reproduction in nature. Reproduction in nature some time produce offspring which have better quality than both of the parents by inheriting the better quality gene of each parent. Reproduction in nature is replicated in Genetic Algorithm through crossover function which exchange genes between the selected individual such that it will lead to the generation of better solution. Like the offspring in nature which is a combination of genetic materials from both parents, next generation individual is produced by crossing genes from a pair of selected individuals.

Degree of crossover in Genetic Algorithm is controlled by crossover rate ?. With probability ?, parents are crossover to get new individuals and with (1 ?) probability selected individual are passed to the next generation without crossover.

Once the individuals are selected, crossover point is randomly selected, then binary strings representing the solutions are split at random crossover point and cross combine to form two new individual.

8.6 Mutation

Mutation is a genetic operator used to maintain genetic diversity from one generation of a population of genetic algorithm chromosomes to the next. It is analogous to biological mutation. Mutation alters one or more gene values in a chromosome from its initial state. In mutation, the solution may change entirely from the previous solution. Hence GA can come to better solution by using mutation. Mutation occurs during evolution according to a user- definable mutation probability. This probability should be set low. If it is set too high, the search will turn into a primitive random search.

In our implementation mutation is achieved by single point mutation, which is mutation is done by flipping a randomly chosen single bit.

Mutate

Figure 8.3 mutation

Mutation Algorithm

Generate a random number r between 0 and 1

If r < mutation>
1. Flip a randomly chosen bit of the solution

Else
1. Pass the individual unchanged to the next generation

8.7 Steps for GA

Chapter9 Nave Bayes

Nave Bayes is a probabilistic machine learning algorithm base on Bayesian learning with strong independence assumption among its attributes. In spite of its simplistic approach it gives good accuracy and its performance can be compared with other advanced classifiers such as SVM. It is quite efficient and training doesnt require multiple iteration like other machine learning algorithms. It also scales well with large number of attributes.

9.1 Bayes Theorem:

9.2 Learning in Bayes Classifier

If we are going to train a Bayes classifier by estimating ????(?????????) and P(Y), then it is reasonable to ask how much training data will be required to obtain reliable estimates of these distributions. Let us assume training examples are generated by drawing instances at random from an unknown underlying distribution P(X).

A hundred independently drawn training examples will usually suffice to obtain a maximum likelihood estimate of P(Y) that is within a few percent of its correct value 1 when Y is a Boolean variable. However, accurately estimating P(X/Y) typically requires many more examples. To see why, consider the number of parameters we must estimate when Y is Boolean and X is a vector of n Boolean attributes. In this case, we need to estimate a set of parameter

????_???????? = ????(???? = ????_???? | ???? = ????_????) 9.2

where the index i takes on 2^???? possible values (one for each of the possible vector values of X), and J takes on 2 possible values. Therefore, we will need to estimate approximately 2^????+1 parameters. To calculate the exact number of required parameters, note for any fixed j, the sum over i of ????_???????? must be one. Therefore, for any particular value ????_????, and the 2 N possible values of ????_????, we need compute only 2^????-1 independent parameters. Given the two possible values for Y, we must estimate a total of 2(2^????-1) such ????_???????? parameters. Unfortunately, this corresponds to two distinct parameters for each of the distinct instances in the instance space for X. Worse yet, to obtain reliable estimates of each of these parameters, we will need to observe each of these distinct instances multiple times! This is clearly unrealistic in most practical learning domains. For example, if X is a vector containing 30 Boolean features, then we will need to estimate more than 3 billion parameters.

9.3 Learning in Nave Bayes

9.4 Nave Bayes With kernel Density Estimation

Chapter10 Results and Discussion

Principal Component Analysis help us find the direction called principal components in the dataset where the data are most divergent. Then data are projected on this components. We took three components, component 1, 25, 49 to highlight the amount of the variance in the data along the principal components. We can see that variance in the data progressively get smaller as we move further down the components.

10.1.2 K-mean Clustering results

Table 10.1 K-mean clustering results

Cluster	Probe	DOS	R2L	U2R	Normal	Total
Cluster 1	1027	1382	10	10	551	2970
Cluster 2	1142	755	0	1	2768	4666
Cluster 3	89	6900	2	0	22	7013
Cluster 4	0	196	2	0	7068	7266
Cluster 5	31	1	195	10	3040	3277

The results of applying k-mean cluster with k taken as 5 is shown in the above table. First two cluster contains Probe and DOS records predominantly whereas the last two cluster contain mainly normal records. Majority of DOS records clustered in third group.

10.1.3 Genetic Algorithm and their selection of PCA components

Table 10.2 . Selected Attributes for each cluster

Cluster	Attributes
Cluster 1	1, 2, 4, 6, 9, 10, 11, 12, 13, 17, 18, 19, 20, 23, 25, 26, 28, 30, 31, 34, 37, 39, 49, 50
Cluster 2	1, 3, 4, 6, 7, 14, 16, 17, 18, 19, 22, 23, 32, 33, 38, 42, 43, 47, 50
Cluster 3	2 6, 10, 11, 17, 19, 22, 23, 26, 31, 32, 39, 40, 45, 49
Cluster 4	1, 2, 10, 11, 16, 17, 18, 24, 29, 33, 37, 38, 39, 40, 42, 43, 45, 49, 50
Cluster 5	1, 2, 5, 6, 7, 8, 9, 10, 14, 17, 18, 25, 33, 35, 44, 46, 50

Table no. 10.2 shows the optimum combination of attributes for each cluster as found out using Genetic Algorithm. This selection not only improves the accuracy, it also improves the efficiency and resource consumption as classifier has to deal with less number of attributes. We can see that each cluster has its own unique collection of attributes which is different from one another. For the first cluster, it selects 24 components out 50, for second cluster it selects 19 components, third 15 components, fourth 19 components and for the last cluster it selects 17 components.

10.1.4 Nave Bayes with kernel estimation

The above three figures- 10.4, 10.5, 10.6, show the probability distribution of different PCA components as estimated by three methods namely normal distribution, histogram and kernel estimate. Traditionally Nave Bayes use normal distribution to estimate probability distribution for numeric attributes. However it cannot match most of distribution encountered in real life problem as evident in figure 10.4. Therefore we used kernel estimate to model probability distribution of attributes as it can model any arbitrary distribution. We can see that kernel estimate are in agreement with the Histogram of the data.

Table 10.3 Accuracy of different methods

Methods	Accuracy
Nave Bayes with Gaussian estimate	89.59%
Nave Bayes with Gaussian estimate and clustering	92.04%
Nave Bayes with Kernel estimate and PCA top 20 components	95.53%
Nave Bayes with Kernel estimate, PCA transformation and Genetic Algorithm selection of components	98.05%

Table. 10.3 shows the accuracy of different methods which we carried out in our work. It shows that Nave Bayes with kernel estimate of probability distribution and Genetic Algorithm selection of attributes, with dataset transformation with PCA gives the best accuracy. Traditional Nave Bayes with Gaussian estimate gives the worst result.

Chapter11 Conclusion and future works

Network researchers have been taking active interest in Intrusion Detection for the last two decades because of the danger pose by security breach in the information system. This danger has increased multifold with the exponential growth of internet and proliferation of network enabled smart devices in the last few years. Though network researchers have proposed many methods on how to detect and protect network from intrusion, most of these methods are yet to be seen in presently available Intrusion Detection System. There is a mismatch between methods proposed by research community and methods practice in industry which deal with intrusion detection system. We feel the mismatch is due to researchers one dimensional approach and tendency to employ more and more complex and resource hogging methods with a sole aim of improving the detection accuracy, without thinking much about the practicality of the methods. We have shown in our work that performance of the Intrusion Detection System can be improved significantly by exploiting the structure inherent in the dataset. In spite of the simplicity of the classifier employed we get result comparable to methods which employed much complex and resource intensive methods. We have shown that records in KDD 99 dataset can be grouped into clusters with similar properties and each cluster can have its own set of unique selection of attributes which give optimum result. Probability distribution of attribute derived from real life problem seldom follow normal distribution which is generally used to model and can be modelled better by Kernel estimation.

In future we plan to better capture the structure of the data by employing advanced clustering algorithm such as SOM, density based clustering methods. Genetic Algorithm can search solution space for optimum solution, however its main drawback is its relatively longtime spend in searching. We plan to reduce searching time by employing other more time efficient solution searching methods such as Particle Swarm Optimization.

Download Solution Now

Uploaded By : Akshita
Posted on : May 28th, 2025
Downloads : 0
Views : 218

INTRUSION DETECTION USING NAVE BAYES WITH KERNEL DENSITY ESTIMATION CSE590