Risk Factors to Retrieve Anomaly Intrusion Information and Profile User Behavior

Risk Factors to Retrieve Anomaly Intrusion Information and Profile User Behavior

Yun Wang (Yale University, Yale-New Haven Health System & Qualidigm, USA) and Lee Seidman (Qualidigm, USA)
DOI: 10.4018/978-1-60566-148-3.ch017
OnDemand PDF Download:
List Price: $37.50
10% Discount:-$3.75


The use of network traffic audit data for retrieving anomaly intrusion information and profiling user behavior has been studied previously, but the risk factors associated with attacks remain unclear. This study aimed to identify a set of robust risk factors via the bootstrap resampling and logistic regression modeling methods based on the KDD-cup 1999 data. Of the 46 examined variables, 16 were identified as robust risk factors, and the classification showed similar performances in sensitivity, specificity, and correctly classified rate in comparison with the KDD-cup 1999 winning results that were based on a rule-based decision tree algorithm with all variables. The study emphasizes that the bootstrap simulation and logistic regression modeling techniques offer a novel approach to understanding and identifying risk factors for better information protection on network security.
Chapter Preview


Data Source

The study sample was drawn from the Third International Knowledge Discovery and Data Mining Tools Competition 1999 data (KDD-cup, 1999), which was created based on the 1998 Defense Advanced Research Projects Agency (DARPA) Intrusion Detection Evaluation offline database developed by the Lincoln Laboratory at Massachusetts Institute of Technology (Cunningham, Lippmann, Fried, Garfinkle, Graf, Kendall, et al., 1999). The full KDD-cup data included 7 weeks of TCP dump network traffic, as training data that were processed into about 5 million connection records, 2 weeks of testing data, and 34 different attack types, was generated on a network that simulated 1,000 Unix hosts and 100 users (Lippmann & Cunningham 2000). The test data do not have the same probability distribution as the training data, and they include additional specific attack types that were not in the training data. The data unit is a connection that consists of about 100 bytes of information and represents a sequence of TCP packets starting and ending at a fixed time window, between which data flows to and from a source IP address to a destination IP address under pre-defined protocols. Each connection record is identified as either normal or a specific attack type. This study used 10% of the training data as a derivation dataset, and the full test data as a validation dataset to identify and examine the risk factors.

Complete Chapter List

Search this Book: