95 lines
4.0 KiB
ReStructuredText
95 lines
4.0 KiB
ReStructuredText
|
.. _kddcup99_dataset:
|
||
|
|
||
|
Kddcup 99 dataset
|
||
|
-----------------
|
||
|
|
||
|
The KDD Cup '99 dataset was created by processing the tcpdump portions
|
||
|
of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,
|
||
|
created by MIT Lincoln Lab [2]_. The artificial data (described on the `dataset's
|
||
|
homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>`_) was
|
||
|
generated using a closed network and hand-injected attacks to produce a
|
||
|
large number of different types of attack with normal activity in the
|
||
|
background. As the initial goal was to produce a large training set for
|
||
|
supervised learning algorithms, there is a large proportion (80.1%) of
|
||
|
abnormal data which is unrealistic in real world, and inappropriate for
|
||
|
unsupervised anomaly detection which aims at detecting 'abnormal' data, i.e.:
|
||
|
|
||
|
* qualitatively different from normal data
|
||
|
* in large minority among the observations.
|
||
|
|
||
|
We thus transform the KDD Data set into two different data sets: SA and SF.
|
||
|
|
||
|
* SA is obtained by simply selecting all the normal data, and a small
|
||
|
proportion of abnormal data to gives an anomaly proportion of 1%.
|
||
|
|
||
|
* SF is obtained as in [3]_
|
||
|
by simply picking up the data whose attribute logged_in is positive, thus
|
||
|
focusing on the intrusion attack, which gives a proportion of 0.3% of
|
||
|
attack.
|
||
|
|
||
|
* http and smtp are two subsets of SF corresponding with third feature
|
||
|
equal to 'http' (resp. to 'smtp').
|
||
|
|
||
|
General KDD structure :
|
||
|
|
||
|
================ ==========================================
|
||
|
Samples total 4898431
|
||
|
Dimensionality 41
|
||
|
Features discrete (int) or continuous (float)
|
||
|
Targets str, 'normal.' or name of the anomaly type
|
||
|
================ ==========================================
|
||
|
|
||
|
SA structure :
|
||
|
|
||
|
================ ==========================================
|
||
|
Samples total 976158
|
||
|
Dimensionality 41
|
||
|
Features discrete (int) or continuous (float)
|
||
|
Targets str, 'normal.' or name of the anomaly type
|
||
|
================ ==========================================
|
||
|
|
||
|
SF structure :
|
||
|
|
||
|
================ ==========================================
|
||
|
Samples total 699691
|
||
|
Dimensionality 4
|
||
|
Features discrete (int) or continuous (float)
|
||
|
Targets str, 'normal.' or name of the anomaly type
|
||
|
================ ==========================================
|
||
|
|
||
|
http structure :
|
||
|
|
||
|
================ ==========================================
|
||
|
Samples total 619052
|
||
|
Dimensionality 3
|
||
|
Features discrete (int) or continuous (float)
|
||
|
Targets str, 'normal.' or name of the anomaly type
|
||
|
================ ==========================================
|
||
|
|
||
|
smtp structure :
|
||
|
|
||
|
================ ==========================================
|
||
|
Samples total 95373
|
||
|
Dimensionality 3
|
||
|
Features discrete (int) or continuous (float)
|
||
|
Targets str, 'normal.' or name of the anomaly type
|
||
|
================ ==========================================
|
||
|
|
||
|
:func:`sklearn.datasets.fetch_kddcup99` will load the kddcup99 dataset; it
|
||
|
returns a dictionary-like object with the feature matrix in the ``data`` member
|
||
|
and the target values in ``target``. The "as_frame" optional argument converts
|
||
|
``data`` into a pandas DataFrame and ``target`` into a pandas Series. The
|
||
|
dataset will be downloaded from the web if necessary.
|
||
|
|
||
|
.. topic:: References
|
||
|
|
||
|
.. [2] Analysis and Results of the 1999 DARPA Off-Line Intrusion
|
||
|
Detection Evaluation, Richard Lippmann, Joshua W. Haines,
|
||
|
David J. Fried, Jonathan Korba, Kumar Das.
|
||
|
|
||
|
.. [3] K. Yamanishi, J.-I. Takeuchi, G. Williams, and P. Milne. Online
|
||
|
unsupervised outlier detection using finite mixtures with
|
||
|
discounting learning algorithms. In Proceedings of the sixth
|
||
|
ACM SIGKDD international conference on Knowledge discovery
|
||
|
and data mining, pages 320-324. ACM Press, 2000.
|