583 lines
23 KiB
TeX
583 lines
23 KiB
TeX
|
\documentclass{amsart}
|
||
|
\usepackage{graphicx}
|
||
|
\usepackage{array}
|
||
|
\usepackage{moreverb}
|
||
|
\title{DRAFT: User Documentation for the STIDE software package}
|
||
|
\author{Julie Rehmeyer}
|
||
|
\date{\today}
|
||
|
\begin{document}
|
||
|
\maketitle
|
||
|
|
||
|
|
||
|
\section{Software Purpose} \label{sec:intro}
|
||
|
STIDE stands for Sequence Time-Delay Embedding, and it implements the
|
||
|
time-delay embedding method of anomaly detection. Its primary
|
||
|
function is to accept as input a time series (or a set of time
|
||
|
series), divide it into a set of fixed-length sequences, compare that
|
||
|
set of sequences with an existing database of fixed length sequences,
|
||
|
and report on the consistency of the time series with the existing
|
||
|
database. It can also be used to created a database of fixed-length
|
||
|
sequences from scratch, or to add to a pre-existing database.
|
||
|
|
||
|
The STIDE software was originally developed by Steve Hofmeyr, a
|
||
|
graduate student in the Computer Science Department at the University
|
||
|
of New Mexico, as part of a research program that is applying ideas
|
||
|
from immunology to problems in computer security. In particular,
|
||
|
STIDE was written to assist in detecting intrusions by identifying the
|
||
|
unusual sequences of system calls that may be created during an
|
||
|
attempted intrusion \cite{lightweight, ci, principles, self}. In this
|
||
|
context, the time series being considered consists of the system calls
|
||
|
made by a single process. We first record the system calls made by a
|
||
|
process exibiting normal behavior (i.e., in non-exploited situations),
|
||
|
and then use STIDE to divide that continuous stream of system calls
|
||
|
into sequences of a given length and store them in a database.
|
||
|
Subsequently, when we want to know if another instance of the same
|
||
|
program has been attacked, we record the system calls the process has
|
||
|
generated and use STIDE to compare the resulting sequences of system
|
||
|
calls with the database of normal sequences. A large number of
|
||
|
sequences created by the potentially attacked process that weren't
|
||
|
created by the uncompromised processes suggests that the process may
|
||
|
have been exploited.
|
||
|
|
||
|
In practice, because of limitations in available system call tracing
|
||
|
mechanisms, is far easier for us to record simultaneously the system
|
||
|
calls generated by several processes that are running at the same
|
||
|
time. STIDE is designed to handle this sort of situation. It can
|
||
|
simultaneously process multiple interwoven time series by requiring
|
||
|
that each element in the input stream be preceded by an identifier to
|
||
|
specify which series it comes from. In our work, that identifier is
|
||
|
the process ID.
|
||
|
|
||
|
The simplest way that STIDE can analyze information about the
|
||
|
consistency of new data with an existing database is to report the
|
||
|
number of anomolous sequences, i.e., the number of sequences in the
|
||
|
input which do not exist in the database.
|
||
|
|
||
|
It can also report the minimum Hamming distance \cite{lightweight}.
|
||
|
Given a sequence from the data stream and a sequence from the
|
||
|
database, we can compute the number of entries that are different
|
||
|
between the two sequences and get the Hamming distance between those
|
||
|
two sequences. The minimum of the Hamming distances between the input
|
||
|
sequence and all of the sequences from the database is the minimum
|
||
|
Hamming distance for the input sequence.
|
||
|
|
||
|
The final option is that it can report a ``locality frame count''
|
||
|
\cite{ci}. When a process is exploited, there may be a short period
|
||
|
of time (a locality) when the percentage of anomolous sequences is
|
||
|
much higher. Although ten anomalies over the course of a long
|
||
|
run may not be cause for concern, ten anomalies within thirty
|
||
|
sequences might be. Thus it can be useful to observe how many
|
||
|
anomalies there are {\it locally}. The number of sequences that are
|
||
|
considered to be ``local'' to one another is called the size of the
|
||
|
locality frame. In this mode, STIDE reports the largest number of
|
||
|
anomalies it finds within any locality frame.
|
||
|
|
||
|
An additional advantage of calculating locality frame counts is that
|
||
|
it provides an ``on-line'' measure. Ultimately, we are interested in a
|
||
|
system which would detect intrusions as the system is running.
|
||
|
Because locality frame counts are calculated locally, one can
|
||
|
immediately be notified when an intrusion may be occurring.
|
||
|
|
||
|
\section{Input Data Format} \label{sec:input}
|
||
|
The input data consists of the time series to be analyzed. It is read
|
||
|
from standard input. It is expected to be a series of pairs of
|
||
|
positive integers, one pair per line, where the first integer
|
||
|
identifies the data stream and the second integer is the element of
|
||
|
the data stream. The end of the data stream can either be designated
|
||
|
by the end of the file or by an occurrence of the number $-1$ as a
|
||
|
stream identifier. In our work, the stream identifier is the process
|
||
|
identification number (PID), and the elements of the data stream are
|
||
|
system call numbers.
|
||
|
|
||
|
The following is a small example of an input file, tracking three
|
||
|
processes, with PID's 744, 1069 and 9.
|
||
|
|
||
|
\vspace{.15in}
|
||
|
\begin{tabular}{l l}
|
||
|
744 & 24 \\
|
||
|
744 & 13 \\
|
||
|
1069 & 4 \\
|
||
|
1069 & 24 \\
|
||
|
1069 & 4 \\
|
||
|
744 & 5 \\
|
||
|
9 & 24 \\
|
||
|
1069 & 13 \\
|
||
|
744 & 81 \\
|
||
|
9 & 13 \\
|
||
|
9 & 2 \\
|
||
|
1069 & 5 \\
|
||
|
1069 & 18 \\
|
||
|
-1
|
||
|
\end{tabular}
|
||
|
\vspace{.15in}
|
||
|
|
||
|
If the number $-1$ occurs as a data element, STIDE interprets that as
|
||
|
a missing data element. It does not form any sequences going through
|
||
|
that data element. It clears the sequence and starts from scratch.
|
||
|
|
||
|
For example, suppose that the sequence length is 3 and the input is as
|
||
|
follows:
|
||
|
\nopagebreak
|
||
|
\vspace{5pt}
|
||
|
\begin{tabular}{l l}
|
||
|
220 & 14 \\
|
||
|
220 & 185 \\
|
||
|
220 & 20 \\
|
||
|
220 & -1 \\
|
||
|
220 & 2 \\
|
||
|
220 & 20 \\
|
||
|
220 & 3 \\
|
||
|
220 & 2 \\
|
||
|
-1
|
||
|
\end{tabular}
|
||
|
\vspace{.15in}
|
||
|
|
||
|
STIDE would derive three sequences from this input: 14, 185, 20; 2,
|
||
|
20, 3; and 20, 3, 2.
|
||
|
|
||
|
\section{Configuration Options}
|
||
|
There are a number of options which affect STIDE's behavior. Every
|
||
|
option has a default value. The values may be changed through command
|
||
|
line arguments or through a configuration file. Values set by the
|
||
|
configuration file override default values and values set by the
|
||
|
command line override those set by either the configuration file or
|
||
|
the defaults. The following options are available:
|
||
|
|
||
|
\vspace{.2in}
|
||
|
\setlength{\extrarowheight}{3pt}
|
||
|
|
||
|
\begin{tabular}{l|l|l|l}
|
||
|
|
||
|
\vspace{-3pt}
|
||
|
Short &&& \\
|
||
|
Name & Long name & Legitimate Values & Default Value \\
|
||
|
\hline
|
||
|
|
||
|
{\tt a} & {\tt add\_to\_db} & on or off & off \\
|
||
|
{\tt c} & {\tt config\_name} & filenames & stide.config \\
|
||
|
{\tt d} & {\tt db\_name} & filenames & default.db \\
|
||
|
{\tt f} & {\tt lf\_size} & 1 -- 999 & 1 \\
|
||
|
{\tt g} & {\tt output\_graph} & on or off & off \\
|
||
|
{\tt l} & {\tt seq\_len} & 1 -- 199 & 6 \\
|
||
|
{\tt p} & {\tt pair\_offset} & integers & 0 \\
|
||
|
{\tt s} & {\tt write\_db\_stats} & on or off & off \\
|
||
|
{\tt v} & {\tt verbose} & on or off & off \\
|
||
|
{\tt V} & {\tt very\_verbose} & on or off & off \\
|
||
|
{\tt hd} & {\tt compute\_hdist} & on or off & off \\
|
||
|
{\tt me} & {\tt max\_elements} & 1 -- 999 & 500 \\
|
||
|
{\tt ms} & {\tt max\_streams} & 1 -- 999 & 100 \\
|
||
|
{\tt aof} & {\tt add\_output\_format} & see below & see below \\
|
||
|
{\tt cof} & {\tt compare\_output\_format} & see below & see below \\
|
||
|
|
||
|
\end{tabular}
|
||
|
|
||
|
\vspace{.2in}
|
||
|
|
||
|
\subsection{Descriptions of Options}
|
||
|
|
||
|
\subsubsection{Option {\tt add\_to\_db} }
|
||
|
|
||
|
This flag indicates that you want the input data to be added to the
|
||
|
database. If there is no pre-existing database, it indicates that you
|
||
|
want to create a new database from the input data. Note that you
|
||
|
cannot simultaneously compare data and add it to the database. If
|
||
|
this switch is off, STIDE compares the input data with the database
|
||
|
without adding it.
|
||
|
|
||
|
\subsubsection{Option {\tt{config\_name}}}
|
||
|
This is the name of the configuration file to be used. See
|
||
|
Section~\ref{subsec:config} for more information about the
|
||
|
configuration file.
|
||
|
|
||
|
\subsubsection{Option {\tt db\_name}}
|
||
|
This is the name of an existing database or the name under which to
|
||
|
store a new database that will be created from the input data.
|
||
|
|
||
|
\subsubsection{Option {\tt lf\_size}}
|
||
|
This is the size of the locality frame (see Section~\ref{sec:intro}
|
||
|
for an explanation of locality frame count). The value 1 effectively
|
||
|
turns off locality frames.
|
||
|
|
||
|
\subsubsection{Option {\tt output\_graph}}
|
||
|
This causes STIDE to create a file {\tt db\_name.dot} containing a
|
||
|
graph of the entire database forest formatted as input for the program
|
||
|
Dot. Running Dot on the file translates it into PostScript format.
|
||
|
The result is a graphical image of the database.
|
||
|
|
||
|
\subsubsection{Option {\tt seq\_len}}
|
||
|
A database stores trees of sequences of a set length. When building a
|
||
|
new database, the length of the sequences to be stored is set with
|
||
|
{\tt seq\_len}. When adding to or comparing with an existing
|
||
|
database, one must use the same sequence length that was used when the
|
||
|
database was generated. In those situations, STIDE will automatically
|
||
|
figure out the correct sequence length and use it regardless of the
|
||
|
user specification or the default.\footnote{STIDE can do this for
|
||
|
revision 1 databases only. STIDE can still process old-style
|
||
|
databases, but cannot implement this feature. STIDE recognizes
|
||
|
revision 1 databases by their initial line: {\tt \#DBrev: 1 } and the
|
||
|
following line: {\tt \#DBseq\_len: } followed by an integer giving
|
||
|
the sequence length. When STIDE processes an old-style database, it
|
||
|
converts it to a revision 1 database if it is in {\tt add\_to\_db}
|
||
|
mode.}
|
||
|
|
||
|
\subsubsection{Option {\tt pair\_offset}} \label{subsubsec:po}
|
||
|
In {\tt verbose} or {\tt very\_verbose} modes, STIDE reports on
|
||
|
particular sequences of interest (see Sections \ref{subsubsec:verbose}
|
||
|
and \ref{subsubsec:very-verbose}). One of the pieces of information
|
||
|
one might be interested in is where a particular sequence occurs in
|
||
|
the input. Recall that the input data is a stream of pairs (stream
|
||
|
number, element number), and each element in the sequence being
|
||
|
considered came from one of those input pairs. STIDE reports on where
|
||
|
the sequence occurred in the input by reporting the pair number of the
|
||
|
last element of the sequence.
|
||
|
|
||
|
These numbers may be offset by a fixed amount by setting {\tt
|
||
|
pair\_offset}.
|
||
|
|
||
|
\subsubsection{Option {\tt write\_db\_stats}}
|
||
|
This flag causes STIDE to print out statistics on the database. The
|
||
|
statistics it will print are the number of nodes in the database, the
|
||
|
number of unique sequences, the number of branches, and the average
|
||
|
database branch factor. See Section~\ref{sec:output} for more
|
||
|
information.
|
||
|
|
||
|
\subsubsection{Option {\tt verbose}} \label{subsubsec:verbose}
|
||
|
When adding to the database in {\tt verbose} mode, STIDE will print
|
||
|
information about each new sequence being added to the database, where
|
||
|
the precise information is specified by the {\tt add\_output\_format}
|
||
|
parameter (see Section~\ref{subsubsec:aof}). When comparing the input
|
||
|
data with an existing database in {\tt verbose }mode, it will print
|
||
|
information about each sequence that is itself a miss or whose
|
||
|
locality frame contains a miss, where the precise information is
|
||
|
specified by the {\tt compare\_output\_format} parameter (see
|
||
|
Section~\ref{subsubsec:cof}). In either case, when adding or
|
||
|
comparing, STIDE will first print out a header with a list of the names
|
||
|
of the variables being printed.
|
||
|
|
||
|
\subsubsection{Option {\tt very\_verbose}} \label{subsubsec:very-verbose}
|
||
|
In {\tt very\_verbose} mode, STIDE will print out the information specified
|
||
|
by {\tt add\_output\_format} or {\tt compare\_output\_format} for each sequence
|
||
|
encountered in the input data, regardless of whether the sequence is
|
||
|
new. As in {\tt verbose} mode, STIDE will first print out a header
|
||
|
with a list of the names of the variables being printed.
|
||
|
|
||
|
\subsubsection{Option {\tt compute\_hdist}}
|
||
|
This switch causes the Hamming distance \cite{lightweight} to be
|
||
|
computed (see Section~\ref{sec:intro} for an explanation of Hamming
|
||
|
distance).
|
||
|
|
||
|
\subsubsection{Option {\tt max\_elements}}
|
||
|
This is the maximum number of unique data elements that STIDE might
|
||
|
encounter in the input data.
|
||
|
|
||
|
\subsubsection{Option {\tt max\_streams}}
|
||
|
This is the maximum number of different data streams that STIDE might
|
||
|
encounter in the input data.
|
||
|
|
||
|
\subsubsection{Option {\tt add\_output\_format}} \label{subsubsec:aof}
|
||
|
|
||
|
When adding to the database in {\tt verbose} or {\tt very\_verbose}
|
||
|
modes, STIDE will print the {\tt add\_output\_format} string for every
|
||
|
sequence of interest (see Sections \ref{subsubsec:verbose} and
|
||
|
\ref{subsubsec:very-verbose}). Substitutions are made for control
|
||
|
characters as follows:
|
||
|
|
||
|
\vspace{.15in}
|
||
|
|
||
|
\begin{tabular}{c|l}
|
||
|
\vspace{-4pt}
|
||
|
Control \\ Char & Meaning \\ \hline
|
||
|
\%s & Stream Identification Number \\
|
||
|
\%d & Database Size \\
|
||
|
\vspace{-4pt}
|
||
|
\%p & Pair number of last data element of \\
|
||
|
& sequence in the whole input stream \\
|
||
|
\vspace{-4pt}
|
||
|
\%i & Pair number of last data element of \\
|
||
|
& sequence in its particular data stream \\
|
||
|
\verb+\+t & Tab \\
|
||
|
\verb+\+n & Newline \\
|
||
|
\end{tabular}
|
||
|
|
||
|
\vspace{.15in}
|
||
|
|
||
|
See section \ref{subsubsec:po} for more information about the meaning
|
||
|
of the \%p and \%i control characters.
|
||
|
|
||
|
The default value of {\tt add\_output\_format} is:
|
||
|
|
||
|
\vspace{5pt}
|
||
|
|
||
|
\verb+"DB Size: %d\tStream: %s\tPair Number: %p\n"+
|
||
|
|
||
|
\subsubsection{Option {\tt compare\_output\_format}} \label{subsubsec:cof}
|
||
|
When comparing data in {\tt verbose} mode, STIDE will print the
|
||
|
{\tt compare\_output\_format} string for every sequence which is
|
||
|
itself an anomaly or whose locality frame conatins an anomaly. In
|
||
|
{\tt very\_verbose} mode, STIDE will print the string indicated for
|
||
|
{\it every} sequence, regardless of whether it is an anomaly.
|
||
|
Substitutions are made for control characters as follows:
|
||
|
|
||
|
\vspace{.15in}
|
||
|
|
||
|
\begin{tabular}{c|l}
|
||
|
\vspace{-4pt}
|
||
|
Control \\ Char & Meaning \\ \hline
|
||
|
\%s & Stream Identification Number \\
|
||
|
\vspace{-4pt}
|
||
|
\%p & Pair number of last data element of \\
|
||
|
& sequence in the whole input stream \\
|
||
|
\vspace{-4pt}
|
||
|
\%i & Pair number of last data element of \\
|
||
|
& sequence in its particular data stream \\
|
||
|
\%a & 1 if this sequence is an anomaly, 0 otherwise \\
|
||
|
\%c & locality frame count of this sequence \\
|
||
|
\%h & Hamming distance \\
|
||
|
\verb+\+t & Tab \\
|
||
|
\verb+\+n & Newline \\
|
||
|
\end{tabular}
|
||
|
|
||
|
\vspace{.15in}
|
||
|
|
||
|
See section \ref{subsubsec:po} for more information about the meaning
|
||
|
of the \%p and \%i control characters.
|
||
|
|
||
|
The default value of {\tt compare\_output\_format} is:
|
||
|
|
||
|
\vspace{5pt}
|
||
|
|
||
|
\verb+"Pair Number: %p\tStream Number: %s\n"+
|
||
|
|
||
|
\subsection{Command-Line Arguments}
|
||
|
All parameters may be set using the command line, in one of two ways.
|
||
|
The short name may be used, preceeded by a hyphen and followed by a
|
||
|
value (if appropriate). The long name may also be used, but it must
|
||
|
be preceeded by {\it two} hyphens and followed by a value (if
|
||
|
appropriate). Values set by the command line override those set in
|
||
|
any other way.
|
||
|
|
||
|
Switches which are simply turned on or off need not be followed by a
|
||
|
value. Parameters may be set in any order. There must be space
|
||
|
between the parameter name and the value. Flags may not be combined.
|
||
|
|
||
|
STIDE expects the input data to come from standard input.
|
||
|
|
||
|
\subsubsection{Examples}
|
||
|
|
||
|
To use STIDE to create a database called ``our\_data.db'' from the
|
||
|
input file ``input1.dat'' with sequences of length 10, using the
|
||
|
default configuration file name, in verbose mode, with ouput format
|
||
|
``\verb+%p\t%s\t%d\n+'', one could type:
|
||
|
|
||
|
\vspace{5pt}
|
||
|
|
||
|
\begin{verbatim}
|
||
|
stide -d our_data.db -a -l 10 -v -aof "%p\t%s\t%d\n" < input1.dat
|
||
|
\end{verbatim}
|
||
|
|
||
|
\vspace{5pt}
|
||
|
|
||
|
To add the data from the file ``input2.dat'' to that database, using
|
||
|
the same configuration file, not in verbose mode, and to create a
|
||
|
graph in dot format, one could type:
|
||
|
|
||
|
\vspace{5pt}
|
||
|
|
||
|
\begin{verbatim}
|
||
|
stide -d our_data.db --output_graph --add_to_db -l 10 < input2.dat
|
||
|
\end{verbatim}
|
||
|
|
||
|
\vspace{5pt}
|
||
|
|
||
|
Then to compare the data in file ``input3.dat'' to the database and
|
||
|
have the results reported using locality frame counts with locality
|
||
|
frame size 20, using the configuration file ``run3.config'', one would
|
||
|
type:
|
||
|
|
||
|
\vspace{5pt}
|
||
|
|
||
|
\begin{verbatim}
|
||
|
stide -d our_data.db -f 20 -l 10 -c run3.config < input3.dat
|
||
|
\end{verbatim}
|
||
|
|
||
|
\vspace{5pt}
|
||
|
|
||
|
\subsection{Configuration File} \label{subsec:config}
|
||
|
All parameters may be set using a configuration file. The first line
|
||
|
of a configuration file must be:\footnote{Old-style configuration
|
||
|
files lack this line. STIDE will assume that configuration files
|
||
|
that lack this line are old-style, and will try to parse them
|
||
|
accordingly, issuing a warning to the user.}
|
||
|
|
||
|
\vspace{5pt}
|
||
|
|
||
|
\begin{verbatim}
|
||
|
#ConfigFileRev: 1
|
||
|
\end{verbatim}
|
||
|
|
||
|
\vspace{5pt}
|
||
|
|
||
|
After the first line, lines may be commented out using a ``\#'' sign.
|
||
|
Each parameter is set on its own line, using the long name followed by
|
||
|
a colon, followed by the value. Lines may be continued by putting a
|
||
|
backslash as the last character of the line. White space at the
|
||
|
beginning of lines will be ignored. Parameters which are simple
|
||
|
switches may be set with the value ``on'' or ``off'', or with no value
|
||
|
at all (which will turn them on).
|
||
|
|
||
|
Configuration file values override default values and are overriden
|
||
|
by command-line values.
|
||
|
|
||
|
\subsubsection{Example}
|
||
|
|
||
|
The following is a sample configuration file:
|
||
|
|
||
|
\vspace{.15in}
|
||
|
|
||
|
\begin{boxedverbatim}
|
||
|
|
||
|
# ConfigFileRev: 1
|
||
|
# Sample STIDE configuration file containing default values.
|
||
|
|
||
|
db_name: default.db # name of database
|
||
|
seq_len: 6 # length of sequences
|
||
|
max_elements: 1000 # maximum number of unique elements
|
||
|
# in input
|
||
|
max_streams: 500 # maximum number of unique streams
|
||
|
# in input
|
||
|
pair_offset: 0 # offset for pair number count
|
||
|
add_output_format: \
|
||
|
"DB Size: %d\tStream: %s\tPair Number: %p\n"
|
||
|
compare_output_format: \
|
||
|
"Pair Number: %p\tStream Number: %s\n"
|
||
|
lf_size: 1 # 1 causes locality frame counts not
|
||
|
# to be computed
|
||
|
add_to_db: off # Add this data to the database, or,
|
||
|
# if there is no database, create a
|
||
|
# new one -- do not do comparisons
|
||
|
output_graph: off # Outputs graphing information in Dot
|
||
|
# format
|
||
|
compute_hdist: off # Compute Hamming distances
|
||
|
write_db_stats: off # At end, print out statistics about
|
||
|
# database
|
||
|
verbose: off # Verbose mode
|
||
|
very_verbose: off # Very verbose mode
|
||
|
|
||
|
\end{boxedverbatim}
|
||
|
|
||
|
\section{Output Data} \label{sec:output}
|
||
|
For every run, STIDE will first output the final configuration data
|
||
|
assembled from the defaults, the configuration file and the
|
||
|
command-line arguments, in a format which could be used as a
|
||
|
configuration file. The subsequent output depends on whether STIDE was
|
||
|
adding to the database or making comparisons.
|
||
|
|
||
|
\subsection{Output Data About Comparisons}
|
||
|
|
||
|
If you have run the program to compare sequences, at the end STIDE
|
||
|
will print out the number of different streams in the input, the total
|
||
|
number of pairs read from the input, the total number of sequences
|
||
|
read from the input, the number of sequences that were anomalous, and
|
||
|
the percentage of sequences that were anomalous. If locality frame
|
||
|
counts were being computed, STIDE reports the maximum locality frame
|
||
|
count encountered in any stream, and if Hamming distances were being
|
||
|
computed, STIDE reports the largest minimum Hamming distance of any
|
||
|
sequence in any stream.
|
||
|
|
||
|
If the {\tt verbose} switch was on and the {\tt
|
||
|
compare\_output\_format} parameter is set appropriately, STIDE will
|
||
|
print out information about each sequence which is either itself an
|
||
|
anomaly or whose locality frame contains an anomaly (if locality
|
||
|
frames are being computed). If the {\tt very\_verbose} switch was on
|
||
|
and the {\tt compare\_output\_format} parameter is set appropriately,
|
||
|
STIDE will print out information about each sequence, regardless of
|
||
|
whether it is an anomaly. The precise information to be output is
|
||
|
specified by the user in {\tt compare\_output\_format}. See Section
|
||
|
\ref{subsubsec:cof} for details on what information {\tt
|
||
|
compare\_output\_format} may request.
|
||
|
|
||
|
\subsection{Output Data About The Database}
|
||
|
|
||
|
If you are adding to the database, STIDE will not print out any
|
||
|
information automatically (beyond the configuration information).
|
||
|
However, one can get further information about the growth of the
|
||
|
database by turning on {\tt verbose} or {\tt very\_verbose} modes, and
|
||
|
one can get information about the shape and complexity of a database
|
||
|
using the {\tt write\_db\_stats} switch.
|
||
|
|
||
|
\subsubsection{Database Growth Information}
|
||
|
|
||
|
In {\tt verbose} mode, STIDE will print out information on each new
|
||
|
sequence which is added to the database. In {\tt very\_verbose} mode,
|
||
|
STIDE will print out information on each sequence read in, regardless
|
||
|
of whether it is new. The information that STIDE produces is
|
||
|
determined by the {\tt add\_output\_format} parameter. See Section
|
||
|
\ref{subsubsec:aof} for details on what information may be requested.
|
||
|
|
||
|
\subsubsection{Database Statistics}
|
||
|
|
||
|
The {\tt write\_db\_stats} switch causes STIDE to print out
|
||
|
information about the shape and complexity of the database. The {\tt
|
||
|
write\_db\_stats} switch may be used either when adding to the
|
||
|
database or when making comparisons.
|
||
|
|
||
|
The sequences are stored as forests (groups of trees). Each path down
|
||
|
each tree represents a sequence that STIDE has encountered. STIDE can
|
||
|
compute the number of nodes on the trees, the number of leaves (leaves
|
||
|
are the ends of the trees, i.e., the last element in a sequence), the
|
||
|
number of branches, and the average branch factor, which is the number
|
||
|
of branches divided by the difference between the number of nodes and
|
||
|
the number of sequences.
|
||
|
|
||
|
For example, consider the sequences derived from the first sample input file in
|
||
|
Section~\ref{sec:input}:
|
||
|
\nopagebreak
|
||
|
\vspace{5pt}
|
||
|
|
||
|
\begin{tabular}{c}
|
||
|
24, 13, 5 \\
|
||
|
13, 5, 81 \\
|
||
|
4, 24, 4 \\
|
||
|
24, 4, 13 \\
|
||
|
4, 13, 5 \\
|
||
|
13, 5, 18 \\
|
||
|
24, 13, 2 \\
|
||
|
\end{tabular}
|
||
|
|
||
|
\vspace{.15in}
|
||
|
|
||
|
We can represent those sequences by the forest:
|
||
|
|
||
|
\vspace{.15in}
|
||
|
|
||
|
\begin{picture}(350, 80)
|
||
|
\put(40,0){\includegraphics{graphic1.eps}}
|
||
|
\end{picture}
|
||
|
|
||
|
\vspace{.15in}
|
||
|
|
||
|
In this database, the number of nodes is 15, the number of leaves is
|
||
|
7, and the number of branches is 12. There are 7 unique sequences.
|
||
|
The average branch factor is $12 / (15 - 7) = 1.5$.
|
||
|
|
||
|
\begin{thebibliography}{99}
|
||
|
|
||
|
\bibitem{lightweight} S. Hofmeyr, S. Forrest, and A. Somayaji
|
||
|
``Lightweight intrusion detection for networked operating systems.''
|
||
|
Submitted to {\em Journal of Computer Security} (July, 1997).
|
||
|
|
||
|
\bibitem{ci} S. Forrest, S. Hofmeyr, and A. Somayaji ``Computer
|
||
|
immunology'' {\em Communications of the ACM} Vol. 40, No. 10, pp.
|
||
|
88-96 (1997).
|
||
|
|
||
|
\bibitem{principles} A. Somayaji, S. Hofmeyr, and S. Forrest
|
||
|
``Principles of a Computer Immune System.'' New Security Paradigms
|
||
|
Workshop (presented September, 1997).
|
||
|
|
||
|
\bibitem{self} S. Forrest, S.~A. Hofmeyr, A. Somayaji, and T.~A.
|
||
|
Longstaff ``A sense of self for Unix processes.'' In Proceedings of
|
||
|
the 1996 IEEE Symposium on Computer Security and Privacy, IEEE
|
||
|
Computer Society Press, Los Alamitos, CA, pp. 120-128 (1996).
|
||
|
\end{thebibliography}
|
||
|
|
||
|
\end{document}
|