\documentclass{amsart}
\usepackage{graphicx}
\usepackage{array}
\usepackage{moreverb}
\title{DRAFT: User Documentation for the STIDE software package}
\author{Julie Rehmeyer}
\date{\today}
\begin{document}
\maketitle 


\section{Software Purpose} \label{sec:intro}
STIDE stands for Sequence Time-Delay Embedding, and it implements the
time-delay embedding method of anomaly detection.  Its primary
function is to accept as input a time series (or a set of time
series), divide it into a set of fixed-length sequences, compare that
set of sequences with an existing database of fixed length sequences,
and report on the consistency of the time series with the existing
database.  It can also be used to created a database of fixed-length
sequences from scratch, or to add to a pre-existing database.

The STIDE software was originally developed by Steve Hofmeyr, a
graduate student in the Computer Science Department at the University
of New Mexico, as part of a research program that is applying ideas
from immunology to problems in computer security.  In particular,
STIDE was written to assist in detecting intrusions by identifying the
unusual sequences of system calls that may be created during an
attempted intrusion \cite{lightweight, ci, principles, self}. In this
context, the time series being considered consists of the system calls
made by a single process.  We first record the system calls made by a
process exibiting normal behavior (i.e., in non-exploited situations),
and then use STIDE to divide that continuous stream of system calls
into sequences of a given length and store them in a database.
Subsequently, when we want to know if another instance of the same
program has been attacked, we record the system calls the process has
generated and use STIDE to compare the resulting sequences of system
calls with the database of normal sequences.  A large number of
sequences created by the potentially attacked process that weren't
created by the uncompromised processes suggests that the process may
have been exploited.

In practice, because of limitations in available system call tracing
mechanisms, is far easier for us to record simultaneously the system
calls generated by several processes that are running at the same
time.  STIDE is designed to handle this sort of situation.  It can
simultaneously process multiple interwoven time series by requiring
that each element in the input stream be preceded by an identifier to
specify which series it comes from.  In our work, that identifier is
the process ID.

The simplest way that STIDE can analyze information about the
consistency of new data with an existing database is to report the
number of anomolous sequences, i.e., the number of sequences in the
input which do not exist in the database.

It can also report the minimum Hamming distance \cite{lightweight}.
Given a sequence from the data stream and a sequence from the
database, we can compute the number of entries that are different
between the two sequences and get the Hamming distance between those
two sequences.  The minimum of the Hamming distances between the input
sequence and all of the sequences from the database is the minimum
Hamming distance for the input sequence.

The final option is that it can report a ``locality frame count''
\cite{ci}.  When a process is exploited, there may be a short period
of time (a locality) when the percentage of anomolous sequences is
much higher.  Although ten anomalies over the course of a long
run may not be cause for concern, ten anomalies within thirty
sequences might be.  Thus it can be useful to observe how many
anomalies there are {\it locally}.  The number of sequences that are
considered to be ``local'' to one another is called the size of the
locality frame.  In this mode, STIDE reports the largest number of
anomalies it finds within any locality frame.

An additional advantage of calculating locality frame counts is that
it provides an ``on-line'' measure.  Ultimately, we are interested in a
system which would detect intrusions as the system is running.
Because locality frame counts are calculated locally, one can
immediately be notified when an intrusion may be occurring.

\section{Input Data Format} \label{sec:input}
The input data consists of the time series to be analyzed.  It is read
from standard input.  It is expected to be a series of pairs of
positive integers, one pair per line, where the first integer
identifies the data stream and the second integer is the element of
the data stream.  The end of the data stream can either be designated
by the end of the file or by an occurrence of the number $-1$ as a
stream identifier.  In our work, the stream identifier is the process
identification number (PID), and the elements of the data stream are
system call numbers.  

The following is a small example of an input file, tracking three
processes, with PID's 744, 1069 and 9.

\vspace{.15in}
\begin{tabular}{l l}
744  & 24   \\
744  & 13   \\
1069 & 4    \\
1069 & 24   \\
1069 & 4    \\
744  & 5    \\
9    & 24   \\
1069 & 13   \\
744  & 81   \\
9    & 13   \\
9    & 2    \\
1069 & 5    \\
1069  & 18  \\
-1 
\end{tabular}
\vspace{.15in}

If the number $-1$ occurs as a data element, STIDE interprets that as
a missing data element.  It does not form any sequences going through
that data element.  It clears the sequence and starts from scratch.

For example, suppose that the sequence length is 3 and the input is as
follows:
\nopagebreak
\vspace{5pt}
\begin{tabular}{l l}
220  &  14  \\
220  &  185 \\
220  &  20  \\
220  &  -1  \\
220  &  2   \\
220  &  20  \\
220  &  3   \\
220  &  2  \\
-1 
\end{tabular}
\vspace{.15in}

STIDE would  derive three sequences from this input: 14, 185, 20; 2,
20, 3; and 20, 3, 2.

\section{Configuration Options}
There are a number of options which affect STIDE's behavior.  Every
option has a default value.  The values may be changed through command
line arguments or through a configuration file.  Values set by the
configuration file override default values and values set by the
command line override those set by either the configuration file or
the defaults.  The following options are available:

\vspace{.2in}
\setlength{\extrarowheight}{3pt}

\begin{tabular}{l|l|l|l}

\vspace{-3pt}
Short &&& \\
Name & Long name & Legitimate Values & Default Value \\ 
\hline 

{\tt a}   & {\tt add\_to\_db}       & on or off         & off \\ 
{\tt c}   & {\tt config\_name}      & filenames         & stide.config \\ 
{\tt d}   & {\tt db\_name}          & filenames         & default.db \\ 
{\tt f}   & {\tt lf\_size}          & 1 -- 999          & 1     \\ 
{\tt g}   & {\tt output\_graph}     & on or off         & off   \\ 
{\tt l}   & {\tt seq\_len}          & 1 -- 199          & 6     \\ 
{\tt p}   & {\tt pair\_offset}      & integers          & 0     \\ 
{\tt s}   & {\tt write\_db\_stats}  & on or off         & off   \\ 
{\tt v}   & {\tt verbose}           & on or off         & off   \\ 
{\tt V}   & {\tt very\_verbose}     & on or off         & off   \\ 
{\tt hd}  & {\tt compute\_hdist}    & on or off         & off   \\ 
{\tt me}  & {\tt max\_elements}     & 1 -- 999          & 500   \\ 
{\tt ms}  & {\tt max\_streams}      & 1 -- 999          & 100   \\ 
{\tt aof} & {\tt add\_output\_format} & see below       & see below \\ 
{\tt cof} & {\tt compare\_output\_format} & see below   & see below \\ 

\end{tabular}

\vspace{.2in}

\subsection{Descriptions of Options}

\subsubsection{Option {\tt add\_to\_db} }
 
This flag indicates that you want the input data to be added to the
database.  If there is no pre-existing database, it indicates that you
want to create a new database from the input data.  Note that you
cannot simultaneously compare data and add it to the database.  If
this switch is off, STIDE compares the input data with the database
without adding it.

\subsubsection{Option {\tt{config\_name}}}
This is the name of the configuration file to be used.  See
Section~\ref{subsec:config} for more information about the
configuration file.

\subsubsection{Option {\tt db\_name}}
This is the name of an existing database or the name under which to
store a new database that will be created from the input data.

\subsubsection{Option {\tt lf\_size}}
This is the size of the locality frame (see Section~\ref{sec:intro}
for an explanation of locality frame count).  The value 1 effectively
turns off locality frames.

\subsubsection{Option {\tt output\_graph}}
This causes STIDE to create a file {\tt db\_name.dot} containing a
graph of the entire database forest formatted as input for the program
Dot.  Running Dot on the file translates it into PostScript format.
The result is a graphical image of the database.

\subsubsection{Option {\tt seq\_len}}
A database stores trees of sequences of a set length.  When building a
new database, the length of the sequences to be stored is set with
{\tt seq\_len}.  When adding to or comparing with an existing
database, one must use the same sequence length that was used when the
database was generated.  In those situations, STIDE will automatically
figure out the correct sequence length and use it regardless of the
user specification or the default.\footnote{STIDE can do this for
  revision 1 databases only.  STIDE can still process old-style
  databases, but cannot implement this feature.  STIDE recognizes
  revision 1 databases by their initial line: {\tt \#DBrev: 1 } and the
  following line: {\tt \#DBseq\_len: }  followed by an integer giving
  the sequence length.  When STIDE processes an old-style database, it
  converts it to a revision 1 database if it is in {\tt add\_to\_db}
  mode.} 

\subsubsection{Option {\tt pair\_offset}} \label{subsubsec:po}
In {\tt verbose} or {\tt very\_verbose} modes, STIDE reports on
particular sequences of interest (see Sections \ref{subsubsec:verbose}
and \ref{subsubsec:very-verbose}).  One of the pieces of information
one might be interested in is where a particular sequence occurs in
the input.  Recall that the input data is a stream of pairs (stream
number, element number), and each element in the sequence being
considered came from one of those input pairs.  STIDE reports on where
the sequence occurred in the input by reporting the pair number of the
last element of the sequence.

These numbers may be offset by a fixed amount by setting {\tt
pair\_offset}. 

\subsubsection{Option {\tt write\_db\_stats}}
This flag causes STIDE to print out statistics on the database.  The
statistics it will print are the number of nodes in the database, the
number of unique sequences, the number of branches, and the average
database branch factor.  See Section~\ref{sec:output} for more
information.

\subsubsection{Option {\tt verbose}} \label{subsubsec:verbose}
When adding to the database in {\tt verbose} mode, STIDE will print
information about each new sequence being added to the database, where
the precise information is specified by the {\tt add\_output\_format}
parameter (see Section~\ref{subsubsec:aof}).  When comparing the input
data with an existing database in {\tt verbose }mode, it will print
information about each sequence that is itself a miss or whose
locality frame contains a miss, where the precise information is
specified by the {\tt compare\_output\_format} parameter (see
Section~\ref{subsubsec:cof}).  In either case, when adding or
comparing, STIDE will first print out a header with a list of the names
of the variables being printed.

\subsubsection{Option {\tt very\_verbose}} \label{subsubsec:very-verbose}
In {\tt very\_verbose} mode, STIDE will print out the information specified
by {\tt add\_output\_format} or {\tt compare\_output\_format} for each sequence
encountered in the input data, regardless of whether the sequence is
new.  As in {\tt verbose} mode, STIDE will first print out a header
with a list of the names of the variables being printed.

\subsubsection{Option {\tt compute\_hdist}}
This switch causes the Hamming distance \cite{lightweight} to be
computed (see Section~\ref{sec:intro} for an explanation of Hamming
distance).
  
\subsubsection{Option {\tt max\_elements}}
This is the maximum number of unique data elements that STIDE might
encounter in the input data.

\subsubsection{Option {\tt max\_streams}}
This is the maximum number of different data streams that STIDE might
encounter in the input data.

\subsubsection{Option {\tt add\_output\_format}} \label{subsubsec:aof}

When adding to the database in {\tt verbose} or {\tt very\_verbose}
modes, STIDE will print the {\tt add\_output\_format} string for every
sequence of interest (see Sections \ref{subsubsec:verbose} and
\ref{subsubsec:very-verbose}).  Substitutions are made for control
characters as follows:

\vspace{.15in}

\begin{tabular}{c|l}
\vspace{-4pt}
Control \\ Char &  Meaning \\ \hline
       \%s  & Stream Identification Number \\
       \%d  & Database Size  \\
\vspace{-4pt}
       \%p  & Pair number of last data element of \\
            & sequence in the whole input stream  \\
\vspace{-4pt}
       \%i  & Pair number of last data element of    \\
            & sequence in its particular data stream \\
       \verb+\+t & Tab    \\
       \verb+\+n & Newline    \\
\end{tabular}

\vspace{.15in}

See section \ref{subsubsec:po} for more information about the meaning
of the \%p and \%i control characters.

The default value of {\tt add\_output\_format} is:

\vspace{5pt}

\verb+"DB Size: %d\tStream: %s\tPair Number: %p\n"+
  
\subsubsection{Option {\tt compare\_output\_format}} \label{subsubsec:cof}
When comparing data in {\tt verbose} mode, STIDE will print the 
{\tt compare\_output\_format} string for every sequence which is
itself an anomaly or whose locality frame conatins an anomaly.  In
{\tt very\_verbose} mode, STIDE will print the string indicated for
{\it every} sequence, regardless of whether it is an anomaly.
Substitutions are made for control characters as follows:

\vspace{.15in}

\begin{tabular}{c|l}
\vspace{-4pt}
Control \\ Char &  Meaning \\ \hline
       \%s  & Stream Identification Number \\
\vspace{-4pt}
       \%p  & Pair number of last data element of \\
            & sequence in the whole input stream \\
\vspace{-4pt}
       \%i  & Pair number of last data element of \\
            & sequence in its particular data stream  \\
       \%a  & 1 if this sequence is an anomaly, 0 otherwise  \\
       \%c  & locality frame count of this sequence  \\
       \%h  & Hamming distance \\
       \verb+\+t & Tab    \\
       \verb+\+n & Newline    \\
\end{tabular}

\vspace{.15in}

See section \ref{subsubsec:po} for more information about the meaning
of the \%p and \%i control characters.

The default value of {\tt compare\_output\_format} is:

\vspace{5pt}

\verb+"Pair Number: %p\tStream Number: %s\n"+

\subsection{Command-Line Arguments}
All parameters may be set using the command line, in one of two ways.
The short name may be used, preceeded by a hyphen and followed by a
value (if appropriate).  The long name may also be used, but it must
be preceeded by {\it two} hyphens and followed by a value (if
appropriate).  Values set by the command line override those set in
any other way.

Switches which are simply turned on or off need not be followed by a
value.  Parameters may be set in any order. There must be space
between the parameter name and the value.  Flags may not be combined.

STIDE expects the input data to come from standard input.

\subsubsection{Examples}

To use STIDE to create a database called ``our\_data.db'' from the
input file ``input1.dat'' with sequences of length 10, using the
default configuration file name, in verbose mode, with ouput format
``\verb+%p\t%s\t%d\n+'',  one could type:

\vspace{5pt}

\begin{verbatim}
stide -d our_data.db -a -l 10 -v -aof "%p\t%s\t%d\n" < input1.dat
\end{verbatim}

\vspace{5pt}

To add the data from the file ``input2.dat'' to that database, using
the same configuration file, not in verbose mode, and to create a
graph in dot format, one could type:

\vspace{5pt}

\begin{verbatim}
stide -d our_data.db --output_graph --add_to_db -l 10 < input2.dat
\end{verbatim}

\vspace{5pt}

Then to compare the data in file ``input3.dat'' to the database and
have the results reported using locality frame counts with locality
frame size 20, using the configuration file ``run3.config'', one would
type:

\vspace{5pt}

\begin{verbatim}
stide -d our_data.db -f 20 -l 10 -c run3.config < input3.dat
\end{verbatim}

\vspace{5pt}

\subsection{Configuration File} \label{subsec:config}
All parameters may be set using a configuration file.  The first line
of a configuration file must be:\footnote{Old-style configuration
  files lack this line.  STIDE will assume that configuration files
  that lack this line are old-style, and will try to parse them
  accordingly, issuing a warning to the user.}

\vspace{5pt}

\begin{verbatim}
#ConfigFileRev: 1
\end{verbatim}

\vspace{5pt}

After the first line, lines may be commented out using a ``\#'' sign.
Each parameter is set on its own line, using the long name followed by
a colon, followed by the value.  Lines may be continued by putting a
backslash as the last character of the line.  White space at the
beginning of lines will be ignored.  Parameters which are simple
switches may be set with the value ``on'' or ``off'', or with no value
at all (which will turn them on).

Configuration file values override default values and are overriden
by command-line values.

\subsubsection{Example}

The following is a  sample configuration file:

\vspace{.15in}

\begin{boxedverbatim}

# ConfigFileRev: 1
# Sample STIDE configuration file containing default values.

 db_name: default.db     # name of database
 seq_len: 6              # length of sequences
 max_elements: 1000      # maximum number of unique elements 
			 # in input 
 max_streams: 500        # maximum number of unique streams 
 			 # in input  
 pair_offset: 0          # offset for pair number count
 add_output_format:  \
         "DB Size: %d\tStream: %s\tPair Number: %p\n"
 compare_output_format:  \
         "Pair Number: %p\tStream Number: %s\n"
 lf_size: 1              # 1 causes locality frame counts not
                         # to be computed 
 add_to_db: off          # Add this data to the database, or, 
                         # if there is no database, create a 
                         # new one -- do not do comparisons
 output_graph: off       # Outputs graphing information in Dot
                         # format 
 compute_hdist: off      # Compute Hamming distances
 write_db_stats: off     # At end, print out statistics about 
 			 # database
 verbose: off            # Verbose mode
 very_verbose: off       # Very verbose mode

\end{boxedverbatim}

\section{Output Data} \label{sec:output}
For every run, STIDE will first output the final configuration data
assembled from the defaults, the configuration file and the
command-line arguments, in a format which could be used as a
configuration file. The subsequent output depends on whether STIDE was
adding to the database or making comparisons.

\subsection{Output Data About Comparisons}

If you have run the program to compare sequences, at the end STIDE
will print out the number of different streams in the input, the total
number of pairs read from the input, the total number of sequences
read from the input, the number of sequences that were anomalous, and
the percentage of sequences that were anomalous.  If locality frame
counts were being computed, STIDE reports the maximum locality frame
count encountered in any stream, and if Hamming distances were being
computed, STIDE reports the largest minimum Hamming distance of any
sequence in any stream.

If the {\tt verbose} switch was on and the {\tt
  compare\_output\_format} parameter is set appropriately, STIDE will
print out information about each sequence which is either itself an
anomaly or whose locality frame contains an anomaly (if locality
frames are being computed).  If the {\tt very\_verbose} switch was on
and the {\tt compare\_output\_format} parameter is set appropriately,
STIDE will print out information about each sequence, regardless of
whether it is an anomaly.  The precise information to be output is
specified by the user in {\tt compare\_output\_format}.  See Section
\ref{subsubsec:cof} for details on what information {\tt
  compare\_output\_format} may request.

\subsection{Output Data About The Database}

If you are adding to the database, STIDE will not print out any
information automatically (beyond the configuration information).
However, one can get further information about the growth of the
database by turning on {\tt verbose} or {\tt very\_verbose} modes, and
one can get information about the shape and complexity of a database
using the {\tt write\_db\_stats} switch.

\subsubsection{Database Growth Information}

In {\tt verbose} mode, STIDE will print out information on each new
sequence which is added to the database.  In {\tt very\_verbose} mode,
STIDE will print out information on each sequence read in, regardless
of whether it is new.  The information that STIDE produces is
determined by the {\tt add\_output\_format} parameter.  See Section
\ref{subsubsec:aof} for details on what information may be requested.

\subsubsection{Database Statistics}

The {\tt write\_db\_stats} switch causes STIDE to print out
information about the shape and complexity of the database.  The {\tt
write\_db\_stats} switch may be used either when adding to the
database or when making comparisons.

The sequences are stored as forests (groups of trees).  Each path down
each tree represents a sequence that STIDE has encountered.  STIDE can
compute the number of nodes on the trees, the number of leaves (leaves
are the ends of the trees, i.e., the last element in a sequence), the
number of branches, and the average branch factor, which is the number
of branches divided by the difference between the number of nodes and
the number of sequences.

For example, consider the sequences derived from the first sample input file in
Section~\ref{sec:input}:
\nopagebreak
\vspace{5pt}

\begin{tabular}{c}
24, 13, 5  \\
13, 5, 81  \\
4, 24, 4 \\
24, 4, 13 \\
4, 13, 5  \\
13, 5, 18 \\
24, 13, 2  \\
\end{tabular}

\vspace{.15in}

We can represent those sequences by the forest:

\vspace{.15in}

\begin{picture}(350, 80)
\put(40,0){\includegraphics{graphic1.eps}}
\end{picture}

\vspace{.15in}

In this database, the number of nodes is 15, the number of leaves is
7, and the number of branches is 12.  There are 7 unique sequences.
The average branch factor is $12 / (15 - 7) = 1.5$.

\begin{thebibliography}{99}
  
\bibitem{lightweight} S. Hofmeyr, S. Forrest, and A. Somayaji
  ``Lightweight intrusion detection for networked operating systems.''
  Submitted to {\em Journal of Computer Security} (July, 1997).
  
\bibitem{ci} S. Forrest, S. Hofmeyr, and A. Somayaji ``Computer
  immunology'' {\em Communications of the ACM} Vol. 40, No. 10, pp.
  88-96 (1997).
  
\bibitem{principles} A. Somayaji, S. Hofmeyr, and S. Forrest
  ``Principles of a Computer Immune System.''  New Security Paradigms
  Workshop (presented September, 1997).
  
\bibitem{self} S. Forrest, S.~A. Hofmeyr, A. Somayaji, and T.~A.
  Longstaff ``A sense of self for Unix processes.''  In Proceedings of
  the 1996 IEEE Symposium on Computer Security and Privacy, IEEE
  Computer Society Press, Los Alamitos, CA, pp. 120-128 (1996).
\end{thebibliography}

\end{document}