73 changed files with 3387152 additions and 10 deletions

11/11.Rproj Normal file
View File

@ -0,0 +1,13 @@
Version: 1.0
RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default
EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8
RnwWeave: Sweave
LaTeX: pdfLaTeX

11/ Normal file
View File

@ -0,0 +1,91 @@
# Charakteryzacja zbiorów oraz jego historyczność (zadanie 1)
## 1. kddcup99
Przygotowany na Fifth International Conference on Knowledge Discovery and Data Mining by w ramach konkursu wyłonić najlepiej zaprojektowany model pradykcyjny wykrywający potencjalny atak.
W zbiorze były 4 typy ataków (DOS, R2L, U2R, probing). W zbiorze danych były 24 ataki i 14 dodatkowych w zbiorze testującym.
Dane zostały zasymulowane w sieci militernej. To 4GB ruchu sieciowego z 7 tygodni (około 5 milionów rekordów połączeń).
Połączenie to sekwencja pakietów TCP z zdefiniowanym początkiem i końcem (w czasie) i jest oznaczone jako norlalne lub przez kod ataku. Każde zawiera około 100 bajtów.
Co do używalności to znalazłem:
- Pracę z 2016r. opisującą zastosowania w uczeniu maszynowym w latach 2010-2015 -
- Pracę z 2018r. w której autor mówi że ten zbiór używa się często jako benchmark -
### Nie wiem czy to oznacza że jest nadal używany (to jednak 3 lata) - wydaje mi się że tak i taką ocenę zostawiam :)
## 2. network
Zrzut ruchu sieciowego wykonanego programem tcdump pomiędzy pewną siecią LAN a sieciami zewnętrznymi.
Dzięki ofiltrowaniu tcdump'a zebrane zostały wyłącznie połączenia TCP i UDP.
### Każdy pakiet TCP składa się z:
- Time stamp
- Source IP address
- Source port
- Destination IP address
- Destination port
- Flags (syn, fin, push, rst, or .)
- Data sequence number of this packet
- Data sequence number of the data expected in return
- Number of bytes of receive buffer space available
- Indication of whether or not the data is urgent
### Każdy pakiet UDP składa się z:
- Time stamp
- Source IP address
- Source port
- Destination IP address
- Destination port
- Length of the packet
Wszystkie adresy IP zostały zmodyfikowane by nie udostępniać możliwie niebezpiecznych danych.
### Ostatnia edycja strony tego zbioru była 4 kwietnia 2001r., ostatni artykuł jaki mają podany na stronie ( jest z 2000r., nie znalazłem wspominek o wykorzystaniu tych danych w nowszych pracach więc oznaczam ten zbiór jaki historyczny.
## 3. wywołania systemowe
Zbiór zawiera dane wywołań aktywnych procesów systemowych.
Każdy plik ścieżkowy (\*.int) zawiera listę par numerów w kolejności:
- PID procesu
- numer reprezentujący zapytanie systemowe
Mapowanie numerów na wywołania jest załączone w dokumentacji w folderze `UserDoc`.
Można też ją pobrać jako postscript pod tym adresem:
## 4. UNIX shell log
9 zbiorów danych aktywności uzytkmownika (USER0 i USER1 to ta sama osoba na innych maszynach) w systemie UNIX.
Dane są wyczyszczone z wszystkich adresów sieciowych, danych osobowych, timestamp'ów etc.
Reprezentacja tokenowa danych zawartych w zbiorze jest super opisana tutaj ( więc nie będę jej powtarzać.
### Nie znalazłem nowych prac z wykorzystaniem tego zbioru, a strona UCI KDD jest archiwalna jako że wchłonął ich UCI ML więc zakładam że zbiór jest archiwalny.
# Dodatkowe zbiory (zadanie 3)
## 1. UNSW-NB15:
- opis:
- link:
## 2. NSL_KDD:
- opis: Nie znalazłem samego setu, ale znalazłem jego zrzut :)
- link:
## P.S.
Wiem że nie do końca o to chodziło ale jak chodzi o ciekawą graficzną interpretację to polecam:

11/kddcup99/response.R Normal file
View File

@ -0,0 +1,64 @@
# Na poczatku pragnę przeprosić za analizę jedynie 10% zamiast całego zbioru, lecz moje ograniczenia techniczne nie pozwalają mi
# na puszczenie tego na pełnych danych bez spalenia mojego sprzętu. Mam nadzieję że zostanie mi to wybaczone :)
headers <- c('back','buffer_overflow','ftp_write','guess_passwd','imap','ipsweep','land','loadmodule','multihop','neptune','nmap','normal','perl','phf','pod','portsweep','rootkit','satan','smurf','spy','teardrop','warezclient','warezmaster',
kddcup99 <- read.csv('kddcup99/Data/kddcup.data_10_percent_corrected', col.names = headers)
# nie do końca rozumiem czemu w zbiorze są 42 wartości a w są podane 64 kolumny...
# do tego kolumny nie zgadzają mi się do końca sensem z tym co byłoby w nich, ale patrząc na dane wyciągam i pokazuję
# poniżej 2 najsensowniej wyglądające do analizy kolumny
print('Most common imap in kddcup99:')
print('Most common ipsweep in kddcup99:')

11/network/response.R Normal file
View File

@ -0,0 +1,72 @@
base <- read.csv('network/Data/base.csv')
net1 <- read.csv('network/Data/net1.csv')
net2 <- read.csv('network/Data/net2.csv')
net3 <- read.csv('network/Data/net3.csv')
net4 <- read.csv('network/Data/net4.csv')
# base
print('Most common src_port in base:')
print('Most common src_addr in base:')
print('Most common dest_port in base:')
print('Most common dest_addr in base:')
# net1
print('Most common src_port in net1:')
print('Most common src_addr in net1:')
print('Most common dest_port in net1:')
print('Most common dest_addr in net1:')
# net2
print('Most common src_port in net2:')
print('Most common src_addr in net2:')
print('Most common dest_port in net2:')
print('Most common dest_addr in net2:')
# net3
print('Most common src_port in net3:')
print('Most common src_addr in net3:')
print('Most common dest_port in net3:')
print('Most common dest_addr in net3:')
# net4
print('Most common src_port in net4:')
print('Most common src_addr in net4:')
print('Most common dest_port in net4:')
print('Most common dest_addr in net4:')

View File

@ -0,0 +1,69 @@
This file contains 9 sets of sanitized user data drawn from the
command histories of 8 UNIX computer users at Purdue over the course
of up to 2 years (USER0 and USER1 were generated by the same person,
working on different platforms and different projects). The data is
drawn from tcsh(1) history files and has been parsed and sanitized to
remove filenames, user names, directory structures, web addresses,
host names, and other possibly identifying items. Command names,
flags, and shell metacharacters have been preserved. Additionally,
**SOF** and **EOF** tokens have been inserted at the start and end of
shell sessions, respectively. Sessions are concatenated by date order
and tokens appear in the order issued within the shell session, but no
timestamps are included in this data. For example, the two sessions:
# Start session 1
cd ~/private/docs
ls -laF | more
cat foo.txt bar.txt zorch.txt > somewhere
# End session 1
# Start session 2
cd ~/games/
xquake &
vi scores.txt
# End session 2
would be represented by the token stream
<1> # one "file name" argument
<3> # three "file" arguments
This data is made available under conditions of anonymity for the
contributing users and may be used for research purposes only.
Summaries and research results employing this data may be published,
but literal tokens or token sequences from the data may not be
published except with express consent of the originators of the data.
No portion of this data may be released with or included in a
commercial product, nor may any portion of this data be sold or
redistributed for profit or as part of of a profit-making endeavor.
Please direct any questions regarding this data to Terran Lane:

11/unix/response.txt Normal file
View File

@ -0,0 +1,124 @@
Nie wiem jak na tych plikach dokonać innej analizy statystycznej, więc zliczyłem po prostu najczęściej występujące
komendy per każdy użytkownik. Poniżej załączam komendę jaką stosowałem do analizy jak i wyniki badania.
Załączyłem tylko 10 najczęstszych żeby nie przytłoczyć.
Dodam że oczywiście tokeny typu <1>, **EOF***, etc. będą się pojawiać, ale nie należy
brać ich pod uwagę przy analizie statystycznej.
Komenda: sort <nazwa_pliku> | uniq -c | sort -rn | head -n 10
2147 <1>
803 ls
567 **SOF**
567 **EOF**
507 cd
485 finger
450 elm
442 exit
251 <2>
230 fg
6069 <1>
1951 cd
1929 ls
1733 vi
884 <2>
515 **SOF**
515 **EOF**
397 smake
350 ll
315 more
5432 <1>
1597 cd
1069 **SOF**
1069 **EOF**
989 a.out
932 <2>
816 ls
626 quota
612 xcc
497 rm
4382 <1>
1710 ls
988 cd
806 more
778 vi
704 elm
577 fg
511 lo
501 **SOF**
501 **EOF**
10699 <1>
4501 cd
2395 ll
1682 vi
1465 dir
1396 <2>
955 **SOF**
955 **EOF**
641 elm
559 logout
8987 <1>
2862 cd
2748 <2>
2144 ls
1279 less
1183 grep
973 make
887 ll
778 -
632 <
16298 <1>
8761 ls
5680 cd
3419 **SOF**
3419 **EOF**
2830 vi
2419 elm
2015 <2>
1457 rm
996 exit
3463 <1>
1522 **SOF**
1522 **EOF**
1133 ls
848 cd
741 z
615 <2>
595 m
514 clear
237 rm
14269 <1>
5108 ll
5016 cd
2188 <2>
1983 **SOF**
1983 **EOF**
1553 k
1259 m
1177 z
796 vi

@ -0,0 +1,216 @@
%!PS-Adobe-2.0 EPSF-2.0
%%Title: graphic1.eps
%%Creator: fig2dev Version 3.2 Patchlevel 0-beta2
%%CreationDate: Tue Feb 24 15:25:25 1998
%%For: julie@snow (Julie Rehmeyer,,,)
%%Orientation: Portrait
%%BoundingBox: 0 0 223 79
%%Pages: 0
%%IncludeFeature: *PageSize Letter
%%Magnification: 0.70
/$F2psDict 200 dict def
$F2psDict begin
$F2psDict /mtrx matrix put
/col-1 {0 setgray} bind def
/col0 {0.000 0.000 0.000 srgb} bind def
/col1 {0.000 0.000 1.000 srgb} bind def
/col2 {0.000 1.000 0.000 srgb} bind def
/col3 {0.000 1.000 1.000 srgb} bind def
/col4 {1.000 0.000 0.000 srgb} bind def
/col5 {1.000 0.000 1.000 srgb} bind def
/col6 {1.000 1.000 0.000 srgb} bind def
/col7 {1.000 1.000 1.000 srgb} bind def
/col8 {0.000 0.000 0.560 srgb} bind def
/col9 {0.000 0.000 0.690 srgb} bind def
/col10 {0.000 0.000 0.820 srgb} bind def
/col11 {0.530 0.810 1.000 srgb} bind def
/col12 {0.000 0.560 0.000 srgb} bind def
/col13 {0.000 0.690 0.000 srgb} bind def
/col14 {0.000 0.820 0.000 srgb} bind def
/col15 {0.000 0.560 0.560 srgb} bind def
/col16 {0.000 0.690 0.690 srgb} bind def
/col17 {0.000 0.820 0.820 srgb} bind def
/col18 {0.560 0.000 0.000 srgb} bind def
/col19 {0.690 0.000 0.000 srgb} bind def
/col20 {0.820 0.000 0.000 srgb} bind def
/col21 {0.560 0.000 0.560 srgb} bind def
/col22 {0.690 0.000 0.690 srgb} bind def
/col23 {0.820 0.000 0.820 srgb} bind def
/col24 {0.500 0.190 0.000 srgb} bind def
/col25 {0.630 0.250 0.000 srgb} bind def
/col26 {0.750 0.380 0.000 srgb} bind def
/col27 {1.000 0.500 0.500 srgb} bind def
/col28 {1.000 0.630 0.630 srgb} bind def
/col29 {1.000 0.750 0.750 srgb} bind def
/col30 {1.000 0.880 0.880 srgb} bind def
/col31 {1.000 0.840 0.000 srgb} bind def
-47.0 140.0 translate
1 -1 scale
/cp {closepath} bind def
/ef {eofill} bind def
/gr {grestore} bind def
/gs {gsave} bind def
/sa {save} bind def
/rs {restore} bind def
/l {lineto} bind def
/m {moveto} bind def
/rm {rmoveto} bind def
/n {newpath} bind def
/s {stroke} bind def
/sh {show} bind def
/slc {setlinecap} bind def
/slj {setlinejoin} bind def
/slw {setlinewidth} bind def
/srgb {setrgbcolor} bind def
/rot {rotate} bind def
/sc {scale} bind def
/sd {setdash} bind def
/ff {findfont} bind def
/sf {setfont} bind def
/scf {scalefont} bind def
/sw {stringwidth} bind def
/tr {translate} bind def
/tnt {dup dup currentrgbcolor
4 -2 roll dup 1 exch sub 3 -1 roll mul add
4 -2 roll dup 1 exch sub 3 -1 roll mul add
4 -2 roll dup 1 exch sub 3 -1 roll mul add srgb}
bind def
/shd {dup dup currentrgbcolor 4 -2 roll mul 4 -2 roll mul
4 -2 roll mul srgb} bind def
/DrawEllipse {
/endangle exch def
/startangle exch def
/yrad exch def
/xrad exch def
/y exch def
/x exch def
/savematrix mtrx currentmatrix def
x y tr xrad yrad sc 0 0 1 startangle endangle arc
savematrix setmatrix
} def
/$F2psBegin {$F2psDict begin /$F2psEnteredState save def} def
/$F2psEnd {$F2psEnteredState restore end} def
10 setmiterlimit
n 0 3367 m 0 0 l 6449 0 l 6449 3367 l cp clip
0.04200 0.04200 sc
7.500 slw
% Ellipse
n 1800 1800 75 75 0 360 DrawEllipse gs 0.00 setgray ef gr gs col0 s gr
% Ellipse
n 1500 2400 75 75 0 360 DrawEllipse gs 0.00 setgray ef gr gs col0 s gr
% Ellipse
n 2078 2378 75 75 0 360 DrawEllipse gs 0.00 setgray ef gr gs col0 s gr
% Ellipse
n 2378 2978 75 75 0 360 DrawEllipse gs 0.00 setgray ef gr gs col0 s gr
% Ellipse
n 1778 2978 75 75 0 360 DrawEllipse gs 0.00 setgray ef gr gs col0 s gr
% Ellipse
n 1246 2994 75 75 0 360 DrawEllipse gs 0.00 setgray ef gr gs col0 s gr
% Ellipse
n 3900 1800 75 75 0 360 DrawEllipse gs 0.00 setgray ef gr gs col0 s gr
% Ellipse
n 5700 1800 75 75 0 360 DrawEllipse gs 0.00 setgray ef gr gs col0 s gr
% Ellipse
n 3900 2400 75 75 0 360 DrawEllipse gs 0.00 setgray ef gr gs col0 s gr
% Ellipse
n 3600 3000 75 75 0 360 DrawEllipse gs 0.00 setgray ef gr gs col0 s gr
% Ellipse
n 4200 3000 75 75 0 360 DrawEllipse gs 0.00 setgray ef gr gs col0 s gr
% Ellipse
n 5400 2400 75 75 0 360 DrawEllipse gs 0.00 setgray ef gr gs col0 s gr
% Ellipse
n 5400 3000 75 75 0 360 DrawEllipse gs 0.00 setgray ef gr gs col0 s gr
% Ellipse
n 6000 2400 75 75 0 360 DrawEllipse gs 0.00 setgray ef gr gs col0 s gr
% Ellipse
n 6000 3000 75 75 0 360 DrawEllipse gs 0.00 setgray ef gr gs col0 s gr
% Polyline
n 1800 1800 m 2100 2400 l gs 0.00 setgray ef gr gs col0 s gr
% Polyline
n 1800 1800 m 1500 2400 l 1800 3000 l gs col0 s gr
% Polyline
n 1500 2400 m 1200 3000 l gs 0.00 setgray ef gr gs col0 s gr
% Polyline
n 2100 2400 m 2400 3000 l gs 0.00 setgray ef gr gs col0 s gr
% Polyline
n 3900 1800 m 3900 2400 l 4200 3000 l gs col0 s gr
% Polyline
n 3900 2400 m 3600 3000 l gs col0 s gr
% Polyline
n 5700 1800 m 6000 2400 l 6000 3000 l gs col0 s gr
% Polyline
n 5700 1800 m 5400 2400 l 5400 3000 l gs col0 s gr
/Times-Roman ff 180.00 scf sf
1725 1575 m
gs 1 -1 sc (24) col0 sh gr
/Times-Roman ff 180.00 scf sf
1125 2400 m
gs 1 -1 sc (13) col0 sh gr
/Times-Roman ff 180.00 scf sf
1200 3300 m
gs 1 -1 sc (5) col0 sh gr
/Times-Roman ff 180.00 scf sf
3825 1575 m
gs 1 -1 sc (13) col0 sh gr
/Times-Roman ff 180.00 scf sf
2325 2400 m
gs 1 -1 sc (4) col0 sh gr
/Times-Roman ff 180.00 scf sf
1725 3300 m
gs 1 -1 sc (2) col0 sh gr
/Times-Roman ff 180.00 scf sf
3525 3300 m
gs 1 -1 sc (81) col0 sh gr
/Times-Roman ff 180.00 scf sf
4125 3300 m
gs 1 -1 sc (18) col0 sh gr
/Times-Roman ff 180.00 scf sf
5625 1575 m
gs 1 -1 sc (4) col0 sh gr
/Times-Roman ff 180.00 scf sf
5325 3300 m
gs 1 -1 sc (4) col0 sh gr
/Times-Roman ff 180.00 scf sf
6000 3300 m
gs 1 -1 sc (5) col0 sh gr
/Times-Roman ff 180.00 scf sf
6225 2475 m
gs 1 -1 sc (13) col0 sh gr
/Times-Roman ff 180.00 scf sf
3600 2475 m
gs 1 -1 sc (5) col0 sh gr
/Times-Roman ff 180.00 scf sf
5025 2475 m
gs 1 -1 sc (24) col0 sh gr
/Times-Roman ff 180.00 scf sf
2325 3300 m
gs 1 -1 sc (13) col0 sh gr

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,583 @@
\title{DRAFT: User Documentation for the STIDE software package}
\author{Julie Rehmeyer}
\section{Software Purpose} \label{sec:intro}
STIDE stands for Sequence Time-Delay Embedding, and it implements the
time-delay embedding method of anomaly detection. Its primary
function is to accept as input a time series (or a set of time
series), divide it into a set of fixed-length sequences, compare that
set of sequences with an existing database of fixed length sequences,
and report on the consistency of the time series with the existing
database. It can also be used to created a database of fixed-length
sequences from scratch, or to add to a pre-existing database.
The STIDE software was originally developed by Steve Hofmeyr, a
graduate student in the Computer Science Department at the University
of New Mexico, as part of a research program that is applying ideas
from immunology to problems in computer security. In particular,
STIDE was written to assist in detecting intrusions by identifying the
unusual sequences of system calls that may be created during an
attempted intrusion \cite{lightweight, ci, principles, self}. In this
context, the time series being considered consists of the system calls
made by a single process. We first record the system calls made by a
process exibiting normal behavior (i.e., in non-exploited situations),
and then use STIDE to divide that continuous stream of system calls
into sequences of a given length and store them in a database.
Subsequently, when we want to know if another instance of the same
program has been attacked, we record the system calls the process has
generated and use STIDE to compare the resulting sequences of system
calls with the database of normal sequences. A large number of
sequences created by the potentially attacked process that weren't
created by the uncompromised processes suggests that the process may
have been exploited.
In practice, because of limitations in available system call tracing
mechanisms, is far easier for us to record simultaneously the system
calls generated by several processes that are running at the same
time. STIDE is designed to handle this sort of situation. It can
simultaneously process multiple interwoven time series by requiring
that each element in the input stream be preceded by an identifier to
specify which series it comes from. In our work, that identifier is
the process ID.
The simplest way that STIDE can analyze information about the
consistency of new data with an existing database is to report the
number of anomolous sequences, i.e., the number of sequences in the
input which do not exist in the database.
It can also report the minimum Hamming distance \cite{lightweight}.
Given a sequence from the data stream and a sequence from the
database, we can compute the number of entries that are different
between the two sequences and get the Hamming distance between those
two sequences. The minimum of the Hamming distances between the input
sequence and all of the sequences from the database is the minimum
Hamming distance for the input sequence.
The final option is that it can report a ``locality frame count''
\cite{ci}. When a process is exploited, there may be a short period
of time (a locality) when the percentage of anomolous sequences is
much higher. Although ten anomalies over the course of a long
run may not be cause for concern, ten anomalies within thirty
sequences might be. Thus it can be useful to observe how many
anomalies there are {\it locally}. The number of sequences that are
considered to be ``local'' to one another is called the size of the
locality frame. In this mode, STIDE reports the largest number of
anomalies it finds within any locality frame.
An additional advantage of calculating locality frame counts is that
it provides an ``on-line'' measure. Ultimately, we are interested in a
system which would detect intrusions as the system is running.
Because locality frame counts are calculated locally, one can
immediately be notified when an intrusion may be occurring.
\section{Input Data Format} \label{sec:input}
The input data consists of the time series to be analyzed. It is read
from standard input. It is expected to be a series of pairs of
positive integers, one pair per line, where the first integer
identifies the data stream and the second integer is the element of
the data stream. The end of the data stream can either be designated
by the end of the file or by an occurrence of the number $-1$ as a
stream identifier. In our work, the stream identifier is the process
identification number (PID), and the elements of the data stream are
system call numbers.
The following is a small example of an input file, tracking three
processes, with PID's 744, 1069 and 9.
\begin{tabular}{l l}
744 & 24 \\
744 & 13 \\
1069 & 4 \\
1069 & 24 \\
1069 & 4 \\
744 & 5 \\
9 & 24 \\
1069 & 13 \\
744 & 81 \\
9 & 13 \\
9 & 2 \\
1069 & 5 \\
1069 & 18 \\
If the number $-1$ occurs as a data element, STIDE interprets that as
a missing data element. It does not form any sequences going through
that data element. It clears the sequence and starts from scratch.
For example, suppose that the sequence length is 3 and the input is as
\begin{tabular}{l l}
220 & 14 \\
220 & 185 \\
220 & 20 \\
220 & -1 \\
220 & 2 \\
220 & 20 \\
220 & 3 \\
220 & 2 \\
STIDE would derive three sequences from this input: 14, 185, 20; 2,
20, 3; and 20, 3, 2.
\section{Configuration Options}
There are a number of options which affect STIDE's behavior. Every
option has a default value. The values may be changed through command
line arguments or through a configuration file. Values set by the
configuration file override default values and values set by the
command line override those set by either the configuration file or
the defaults. The following options are available:
Short &&& \\
Name & Long name & Legitimate Values & Default Value \\
{\tt a} & {\tt add\_to\_db} & on or off & off \\
{\tt c} & {\tt config\_name} & filenames & stide.config \\
{\tt d} & {\tt db\_name} & filenames & default.db \\
{\tt f} & {\tt lf\_size} & 1 -- 999 & 1 \\
{\tt g} & {\tt output\_graph} & on or off & off \\
{\tt l} & {\tt seq\_len} & 1 -- 199 & 6 \\
{\tt p} & {\tt pair\_offset} & integers & 0 \\
{\tt s} & {\tt write\_db\_stats} & on or off & off \\
{\tt v} & {\tt verbose} & on or off & off \\
{\tt V} & {\tt very\_verbose} & on or off & off \\
{\tt hd} & {\tt compute\_hdist} & on or off & off \\
{\tt me} & {\tt max\_elements} & 1 -- 999 & 500 \\
{\tt ms} & {\tt max\_streams} & 1 -- 999 & 100 \\
{\tt aof} & {\tt add\_output\_format} & see below & see below \\
{\tt cof} & {\tt compare\_output\_format} & see below & see below \\
\subsection{Descriptions of Options}
\subsubsection{Option {\tt add\_to\_db} }
This flag indicates that you want the input data to be added to the
database. If there is no pre-existing database, it indicates that you
want to create a new database from the input data. Note that you
cannot simultaneously compare data and add it to the database. If
this switch is off, STIDE compares the input data with the database
without adding it.
\subsubsection{Option {\tt{config\_name}}}
This is the name of the configuration file to be used. See
Section~\ref{subsec:config} for more information about the
configuration file.
\subsubsection{Option {\tt db\_name}}
This is the name of an existing database or the name under which to
store a new database that will be created from the input data.
\subsubsection{Option {\tt lf\_size}}
This is the size of the locality frame (see Section~\ref{sec:intro}
for an explanation of locality frame count). The value 1 effectively
turns off locality frames.
\subsubsection{Option {\tt output\_graph}}
This causes STIDE to create a file {\tt db\} containing a
graph of the entire database forest formatted as input for the program
Dot. Running Dot on the file translates it into PostScript format.
The result is a graphical image of the database.
\subsubsection{Option {\tt seq\_len}}
A database stores trees of sequences of a set length. When building a
new database, the length of the sequences to be stored is set with
{\tt seq\_len}. When adding to or comparing with an existing
database, one must use the same sequence length that was used when the
database was generated. In those situations, STIDE will automatically
figure out the correct sequence length and use it regardless of the
user specification or the default.\footnote{STIDE can do this for
revision 1 databases only. STIDE can still process old-style
databases, but cannot implement this feature. STIDE recognizes
revision 1 databases by their initial line: {\tt \#DBrev: 1 } and the
following line: {\tt \#DBseq\_len: } followed by an integer giving
the sequence length. When STIDE processes an old-style database, it
converts it to a revision 1 database if it is in {\tt add\_to\_db}
\subsubsection{Option {\tt pair\_offset}} \label{subsubsec:po}
In {\tt verbose} or {\tt very\_verbose} modes, STIDE reports on
particular sequences of interest (see Sections \ref{subsubsec:verbose}
and \ref{subsubsec:very-verbose}). One of the pieces of information
one might be interested in is where a particular sequence occurs in
the input. Recall that the input data is a stream of pairs (stream
number, element number), and each element in the sequence being
considered came from one of those input pairs. STIDE reports on where
the sequence occurred in the input by reporting the pair number of the
last element of the sequence.
These numbers may be offset by a fixed amount by setting {\tt
\subsubsection{Option {\tt write\_db\_stats}}
This flag causes STIDE to print out statistics on the database. The
statistics it will print are the number of nodes in the database, the
number of unique sequences, the number of branches, and the average
database branch factor. See Section~\ref{sec:output} for more
\subsubsection{Option {\tt verbose}} \label{subsubsec:verbose}
When adding to the database in {\tt verbose} mode, STIDE will print
information about each new sequence being added to the database, where
the precise information is specified by the {\tt add\_output\_format}
parameter (see Section~\ref{subsubsec:aof}). When comparing the input
data with an existing database in {\tt verbose }mode, it will print
information about each sequence that is itself a miss or whose
locality frame contains a miss, where the precise information is
specified by the {\tt compare\_output\_format} parameter (see
Section~\ref{subsubsec:cof}). In either case, when adding or
comparing, STIDE will first print out a header with a list of the names
of the variables being printed.
\subsubsection{Option {\tt very\_verbose}} \label{subsubsec:very-verbose}
In {\tt very\_verbose} mode, STIDE will print out the information specified
by {\tt add\_output\_format} or {\tt compare\_output\_format} for each sequence
encountered in the input data, regardless of whether the sequence is
new. As in {\tt verbose} mode, STIDE will first print out a header
with a list of the names of the variables being printed.
\subsubsection{Option {\tt compute\_hdist}}
This switch causes the Hamming distance \cite{lightweight} to be
computed (see Section~\ref{sec:intro} for an explanation of Hamming
\subsubsection{Option {\tt max\_elements}}
This is the maximum number of unique data elements that STIDE might
encounter in the input data.
\subsubsection{Option {\tt max\_streams}}
This is the maximum number of different data streams that STIDE might
encounter in the input data.
\subsubsection{Option {\tt add\_output\_format}} \label{subsubsec:aof}
When adding to the database in {\tt verbose} or {\tt very\_verbose}
modes, STIDE will print the {\tt add\_output\_format} string for every
sequence of interest (see Sections \ref{subsubsec:verbose} and
\ref{subsubsec:very-verbose}). Substitutions are made for control
characters as follows:
Control \\ Char & Meaning \\ \hline
\%s & Stream Identification Number \\
\%d & Database Size \\
\%p & Pair number of last data element of \\
& sequence in the whole input stream \\
\%i & Pair number of last data element of \\
& sequence in its particular data stream \\
\verb+\+t & Tab \\
\verb+\+n & Newline \\
See section \ref{subsubsec:po} for more information about the meaning
of the \%p and \%i control characters.
The default value of {\tt add\_output\_format} is:
\verb+"DB Size: %d\tStream: %s\tPair Number: %p\n"+
\subsubsection{Option {\tt compare\_output\_format}} \label{subsubsec:cof}
When comparing data in {\tt verbose} mode, STIDE will print the
{\tt compare\_output\_format} string for every sequence which is
itself an anomaly or whose locality frame conatins an anomaly. In
{\tt very\_verbose} mode, STIDE will print the string indicated for
{\it every} sequence, regardless of whether it is an anomaly.
Substitutions are made for control characters as follows:
Control \\ Char & Meaning \\ \hline
\%s & Stream Identification Number \\
\%p & Pair number of last data element of \\
& sequence in the whole input stream \\
\%i & Pair number of last data element of \\
& sequence in its particular data stream \\
\%a & 1 if this sequence is an anomaly, 0 otherwise \\
\%c & locality frame count of this sequence \\
\%h & Hamming distance \\
\verb+\+t & Tab \\
\verb+\+n & Newline \\
See section \ref{subsubsec:po} for more information about the meaning
of the \%p and \%i control characters.
The default value of {\tt compare\_output\_format} is:
\verb+"Pair Number: %p\tStream Number: %s\n"+
\subsection{Command-Line Arguments}
All parameters may be set using the command line, in one of two ways.
The short name may be used, preceeded by a hyphen and followed by a
value (if appropriate). The long name may also be used, but it must
be preceeded by {\it two} hyphens and followed by a value (if
appropriate). Values set by the command line override those set in
any other way.
Switches which are simply turned on or off need not be followed by a
value. Parameters may be set in any order. There must be space
between the parameter name and the value. Flags may not be combined.
STIDE expects the input data to come from standard input.
To use STIDE to create a database called ``our\_data.db'' from the
input file ``input1.dat'' with sequences of length 10, using the
default configuration file name, in verbose mode, with ouput format
``\verb+%p\t%s\t%d\n+'', one could type:
stide -d our_data.db -a -l 10 -v -aof "%p\t%s\t%d\n" < input1.dat
To add the data from the file ``input2.dat'' to that database, using
the same configuration file, not in verbose mode, and to create a
graph in dot format, one could type:
stide -d our_data.db --output_graph --add_to_db -l 10 < input2.dat
Then to compare the data in file ``input3.dat'' to the database and
have the results reported using locality frame counts with locality
frame size 20, using the configuration file ``run3.config'', one would
stide -d our_data.db -f 20 -l 10 -c run3.config < input3.dat
\subsection{Configuration File} \label{subsec:config}
All parameters may be set using a configuration file. The first line
of a configuration file must be:\footnote{Old-style configuration
files lack this line. STIDE will assume that configuration files
that lack this line are old-style, and will try to parse them
accordingly, issuing a warning to the user.}
#ConfigFileRev: 1
After the first line, lines may be commented out using a ``\#'' sign.
Each parameter is set on its own line, using the long name followed by
a colon, followed by the value. Lines may be continued by putting a
backslash as the last character of the line. White space at the
beginning of lines will be ignored. Parameters which are simple
switches may be set with the value ``on'' or ``off'', or with no value
at all (which will turn them on).
Configuration file values override default values and are overriden
by command-line values.
The following is a sample configuration file:
# ConfigFileRev: 1
# Sample STIDE configuration file containing default values.
db_name: default.db # name of database
seq_len: 6 # length of sequences
max_elements: 1000 # maximum number of unique elements
# in input
max_streams: 500 # maximum number of unique streams
# in input
pair_offset: 0 # offset for pair number count
add_output_format: \
"DB Size: %d\tStream: %s\tPair Number: %p\n"
compare_output_format: \
"Pair Number: %p\tStream Number: %s\n"
lf_size: 1 # 1 causes locality frame counts not
# to be computed
add_to_db: off # Add this data to the database, or,
# if there is no database, create a
# new one -- do not do comparisons
output_graph: off # Outputs graphing information in Dot
# format
compute_hdist: off # Compute Hamming distances
write_db_stats: off # At end, print out statistics about
# database
verbose: off # Verbose mode
very_verbose: off # Very verbose mode
\section{Output Data} \label{sec:output}
For every run, STIDE will first output the final configuration data
assembled from the defaults, the configuration file and the
command-line arguments, in a format which could be used as a
configuration file. The subsequent output depends on whether STIDE was
adding to the database or making comparisons.
\subsection{Output Data About Comparisons}
If you have run the program to compare sequences, at the end STIDE
will print out the number of different streams in the input, the total
number of pairs read from the input, the total number of sequences
read from the input, the number of sequences that were anomalous, and
the percentage of sequences that were anomalous. If locality frame
counts were being computed, STIDE reports the maximum locality frame
count encountered in any stream, and if Hamming distances were being
computed, STIDE reports the largest minimum Hamming distance of any
sequence in any stream.
If the {\tt verbose} switch was on and the {\tt
compare\_output\_format} parameter is set appropriately, STIDE will
print out information about each sequence which is either itself an
anomaly or whose locality frame contains an anomaly (if locality
frames are being computed). If the {\tt very\_verbose} switch was on
and the {\tt compare\_output\_format} parameter is set appropriately,
STIDE will print out information about each sequence, regardless of
whether it is an anomaly. The precise information to be output is
specified by the user in {\tt compare\_output\_format}. See Section
\ref{subsubsec:cof} for details on what information {\tt
compare\_output\_format} may request.
\subsection{Output Data About The Database}
If you are adding to the database, STIDE will not print out any
information automatically (beyond the configuration information).
However, one can get further information about the growth of the
database by turning on {\tt verbose} or {\tt very\_verbose} modes, and
one can get information about the shape and complexity of a database
using the {\tt write\_db\_stats} switch.
\subsubsection{Database Growth Information}
In {\tt verbose} mode, STIDE will print out information on each new
sequence which is added to the database. In {\tt very\_verbose} mode,
STIDE will print out information on each sequence read in, regardless
of whether it is new. The information that STIDE produces is
determined by the {\tt add\_output\_format} parameter. See Section
\ref{subsubsec:aof} for details on what information may be requested.
\subsubsection{Database Statistics}
The {\tt write\_db\_stats} switch causes STIDE to print out
information about the shape and complexity of the database. The {\tt
write\_db\_stats} switch may be used either when adding to the
database or when making comparisons.
The sequences are stored as forests (groups of trees). Each path down
each tree represents a sequence that STIDE has encountered. STIDE can
compute the number of nodes on the trees, the number of leaves (leaves
are the ends of the trees, i.e., the last element in a sequence), the
number of branches, and the average branch factor, which is the number
of branches divided by the difference between the number of nodes and
the number of sequences.
For example, consider the sequences derived from the first sample input file in
24, 13, 5 \\
13, 5, 81 \\
4, 24, 4 \\
24, 4, 13 \\
4, 13, 5 \\
13, 5, 18 \\
24, 13, 2 \\
We can represent those sequences by the forest:
\begin{picture}(350, 80)
In this database, the number of nodes is 15, the number of leaves is
7, and the number of branches is 12. There are 7 unique sequences.
The average branch factor is $12 / (15 - 7) = 1.5$.
\bibitem{lightweight} S. Hofmeyr, S. Forrest, and A. Somayaji
``Lightweight intrusion detection for networked operating systems.''
Submitted to {\em Journal of Computer Security} (July, 1997).
\bibitem{ci} S. Forrest, S. Hofmeyr, and A. Somayaji ``Computer
immunology'' {\em Communications of the ACM} Vol. 40, No. 10, pp.
88-96 (1997).
\bibitem{principles} A. Somayaji, S. Hofmeyr, and S. Forrest
``Principles of a Computer Immune System.'' New Security Paradigms
Workshop (presented September, 1997).
\bibitem{self} S. Forrest, S.~A. Hofmeyr, A. Somayaji, and T.~A.
Longstaff ``A sense of self for Unix processes.'' In Proceedings of
the 1996 IEEE Symposium on Computer Security and Privacy, IEEE
Computer Society Press, Los Alamitos, CA, pp. 120-128 (1996).

View File

@ -0,0 +1,6 @@
(cd Seq-code; make; cp stide ..)
@rm -f stide
@(cd Seq-code; rm -f *.o stide)

View File

@ -0,0 +1,11 @@
STIDE version 1.1
Copyright (C) 1996, 1998 The Regents of the University
of New Mexico. All rights reserved.
This code was written for GCC version 2.7.2, but should compile correctly
under other more recent versions of GCC.
For usage information invoke stide with the --help option. More detailed
documentation can be found in the UserDoc directory.

@ -0,0 +1,339 @@
Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc.
675 Mass Ave, Cambridge, MA 02139, USA
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
The licenses for most software are designed to take away your
freedom to share and change it. By contrast, the GNU General Public
License is intended to guarantee your freedom to share and change free
software--to make sure the software is free for all its users. This
General Public License applies to most of the Free Software
Foundation's software and to any other program whose authors commit to
using it. (Some other Free Software Foundation software is covered by
the GNU Library General Public License instead.) You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
this service if you wish), that you receive source code or can get it
if you want it, that you can change the software or use pieces of it
in new free programs; and that you know you can do these things.
To protect your rights, we need to make restrictions that forbid
anyone to deny you these rights or to ask you to surrender the rights.
These restrictions translate to certain responsibilities for you if you
distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must give the recipients all the rights that
you have. You must make sure that they, too, receive or can get the
source code. And you must show them these terms so they know their
We protect your rights with two steps: (1) copyright the software, and
(2) offer you this license which gives you legal permission to copy,
distribute and/or modify the software.
Also, for each author's protection and ours, we want to make certain
that everyone understands that there is no warranty for this free
software. If the software is modified by someone else and passed on, we
want its recipients to know that what they have is not the original, so
that any problems introduced by others will not reflect on the original
authors' reputations.
Finally, any free program is threatened constantly by software
patents. We wish to avoid the danger that redistributors of a free
program will individually obtain patent licenses, in effect making the
program proprietary. To prevent this, we have made it clear that any
patent must be licensed for everyone's free use or not licensed at all.
The precise terms and conditions for copying, distribution and
modification follow.
0. This License applies to any program or other work which contains
a notice placed by the copyright holder saying it may be distributed
under the terms of this General Public License. The "Program", below,
refers to any such program or work, and a "work based on the Program"
means either the Program or any derivative work under copyright law:
that is to say, a work containing the Program or a portion of it,
either verbatim or with modifications and/or translated into another
language. (Hereinafter, translation is included without limitation in
the term "modification".) Each licensee is addressed as "you".
Activities other than copying, distribution and modification are not
covered by this License; they are outside its scope. The act of
running the Program is not restricted, and the output from the Program
is covered only if its contents constitute a work based on the
Program (independent of having been made by running the Program).
Whether that is true depends on what the Program does.
1. You may copy and distribute verbatim copies of the Program's
source code as you receive it, in any medium, provided that you
conspicuously and appropriately publish on each copy an appropriate
copyright notice and disclaimer of warranty; keep intact all the
notices that refer to this License and to the absence of any warranty;
and give any other recipients of the Program a copy of this License
along with the Program.
You may charge a fee for the physical act of transferring a copy, and
you may at your option offer warranty protection in exchange for a fee.
2. You may modify your copy or copies of the Program or any portion
of it, thus forming a work based on the Program, and copy and
distribute such modifications or work under the terms of Section 1
above, provided that you also meet all of these conditions:
a) You must cause the modified files to carry prominent notices
stating that you changed the files and the date of any change.
b) You must cause any work that you distribute or publish, that in
whole or in part contains or is derived from the Program or any
part thereof, to be licensed as a whole at no charge to all third
parties under the terms of this License.
c) If the modified program normally reads commands interactively
when run, you must cause it, when started running for such
interactive use in the most ordinary way, to print or display an
announcement including an appropriate copyright notice and a
notice that there is no warranty (or else, saying that you provide
a warranty) and that users may redistribute the program under
these conditions, and telling the user how to view a copy of this
License. (Exception: if the Program itself is interactive but
does not normally print such an announcement, your work based on
the Program is not required to print an announcement.)
These requirements apply to the modified work as a whole. If
identifiable sections of that work are not derived from the Program,
and can be reasonably considered independent and separate works in
themselves, then this License, and its terms, do not apply to those
sections when you distribute them as separate works. But when you
distribute the same sections as part of a whole which is a work based
on the Program, the distribution of the whole must be on the terms of
this License, whose permissions for other licensees extend to the
entire whole, and thus to each and every part regardless of who wrote it.
Thus, it is not the intent of this section to claim rights or contest
your rights to work written entirely by you; rather, the intent is to
exercise the right to control the distribution of derivative or
collective works based on the Program.
In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of
a storage or distribution medium does not bring the other work under
the scope of this License.
3. You may copy and distribute the Program (or a work based on it,
under Section 2) in object code or executable form under the terms of
Sections 1 and 2 above provided that you also do one of the following:
a) Accompany it with the complete corresponding machine-readable
source code, which must be distributed under the terms of Sections
1 and 2 above on a medium customarily used for software interchange; or,
b) Accompany it with a written offer, valid for at least three
years, to give any third party, for a charge no more than your
cost of physically performing source distribution, a complete
machine-readable copy of the corresponding source code, to be
distributed under the terms of Sections 1 and 2 above on a medium
customarily used for software interchange; or,
c) Accompany it with the information you received as to the offer
to distribute corresponding source code. (This alternative is
allowed only for noncommercial distribution and only if you
received the program in object code or executable form with such
an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for
making modifications to it. For an executable work, complete source
code means all the source code for all modules it contains, plus any
associated interface definition files, plus the scripts used to
control compilation and installation of the executable. However, as a
special exception, the source code distributed need not include
anything that is normally distributed (in either source or binary
form) with the major components (compiler, kernel, and so on) of the
operating system on which the executable runs, unless that component
itself accompanies the executable.
If distribution of executable or object code is made by offering
access to copy from a designated place, then offering equivalent
access to copy the source code from the same place counts as
distribution of the source code, even though third parties are not
compelled to copy the source along with the object code.
4. You may not copy, modify, sublicense, or distribute the Program
except as expressly provided under this License. Any attempt
otherwise to copy, modify, sublicense or distribute the Program is
void, and will automatically terminate your rights under this License.
However, parties who have received copies, or rights, from you under
this License will not have their licenses terminated so long as such
parties remain in full compliance.
5. You are not required to accept this License, since you have not
signed it. However, nothing else grants you permission to modify or
distribute the Program or its derivative works. These actions are
prohibited by law if you do not accept this License. Therefore, by
modifying or distributing the Program (or any work based on the
Program), you indicate your acceptance of this License to do so, and
all its terms and conditions for copying, distributing or modifying
the Program or works based on it.
6. Each time you redistribute the Program (or any work based on the
Program), the recipient automatically receives a license from the
original licensor to copy, distribute or modify the Program subject to
these terms and conditions. You may not impose any further
restrictions on the recipients' exercise of the rights granted herein.
You are not responsible for enforcing compliance by third parties to
this License.
7. If, as a consequence of a court judgment or allegation of patent
infringement or for any other reason (not limited to patent issues),
conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot
distribute so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you
may not distribute the Program at all. For example, if a patent
license would not permit royalty-free redistribution of the Program by
all those who receive copies directly or indirectly through you, then
the only way you could satisfy both it and this License would be to
refrain entirely from distribution of the Program.
If any portion of this section is held invalid or unenforceable under
any particular circumstance, the balance of the section is intended to
apply and the section as a whole is intended to apply in other
It is not the purpose of this section to induce you to infringe any
patents or other property right claims or to contest validity of any
such claims; this section has the sole purpose of protecting the
integrity of the free software distribution system, which is
implemented by public license practices. Many people have made
generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that
system; it is up to the author/donor to decide if he or she is willing
to distribute software through any other system and a licensee cannot
impose that choice.
This section is intended to make thoroughly clear what is believed to
be a consequence of the rest of this License.
8. If the distribution and/or use of the Program is restricted in
certain countries either by patents or by copyrighted interfaces, the
original copyright holder who places the Program under this License
may add an explicit geographical distribution limitation excluding
those countries, so that distribution is permitted only in or among
countries not thus excluded. In such case, this License incorporates
the limitation as if written in the body of this License.
9. The Free Software Foundation may publish revised and/or new versions
of the General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the Program
specifies a version number of this License which applies to it and "any
later version", you have the option of following the terms and conditions
either of that version or of any later version published by the Free
Software Foundation. If the Program does not specify a version number of
this License, you may choose any version ever published by the Free Software
10. If you wish to incorporate parts of the Program into other free
programs whose distribution conditions are different, write to the author
to ask for permission. For software which is copyrighted by the Free
Software Foundation, write to the Free Software Foundation; we sometimes
make exceptions for this. Our decision will be guided by the two goals
of preserving the free status of all derivatives of our free software and
of promoting the sharing and reuse of software generally.
Appendix: How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
convey the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) 19yy <name of author>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this
when it starts in an interactive mode:
Gnomovision version 69, Copyright (C) 19yy name of author
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, the commands you use may
be called something other than `show w' and `show c'; they could even be
mouse-clicks or menu items--whatever suits your program.
You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the program, if
necessary. Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program
`Gnomovision' (which makes passes at compilers) written by James Hacker.
<signature of Ty Coon>, 1 April 1989
Ty Coon, President of Vice
This General Public License does not permit incorporating your program into
proprietary programs. If your program is a subroutine library, you may
consider it more useful to permit linking proprietary applications with the
library. If this is what you want to do, use the GNU Library General
Public License instead of this License.

@ -0,0 +1,23 @@
STIDE_OBJECTS = stide.o seq_config.o seq_stream.o template.o flexitree.o
LIBES = -lm
#FLAGS = -O2
FLAGS = -g
stide : $(STIDE_OBJECTS)
g++ -fno-implicit-templates $(FLAGS) $(STIDE_OBJECTS) $(LIBES) -o stide
template.o : ../Utils/arrays.h ../Utils/tll.h ../Utils/hash.h ../Utils/ ../Utils/ ../Utils/ seq_stream.h flexitree.h
g++ -fno-implicit-templates $(FLAGS) -c
stide.o : ../Utils/arrays.h ../Utils/hash.h seq_stream.h seq_config.h flexitree.h
g++ -fno-implicit-templates $(FLAGS) -c
flexitree.o : ../Utils/arrays.h flexitree.h
g++ -fno-implicit-templates $(FLAGS) -c
seq_config.o : seq_config.h ../Utils/arrays.h
g++ -fno-implicit-templates $(FLAGS) -c
seq_stream.o : seq_stream.h seq_config.h flexitree.h ../Utils/arrays.h ../Utils/hash.h
g++ -fno-implicit-templates $(FLAGS) -c

@ -0,0 +1,449 @@
#include "flexitree.h"
extern int counter;
// data structures:
// node for a linked list
class FlexiTreeNode {
FlexiTree *tree; // the element at this node
FlexiTreeNode *next; // pointer to the next node
FlexiTreeNode(int root) {tree = new FlexiTree(root); next = NULL;}
FlexiTree::FlexiTree(void) {
children = NULL;
root = -1;
id = counter;
FlexiTree::FlexiTree(int d) {
children = NULL;
root = d;
id = counter;
FlexiTree::~FlexiTree(void) {
if (children) {
FlexiTreeNode *temp_ptr = children->next, *next_temp_ptr;
if (children->tree) delete children->tree;
delete children;
while (temp_ptr) {
next_temp_ptr = temp_ptr->next;
if (temp_ptr->tree) delete temp_ptr->tree;
delete temp_ptr;
temp_ptr = next_temp_ptr;
int FlexiTree::NumNodes(void) {
int size = 1;
if (children) {
FlexiTreeNode *temp_ptr = children;
while (temp_ptr) {
size += temp_ptr->tree->NumNodes();
temp_ptr = temp_ptr->next;
return size;
int FlexiTree::NumLeaves(void) {
int size;
if (children) {
size = 0;
FlexiTreeNode *temp_ptr = children;
while (temp_ptr) {
size += temp_ptr->tree->NumLeaves();
temp_ptr = temp_ptr->next;
} else size = 1;
return size;
int FlexiTree::NumBranches(void) {
int branches = 0;
if (children) {
FlexiTreeNode *temp_ptr = children;
while (temp_ptr) {
branches += (temp_ptr->tree->NumBranches() + 1);
temp_ptr = temp_ptr->next;
return branches;
* InsertSeq() *
* Inserts a sequence in this tree and returns 1 if the sequence *
* begins with the root of this tree and the sequence isn't already *
* in this tree. It returns -1 if the sequence doesn't begin with *
* the root of this tree. It returns 0 if the sequence was already *
* in this tree. This function is recursive and only compares the *
* portion of the sequence lying between the argument first and the *
* argument last. *
* *
* *
* Input: const Array<int> &seq Current sequence *
* int first The first element of the sequence *
* to consider *
* int last The length of the sequence *
int FlexiTree::InsertSeq(const Array<int> &seq, int first, int last)
// If the root of this tree isn't the same as the first element of
// the sequence, return -1 to indicate that
if (root != seq[first]) {
return -1;
first++; // shift the seq forward
// If we have reached the end of the sequence now, we haven't added
// anything to the tree, so we return 0 to indicate that it was
// already there
if (first > last) {
return 0;
// If there are no children, create some with the correct root,
// insert the sequence and return 1.
if (!children) {
children = new FlexiTreeNode(seq[first]);
children->tree->InsertSeq(seq, first, last);
return 1;
// The root agrees, we're not at the end, and there are children.
// Now we want to know if the sequence is already in the children,
// and if not, we want to find out and add it.
FlexiTreeNode *temp_ptr = children;
int flag;
while (1) {
flag = temp_ptr->tree->InsertSeq(seq, first, last);
// If the sequence is new and gets added, return 1
if (flag == 1) return 1;
// If the sequence is old, return 0
if (flag == 0) return 0;
// Otherwise the new root of the sequence isn't the same as the
// root of this child tree, so we will try the next one. But
// first, if this is the last child, we know it isn't in here, so
// we will add it in and return 1
if (temp_ptr->next == NULL) {
temp_ptr->next = new FlexiTreeNode(seq[first]);
temp_ptr->next->tree->InsertSeq(seq, first, last);
return 1;
temp_ptr = temp_ptr->next;
* IsSeqInTree() *
* Returns 1 if the sequence has a match within this tree and *
* returns 0 otherwise. This function is recursive and only *
* compares the portion of the sequence lying between the argument *
* first and the argument last. *
* *
* *
* Input: Array<int> &seq Current sequence *
* int first The first element of the sequence to *
* consider *
* int last The length of the sequence *
int FlexiTree::IsSeqInTree(const Array<int> &seq, int first, int last)
// If the first element of the sequence isn't the same as the root
// of this tree, then we know already that there isn't a match here,
// so return 0.
if (root != seq[first]) {
return 0;
first++; // shift the seq forward
// If we have reached the end of the sequence, then we have
// found matches all the way along, so return 1 saying that this is
// a match.
if (first > last) {
return 1;
// Now we want to find out if there is a match in any of the
// subtrees below this tree. The subtrees are contained in the
// linked list children->next->next->...
FlexiTreeNode *next_node = children;
while (next_node != NULL) {
if (next_node->tree->IsSeqInTree(seq, first, last)) {
return 1; //Found it!
next_node = next_node->next;
// Now we've been through all of the subtrees without finding a
// match, so there aren't any matches.
return 0;
* ComputeHDistForTree() *
* Reports the minimum number of mismatches with any sequence on *
* this tree. This is a highly compute-intensive method, because *
* every path down the tree is followed. This function is *
* recursive, and only compares the portion of the sequence lying *
* between the argument first and the argument last. *
* *
* *
* Input: Array<int> &seq Current sequence *
* int first The first element of the sequence to *
* consider *
* int last The length of the sequence *
int FlexiTree::ComputeHDistForTree(Array<int> &seq, int first, int
int tot_misses = 0;
// If the first element of the sequence isn't the same as the root
// of this tree, then every sequence on this tree will disagree with
// the sequence here, so we increment tot_misses
if (root != seq[first]) {
first++; // shift the seq forward
if (first > last) { // reached the end of the seq
return tot_misses; // return a zero, i.e. no mismatches
// Now we want to add to tot_misses the smallest number of
// mismatches with any of this tree's subtrees. This tree's
// subtrees are in the linked list children->next->next->
FlexiTreeNode *next_node = children;
// last is the last element of the sequence, which is one less than
// the number of elements in the sequence. The most misses possible
// is the number of elements in the sequence.
int min_misses = last + 1;
int misses;
while (next_node != NULL) {
misses = next_node->tree->ComputeHDistForTree(seq, first, last);
if (misses < min_misses) {
min_misses = misses;
next_node = next_node->next;
return (tot_misses + min_misses);
// format for writing out: we do it df, each path is terminated by a negative number,
// which is -(the reqd backtrack length)-1. depth should start out as 0.
// the tree writing out will end with -1.
void FlexiTree::Write(ostream &s, int &depth) {
s<<root<<" ";
FlexiTreeNode *temp_ptr = children;
while (temp_ptr) {
depth = 0;
temp_ptr->tree->Write(s, depth);
temp_ptr = temp_ptr->next;
if (temp_ptr) s<<"-"<<(depth + 1)<<" ";
depth++; // now incr the count
ostream &operator<<(ostream &s, FlexiTree &tree) {
int depth = 0;
tree.Write(s, depth);
s<<" -1"; // we terminate with a -1
return s;
// returns 0 if we have reached the end of the file, 1 otherwise
int FlexiTree::Read(istream &s, int &depth) {
int next_num;
if (s.eof()) return 0;
if (next_num == -1) return 0; // we have reached the end of the tree
if (next_num >= 0) {
children = new FlexiTreeNode(next_num);
if (!children->tree->Read(s, depth)) return 0;
FlexiTreeNode *temp_ptr = children;
while (depth == 0) {
if (s.eof()) return 0;
if (next_num == -1) return 0; // we have reached the end of the tree
temp_ptr->next = new FlexiTreeNode(next_num);
temp_ptr = temp_ptr->next;
if (!temp_ptr->tree->Read(s, depth)) return 0;
} else depth = (-1 * next_num) - 1;
if (depth) depth--;
return 1;
istream &operator>>(istream &s, FlexiTree &tree) {
int next_num, depth = 0;
tree.Read(s, depth);
return s;
// writes out in the format that dot uses for dags
int FlexiTree::OutputGraph(ostream &s) {
// first write out the name of the tree
s<<" "<<id<<" [label=\""<<root<<"\",shape=plaintext];"<<endl;
FlexiTreeNode *temp_ptr = children;
int childid;
while (temp_ptr) {
childid = temp_ptr->tree->OutputGraph(s);
s<<" "<<id<<" -> "<<childid<<";"<<endl;
temp_ptr = temp_ptr->next;
return id;
* IsSeqInForest() *
* Searches through database forest to locate sequence. Returns 1 *
* if it finds it, 0 otherwise *
SeqForest::IsSeqInForest(const Array<int> &seq, int seq_len) const
// Have we ever seen a sequence starting with the same root?
if (trees_found[seq[0]]) {
// Have we seen this precise sequence?
return trees[seq[0]].IsSeqInTree(seq, 0, seq_len-1);
return 0;
@ -0,0 +1,59 @@
#ifndef __FLEXITREE_H
#define __FLEXITREE_H
#include "../Utils/arrays.h"
class FlexiTreeNode;
class FlexiTree {
FlexiTreeNode *children;
int root;
int id;
void Write(ostream &s, int &depth);
int Read(istream &s, int &depth);
int OutputGraph(ostream &s);
FlexiTree(int d);
void SetRoot(int d) {root = d;}
int InsertSeq(const Array<int> &seq, int first, int last);
int IsSeqInTree(const Array<int> &seq, int first, int last);
int ComputeHDistForTree(Array<int> &seq, int first, int last);
friend ostream &operator<<(ostream &s, FlexiTree &tn);
friend istream &operator>>(istream &s, FlexiTree &tn);
int NumNodes(); // returns the number of nodes in the tree
int NumLeaves(); // returns the number of leaves in the tree, i.e num of distinct seqs
int NumBranches(); // returns the total # of branches, of all nodes
class SeqForest {
// this structure is a an array of N tree nodes, i.e. a tree for each value
// type
Array<FlexiTree> trees;
// this structure is to record what types of values actually occured -
// for efficiency, if there were actually fewer value types than
// specified in the config
Array<int> trees_found;
SeqForest(int max_trees)
{trees.Allocate(max_trees); trees_found.Allocate(max_trees); trees_found.Set(0);}
int IsSeqInForest(const Array<int> &seq, int seq_len) const;

@ -0,0 +1,34 @@
#ifndef __OPT_INFO_H
#define __OPT_INFO_H
#include <string>
#include "../Utils/arrays.h"
#define NUM_OPTS 16
#define SHORT_NAME 0
#define LONG_NAME 1
class OptInfo {
string long_name; // Long name of this option; used in
// configuration file and with the -- marker
// on the command line
string short_name; // Short name of this option; used with the -
// marker on the command line
int set; // Flag indicating if this option has already
// been set
char type; // type of value: legitimate values are f
// (flag, i.e., boolean), i (int), s (string)
// or h (help)
union { // pointer to actual value to be set
int *flag_val; // value if type = 'f'
int *int_val; // value if type = 'i'
string *str_val; // value if type = 's'
OptInfo() {};

@ -0,0 +1,54 @@
#ConfigFileRev: 1
#Sample STIDE configuration file containing default values.
db_name: default.db # name of database
seq_len: 6 # length of sequences
max_elements: 500 # maximum number of unique elements in input
max_streams: 100 # maximum number of unique streams in input
pair_offset: 0 # offset for pair number count
add_output_format: \
"DB Size: %d\tStream: %s\tPair Number: %p\n"
# In verbose mode, STIDE will print
# this information for every new
# sequence added to the database. In
# very verbose mode, STIDE will print
# this information for every sequence
# considered. Possible data:
# %d Database Size
# %i Pair number of last data element of
# sequence in its particular
# data stream
# %p Pair number of last data element of
# sequence in the whole input
# stream
# %s Stream Number
compare_output_format: \
"Pair Number: %p\tStream Number: %s\n"
# In verbose mode, STIDE will print
# this information for every sequence
# which is itself an anomaly or whose
# locality frame conatins an anomaly.
# In very verbose mode, STIDE will
# print this information for every
# sequence. Possible data:
# %a 1 if this sequence is an anomaly, 0
# otherwise
# %c locality frame count of this sequence
# %h Hamming distance
# %i Pair number of last data element of
# its particular data stream
# %p Pair number of last data element of
# the entire input
# %s Stream Number
lf_size: 1 # 1 causes locality frame counts not
# to be computed
add_to_db: off # Add this data to the database, or, if there
# is no database, create a new one -- do not
# do comparisons
output_graph: off # Outputs graphing information in Dot
# format
compute_hdist: off # Compute Hamming distances
write_db_stats: off # At end, print out statistics about database
verbose: off # See add_ouput_format and compare_output_format
very_verbose: off # See add_ouput_format and compare_output_format

@ -0,0 +1,797 @@
#include <stdlib.h>
#include <stdio.h>
#include <fstream.h>
#include <string>
#include "seq_config.h"
#include "opt_info.h"
#define LF_LIM 999
#define SEQ_LEN_LIM 199
#define MAX_ELEM_LIM 999
#define MAX_STREAMS_LIM 9999
* Config() *
* Reads in configuration information from configuration file, from *
* the command line, and from preset defaults. *
* *
* Input: int argc: Number of arguments on command line *
* char *argv[]: Array of strings of actual arguments *
* *
* Output: Nothing *
Config::Config(const int argc, const char *argv[])
Array<OptInfo> opt_array;
ReadCommandLine(argc, argv, opt_array);
* InitOptArray() *
* Sets the values of opt_array so that opr_array contains all the *
* information needed about the parameters being set by the config *
* file and the command-line arguments. *
* *
* Input: Array<OptInfo> &opt_array: Array of information about *
* options for the program *
* *
* Output: Nothing *
void Config::InitOptArray(Array<OptInfo> &opt_array)
opt_array[0].long_name = "db_name";
opt_array[0].short_name = "d";
opt_array[0].set = 0;
opt_array[0].type = 's';
opt_array[0].str_val = &db_name;
opt_array[1].long_name = "seq_len";
opt_array[1].short_name = "l";
opt_array[1].set = 0;
opt_array[1].type = 'i';
opt_array[1].int_val = &seq_len;
opt_array[2].long_name = "max_elements";
opt_array[2].short_name = "me";
opt_array[2].set = 0;
opt_array[2].type = 'i';
opt_array[2].int_val = &max_elements;
opt_array[3].long_name = "max_streams";
opt_array[3].short_name = "ms";
opt_array[3].set = 0;
opt_array[3].type = 'i';
opt_array[3].int_val = &max_streams;
opt_array[4].long_name = "cfg_name";
opt_array[4].short_name = "c";
opt_array[4].set = 0;
opt_array[4].type = 's';
opt_array[4].str_val = &cfg_name;
opt_array[5].long_name = "pair_offset";
opt_array[5].short_name = "p";
opt_array[5].set = 0;
opt_array[5].type = 'i';
opt_array[5].int_val = &pair_offset;
opt_array[6].long_name = "add_output_format";
opt_array[6].short_name = "aof";
opt_array[6].set = 0;
opt_array[6].type = 's';
opt_array[6].str_val = &add_output_format;
opt_array[7].long_name = "compare_output_format";
opt_array[7].short_name = "cof";
opt_array[7].set = 0;
opt_array[7].type = 's';
opt_array[7].str_val = &compare_output_format;
opt_array[8].long_name = "add_to_db";
opt_array[8].short_name = "a";
opt_array[8].set = 0;
opt_array[8].type = 'f';
opt_array[8].int_val = &add_to_db;
opt_array[9].long_name = "output_graph";
opt_array[9].short_name = "g";
opt_array[9].set = 0;
opt_array[9].type = 'f';
opt_array[9].int_val = &output_graph;
opt_array[10].long_name = "compute_hdist";
opt_array[10].short_name = "hd";
opt_array[10].set = 0;
opt_array[10].type = 'f';
opt_array[10].int_val = &compute_hdist;
opt_array[11].long_name = "lf_size";
opt_array[11].short_name = "lf";
opt_array[11].set = 0;
opt_array[11].type = 'i';
opt_array[11].int_val = &lf_size;
opt_array[12].long_name = "write_db_stats";
opt_array[12].short_name = "s";
opt_array[12].set = 0;
opt_array[12].type = 'f';
opt_array[12].int_val = &write_db_stats;
opt_array[13].long_name = "verbose";
opt_array[13].short_name = "v";
opt_array[13].set = 0;
opt_array[13].type = 'f';
opt_array[13].int_val = &verbose;
opt_array[14].long_name = "very_verbose";
opt_array[14].short_name = "V";
opt_array[14].set = 0;
opt_array[14].type = 'f';
opt_array[14].int_val = &very_verbose;
opt_array[15].long_name = "help";
opt_array[15].short_name = "h";
opt_array[15].set = 0;
opt_array[15].type = 'h';
* SetDefaults() *
* Sets conifiguration variables to their default values *
* *
* Input: None *
* *
* Output: None *
void Config::SetDefaults()
cfg_name = "stide.config";
db_name = "default.db";
seq_len = 6;
max_elements = 500;
max_streams = 100;
pair_offset = 0;
add_output_format = "DB Size: %d\tStream: %s\tPair Number: %p\n";
compare_output_format = "Pair Number: %p\tStream Number: %s\n";
lf_size = 1;
add_to_db = 0;
output_graph = 0;
compute_hdist = 0;
write_db_stats = 0;
verbose = 0;
very_verbose = 0;
num_fvars = 0;
* ReadCommandLine() *
* Parses the command line. Updates configuration variables. *
* *
* const int argc Number of arguments *
* const char *argv[], Array of arguments *
* Array<OptInfo> &opt_array Constant array of information about *
* the configuration variables *
void Config::ReadCommandLine(const int argc, const char *argv[],
Array<OptInfo> &opt_array)
string var_name; // Name of variable
string var_val; // Value of variable
int name_type; // LONG_NAME or SHORT_NAME
int argv_i = 1; // First index of argv
int argv_j = 0; // Second index of argv
while (argv_i < argc) {
if (argv[argv_i][argv_j] != '-') {
cerr<< "ERROR: Switches must be preceeded by a dash: "<<argv[argv_i]
<< endl << " is illegal" << endl;
if (argv[argv_i][argv_j] == '-') { // Long name
name_type = LONG_NAME;
else {
name_type = SHORT_NAME;
// Read name into var_name
var_name = argv[argv_i]+argv_j;
// Now we want to read the value, if there is one.
argv_j = 0;
if (++argv_i < argc) {
if (argv[argv_i][argv_j] != '-') {
var_val = argv[argv_i];
// assign value to appropriate variable
AssignValToVar(opt_array, var_val, var_name, name_type);
// Blank var_name and var_val for next time around
* AssignValToVar() *
* Figures out which variable to assign a given value to and does *
* so. Updates opt_array, to say that that particular variable *
* has been set. *
* *
* Input: Array<OptInfo> &opt_array Option Information *
* const string &var_val Value to be assigned *
* const string &var_name Name of variable to be updated *
* const int name_type SHORT_NAME or LONG_NAME *
* *
* Output: None *
void Config::AssignValToVar(Array<OptInfo> &opt_array, const string
&var_val, const string &var_name, const
int name_type)
int opt_i;
for (opt_i = 0; opt_i < NUM_OPTS; opt_i++) {
if (((name_type == LONG_NAME) && (opt_array[opt_i].long_name ==
var_name)) ||
((name_type == SHORT_NAME) && (opt_array[opt_i].short_name ==
var_name))) {
// If we have already set this variable and shouldn't change it,
// don't
if (opt_array[opt_i].set == 1) {
switch (opt_array[opt_i].type) {
case 'f': // flag
if ((var_val.length() == 0) || (var_val == "On") ||
(var_val == "ON") || (var_val == "on")) {
*(opt_array[opt_i].flag_val) = 1;
opt_array[opt_i].set = 1;
else if ((var_val != "Off") && (var_val != "off") &&
(var_val != "OFF")) {
cerr << "ERROR: Illegal value for parameter " << var_name
<< ". This parameter is a simple flag," << endl
<< "and may be followed by \"on\", \"off\", or nothing "
<< "(which turns it on). The current value is "
<< var_val << ". Aborting...";
exit -1;
case 'i':
// If there isn't a value, just use the default
if (var_val.length() == 0) {
*(opt_array[opt_i].int_val) = atoi(var_val.c_str());
opt_array[opt_i].set = 1;
case 's':
// If there is no string given, just use the default
if (var_val.length() == 0) {
*(opt_array[opt_i].str_val) = var_val;
opt_array[opt_i].set = 1;
case 'h':
} // end of switch
return; // we've found it, so we're done
} // end of if (opt_array[opt_i]...
} // end of for (opt_i = 0; ...
* ReadConfigFile() *
* Parses the configuration file. Updates configuration *
* variables. *
* *
* Input: Array<OptInfo> &opt_array: Option information *
* *
* Output: None *
void Config::ReadConfigFile(Array<OptInfo> &opt_array)
string var_name;
string var_val;
// Set up stream for reading configuration
ifstream cfg_file(cfg_name.c_str());
string buff;
int buff_i = 0; // index for buff
int opt_i = 0; // index for opt_array
int rev_num; // revision number of configuration file
if (!cfg_file.is_open()) {
cerr<<"WARNING: Cannot open configuration file "<<cfg_name
<<". I will continue, using the" <<endl
<<"default values and the command line arguments." << endl
<<"If that isn't what you wanted, type Ctrl-C now to abort."
<< endl;
// First we need to determine if the configuration file is old-style
// or new-style, i.e., is there a #ConfigFileRev: in the first
// line. We can determine this just be checking the first
// character.
char c = cfg_file.peek();
// Config file is empty; just return
if (cfg_file.eof()) {
// If old-style
if (c != '#') {
cerr << "WARNING: The first line of the configuration file did "
<< "not contain the string" << endl
<< "\"#ConfigFileRev: " << CFREV << "\"." << endl
<< "I will assume that this is an old format configuration "
<< "file." << endl
<<"If that isn't what you wanted, type Ctrl-C now to abort."
<< endl << endl;
ReadOldConfigFile(cfg_file, opt_array);
// Look for "#ConfigFileRev:"
cfg_file >> buff;
if (buff != "#ConfigFileRev:") {
cerr << "ERROR: I expected the first line of the configuration "
<< "file to either be \"#ConfigFileRev: \" followed by the "
<< "revision number or the beginning of an old-style "
<< "configuration file, which does not have a comment in the "
<< "first line. I'm confused, so I will abort..."
<< endl << endl;
cfg_file >> rev_num;
if (rev_num > CFREV) {
cerr << "ERROR: This version of STIDE does not know how to deal "
<< "with configuration files" << endl
<< "more modern than revision " << CFREV << ". Aborting..."
if (rev_num < CFREV) {
cerr << "ERROR: Configuration files must be revision " << CFREV
<< "or later, " << "or an old-style" << endl
<< "configuration file without a revision number. "
<< "Aborting..." << endl;
// Now we know everything's as we expect, so we'll parse the file
while (!cfg_file.eof()) {
// Skip white space at the beginning of the line
while (isspace(buff[buff_i])) {
// If buff is empty, move on to next line
if (buff.length() <= buff_i) {
getline(cfg_file, buff);
buff_i = 0;
// If we start with a comment, move on to next line
if (buff[buff_i] == '#') {
getline(cfg_file, buff);
buff_i = 0;
// Read in variable name, up to the :
int start_place = buff_i; // the beginning place of the name
while (buff[buff_i] != ':' && (buff_i < buff.length())) {
if (buff[buff_i] == buff.length()) {
cerr << "ERROR: Variable names in the configuration file must "
<< "be followed by a colon. The line " << endl
<< buff << endl << "contains a variable name which is not "
<< "terminated by a colon. Aborting..." <<endl;
// This assigns the values in buff between start_place and buff_i
// to var_name
var_name.assign(buff, start_place, buff_i - start_place);
// Skip colon
// Skip white space
while (isspace(buff[buff_i])) { buff_i++; }
start_place = buff_i; // the starting place of the value
// Find last point in value. If it starts with a quote, it ends
// with a quote.
if ((buff[buff_i] == '\"') && (buff_i < buff.length())) {
while (buff[buff_i] != '\"') {
// Strip off first "
// Otherwise, it ends with a space, a # or the end of the line
else {
while ((buff_i < buff.length()) && (!isspace(buff[buff_i])) &&
(buff[buff_i] != '#')) {
var_val.assign(buff, start_place, buff_i - start_place);
// Now we want to check to see if the line was continued, in which
// case we haven't gotten the value of the variable in var_val, so
// we still need to do that.
if (buff[buff_i-1] == '\\') {
getline(cfg_file, buff);
buff_i = 0;
while (isspace(buff[buff_i])) { buff_i++; }
start_place = buff_i;
// Find last point in value. If it starts with a quote, it ends with a
// quote.
if (buff[buff_i] == '\"') {
while ((buff[buff_i] != '\"') && (buff_i < buff.length())) {
start_place++; // Strip off first "
// Otherwise, it ends with a space, a # or the end of the line
else {
while ((buff_i < buff.length()) && (!isspace(buff[buff_i])) &&
(buff[buff_i] != '#')) {
var_val.assign(buff, start_place, buff_i - start_place);
// assign value to appropriate variable
AssignValToVar(opt_array, var_val, var_name, LONG_NAME);
getline(cfg_file, buff);
buff_i = 0;
} //end of while (!cfg_file.eof())...
* ReadOldConfigFile() *
* Reads information from an old-style configuration file. *
* Updates configuration variables. *
* *
* Input: ifstream &cfg_file Configuration file (already opened) *
* Array<OptInfo> &opt_array: Option information *
* *
* Output: None *
void Config::ReadOldConfigFile(ifstream &cfg_file,
Array<OptInfo> &opt_array)
string buff;
string var_name;
string var_val;
var_name = "max_elements";
AssignValToVar(opt_array, var_val, var_name, LONG_NAME);
getline(cfg_file, buff);
var_name = "max_streams";
AssignValToVar(opt_array, var_val, var_name, LONG_NAME);
getline(cfg_file, buff);
// Next line is hash table size, but we are now figuring that out
// dynamically, so just throw it away.
getline(cfg_file, buff);
// Now read in the format string
getline(cfg_file, var_val);
// Put the format string in the appropriate place
if (add_to_db) {
var_name = "add_output_format";
AssignValToVar(opt_array, var_val, var_name, LONG_NAME);
else {
var_name = "compare_output_format";
AssignValToVar(opt_array, var_val, var_name, LONG_NAME);
* CheckValues() *
* Checks configuration values that have been read in to make *
* sure that they are within the limits. Flags are automatically *
* checked while being read in, the output formats are checked *
* in InitOutputFormat(), and filenames are checked when they are *
* opened, so all that is left is the integer values. *
* *
* Input: None *
* *
* Output: None *
void Config::CheckValues()
if ((lf_size < 1) || (lf_size > LF_LIM)) {
cerr << "ERROR: lf_size must be between 1 and " << LF_LIM
<< ". It has been set to " << lf_size << ". Aborting..." << endl;
if ((seq_len < 1) || (seq_len > SEQ_LEN_LIM)) {
cerr << "ERROR: seq_len must be between 1 and " << SEQ_LEN_LIM
<< ". It has been set to " << seq_len << ". Aborting..." << endl;
if ((max_elements < 1) || (max_elements > MAX_ELEM_LIM)) {
cerr << "ERROR: max_elements must be between 1 and " << MAX_ELEM_LIM
<< ". It has been set to " << max_elements
<< ". Aborting..." << endl;
if ((max_streams < 1) || (max_streams > MAX_STREAMS_LIM)) {
cerr << "ERROR: max_streams must be between 1 and " << MAX_STREAMS_LIM
<< ". It has been set to " << max_streams
<< ". Aborting..." << endl;
* InitOutputFormat() *
* Converts the string add_output_format or compare_output_format *
* to information filling fmt_str and num_fvars, which is more *
* convenient for output. *
* *
* Input: None *
* *
* Output: None *
void Config::InitOutputFormat()
// Now we analyze add_output_format or compare_output_format
int flag = 0;
int f_i = 0;
num_fvars = 0;
string *buff;
// If we're not in verbose or very_verbose modes, we're never going
// to use this information, so don't waste our time doing this
if (!(verbose || very_verbose)) {
if (add_to_db) {
buff = &add_output_format;
else {
buff = &compare_output_format;
for (int i = 0; i <(*buff).length(); i++) {
switch ((*buff)[i]) {
case '\\':
switch ((*buff)[i]) {
case 't': fmt_str[num_fvars][f_i] = '\t'; break;
case 'n': fmt_str[num_fvars][f_i] = '\n'; break;
case '%':
fmt_str[num_fvars][f_i] = '%';
flag = 1;
fmt_str[num_fvars][f_i] = (*buff)[i];
if (flag) {
switch (fmt_str[num_fvars][f_i]) {
case 'd': // database size
case 'i': // number of last value of sequence in this
// data stream
case 'p': // number of last value of sequence in entire
// input
case 's': // external stream ID
case 'a': // flag for whether this sequence is anomalous
case 'c': // locality frame count of this sequence
case 'h': // Hamming distance for this sequence
// Record that we must write that val at that position
write_val[num_fvars] = fmt_str[num_fvars][f_i];
fmt_str[num_fvars][f_i] = 'd';
fmt_str[num_fvars][f_i + 1] = '\0';
f_i = -1;
flag = 0;
default: // Unknown flag
cerr << "ERROR: Illegal control character in output format."
<< " Type stide -h for help." << endl;
} // switch ((*buff)[i ...
fmt_str[num_fvars][f_i] = '\0';
* OutputConfigInfo() *
* Writes information about the final configuration to standard *
* output. Does so in a format that could be used as a *
* configuration file. Changes no values anywhere. *
* *
* Input: const Array<OptInfo> &opt_array Option Information *
* *
* Output: None *
void Config::OuputConfigInfo(const Array<OptInfo> &opt_array) const
cout<<"This run was configured using configuration file "
<< cfg_name << " and command" << endl
<< "line arguments. The configuration values were as "
<< "follows." << endl
<<"#ConfigFileRev: " << CFREV << endl;
for (int i = 0; i < NUM_OPTS; i++) {
if (opt_array[i].type == 'i') {
cout << opt_array[i].long_name << ": " << *(opt_array[i].int_val)
<< endl;
if ((opt_array[i].type == 's') &&
((add_to_db && (opt_array[i].short_name == "aof")) ||
(!add_to_db && (opt_array[i].short_name == "cof")))) {
cout << opt_array[i].long_name << ": \"" << *(opt_array[i].str_val)
<< "\"" << endl;
if (opt_array[i].type == 'f') {
if (*(opt_array[i].int_val) == 1) {
cout << opt_array[i].long_name << ": On" << endl;
if (*(opt_array[i].int_val) == 0) {
cout << opt_array[i].long_name << ": Off" << endl;
cout << endl << endl;
// Now print header for verbose modes
if (verbose || very_verbose) {
cout<<endl<<"Variables in output: "<<endl;
for (int j = 0; j < num_fvars; j++) {
switch (write_val[j]) {
case 's': cout<<"stream #, "; break;
case 'i': cout<<"index #, "; break;
case 'h': if (compute_hdist) {cout<<"hamming miss, "; } break;
case 'c': if (lf_size > 1) {cout<<"lfc, "; } break;
case 'p': cout<<"pair #, "; break;
case 'd': cout<<"db size, "; break;
case 'a': cout<<"is anomalous?, "; break;
* WriteHelpInfo() *
* Writes help information to standard output. Changes no values.*
* *
* Input: None *
* *
* Output: None *
* *
void Config::WriteHelpInfo() const
cout<<"STIDE accepts calls of the form:"<<endl
<<" stide -c cfg_name -d db_name -e max_num_elements"
<<" -lf lf_size -l seq_len"<<endl<<" -n max_num_streams"
<<" -p pair_num_offset -aof add_out_format "
<< endl << " -cof comp_out_format -a -g -h -m -s -v -V"
<< endl << endl;
cout<<"STIDE expects input to come through standard input in"
<<" the format of a pair"<<endl
<<"of integers per line, where the first integer is a"
<<" stream identifier"<<endl
<<"and the second is a data element. Command line"
<<" arguments override"<<endl
<<"specifications in the configuration file. All"
<<" parameters are optional"<<endl
<<"and can be specified in any order. Parameters"
<<" are always preceded by a"<<endl
<<"switch. The switches are:"<<endl<<endl;
cout<<"-a Add to database; defaults to off"<<endl;
cout<<"-c cfg_name The name of file containing the"
<<" configuration;"<<endl
<<" defaults to \"stide.config\""<<endl;
cout<<"-d db_name The name of the file containing"
<<" the database;"<<endl
<<" defaults to \"default.db\""<<endl;
cout<<"-lf lf_size The size of the locality frame;"
<<" defaults to 1"<<endl;
cout<<"-g Write graphing data in dot format to"
<<" defaults to off"<<endl;
cout<<"-h Help; displays this information"<<endl;
cout<<"-l seq_len Length of sequence; defaults to 6"
cout<<"-p pair_offset Offset for pair number count;"
<<" defaults to 0"<<endl;
cout<<"-s Display db stats; defaults to off"
cout<<"-v Verbose mode on; defaults to off"<<endl;
cout<<"-V Very verbose mode on; defaults to off"<<endl;
cout<<"-hd Compute Hamming distance measures;"
<<" defaults to off"<<endl;
cout<<"-me max_elements Maximum number of different"
<<" elements"<<endl
<<" in the input stream; defaults to"
<<" 500" <<endl;
cout<<"-ms max_num_streams Maximum number of different"
<<" streams in input;"<<endl
<<" defaults to 100"<<endl;
cout<<"-aof add_out_format Format for output when adding to"
<<" database"<<endl
<<" in verbose or very_verbose"
<<" modes; defaults to"<<endl
<<" \"DB Size: %d\\tStream: "
<<"%s\\tPair Number: %p\\n\""<<endl;
cout<<"-cof compare_out_format Format for output when comparing"
<<" with database"<<endl
<<" in verbose or very_verbose modes;"
<<" defaults to"<<endl
<<" \"Pair Number: %p\\tStream"
<<" Number: %s\\n\""<<endl;

@ -0,0 +1,68 @@
#ifndef __SEQ_CONFIG_H
#define __SEQ_CONFIG_H
#define CFREV 1
#include <iostream.h>
#include <fstream.h>
#include <string>
#include "opt_info.h"
class Config {
Config(const int argc, const char *argv[]); // Constructor; reads
// configuration file and command
// line arguments
string cfg_name; // Name of configuration file
string db_name; // Name of database
int seq_len; // Sequence Length
int max_elements; // Maximum number of different
// data elements we may encounter
int max_streams; // Maximum number of different
// streams we may encounter
int pair_offset; // Number by which to offset
// num_pairs_read
string add_output_format; // Format for verbose-mode output
// when adding to database
string compare_output_format; // Format for verbose-mode output
// when comparing with an
// existing database
int lf_size; // Size of locality frames: 1
// effectively means don't
// compute locality frames
int add_to_db; // Flag indicating that we should
// add to the database rather
// than make comparisons
int output_graph; // Output graphing information in
// Dot format
int compute_hdist; // Compute Hamming distance
int write_db_stats; // Write statistics about the
// database
int verbose; // Output information about each
// anomaly or each new sequence
// added to the database
int very_verbose; // Output information about each
// sequence encountered
char fmt_str[10][50]; // String used for outputting
// information in verbose mode
char write_val[7]; // Do we write the value? used
// with fmt_str
int num_fvars; // Number of format variables
void Config::InitOptArray(Array<OptInfo> &opt_array);
void Config::SetDefaults();
void Config::ReadCommandLine(const int argc, const char *argv[],
Array<OptInfo> &opt_array);
void Config::AssignValToVar(Array<OptInfo> &opt_array, const
string &var_val, const string
&var_name, const int name_type);
void Config::ReadConfigFile(Array<OptInfo> &opt_array);
void Config::ReadOldConfigFile(ifstream &cfg_file,
Array<OptInfo> &opt_array);
void Config::InitOutputFormat();
void Config::CheckValues();
void Config::OuputConfigInfo(const Array<OptInfo> &opt_array) const;
void Config::WriteHelpInfo() const;

@ -0,0 +1,358 @@
#include <stdlib.h>
#include <string.h>
#include <iostream.h>
#include <fstream.h>
#include <stdio.h>
#include "../Utils/hash.h"
#include "seq_stream.h"
* Init() *
* Initializes an instance of Stream. *
* *
* Input: const Config &cfg Configuration information *
* const int intern internal stream identifier *
* const int extern external stream identifier *
* Output: none *
void Stream::Init(const Config &cfg,
const int intern_id, const int extern_id) {
// initialize all the arrays
current_seq.Set(-1); // initialize the array to be empty
num_in_seq = -1;
num_pairs_read = 0;
num_anoms = 0;
num_seqs_fnd = 0;
int_sid = intern_id;
ext_sid = extern_id;
max_hdist = 0;
seq_hdist = 0;
seq_lfc = 0;
max_lfc = 0;
ready = 0;
seq_len = cfg.seq_len;
* Append() *
* This function puts the integer given into the current_seq array *
* as the last element. It flags ready according to whether *
* current_seq is full. Updates num_in_seq, ready, current_seq, *
* num_seqs_fnd, and num_pairs_read. *
* *
* Input: const int new_value The next value to be put into the *
* current_seq array *
* Output: none *
void Stream::Append(const int new_value)
// missing system call - zero the current sequence
if (new_value == -1) {
num_in_seq = -1;
ready = 0;
else {
if (num_in_seq < seq_len - 1) { // window not yet full
current_seq[num_in_seq] = new_value;
if (num_in_seq == seq_len - 1) {
ready = 1;
else {
// Roll over current_seq array
for (int k = 0; k < num_in_seq; k++) {
current_seq[k] = current_seq[k + 1];
current_seq[num_in_seq] = new_value;
* AddToDB() *
* *
* Adds current_seq to the database if it isn't already there; *
* Returns 0 if it is already there, 1 if it is new. Updates *
* normal and db_size. *
* *
* Input: SeqForest &normal Forest of normal sequences *
* int &db_size Number of unique sequences in the *
* database *
* const int total_pairs_read Number of pairs read from the *
* entire input stream *
* const Config &cfg Configuration Information *
* Output: 0 if sequence isn't new, 1 if it is *
int Stream::AddToDB(SeqForest &normal, int &db_size, const int
total_pairs_read, const Config &cfg) const
int is_new;
// If there is not a tree with the same root as this sequence has,
// make a new tree with that root and flag trees_found
if (!normal.trees_found[current_seq[0]]) {
normal.trees_found[current_seq[0]] = 1;
// Try to add the sequence. If it's already there, is_new will be
// set to 0, otherwise it will be set to 1.
is_new = normal.trees[current_seq[0]].InsertSeq(current_seq, 0,
db_size += is_new;
if ((is_new && cfg.verbose) || cfg.very_verbose) {
ReportNewSeq(cfg, total_pairs_read, db_size);
if (is_new)
return 1;
return 0;
* CompareSeq() *
* Compares the current sequence in this stream to the database, *
* in the manner indicated by the configuration file. Reports *
* on anomalies if told to by the configuration file. Updates *
* num_anoms, seq_hdist, max_hdist, seq_lfc, and max_lfc. *
* *
* Input: const Config &cfg: Information from configuration file *
* const SeqForest &normal: DB of normal sequences *
* const int total_pairs_read: Number of pairs read from *
* all of the streams *
* Output: none *
void Stream::CompareSeq(const Config &cfg, const SeqForest &normal,
const int total_pairs_read)
int is_anom; // flag to indicate whether current_seq is an anomaly
is_anom = ComputeMisses(normal);
if ((is_anom) && (cfg.compute_hdist)) {
if (cfg.lf_size > 1) {
ComputeLF(is_anom, cfg.lf_size);
// if we're in verbose mode and either current_seq is an anomaly or
// its locality frame contains an anomaly, report it
if ((cfg.very_verbose) || (cfg.verbose && (is_anom || seq_lfc))) {
ReportSeq(cfg, total_pairs_read, is_anom);
* ComputeMisses() *
* Compares the current sequence to the database sequences. If *
* there is an exact match, we return 0. Otherwise we return 1. *
* Updates num_anoms and seq_hdist. *
* *
* Input: const SeqForest &normal: DB of normal sequences *
* Output: 0 if there is an exact match *
* 1 if the sequence is anomalous *
int Stream::ComputeMisses(const SeqForest &normal)
if (normal.IsSeqInForest(current_seq, seq_len)) {
seq_hdist = 0;
// We have an anomaly
* ComputeHDist() *
* Compares the current sequence in this stream to each sequence *
* in the database in turn, adding up the number of mismatches *
* between the two sequences. The smallest difference between *
* the current sequence and the database sequences is the minimum *
* Hamming distance for the current sequence. If this minimum *
* Hamming distance is greater than the largest minimum Hamming *
* distance encountered so far, then the variable max_hdist is *
* updated. Updates seq_hdist and max_hdist. *
* *
* Input: const SeqForest &normal: DB of normal sequences *
* *
* Output: none *
void Stream::ComputeHDist(const SeqForest &normal)
int misses_on_this_seq; // the number of mismatches between
// current_seq and the sequence we're
// comparing it with at the moment
seq_hdist = seq_len; // start with seq_hdist as high as
// possible
// We compare current_seq with each sequence in our database tree
for (int i = 0; i < normal.trees.Size(); i++) {
// Have we seen any sequences starting with element i? If not, we
// can go on to consider sequences starting with element i+1.
if (normal.trees_found[i]) {
misses_on_this_seq =
normal.trees[i].ComputeHDistForTree(current_seq, 0,
if (misses_on_this_seq < seq_hdist) {
seq_hdist = misses_on_this_seq;
if (seq_hdist > max_hdist) {
max_hdist = seq_hdist;
* ComputeLF() *
* Computes the number of misses in current_seq's locality frame. *
* Updates lf, seq_lfc and max_lfc. *
* *
* Input: const int is_anom Flag to indicate whether *
* current_seq is an anomaly *
* const int lf_size Size of locality frame *
* Output: none *
void Stream::ComputeLF(const int is_anom, const int lf_size)
// When num_seqs_fnd is less than lf_size, the locality frame
// array is not full
if (num_seqs_fnd <= lf_size) {
lf[num_seqs_fnd-1] = is_anom;
seq_lfc += is_anom;
else {
// We're about to remove the first element of lf; since seq_lfc is
// the sum of the elements of lf, we should subtract lf[0] from
// seq_lfc to remove it from the sum.
seq_lfc -= lf[0];
// Now we add is_anom and seq_lfc is the sum of the new locality
// frame.
seq_lfc += is_anom;
// roll over the array
for (int i = 0; i < lf_size-1; i++) {
lf[i] = lf[i+1];
lf[lf_size-1] = is_anom;
if (seq_lfc > max_lfc) {
max_lfc = seq_lfc;
* ReportSeq() *
* This function reports data about a sequence. Specifically, it *
* can report the external stream id, a number indicating where *
* the first element of the current sequence occurs in the input, *
* a number indicating how many pairs from this particular data *
* stream have been read prior to the first element of the *
* sequence, the minimum Hamming distance for the current *
* sequence, the locality frame count, the locality frame count, *
* and whether this particular sequence is itself an anomaly (it *
* could be that some other sequence in its locality frame is *
* anomalous). The configuration file determines which of those *
* possible data are reported and in what format. Updates no *
* values. *
* *
* Input: const Config &cfg Configuration information *
* const int total_pairs_read Total number of pairs read *
* from the input stream from any data *
* stream, not just this one *
* const int is_anom flag for whether the current *
* sequence is itself an anomaly *
* Output: none *
void Stream::ReportSeq(const Config &cfg, const int total_pairs_read,
const int is_anom) const
for (int i = 0; i < cfg.num_fvars; i++) {
switch (cfg.write_val[i]) {
case 'a':
printf(cfg.fmt_str[i], is_anom); break;
case 'c':
if (cfg.lf_size > 1) {
printf(cfg.fmt_str[i], seq_lfc);
case 'h':
if (cfg.compute_hdist) {
printf(cfg.fmt_str[i], seq_hdist);
case 'i':
printf(cfg.fmt_str[i], num_pairs_read); break;
case 'p':
printf(cfg.fmt_str[i], total_pairs_read); break;
case 's':
printf(cfg.fmt_str[i], ext_sid); break;
* ReportNewSeq() *
* This function reports on sequences which have been newly added *
* to the database. It can report the external stream *
* identifier, where the first element of the sequence occurs *
* both within the whole input stream and within its own data *
* stream, and the number of unique sequences in the database *
* after this sequence has been added. The configuration file *
* determines which of those possible data are reported and in *
* what format. Updates no values. *
* *
* Input: const Config &cfg Configuration information *
* const int total_pairs_read Total number of pairs read *
* from the input stream from any data *
* stream, not just this one *
* const int db_size Number of unique sequences *
* in the database *
* Output: none *
void Stream::ReportNewSeq(const Config &cfg, const int total_pairs_read,
const int db_size) const
for (int i = 0; i < cfg.num_fvars; i++) {
switch (cfg.write_val[i]) {
case 'd':
printf(cfg.fmt_str[i], db_size); break;
case 'i':
printf(cfg.fmt_str[i], num_pairs_read); break;
case 'p':
printf(cfg.fmt_str[i], total_pairs_read); break;
case 's':
printf(cfg.fmt_str[i], ext_sid); break;

#ifndef __STREAM_H
#define __STREAM_H
#include "../Utils/arrays.h"
#include "seq_config.h"
#include "flexitree.h"
class Stream {
Stream() {};
void Init(const Config &cfg, const int intern_id, const int
void Append(const int next_value);
int AddToDB(SeqForest &normal, int &db_size, int total_pairs_read,
const Config &cfg) const;
void CompareSeq(const Config &cfg, const SeqForest &normal, const
int total_pairs_read);
int GetMaxHDist(void) {return max_hdist;}
int GetMaxLFC(void) {return max_lfc;}
int Ready(void) {return ready;}
int GetNumAnoms(void) {return num_anoms;}
int GetNumPairsRead(void) {return num_pairs_read;}
int GetNumSeqsFnd(void) {return num_seqs_fnd;}
Array<int> current_seq; // current sequence being filled or
// processed
int num_in_seq; // current_seq is full up through
// num_in_seq
int num_pairs_read; // the number of input pairs belonging to
// this stream that have been read so far
int num_anoms; // the number of anomalies found so far
int num_seqs_fnd; // the number of (not necessarily unique)
// sequences belonging to this stream
// found so far
int ext_sid; // the external stream id
int int_sid; // the internal stream id
int max_hdist; // the largest minimum Hamming distance
// found in this stream
int seq_hdist; // the minimum Hamming distance for
// current_seq
Array<int> lf; // array for locality frame
int seq_lfc; // the locality frame count for this
// sequence
int max_lfc; // the largest locality frame count
// encountered so far
int ready; // a flag to indicate whether this stream
// has a full sequence ready to be
// processed. 0 = no, 1 = yes.
int seq_len; // sequence length
int ComputeMisses(const SeqForest &normal);
void ComputeHDist(const SeqForest &normal);
void ComputeLF(const int is_anom, const int lf_size);
void ReportSeq(const Config &cfg, const int total_pairs_read,
const int is_anom) const;
void ReportNewSeq(const Config &cfg, const int total_pairs_read,
const int db_size) const;

* *
* STIDE: Sequence Time-Delay Embedding v1.1 *
* *
* Written by Steve Hofmeyr 7/21/96 *
* Revised by Julie Rehmeyer 3/98 *
* *
* Copyright (C) 1996, 1998 Regents of the University of New Mexico. *
* All Rights Reserved. *
* *
* This program is free software; you can redistribute it and/or *
* modify it under the terms of the GNU General Public License as *
* published by the Free Software Foundation; either version 2 of *
* the License, or (at your option) any later version. *
* *
* This program is distributed in the hope that it will be useful, *
* but WITHOUT ANY WARRANTY; without even the implied warranty of *
* GNU General Public License for more details. *
* *
* You should have received a copy of the GNU General Public *
* License along with this program; if not, write to the Free *
* Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, *
* USA. *
* *
#include <stdlib.h>
#include <string.h>
#include <iostream.h>
#include <fstream.h>
#include "../Utils/arrays.h"
#include "../Utils/hash.h"
#include "seq_config.h"
#include "seq_stream.h"
#include "flexitree.h"
#define DBREV 1
int counter = 0;
Stream *GetReadyStream(Array<Stream> &streams, HashTableInt
&sid_table, int &num_streams_fnd, int
&total_pairs_read, const Config &cfg);
int ReadDB(SeqForest &db_forest, const string &db_name,
int &seq_len);
void WriteDB(const SeqForest &db_forest, const string &db_name, const
int db_size, const int seq_len);
void FinalReport(const Config &cfg, const SeqForest &normal, const int
num_streams_fnd, const int num_seqs_added, const
Array<Stream> &streams, const int db_size);
void WriteDBStats(const SeqForest &db_forest, ostream &out_stream,
const int db_size);
void OutputGraph(const SeqForest &db_forest, string db_name);
int GetPrimeLargerThan(const int n);
* main() *
* Input: int argc: Number of command-line arguments *
* char *argv[]: array of strings containing *
* command-line arguments *
* Output: 0 if successful, -1 if unsuccessful *
int main(int argc, char *argv[])
Config cfg((const int) argc, (const char **) argv);
// Declare configuration object and do
// the configuration on the basis of the
// command line arguments and the
// configuration file
Stream *active_stream; // This will point to the stream that
// currently has a sequence to be worked
// on (either added to the database or
// compared).
HashTableInt sid_table(GetPrimeLargerThan(cfg.max_streams));
// Hash table relating external stream ids to
// internal sids; make size of table
// smallest prime larger than the number
// of streams
SeqForest normal(cfg.max_elements); // Uninitialized forest of
// normal sequences
Array<Stream> streams(cfg.max_streams); // Array of stream objects,
// one for each data stream
// in input, which are
// allocated as needed
int num_streams_fnd = 0; // Number of data streams
// encountered to date
int total_pairs_read = cfg.pair_offset; // Number of pairs read from
// input to date from all
// the data streams combined
// -- can be offset using
// the "-n" switch
int db_size; // Total number of unique
// sequences in the database
int init_db_size = 0; // Number of unique
// sequences in the
// pre-existing database
// Read database into normal, if database exists
db_size = init_db_size = ReadDB(normal, cfg.db_name, cfg.seq_len);
if (cfg.add_to_db) {
while ((active_stream =
GetReadyStream(streams, sid_table, num_streams_fnd,
total_pairs_read, cfg))
!= NULL) {
active_stream->AddToDB(normal, db_size, total_pairs_read, cfg);
WriteDB(normal, cfg.db_name, db_size, cfg.seq_len);
if (cfg.output_graph) {
else {
int i = 0;
while ((active_stream =
GetReadyStream(streams, sid_table, num_streams_fnd,
total_pairs_read, cfg))
!= NULL) {
active_stream->CompareSeq(cfg, normal, total_pairs_read);
FinalReport(cfg, normal, num_streams_fnd, db_size - init_db_size,
streams, db_size);
* GetReadyStream() *
* This function reads a pair from the input, appends the element *
* to the current sequence string in the appropriate data stream, *
* finds out if that data stream has a complete sequence to be *
* processed, continues until it has found such a data stream, and *
* returns a pointer to it. It updates num_streams_fnd, *
* total_pairs_read, sid_table, and streams. *
* *
* Input: Array<Stream> &streams: the array of streams that we have *
* found so far *
* HashTableInt &sid_table: hash table relating external sids *
* to internal sids *
* int &num_streams_fnd: the number of streams found so far; *
* int &total_pairs_read: the number of pairs read from the *
* input stream so far *
* const Config &cfg: configuration information *
* *
* Output: a pointer to the next stream that is ready for processing *
Stream *GetReadyStream(Array<Stream> &streams, HashTableInt
&sid_table, int &num_streams_fnd, int
&total_pairs_read, const Config &cfg)
Stream *ready_stream = NULL;
int ext_sid;
int int_sid;
int sval;
cin >> ext_sid;
while (!cin.eof()) {
if (ext_sid == -1) {
int_sid = sid_table.ExtToInt(ext_sid, num_streams_fnd);
cin >> sval;
// Update num_streams_fnd, if necessary
if (int_sid >= num_streams_fnd) {
if (int_sid > cfg.max_streams) {
cerr<<"ERROR: Too many streams to follow, aborting..."<<endl;
// We need a new stream object
streams[num_streams_fnd].Init(cfg, int_sid, ext_sid);
num_streams_fnd = int_sid + 1;
if (streams[int_sid].Ready()) {
ready_stream = &streams[int_sid];
cin >> ext_sid;
return ready_stream;
* ReadDB() *
* Reads the database from a file and returns the number of unique *
* sequences in the database. Checks for appropriate revision *
* number. If it is a revision DBREV database, the second line *
* will be "#DBseq_len: " followed by the sequence length. The *
* next line will contain a single number, giving the root of the *
* first tree. The following lines will contain the tree itself. *
* The first seq_len numbers make up the first sequence (so the *
* first number of the second line will be the same as the number *
* on the first line). The next number will be a negative number *
* between -(seq_len-1) and -2, indicating how far to backtrack in *
* the first sequence, and the following positive numbers give the *
* rest of the second sequence. So, for example, -3 would mean *
* backtrack 3 numbers, take the previous numbers including the *
* one you're on, and append the next two numbers. So after the *
* -3 you would find two positive numbers, followed by a negative *
* number (which you would use the same way as you used the -3, on *
* the most recent sequence). Each tree is terminated by the *
* number -1. So the sample input file *
* 3 *
* 3 4 2 9 10 3 -4 3 9 8 -2 3 -3 4 9 -1 *
* 2 *
* 2 3 4 5 6 7 -3 2 9 -1 *
* yields the sequences: *
* 3 4 2 9 10 3 *
* 3 4 2 3 9 8 *
* 3 4 2 3 9 3 *
* 3 4 2 3 4 9 *
* 2 3 4 5 6 7 *
* 2 3 4 5 2 9 *
* *
* Input: SeqForest &db_forest Forest of sequences *
* const string &db_name Name of database *
* int &seq_len User-specified sequence length *
* *
* Output: the number of unique sequences in the database *
* *
int ReadDB(SeqForest &db_forest, const string &db_name,
int &seq_len)
ifstream in_db_file(db_name.c_str()); // file to read the database from
int db_size = 0; // size of the database
int root; // the first element of the sequences
// we are reading in at the moment;
// i.e., the root of this tree
string buff;
int db_seq_len;
int rev_num;
if (!in_db_file.is_open()) {
cerr<<"WARNING: Cannot open database file " << db_name
<< " for input"<<endl<<"Creating a new file"<<endl;
return 0;
// Check to see if the first line contains "#DBrev:"
if (buff == "#DBrev:") {
if (rev_num > DBREV) {
cerr << "ERROR: The revision number is greater than " << DBREV
<< ". This version of STIDE is only capable of dealing "
<< "with databases through DBrev " << DBREV
<< ". Aborting..."<<endl;
if (rev_num < DBREV) {
cerr << "ERROR: Revision number of database must be >= " << DBREV
<< endl;
// Now we know that it is revision DBREV. Check sequence length of
// database against user-indicated sequence length
// Now check to see if next line is "#DBseq_len: " followed by a
// number
if (buff != "#DBseq_len:") {
cerr << "ERROR: The second line of the database does not "
<< "contain the string \"#DBseq_len: \"" << endl
<< "followed by the sequence length of the database, as "
<< "required of revision " << DBREV
<< " databases. Aborting..."<< endl;
if (db_seq_len != seq_len) {
cerr << "WARNING: Database sequence length is " << db_seq_len
<< ", which does not match "
<< "sequence length specified" << endl
<< "by user (or by default if no specification was given), "
<< "which is " << seq_len << endl
<< "I will use the database sequence length. If that is "
<< "not what you intended, type Ctrl-C to abort." << endl;
seq_len = db_seq_len;
// Read next number into root
in_db_file >> root;
// Otherwise, we assume we have an old-style database, and let the
// user know that that's our assumption
else {
cerr << "WARNING: The string \"DBrev: \" is not in the first "
<< "line of the database." << endl
<< "I'm assuming that it's an older style of database, and "
<< "will read it in" << endl
<< "based on that assumption. If that is not what you want "
<< "me to do, type CTRL-C" << endl << endl;
// we have just read the first root into buff -- put it in root
// instead
root = atoi(buff.c_str());
while (!in_db_file.eof()) {
if (root == -1) break;
db_size += db_forest.trees[root].NumLeaves();
return db_size;
* WriteDB() *
* Writes db_forest to the file db_name, with the format described *
* in the header of ReadDB(). Prints database statistics at the *
* end of the file. *
* *
* Input: const SeqForest &db_forest Forest of sequences in *
* database *
* const string &db_name Name of file in which to *
* put database. *
* const int db_size Number of unique sequences *
* in the database *
* const int seq_len Sequence length *
* *
* Output: none *
void WriteDB(const SeqForest &db_forest, const string &db_name, const
int db_size, const int seq_len)
ofstream out_db_file(db_name.c_str());
if (!out_db_file.is_open()) {
cerr << "ERROR: Cannot open database file " << db_name
<< "for output, aborting..." << endl ;
out_db_file << "#DBrev: " << DBREV << endl;
out_db_file << "#DBseq_len: " << seq_len << endl;
for (int i = 0; i < db_forest.trees.Size(); i++) {
if (db_forest.trees_found[i]) {
out_db_file<<" -1"<<endl;
// we can now write anything, so I will write the db stats
out_db_file<<"; DB STATS"<<endl;
WriteDBStats(db_forest, out_db_file, db_size);
* FinalReport() *
* Reports data at end of run. The number of streams, the number *
* of input pairs, and the number of sequences in the input are *
* always reported. If we have done a comparison run, we report *
* the number of anomalies, and the precentage of sequences that *
* were anomalous. Additionally, if asked for, the Hamming *
* distance or locality frame count is reported. If we have added *
* to the database, we report having done so and report the number *
* of sequences added. If database statistics are asked for, we *
* report the number of nodes, the number of unique sequences, the *
* number of branches, and the average database branch factor. *
* *
* Input: const Config &cfg: Configuration information *
* const SeqForest &normal: DB of normal sequences *
* const int num_streams_fnd: Total number of streams found*
* const int num_seqs_added: Number of unique sequences *
* added *
* const Array<Stream> &streams: Array of data streams *
* const int db_size: Number of unique sequences *
* in DB *
* *
* Output: none *
* *
void FinalReport(const Config &cfg, const SeqForest &normal, const int
num_streams_fnd, const int num_seqs_added, const
Array<Stream> &streams, const int db_size)
int total_pairs = 0;
int total_seqs = 0;
int total_anoms = 0;
int total_max_lfc = 0;
int total_max_hdist = 0;
int db_nodes = 0;
int db_seqs = 0;
int db_branches = 0;
int j;
// Sum up number of pairs input and number of seqs from all the streams
for (j = 0; j < num_streams_fnd; j++) {
total_seqs += streams[j].GetNumSeqsFnd();
total_pairs += streams[j].GetNumPairsRead();
cout << endl;
cout << "Number of different streams in input = "
<< num_streams_fnd << endl;
cout << "Total number of input pairs = "
<< total_pairs << endl;
cout << "Total number of sequences in input = "
<< total_seqs << endl;
if (cfg.add_to_db) {
cout << "File added to database" << endl;
cout << "Number of new sequences added to the database: "
<< num_seqs_added << endl;
else {
cout << "Scan completed" << endl;
// Sum up number of anomalies from all the streams
for (j = 0; j < num_streams_fnd; j++) {
total_anoms += streams[j].GetNumAnoms();
cout << "Number of anomalies = "
<< total_anoms << endl;
cout << "Percentage anomalous = "
<< ((float)total_anoms * 100.0)/total_seqs << endl;
// If asked for, compute Hamming distances across streams and report
if (cfg.compute_hdist) {
for (j = 0; j < num_streams_fnd; j++) {
if (streams[j].GetMaxHDist() > total_max_hdist) {
total_max_hdist = streams[j].GetMaxHDist();
cout << "Largest minimum Hamming distance = "
<< total_max_hdist << endl;
// If asked for, compute lfc across streams and report
if (cfg.lf_size > 1) {
for (j = 0; j < num_streams_fnd; j++) {
if (streams[j].GetMaxLFC() > total_max_lfc) {
total_max_lfc = streams[j].GetMaxLFC();
cout << "Maximum lfc = " << total_max_lfc << endl;
// If asked for, compute db stats and report
if (cfg.write_db_stats) {
WriteDBStats(normal, cout, db_size);
* WriteDBStats() *
* Computes and writes to standard output the number of nodes in *
* the database, the number of unique sequences, the number of *
* branches, and the average database branch factor. *
* *
* Input: const SeqForest &db_forest Forest of sequences in *
* database *
* ostream &out_stream Where to write info *
* const int db_size Number of unique sequences in the *
* database *
* *
* Output: none *
void WriteDBStats(const SeqForest &db_forest, ostream &out_stream,
const int db_size)
int db_nodes = 0;
int db_branches = 0;
for (int i = 0; i < db_forest.trees.Size(); i++) {
if (db_forest.trees_found[i]) {
db_nodes += db_forest.trees[i].NumNodes();
db_branches += db_forest.trees[i].NumBranches();
out_stream << "Number of DB nodes = " << db_nodes << endl;
out_stream << "Number of unique sequences = "<<db_size << endl;
out_stream << "Number of branches (edges) = "<<db_branches << endl;
out_stream << "Average DB branch factor = "
<<((float)db_branches/(db_nodes - db_size))<<endl;
* OutputGraph() *
* Writes a file containing input for the program Dot. *
* Running Dot on produces a PostScript file *
* containing a picture of the whole database tree. *
* *
* Input: const SeqForest &db_forest Forest of sequences in *
* database *
* const string db_name Filename to use *
* *
* Output: none *
void OutputGraph(const SeqForest &db_forest, const string db_name)
char *dot_filename;
dot_filename = new char [strlen(db_name.c_str())+4];
strcpy(dot_filename, db_name.c_str());
ofstream output_file(strcat(dot_filename,".dot"));
output_file<<"digraph \""<<db_name<<"\" {"<<endl;
output_file<<" ratio=auto;"<<endl;
output_file<<" page=\"8.5,11\";"<<endl;
for (int i = 0; i < db_forest.trees.Size(); i++) {
if (db_forest.trees_found[i])
* GetPrimeLargerThan(int n) *
* Returns the smallest prime larger than the input integer. *
* Changes no values. *
* *
* Input: const int n *
* Output: smallest prime larger than n *
int GetPrimeLargerThan(const int n)
int primes[n];
int primes_fnd = 1;
int curr_num = 3;
int is_prime = 1;
primes[0] = 2;
while(1) {
for (int i = 0; i < primes_fnd; i++) {
if ((curr_num % primes[i]) == 0) {
is_prime = 0;
if (is_prime == 1) {
primes[primes_fnd++] = curr_num;
if (curr_num > n) {
curr_num = curr_num + 2;
is_prime = 1;
return curr_num;

#include "../Utils/"
#include "../Utils/"
#include "../Utils/"
#include "seq_stream.h"
#include "flexitree.h"
#include "opt_info.h"
template class List<FlexiTree>;
template class LLNode<FlexiTree>;
template class LinkedList<FlexiTree>;
template class Array<FlexiTree>;
template class List<int>;
template class LinkedList<int>;
template class Array<int>;
template class Array<LinkedList<int> >;
template class List<HashItem>;
template class LLNode<HashItem>;
template class LinkedList<HashItem>;
template class Array<LinkedList<HashItem> >;
template class List<HashItemInt>;
template class LLNode<HashItemInt>;
template class LinkedList<HashItemInt>;
template class Array<LinkedList<HashItemInt> >;
template class Array<HashItemInt>;
template class Array<Stream>;
template class Array<char*>;
template class Array<OptInfo>;

// **********
// **********
#include <iostream.h>
#include <assert.h>
#include "arrays.h"
template <class T> void Array<T>::Init(const Array<T> &t) {
assert(size == t.size);
for (int i = 0; i < size; i++)
data[i] =[i];
template <class T> void Array<T>::Allocate(int as) {
// if previously allocated, delete old dynamic array
if (size) delete[] data;
size = as;
data = new T[size];
template <class T> Array<T>::~Array() {
delete[] data;
template <class T> T &Array<T>::operator[](int i) const {
if (i < 0) { cout<<"ERROR in []: "<<i<<"< 0"<<endl; exit(-1); }
if (i >= size) { cout<<"ERROR in []: "<<i<<" >= "<<size<<endl; exit(-1); }
assert(i >= 0);
assert(i < size);
return data[i];
template <class T> T &Array<T>::Data(int i) {
if (i < 0) { cout<<"ERROR in Data: "<<i<<"< 0"<<endl; exit(-1); }
if (i >= size) { cout<<"ERROR in Data: "<<i<<" >= "<<size<<endl; exit(-1); }
assert(i >= 0);
assert(i < size);
return data[i];
template <class T> Array<T> &Array<T>::operator = (const Array<T> &t) {
if (!size) // if the object in not yet allocated, do it and then assign
assert(size == t.size);
for (int i = 0; i < size; i++)
data[i] =[i];
return *this;
template <class T> int Array<T>::Size() const {
return size;
template <class T> ostream &operator<<(ostream &s, const Array<T> &t) {
for (int i =0; i < t.size; i++)
s<<[i]<<" ";
return s;
template <class T> void Array<T>::Set(T t) {
for (int i =0; i < size; i++)
data[i] = t;
// HeapSort data[0..size-1] DESCENDING
template <class T> void SortableArray<T>::Sort() {
// build the heap
for (int i = Size()-1; i >= 0; i--)
Adjust(i, Size()-1);
for (int i = Size()-1; i >= 1; i--) {
// swap data
T temp1 = Data(0);
Data(0) = Data(i);
Data(i) = temp1;
Adjust(0, i-1);
template <class T> void SortableArray<T>::Adjust(int root, int last) {
if (2*root <= last) {
int child = 2*root;
if ((child+1) <= last) {
if (Data(child+1) < Data(child))
if (Data(child) < Data(root)) {
T temp = Data(root);
Data(root) = Data(child);
Data(child) = temp;
Adjust(child, last);
// HeapSort data[0..size-1] DESCENDING
template <class T, class C> void CompSortableArray<T, C>::Sort() {
// build the heap
int sz = Size();
for (int i = sz-1; i >= 0; i--) {
Adjust(i, sz-1);
for (i = sz-1; i >= 1; i--) {
// do the swap
T temp1 = Data(0);
Data(0) = Data(i);
Data(i) = temp1;
Adjust(0, i-1);
template <class T, class C> void CompSortableArray<T, C>::Adjust(int root,
int last) {
if (2*root <= last) {
int child = 2*root;
if ((child+1) <= last) {
if (comp_ptr->Compare(Data(child+1), Data(child)) == -1) {
if (comp_ptr->Compare(Data(child), Data(root)) == -1) {
T temp = Data(root);
Data(root) = Data(child);
Data(child) = temp;
Adjust(child, last);

// ********
// ********
#ifndef ARRAYS_H
#define ARRAYS_H
//#define PC
#include <iostream.h>
#include "errors.h"
// this is a template for all classes which use an array of objects
// it is dynamically allocated
template <class T> class Array {
Array() {Init();}
Array(const Array<T> &t) {Init(t);} // the copy constructor
Array(int asize) {Init(); Allocate(asize);} // creates an array of size "asize"
~Array(); // the destructor deletes all internal
// objects
void Allocate(int asize); // allocates asize objects
// if data was already allocated,
// deletes and re-allocates
T &operator[](int i) const;
Array<T> &operator = (const Array<T> &t);
// copies one array to another,
// requires that the assignment
// operator be defined for array
// elements
void Set(T t); // sets all elements to t
int Size() const; // returns the size of the array
friend ostream &operator<<(ostream &s, const Array<T> &t);
T &Data(int i); // method derived class can use for
// accessing data
void Init() {size = 0; data = NULL;}// default intialisor
void Init(const Array<T> &t); // implements copy constructor
int size; // the size of the array
T *data; // ptr to the array of objects
// this is a template for sortable arrays of objects, i.e. the objects provide
// a less than comparison operator, which is used in the Sort method to perform
// a heap sort
template <class T> class SortableArray : public Array<T> {
SortableArray() {Init();}
SortableArray(const SortableArray<T> &t) {Init(t);}
SortableArray(int asize) {Allocate(asize);}
void Sort(); // performs a heapsort on the data,
// using the < operator
void Adjust(int root, int last); // for the heap sort
// this is a template for sortable arrays of objects, but the comparison
// operator is provided by another class C
template <class T, class C> class CompSortableArray : public Array<T> {
CompSortableArray() {Init();}
CompSortableArray(int asize, C *c_ptr) {Allocate(asize); comp_ptr = c_ptr;}
void Sort(); // performs a heapsort on the data,
// using comp_ptr->Compare
C *comp_ptr; // a ptr to the object with the Compare
// method
void Adjust(int root, int last);
// this is a template for a multidimensional array of one type of object
// when declaring this one must specify the number of dimensions first,
// followed by the size for each array dimension
template <class T> class MultiArray {
MultiArray() {Init();}
MultiArray(const MultiArray<T> &t) {Init(t);} // the copy constructor
MultiArray(int dims, int x, ...); // a variable number of parameters
{Init(); Allocate(xsize, ysize);}
~Array2D(); // the destructor deletes all internal
// objects
void Allocate(int xsize, ...); // allocates x, y, ... size array
// if data was already allocated,
// deletes and re-allocates
T Data(int x, int y); // returns object in x,y location
Array2D<T> &operator = (const Array2D<T> &t);
// copies one array to another,
// requires that the assignment
// operator be defined for array
// elements
void Set(T t); // sets all elements to t
int XSize(); // returns the x size of the array
int YSize(); // returns the y size of the array
friend ostream &operator<<(ostream &s, const Array2D<T> &t);
T &Data(int i); // method derived class can use for
// accessing data
void Init() {size = 0; data = NULL;}// default intialisor
void Init(const Array<T> &t); // implements copy constructor
Array<Array<T> > data;

@ -0,0 +1,16 @
This class implements strings. It is meant to offer all the functionality
of strings in C, so whenever a C function is needed that manipulates strings,
it must be coded into this.
class String {
String(void) {data = NULL; dsize = 0;}
String(char *init) {dsize = strlen(init); data = new char[dsize];}
~String(void) {if (dsize) delete data;}
char *data;
int dsize;

@ -0,0 +1,20 @
// **********
// **********
#include <stdio.h>
#include <stdlib.h>
#include <iostream.h>
#include "errors.h"
void Error(const char *msg, ...) {
char buffer[150];
va_list ap;
va_start(ap, msg);
vsprintf(buffer, msg, ap);

@ -0,0 +1,16 @
// ********
// ********
#ifndef ERRORS_H
#define ERRORS_H
#include <stdarg.h>
#include <assert.h>
// this function takes a formatted character string and params like printf,
// prints a formatted message, and then aborts the program. Its used for
// trapping errors and halting execution.
void Error(const char *msg, ...);

View File

@ -0,0 +1,182 @
// hash.cpp
#include "hash.h"
HashItem::HashItem(char *s, int v) {
if (strlen(s) > STR_LEN) {
cout<<endl<<"Hash item string too long";
strcpy(str, s); value = v;
void HashItem::Set(char *s, int v) {
if (strlen(s) > STR_LEN) {
cout<<endl<<"Hash item string too long";
strcpy(str, s); value = v;
void HashTable::Insert(HashItem &h_item) {
int hash_index = HashFunc(h_item.str);
#ifdef DBG
int HashTable::Retrieve(HashItem &h_item) {
int hash_index = HashFunc(h_item.str);
HashItem *temp_item_ptr = data[hash_index].Search(h_item);
if (!temp_item_ptr) return 0;
else {
h_item = *temp_item_ptr;
return 1;
unsigned HashTable::HashFunc(char *str) {
unsigned k = 0;
for (int i = 0; i < strlen(str); i++) {
k += (unsigned)str[i] << (i * 8);
return (k % data.Size());
ostream &operator<<(ostream &s, HashTable &ht) {
for (int i = 0; i <; i++) {
if (![i].Empty()) {[i].Write(s);
return s;
// for int hash tables
HashItemInt &HashItemInt::operator = (const HashItemInt &h_item) {
key = h_item.key;
value = h_item.value;
return *this;
void HashTableInt::Insert(HashItemInt &h_item) {
int hash_index = HashFunc(h_item.key);
int HashTableInt::Retrieve(HashItemInt &h_item) {
int hash_index = HashFunc(h_item.key);
HashItemInt *temp_item_ptr;
temp_item_ptr = data[hash_index].Search(h_item);
if (!temp_item_ptr) return 0;
else {
h_item = *temp_item_ptr;
return 1;
unsigned HashTableInt::HashFunc(int key) {
return (key % data.Size());
ostream &operator<<(ostream &s, HashTableInt &ht) {
for (int i = 0; i <; i++) {
if (![i].Empty()) {[i].Write(s);
return s;
void HashTableInt::PutInArray(Array<HashItemInt> &h_array, int &num_items) {
num_items = 0;
HashItemInt h_item;
for (int i = 0; i < data.Size(); i++) {
if (!data[i].Empty()) { // now iterate through the linked list
int start = 1;
while (data[i].GetNext(h_item, start)) {
h_array[num_items].Set(h_item.key, h_item.value);
start = 0;
int HashTableInt::ExtToInt(int key, int next_value)
HashItemInt h_item(key, 0);
// Check to see if we know this one. 0 matches any number. If we
// do know this one, h_item.value gets set to what we knew it to be.
if (!Retrieve(h_item)) {
h_item.Set(key, next_value);
return (h_item.value);
// to test out the hash table
#include <stdlib.h>
#include <string.h>
#include <fstream.h>
#define MAX_SYS_CALLS 255
int GetCalls(HashTable &ht) {
ifstream calls_file("calls.txt");
char buff[255];
int buff_len;
int num_sys_calls = 0;
HashItem h_item;
while (!calls_file.eof() && num_sys_calls < MAX_SYS_CALLS) {
calls_file.getline(buff, 254);
buff_len = strlen(buff);
if (buff_len) {
// cat on a parenth to make sure only calls are matched
strcat(buff, "(");
#ifdef DBG
cout<<endl<<buff; cout.flush();
h_item.Set(buff, num_sys_calls);
if (num_sys_calls == MAX_SYS_CALLS) return 0;
else return 1;
void main(void) {
HashTable hashtable(701);
HashItem h_item("unlink(", 0);
if (GetCalls(hashtable)) {
h_item.Set("unlink(", 0);
if (hashtable.Retrieve(h_item))
cout<<endl<<" unlink found, index = "<<h_item.value;
else cout<<endl<<" unlink not found";
h_item.Set("get_kernel_syms(", 0);
if (hashtable.Retrieve(h_item))
cout<<endl<<" get_kernel_syms found, index = "<<h_item.value;
else cout<<endl<<" get_kernel_syms not found";
h_item.Set("hello(", 0);
if (hashtable.Retrieve(h_item))
cout<<endl<<" hello found, index = "<<h_item.value;
else cout<<endl<<" hello not found";
h_item.Set("setsockopt(", 0);
if (hashtable.Retrieve(h_item))
cout<<endl<<" setsockopt found, index = "<<h_item.value;
else cout<<endl<<" setsockopt not found";

@ -0,0 +1,71 @
#ifndef __HASH_H
#define __HASH_H
#define STR_LEN 100
#include <string.h>
#include "arrays.h"
#include "tll.h"
//#define DBG
class HashItem {
HashItem(void) {strcpy(str, ""); value = 0;}
HashItem(char *s, int v);
void Set(char *s, int v);
int operator == (const HashItem &h_item) {return !strcmp(str, h_item.str);}
friend ostream &operator<<(ostream &s, HashItem &h_item) {
s<<h_item.str<<":"<<h_item.value; return s;}
int value;
char str[STR_LEN];
class HashTable {
HashTable(int size) {data.Allocate(size);}
void Insert(HashItem &h_item); // we insert a complete item, i.e. the
// str and its assoc.
int Retrieve(HashItem &h_item); // returns 0 if item is not found
// we retrieve a complete item, the value
// of assoc is not specified beforehand,
// and is returned in h_item
friend ostream &operator<<(ostream &s, HashTable &ht);
Array<LinkedList<HashItem> > data;
unsigned HashFunc(char *str);
// these store ints, not strings
class HashItemInt {
HashItemInt(void) {key = 0; value = 0;}
HashItemInt(int k, int v) {key = k; value = v;}
// the copy constructor
HashItemInt(const HashItemInt &h_item) {key = h_item.key; value = h_item.value;}
void Set(int k, int v) {key = k; value = v;}
int operator == (const HashItemInt &h_item)
{return ((key == h_item.key) ? 1 : 0);}
HashItemInt &operator = (const HashItemInt &h_item);
friend ostream &operator<<(ostream &s, HashItemInt &h_item) {
s<<h_item.key<<":"<<h_item.value; return s;}
int value, key;
class HashTableInt {
HashTableInt(int size) {data.Allocate(size);}
void Insert(HashItemInt &h_item); // we insert a complete item, i.e. the
// str and its assoc.
int Retrieve(HashItemInt &h_item); // returns 0 if item is not found
// we retrieve a complete item, the value
// of assoc is not specified beforehand,
// and is returned in h_item
friend ostream &operator<<(ostream &s, HashTableInt &ht);
void PutInArray(Array<HashItemInt> &h_array, int &num_items); // puts it into a linear array
int ExtToInt(int key, int next_value);
Array<LinkedList<HashItemInt> > data;
unsigned HashFunc(int key);

@ -0,0 +1,87 @
#include <stdio.h>
#include <stdlib.h>
#include <iostream.h>
#include "krand.h"
#include <time.h>
#define MBIG 1000000000L
#define MSEED 161803398L
#define FAC (1.0 / MBIG)
static int inext;
static int inextp;
static long ma[56];
double knuth_random(void) {
long mj;
if (++inext == 56) inext = 1;
if (++inextp == 56) inextp = 1;
mj = ma[inext] - ma[inextp];
if (mj < 0) mj += MBIG;
ma[inext] = mj;
return mj * FAC;
long seed_random(long seed) {
long mj, mk;
register int i, k;
if (seed < 0) {
time_t tp;
seed = time(&tp);
if (seed >= MBIG) {
cerr<<"Seed value too big (> "<<MBIG<<") in knuth_srand().";
ma[55] = mj = seed;
mk = 1;
for (i = 1; i <= 54; i++) {
register int ii = (21 * i) % 55;
ma[ii] = mk;
mk = mj - mk;
if (mk < 0) mk += MBIG;
mj = ma[ii];
for (k = 0; k < 4; k++) {
for (i = 1; i <= 55; i++) {
ma[i] -= ma[1 + (i + 30) % 55];
if (ma[i] < 0) ma[i] += MBIG;
inext = 0;
inextp = 31;
return seed;
int krandom(int max) {
int retval = (int)(knuth_random() * max);
if (retval < 0 || retval >= max) {
cout<<"ERROR: random num generator out of bounds!"<<endl;
return retval;
//for testing
void main(void) {
for (int i = 0; i < 24; i++) cout<<i<<" "<<knuth_random()<<" "<<krandom(100)<<endl;
for (i = 0; i < 24; i++) cout<<i<<" "<<knuth_random()<<" "<<krandom(100)<<endl;
for (i = 0; i < 24; i++) cout<<i<<" "<<knuth_random()<<" "<<krandom(100)<<endl;

@ -0,0 +1,13 @
#ifndef __KRAND_H
#define __KRAND_H
declarations for
double knuth_random(void);
long seed_random(long);
int krandom(int);

@ -0,0 +1,132 @
// linklist.cpp
#include "linklist.h"
// data structures:
// node for a linked list
class LLNode {
int val; // the value at this node
LLNode *next; // pointer to the next node
LinkedList::LinkedList(void) {
root = NULL;
LinkedList::LinkedList(const LinkedList &llist) { // the copy constructor
root = NULL;
LLNode *temp_ptr = llist.root;
while (temp_ptr) {
temp_ptr = temp_ptr->next;
LinkedList &LinkedList::operator = (const LinkedList &llist) {
root = NULL;
LLNode *temp_ptr = llist.root;
while (temp_ptr) {
temp_ptr = temp_ptr->next;
return *this;
LinkedList::~LinkedList(void) {
if (root) {
LLNode *temp_ptr = root->next, *next_temp_ptr;
delete root;
while (temp_ptr) {
next_temp_ptr = temp_ptr->next;
delete temp_ptr;
temp_ptr = next_temp_ptr;
// returns 1 if there was no copy, 0 otherwise
int LinkedList::Insert(int val) {
if (!root) {
root = new LLNode;
root->val = val;
root->next = NULL;
return 1;
} else {
if (!Search(val)) { // only put in if it is not already in - this is ineff.
LLNode temp_node;
temp_node.val = root->val; = root->next;
root->val = val;
root->next = new LLNode;
root->next->val = temp_node.val;
root->next->next =;
return 1;
return 0;
int LinkedList::Search(int val) {
LLNode *curr_ptr = root;
while (curr_ptr) {
if (curr_ptr->val == val) return 1;
else curr_ptr = curr_ptr->next;
return 0;
void LinkedList::Write(ostream &s) {
LLNode *curr_ptr = root;
while (curr_ptr) {
s<<curr_ptr->val<<" ";
curr_ptr = curr_ptr->next;
ostream &operator<<(ostream &s, LinkedList &ll) {
return s;
// this is for testing the linked list
// something similar should be done for all data structures
// using the Test func on the base class, we can test all descendants with
// those methods
#include "arrays.h"
void Test(LinkedList &list) {
void TestArray(Array<LinkedList> &larray) {
for (int i = 0; i < larray.Size(); i++)
#include <fstream.h>
void main(void) {
ifstream inf;
Array<LinkedList> ll(200);
ll[3] = ll[0];
cout<<endl<<ll[3]<<" = "<<ll[0];

@ -0,0 +1,22 @
// linklist.h
#include <iostream.h>
#include "list.h"
class LLNode;
class LinkedList : public List {
LLNode *root;
LinkedList(const LinkedList &llist); // the copy constructor
LinkedList &operator = (const LinkedList &llist);
int Insert(int val); // this does not insert if val already exists
// returns 1 if it could insert
// could later on also do frequency counts
int Search(int val);
void Write(ostream &s);
friend ostream &operator<<(ostream &s, LinkedList &ll);
int Empty(void) {return (root ? 0 : 1);}
//typedef LinkedList *LinkedListPtr;

@ -0,0 +1,12 @
// list.h
class List {
virtual int Insert(int val) {return 0;}
virtual int Search(int val) {return 0;}
virtual void Write(ostream &s) {;}
virtual int Empty(void) {return 0;}
//typedef List *ListPtr;

@ -0,0 +1,71 @
#include <string.h>
#include <math.h>
#include <stdlib.h>
#include "random.h"
#define PCRAND
#ifdef PCRAND
#define RANDOM_MAX pow(2, 31)-1
/* this random function returns 1 if a random toss is within pfactor, 0<pfactor<1*/
int Probability(float pfactor)
return (pfactor>Random1());
/* if a random value from 0 to 1 is less than pfactor, return true*/
/* -----------------------------------------------------------------------------------------------------------------*/
unsigned Random(unsigned num) /* returns a random word between 0 and num-1*/
float ratio, temp;
#ifdef PCRAND
if (temp*ratio>num-1) return num-1;
else return (temp*ratio);
/* ------------------------------------------------------------------------------------------------------------*/
/* returns a value between 0 and 1*/
float Random1(void)
float ratio,temp;
#ifdef PCRAND
if (ratio<=0) ratio=0.0001;
if (ratio>=0.9999) ratio=0.9999;
return ratio;
/* -------------------------------------------------------------------------------------------------------------------*/
/* initializes random generator*/
void InitRandom(int seed)
#ifdef PCRAND
static char state[64];

@ -0,0 +1,13 @
#ifndef __RANDOM_H
#define __RANDOM_H
/* these are routines to generate random nos in commonly used formats. These routines all
use the random function and so are very random !
void InitRandom(int seed); /* initializes the random system to seed - uses internal state buffers*/
int Probability(float pfactor); /* returns 1 if a random toss is within pfactor, 0 otherwise*/
unsigned Random(unsigned num); /* returns an unsigned from 0 to num-1*/
float Random1(void); /* returns a random floating pt between 0 and 1, i.e over interval (0,1)*/

@ -0,0 +1,27 @
#ifndef __TLIST_H
#define __TLIST_H
// tlist.h
// this is a base template class for lists
// it is for a list of elements. An element can be of any class, but it must
// have the operators == and = defined, so that the list can be searched
// also the operator >> must be defined for write
template <class Elem> class List {
virtual Elem *Insert(const Elem &elem) {return NULL;}
// insert elem into the list
// returns a ptr to elem if elem inserted, NULL if elem was already there
// i.e. doesn't put in duplicates and returns NULL for duplicates
virtual Elem *Search(const Elem &elem) {return NULL;}
// finds the element that matchs elem and returns it. This allows assoc
// retrieval
// returns NULL if the elem is not found
virtual void Write(ostream &s) {;}
// writes out the list of elements. Requires that the element overload the
// stream output operator
virtual int Empty(void) {return 0;}
// returns true if the list is empty

@ -0,0 +1,136 @
// tll.cpp
#include "tll.h"
// data structures:
// node for a linked list
template <class Elem> class LLNode {
Elem elem; // the element at this node
LLNode<Elem> *next; // pointer to the next node
template <class Elem> LinkedList<Elem>::LinkedList(void) {
root = NULL;
length = 0;
template <class Elem> LinkedList<Elem>::LinkedList(const LinkedList<Elem> &llist) {
root = NULL;
LLNode<Elem> *temp_ptr = llist.root;
while (temp_ptr) {
temp_ptr = temp_ptr->next;
template <class Elem> LinkedList<Elem> &LinkedList<Elem>::operator = (
const LinkedList<Elem> &llist) {
root = NULL;
LLNode<Elem> *temp_ptr = llist.root;
while (temp_ptr) {
temp_ptr = temp_ptr->next;
return *this;
template <class Elem> LinkedList<Elem>::~LinkedList(void) {
if (root) {
LLNode<Elem> *temp_ptr = root->next, *next_temp_ptr;
delete root;
while (temp_ptr) {
next_temp_ptr = temp_ptr->next;
delete temp_ptr;
temp_ptr = next_temp_ptr;
template <class Elem> void LinkedList<Elem>::Clear(void) {
if (root) {
LLNode<Elem> *temp_ptr = root->next, *next_temp_ptr;
delete root;
while (temp_ptr) {
next_temp_ptr = temp_ptr->next;
delete temp_ptr;
temp_ptr = next_temp_ptr;
root = NULL;
length = 0;
template <class Elem> Elem *LinkedList<Elem>::Insert(const Elem &elem) {
if (!root) {
root = new LLNode<Elem>;
root->elem = elem;
root->next = NULL;
return &(root->elem);
} else {
if (!Search(elem)) { // only put in if it is not already in - this is ineff.
LLNode<Elem> temp_node;
temp_node.elem = root->elem; = root->next;
root->elem = elem;
root->next = new LLNode<Elem>;
root->next->elem = temp_node.elem;
root->next->next =;
return &(root->elem);
} else { // put the elem back in the same place
root->elem = elem;
return &(root->elem);
return NULL;
template <class Elem> Elem *LinkedList<Elem>::Search(const Elem &elem) {
LLNode<Elem> *curr_ptr = root;
while (curr_ptr) {
if (curr_ptr->elem == elem)
return &(curr_ptr->elem);
// this is very important, because they may not be completely the same,
// since the comparison could be done on a key only
else curr_ptr = curr_ptr->next;
return NULL;
template <class Elem> void LinkedList<Elem>::Write(ostream &s) {
LLNode<Elem> *curr_ptr = root;
while (curr_ptr) {
s<<(curr_ptr->elem)<<" ";
curr_ptr = curr_ptr->next;
template <class Elem> ostream &operator<<(ostream &s, LinkedList<Elem> &ll) {
return s;
template <class Elem> int LinkedList<Elem>::DeleteNext(Elem &elem) {
if (!root) return 0;
elem = root->elem;
LLNode<Elem> *kill_ptr = root;
root = root->next;
delete kill_ptr;
return 1;
template <class Elem> int LinkedList<Elem>::GetNext(Elem &elem, int start) {
if (start) get_next_ptr = root;
if (get_next_ptr) {
elem = get_next_ptr->elem;
get_next_ptr = get_next_ptr->next;
return 1;
return 0;

@ -0,0 +1,42 @
#ifndef __TLL_H
#define __TLL_H
// tll.h
/* this implements a template class linklist, descended from tlist.h
one can create an assoc array out of this by creating an elem class in
which the comparison operator depends on the key alone. Then search will
return the full elem and one can check the associated vaule to the key
#include <iostream.h>
#include "tlist.h"
template <class Elem> class LLNode;
template <class Elem> class LinkedList : public List<Elem> {
LLNode<Elem> *root;
LinkedList(const LinkedList<Elem> &llist); // the copy constructor
LinkedList &operator = (const LinkedList<Elem> &llist);
Elem *Insert(const Elem &elem); // this does not insert if val already exists
// returns ptr to elem in list if it could insert
void Clear(void);
int DeleteNext(Elem &elem); // deletes first elem in list and returns it
int GetNext(Elem &elem, int start); // returns the next element in the list, if start is set then returns
// the first one, returns 0 if the list is now empty
Elem *Search(const Elem &elem); // assumes the == operator defined on elem
void Write(ostream &s);
friend ostream &operator<<(ostream &s, LinkedList<Elem> &ll);
int Empty(void) {return (root ? 0 : 1);}
int Size(void) {return length;}
int length;
LLNode<Elem> *get_next_ptr; // because the next one is ongoing

View File

@ -0,0 +1,339 @
Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc.
675 Mass Ave, Cambridge, MA 02139, USA
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
The licenses for most software are designed to take away your
freedom to share and change it. By contrast, the GNU General Public
License is intended to guarantee your freedom to share and change free
software--to make sure the software is free for all its users. This
General Public License applies to most of the Free Software
Foundation's software and to any other program whose authors commit to
using it. (Some other Free Software Foundation software is covered by
the GNU Library General Public License instead.) You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
this service if you wish), that you receive source code or can get it
if you want it, that you can change the software or use pieces of it
in new free programs; and that you know you can do these things.
To protect your rights, we need to make restrictions that forbid
anyone to deny you these rights or to ask you to surrender the rights.
These restrictions translate to certain responsibilities for you if you
distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must give the recipients all the rights that
you have. You must make sure that they, too, receive or can get the
source code. And you must show them these terms so they know their
We protect your rights with two steps: (1) copyright the software, and
(2) offer you this license which gives you legal permission to copy,
distribute and/or modify the software.
Also, for each author's protection and ours, we want to make certain
that everyone understands that there is no warranty for this free
software. If the software is modified by someone else and passed on, we
want its recipients to know that what they have is not the original, so
that any problems introduced by others will not reflect on the original
authors' reputations.
Finally, any free program is threatened constantly by software
patents. We wish to avoid the danger that redistributors of a free
program will individually obtain patent licenses, in effect making the
program proprietary. To prevent this, we have made it clear that any
patent must be licensed for everyone's free use or not licensed at all.
The precise terms and conditions for copying, distribution and
modification follow.
0. This License applies to any program or other work which contains
a notice placed by the copyright holder saying it may be distributed
under the terms of this General Public License. The "Program", below,
refers to any such program or work, and a "work based on the Program"
means either the Program or any derivative work under copyright law:
that is to say, a work containing the Program or a portion of it,
either verbatim or with modifications and/or translated into another
language. (Hereinafter, translation is included without limitation in
the term "modification".) Each licensee is addressed as "you".
Activities other than copying, distribution and modification are not
covered by this License; they are outside its scope. The act of
running the Program is not restricted, and the output from the Program
is covered only if its contents constitute a work based on the
Program (independent of having been made by running the Program).
Whether that is true depends on what the Program does.
1. You may copy and distribute verbatim copies of the Program's
source code as you receive it, in any medium, provided that you
conspicuously and appropriately publish on each copy an appropriate
copyright notice and disclaimer of warranty; keep intact all the
notices that refer to this License and to the absence of any warranty;
and give any other recipients of the Program a copy of this License
along with the Program.
You may charge a fee for the physical act of transferring a copy, and
you may at your option offer warranty protection in exchange for a fee.
2. You may modify your copy or copies of the Program or any portion
of it, thus forming a work based on the Program, and copy and
distribute such modifications or work under the terms of Section 1
above, provided that you also meet all of these conditions:
a) You must cause the modified files to carry prominent notices
stating that you changed the files and the date of any change.
b) You must cause any work that you distribute or publish, that in
whole or in part contains or is derived from the Program or any
part thereof, to be licensed as a whole at no charge to all third
parties under the terms of this License.
c) If the modified program normally reads commands interactively
when run, you must cause it, when started running for such
interactive use in the most ordinary way, to print or display an
announcement including an appropriate copyright notice and a
notice that there is no warranty (or else, saying that you provide
a warranty) and that users may redistribute the program under
these conditions, and telling the user how to view a copy of this
License. (Exception: if the Program itself is interactive but
does not normally print such an announcement, your work based on
the Program is not required to print an announcement.)
These requirements apply to the modified work as a whole. If
identifiable sections of that work are not derived from the Program,
and can be reasonably considered independent and separate works in
themselves, then this License, and its terms, do not apply to those
sections when you distribute them as separate works. But when you
distribute the same sections as part of a whole which is a work based
on the Program, the distribution of the whole must be on the terms of
this License, whose permissions for other licensees extend to the
entire whole, and thus to each and every part regardless of who wrote it.
Thus, it is not the intent of this section to claim rights or contest
your rights to work written entirely by you; rather, the intent is to
exercise the right to control the distribution of derivative or
collective works based on the Program.
In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of
a storage or distribution medium does not bring the other work under
the scope of this License.
3. You may copy and distribute the Program (or a work based on it,
under Section 2) in object code or executable form under the terms of
Sections 1 and 2 above provided that you also do one of the following:
a) Accompany it with the complete corresponding machine-readable
source code, which must be distributed under the terms of Sections
1 and 2 above on a medium customarily used for software interchange; or,
b) Accompany it with a written offer, valid for at least three
years, to give any third party, for a charge no more than your
cost of physically performing source distribution, a complete
machine-readable copy of the corresponding source code, to be
distributed under the terms of Sections 1 and 2 above on a medium
customarily used for software interchange; or,
c) Accompany it with the information you received as to the offer
to distribute corresponding source code. (This alternative is
allowed only for noncommercial distribution and only if you
received the program in object code or executable form with such
an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for
making modifications to it. For an executable work, complete source
code means all the source code for all modules it contains, plus any
associated interface definition files, plus the scripts used to
control compilation and installation of the executable. However, as a
special exception, the source code distributed need not include
anything that is normally distributed (in either source or binary
form) with the major components (compiler, kernel, and so on) of the
operating system on which the executable runs, unless that component
itself accompanies the executable.
If distribution of executable or object code is made by offering
access to copy from a designated place, then offering equivalent
access to copy the source code from the same place counts as
distribution of the source code, even though third parties are not
compelled to copy the source along with the object code.
4. You may not copy, modify, sublicense, or distribute the Program
except as expressly provided under this License. Any attempt
otherwise to copy, modify, sublicense or distribute the Program is
void, and will automatically terminate your rights under this License.
However, parties who have received copies, or rights, from you under
this License will not have their licenses terminated so long as such
parties remain in full compliance.
5. You are not required to accept this License, since you have not
signed it. However, nothing else grants you permission to modify or
distribute the Program or its derivative works. These actions are
prohibited by law if you do not accept this License. Therefore, by
modifying or distributing the Program (or any work based on the
Program), you indicate your acceptance of this License to do so, and
all its terms and conditions for copying, distributing or modifying
the Program or works based on it.
6. Each time you redistribute the Program (or any work based on the
Program), the recipient automatically receives a license from the
original licensor to copy, distribute or modify the Program subject to
these terms and conditions. You may not impose any further
restrictions on the recipients' exercise of the rights granted herein.
You are not responsible for enforcing compliance by third parties to
this License.
7. If, as a consequence of a court judgment or allegation of patent
infringement or for any other reason (not limited to patent issues),
conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot
distribute so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you
may not distribute the Program at all. For example, if a patent
license would not permit royalty-free redistribution of the Program by
all those who receive copies directly or indirectly through you, then
the only way you could satisfy both it and this License would be to
refrain entirely from distribution of the Program.
If any portion of this section is held invalid or unenforceable under
any particular circumstance, the balance of the section is intended to
apply and the section as a whole is intended to apply in other
It is not the purpose of this section to induce you to infringe any
patents or other property right claims or to contest validity of any
such claims; this section has the sole purpose of protecting the
integrity of the free software distribution system, which is
implemented by public license practices. Many people have made
generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that
system; it is up to the author/donor to decide if he or she is willing
to distribute software through any other system and a licensee cannot
impose that choice.
This section is intended to make thoroughly clear what is believed to
be a consequence of the rest of this License.
8. If the distribution and/or use of the Program is restricted in
certain countries either by patents or by copyrighted interfaces, the
original copyright holder who places the Program under this License
may add an explicit geographical distribution limitation excluding
those countries, so that distribution is permitted only in or among
countries not thus excluded. In such case, this License incorporates
the limitation as if written in the body of this License.
9. The Free Software Foundation may publish revised and/or new versions
of the General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the Program
specifies a version number of this License which applies to it and "any
later version", you have the option of following the terms and conditions
either of that version or of any later version published by the Free
Software Foundation. If the Program does not specify a version number of
this License, you may choose any version ever published by the Free Software
10. If you wish to incorporate parts of the Program into other free
programs whose distribution conditions are different, write to the author
to ask for permission. For software which is copyrighted by the Free
Software Foundation, write to the Free Software Foundation; we sometimes
make exceptions for this. Our decision will be guided by the two goals
of preserving the free status of all derivatives of our free software and
of promoting the sharing and reuse of software generally.
Appendix: How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
convey the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) 19yy <name of author>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this
when it starts in an interactive mode:
Gnomovision version 69, Copyright (C) 19yy name of author
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, the commands you use may
be called something other than `show w' and `show c'; they could even be
mouse-clicks or menu items--whatever suits your program.
You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the program, if
necessary. Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program
`Gnomovision' (which makes passes at compilers) written by James Hacker.
<signature of Ty Coon>, 1 April 1989
Ty Coon, President of Vice
This General Public License does not permit incorporating your program into
proprietary programs. If your program is a subroutine library, you may
consider it more useful to permit linking proprietary applications with the
library. If this is what you want to do, use the GNU Library General
Public License instead of this License.

@ -0,0 +1,21 @
STIDE_OBJECTS = config.o flexitree.o stide.o stream.o
STIDE_HEADERS = config.h flexitree.h opt_info.h stream.h
FLAGS = -g
g++ $(FLAGS) $(STIDE_OBJECTS) -o stide
config.o: config.C config.h
g++ -c $(FLAGS) config.C
flexitree.o: flexitree.C flexitree.h
g++ -c $(FLAGS) flexitree.C
stream.o: stream.C stream.h
g++ -c $(FLAGS) stream.C
stide.o: stide.C $(STIDE_HEADERS)
g++ -c $(FLAGS) stide.C

@ -0,0 +1,13 @
STIDE version 1.2
Copyright (C) 1996, 1998 The Regents of the University of New Mexico.
Copyright (C) 2006 Hajime Inoue.
All rights reserved.
STIDE v1.2 should work identically to v1.1. Modern GCCs will not compile v1.1.
STIDE v1.2 was ported to STL and current C++ conventions. Please report
any bugs to
For usage information invoke stide with the --help option. More detailed
documentation can be found in the UserDoc directory.

@ -0,0 +1,803 @
#include <stdlib.h>
#include <stdio.h>
#include <fstream>
#include <string>
#include <vector>
#include "config.h"
#include "opt_info.h"
#define LF_LIM 999
#define SEQ_LEN_LIM 199
#define MAX_ELEM_LIM 999
#define MAX_STREAMS_LIM 9999
using std::vector;
using std::cout;
using std::cerr;
using std::endl;
* Config() *
* Reads in configuration information from configuration file, from *
* the command line, and from preset defaults. *
* *
* Input: int argc: Number of arguments on command line *
* char *argv[]: Array of strings of actual arguments *
* *
* Output: Nothing *
Config::Config(const int argc, const char *argv[])
vector<OptInfo> opt_array(NUM_OPTS);
ReadCommandLine(argc, argv, opt_array);
* InitOptArray() *
* Sets the values of opt_array so that opr_array contains all the *
* information needed about the parameters being set by the config *
* file and the command-line arguments. *
* *
* Input: vector<OptInfo> &opt_array: Array of information about *
* options for the program *
* *
* Output: Nothing *
void Config::InitOptArray(vector<OptInfo> &opt_array)
// opt_array.reserve(NUM_OPTS);
opt_array[0].long_name = "db_name";
opt_array[0].short_name = "d";
opt_array[0].set = 0;
opt_array[0].type = 's';
opt_array[0].str_val = &db_name;
opt_array[1].long_name = "seq_len";
opt_array[1].short_name = "l";
opt_array[1].set = 0;
opt_array[1].type = 'i';
opt_array[1].int_val = &seq_len;
opt_array[2].long_name = "max_elements";
opt_array[2].short_name = "me";
opt_array[2].set = 0;
opt_array[2].type = 'i';
opt_array[2].int_val = &max_elements;
opt_array[3].long_name = "max_streams";
opt_array[3].short_name = "ms";
opt_array[3].set = 0;
opt_array[3].type = 'i';
opt_array[3].int_val = &max_streams;
opt_array[4].long_name = "cfg_name";
opt_array[4].short_name = "c";
opt_array[4].set = 0;
opt_array[4].type = 's';
opt_array[4].str_val = &cfg_name;
opt_array[5].long_name = "pair_offset";
opt_array[5].short_name = "p";
opt_array[5].set = 0;
opt_array[5].type = 'i';
opt_array[5].int_val = &pair_offset;
opt_array[6].long_name = "add_output_format";
opt_array[6].short_name = "aof";
opt_array[6].set = 0;
opt_array[6].type = 's';
opt_array[6].str_val = &add_output_format;
opt_array[7].long_name = "compare_output_format";
opt_array[7].short_name = "cof";
opt_array[7].set = 0;
opt_array[7].type = 's';
opt_array[7].str_val = &compare_output_format;
opt_array[8].long_name = "add_to_db";
opt_array[8].short_name = "a";
opt_array[8].set = 0;
opt_array[8].type = 'f';
opt_array[8].int_val = &add_to_db;
opt_array[9].long_name = "output_graph";
opt_array[9].short_name = "g";
opt_array[9].set = 0;
opt_array[9].type = 'f';
opt_array[9].int_val = &output_graph;
opt_array[10].long_name = "compute_hdist";
opt_array[10].short_name = "hd";
opt_array[10].set = 0;
opt_array[10].type = 'f';
opt_array[10].int_val = &compute_hdist;
opt_array[11].long_name = "lf_size";
opt_array[11].short_name = "lf";
opt_array[11].set = 0;
opt_array[11].type = 'i';
opt_array[11].int_val = &lf_size;
opt_array[12].long_name = "write_db_stats";
opt_array[12].short_name = "s";
opt_array[12].set = 0;
opt_array[12].type = 'f';
opt_array[12].int_val = &write_db_stats;
opt_array[13].long_name = "verbose";
opt_array[13].short_name = "v";
opt_array[13].set = 0;
opt_array[13].type = 'f';
opt_array[13].int_val = &verbose;
opt_array[14].long_name = "very_verbose";
opt_array[14].short_name = "V";
opt_array[14].set = 0;
opt_array[14].type = 'f';
opt_array[14].int_val = &very_verbose;
opt_array[15].long_name = "help";
opt_array[15].short_name = "h";
opt_array[15].set = 0;
opt_array[15].type = 'h';
* SetDefaults() *
* Sets conifiguration variables to their default values *
* *
* Input: None *
* *
* Output: None *
void Config::SetDefaults()
cfg_name = "stide.config";
db_name = "default.db";
seq_len = 6;
max_elements = 500;
max_streams = 500;
pair_offset = 0;
add_output_format = "DB Size: %d\tStream: %s\tPair Number: %p\n";
compare_output_format = "Pair Number: %p\tStream Number: %s\n";
lf_size = 1;
add_to_db = 0;
output_graph = 0;
compute_hdist = 0;
write_db_stats = 0;
verbose = 0;
very_verbose = 0;
num_fvars = 0;
* ReadCommandLine() *
* Parses the command line. Updates configuration variables. *
* *
* const int argc Number of arguments *
* const char *argv[], Array of arguments *
* vector<OptInfo> &opt_array Constant array of information about *
* the configuration variables *
void Config::ReadCommandLine(const int argc, const char *argv[],
vector<OptInfo> &opt_array)
string var_name; // Name of variable
string var_val; // Value of variable
int name_type; // LONG_NAME or SHORT_NAME
int argv_i = 1; // First index of argv
int argv_j = 0; // Second index of argv
while (argv_i < argc) {
if (argv[argv_i][argv_j] != '-') {
cerr<< "ERROR: Switches must be preceeded by a dash: "<<argv[argv_i]
<< endl << " is illegal" << endl;
if (argv[argv_i][argv_j] == '-') { // Long name
name_type = LONG_NAME;
else {
name_type = SHORT_NAME;
// Read name into var_name
var_name = argv[argv_i]+argv_j;
// Now we want to read the value, if there is one.
argv_j = 0;
if (++argv_i < argc) {
if (argv[argv_i][argv_j] != '-') {
var_val = argv[argv_i];
// assign value to appropriate variable
AssignValToVar(opt_array, var_val, var_name, name_type);
// Blank var_name and var_val for next time around
* AssignValToVar() *
* Figures out which variable to assign a given value to and does *
* so. Updates opt_array, to say that that particular variable *
* has been set. *
* *
* Input: vector<OptInfo> &opt_array Option Information *
* const string &var_val Value to be assigned *
* const string &var_name Name of variable to be updated *
* const int name_type SHORT_NAME or LONG_NAME *
* *
* Output: None *
void Config::AssignValToVar(vector<OptInfo> &opt_array, const string
&var_val, const string &var_name, const
int name_type)
int opt_i;
for (opt_i = 0; opt_i < NUM_OPTS; opt_i++) {
if (((name_type == LONG_NAME) && (opt_array[opt_i].long_name ==
var_name)) ||
((name_type == SHORT_NAME) && (opt_array[opt_i].short_name ==
var_name))) {
// If we have already set this variable and shouldn't change it,
// don't
if (opt_array[opt_i].set == 1) {
switch (opt_array[opt_i].type) {
case 'f': // flag
if ((var_val.length() == 0) || (var_val == "On") ||
(var_val == "ON") || (var_val == "on")) {
*(opt_array[opt_i].flag_val) = 1;
opt_array[opt_i].set = 1;
else if ((var_val != "Off") && (var_val != "off") &&
(var_val != "OFF")) {
cerr << "ERROR: Illegal value for parameter " << var_name
<< ". This parameter is a simple flag," << endl
<< "and may be followed by \"on\", \"off\", or nothing "
<< "(which turns it on). The current value is "
<< var_val << ". Aborting...";
case 'i':
// If there isn't a value, just use the default
if (var_val.length() == 0) {
*(opt_array[opt_i].int_val) = atoi(var_val.c_str());
opt_array[opt_i].set = 1;
case 's':
// If there is no string given, just use the default
if (var_val.length() == 0) {
*(opt_array[opt_i].str_val) = var_val;
opt_array[opt_i].set = 1;
case 'h':
} // end of switch
return; // we've found it, so we're done
} // end of if (opt_array[opt_i]...
} // end of for (opt_i = 0; ...
* ReadConfigFile() *
* Parses the configuration file. Updates configuration *
* variables. *
* *
* Input: vector<OptInfo> &opt_array: Option information *
* *
* Output: None *
void Config::ReadConfigFile(vector<OptInfo> &opt_array)
string var_name;
string var_val;
// Set up stream for reading configuration
ifstream cfg_file(cfg_name.c_str());
string buff;
int buff_i = 0; // index for buff
int opt_i = 0; // index for opt_array
int rev_num; // revision number of configuration file
if (!cfg_file.is_open()) {
cerr<<"WARNING: Cannot open configuration file "<<cfg_name
<<". I will continue, using the" <<endl
<<"default values and the command line arguments." << endl
<<"If that isn't what you wanted, type Ctrl-C now to abort."
<< endl;
// First we need to determine if the configuration file is old-style
// or new-style, i.e., is there a #ConfigFileRev: in the first
// line. We can determine this just be checking the first
// character.
char c = cfg_file.peek();
// Config file is empty; just return
if (cfg_file.eof()) {
// If old-style
if (c != '#') {
cerr << "WARNING: The first line of the configuration file did "
<< "not contain the string" << endl
<< "\"#ConfigFileRev: " << CFREV << "\"." << endl
<< "I will assume that this is an old format configuration "
<< "file." << endl
<<"If that isn't what you wanted, type Ctrl-C now to abort."
<< endl << endl;
ReadOldConfigFile(cfg_file, opt_array);
// Look for "#ConfigFileRev:"
cfg_file >> buff;
if (buff != "#ConfigFileRev:") {
cerr << "ERROR: I expected the first line of the configuration "
<< "file to either be \"#ConfigFileRev: \" followed by the "
<< "revision number or the beginning of an old-style "
<< "configuration file, which does not have a comment in the "
<< "first line. I'm confused, so I will abort..."
<< endl << endl;
cfg_file >> rev_num;
if (rev_num > CFREV) {
cerr << "ERROR: This version of STIDE does not know how to deal "
<< "with configuration files" << endl
<< "more modern than revision " << CFREV << ". Aborting..."
if (rev_num < CFREV) {
cerr << "ERROR: Configuration files must be revision " << CFREV
<< "or later, " << "or an old-style" << endl
<< "configuration file without a revision number. "
<< "Aborting..." << endl;
// Now we know everything's as we expect, so we'll parse the file
while (!cfg_file.eof()) {
// Skip white space at the beginning of the line
while (isspace(buff[buff_i])) {
// If buff is empty, move on to next line
if (buff.length() <= buff_i) {
getline(cfg_file, buff);
buff_i = 0;
// If we start with a comment, move on to next line
if (buff[buff_i] == '#') {
getline(cfg_file, buff);
buff_i = 0;
// Read in variable name, up to the :
int start_place = buff_i; // the beginning place of the name
while (buff[buff_i] != ':' && (buff_i < buff.length())) {
if (buff[buff_i] == buff.length()) {
cerr << "ERROR: Variable names in the configuration file must "
<< "be followed by a colon. The line " << endl
<< buff << endl << "contains a variable name which is not "
<< "terminated by a colon. Aborting..." <<endl;
// This assigns the values in buff between start_place and buff_i
// to var_name
var_name.assign(buff, start_place, buff_i - start_place);
// Skip colon
// Skip white space
while (isspace(buff[buff_i])) { buff_i++; }
start_place = buff_i; // the starting place of the value
// Find last point in value. If it starts with a quote, it ends
// with a quote.
if ((buff[buff_i] == '\"') && (buff_i < buff.length())) {
while (buff[buff_i] != '\"') {
// Strip off first "
// Otherwise, it ends with a space, a # or the end of the line
else {
while ((buff_i < buff.length()) && (!isspace(buff[buff_i])) &&
(buff[buff_i] != '#')) {
var_val.assign(buff, start_place, buff_i - start_place);
// Now we want to check to see if the line was continued, in which
// case we haven't gotten the value of the variable in var_val, so
// we still need to do that.
if (buff[buff_i-1] == '\\') {
getline(cfg_file, buff);
buff_i = 0;
while (isspace(buff[buff_i])) { buff_i++; }
start_place = buff_i;
// Find last point in value. If it starts with a quote, it ends with a
// quote.
if (buff[buff_i] == '\"') {
while ((buff[buff_i] != '\"') && (buff_i < buff.length())) {
start_place++; // Strip off first "
// Otherwise, it ends with a space, a # or the end of the line
else {
while ((buff_i < buff.length()) && (!isspace(buff[buff_i])) &&
(buff[buff_i] != '#')) {
var_val.assign(buff, start_place, buff_i - start_place);
// assign value to appropriate variable
AssignValToVar(opt_array, var_val, var_name, LONG_NAME);
getline(cfg_file, buff);
buff_i = 0;
} //end of while (!cfg_file.eof())...
* ReadOldConfigFile() *
* Reads information from an old-style configuration file. *
* Updates configuration variables. *
* *
* Input: ifstream &cfg_file Configuration file (already opened) *
* vector<OptInfo> &opt_array: Option information *
* *
* Output: None *
void Config::ReadOldConfigFile(ifstream &cfg_file,
vector<OptInfo> &opt_array)
string buff;
string var_name;
string var_val;
var_name = "max_elements";
AssignValToVar(opt_array, var_val, var_name, LONG_NAME);
getline(cfg_file, buff);
var_name = "max_streams";
AssignValToVar(opt_array, var_val, var_name, LONG_NAME);
getline(cfg_file, buff);
// Next line is hash table size, but we are now figuring that out
// dynamically, so just throw it away.
getline(cfg_file, buff);
// Now read in the format string
getline(cfg_file, var_val);
// Put the format string in the appropriate place
if (add_to_db) {
var_name = "add_output_format";
AssignValToVar(opt_array, var_val, var_name, LONG_NAME);
else {
var_name = "compare_output_format";
AssignValToVar(opt_array, var_val, var_name, LONG_NAME);
* CheckValues() *
* Checks configuration values that have been read in to make *
* sure that they are within the limits. Flags are automatically *
* checked while being read in, the output formats are checked *
* in InitOutputFormat(), and filenames are checked when they are *
* opened, so all that is left is the integer values. *
* *
* Input: None *
* *
* Output: None *
void Config::CheckValues()
if ((lf_size < 1) || (lf_size > LF_LIM)) {
cerr << "ERROR: lf_size must be between 1 and " << LF_LIM
<< ". It has been set to " << lf_size << ". Aborting..." << endl;
if ((seq_len < 1) || (seq_len > SEQ_LEN_LIM)) {
cerr << "ERROR: seq_len must be between 1 and " << SEQ_LEN_LIM
<< ". It has been set to " << seq_len << ". Aborting..." << endl;
if ((max_elements < 1) || (max_elements > MAX_ELEM_LIM)) {
cerr << "ERROR: max_elements must be between 1 and " << MAX_ELEM_LIM
<< ". It has been set to " << max_elements
<< ". Aborting..." << endl;
if ((max_streams < 1) || (max_streams > MAX_STREAMS_LIM)) {
cerr << "ERROR: max_streams must be between 1 and " << MAX_STREAMS_LIM
<< ". It has been set to " << max_streams
<< ". Aborting..." << endl;
* InitOutputFormat() *
* Converts the string add_output_format or compare_output_format *
* to information filling fmt_str and num_fvars, which is more *
* convenient for output. *
* *
* Input: None *
* *
* Output: None *
void Config::InitOutputFormat()
// Now we analyze add_output_format or compare_output_format
int flag = 0;
int f_i = 0;
num_fvars = 0;
string *buff;
// If we're not in verbose or very_verbose modes, we're never going
// to use this information, so don't waste our time doing this
if (!(verbose || very_verbose)) {
if (add_to_db) {
buff = &add_output_format;
else {
buff = &compare_output_format;
for (int i = 0; i <(*buff).length(); i++) {
switch ((*buff)[i]) {
case '\\':
switch ((*buff)[i]) {
case 't': fmt_str[num_fvars][f_i] = '\t'; break;
case 'n': fmt_str[num_fvars][f_i] = '\n'; break;
case '%':
fmt_str[num_fvars][f_i] = '%';
flag = 1;
fmt_str[num_fvars][f_i] = (*buff)[i];
if (flag) {
switch (fmt_str[num_fvars][f_i]) {
case 'd': // database size
case 'i': // number of last value of sequence in this
// data stream
case 'p': // number of last value of sequence in entire
// input
case 's': // external stream ID
case 'a': // flag for whether this sequence is anomalous
case 'c': // locality frame count of this sequence
case 'h': // Hamming distance for this sequence
// Record that we must write that val at that position
write_val[num_fvars] = fmt_str[num_fvars][f_i];
fmt_str[num_fvars][f_i] = 'd';
fmt_str[num_fvars][f_i + 1] = '\0';
f_i = -1;
flag = 0;
default: // Unknown flag
cerr << "ERROR: Illegal control character in output format."
<< " Type stide -h for help." << endl;
} // switch ((*buff)[i ...
fmt_str[num_fvars][f_i] = '\0';
* OutputConfigInfo() *
* Writes information about the final configuration to standard *
* output. Does so in a format that could be used as a *
* configuration file. Changes no values anywhere. *
* *
* Input: const vector<OptInfo> &opt_array Option Information *
* *
* Output: None *
void Config::OuputConfigInfo(const vector<OptInfo> &opt_array) const
cout<<"This run was configured using configuration file "
<< cfg_name << " and command" << endl
<< "line arguments. The configuration values were as "
<< "follows." << endl
<<"#ConfigFileRev: " << CFREV << endl;
for (int i = 0; i < NUM_OPTS; i++) {
if (opt_array[i].type == 'i') {
cout << opt_array[i].long_name << ": " << *(opt_array[i].int_val)
<< endl;
if ((opt_array[i].type == 's') &&
((add_to_db && (opt_array[i].short_name == "aof")) ||
(!add_to_db && (opt_array[i].short_name == "cof")))) {
cout << opt_array[i].long_name << ": \"" << *(opt_array[i].str_val)
<< "\"" << endl;
if (opt_array[i].type == 'f') {
if (*(opt_array[i].int_val) == 1) {
cout << opt_array[i].long_name << ": On" << endl;
if (*(opt_array[i].int_val) == 0) {
cout << opt_array[i].long_name << ": Off" << endl;
cout << endl << endl;
// Now print header for verbose modes
if (verbose || very_verbose) {
cout<<endl<<"Variables in output: "<<endl;
for (int j = 0; j < num_fvars; j++) {
switch (write_val[j]) {
case 's': cout<<"stream #, "; break;
case 'i': cout<<"index #, "; break;
case 'h': if (compute_hdist) {cout<<"hamming miss, "; } break;
case 'c': if (lf_size > 1) {cout<<"lfc, "; } break;
case 'p': cout<<"pair #, "; break;
case 'd': cout<<"db size, "; break;
case 'a': cout<<"is anomalous?, "; break;
* WriteHelpInfo() *
* Writes help information to standard output. Changes no values.*
* *
* Input: None *
* *
* Output: None *
* *
void Config::WriteHelpInfo() const
cout<<"STIDE accepts calls of the form:"<<endl
<<" stide -c cfg_name -d db_name -e max_num_elements"
<<" -lf lf_size -l seq_len"<<endl<<" -n max_num_streams"
<<" -p pair_num_offset -aof add_out_format "
<< endl << " -cof comp_out_format -a -g -h -m -s -v -V"
<< endl << endl;
cout<<"STIDE expects input to come through standard input in"
<<" the format of a pair"<<endl
<<"of integers per line, where the first integer is a"
<<" stream identifier"<<endl
<<"and the second is a data element. Command line"
<<" arguments override"<<endl
<<"specifications in the configuration file. All"
<<" parameters are optional"<<endl
<<"and can be specified in any order. Parameters"
<<" are always preceded by a"<<endl
<<"switch. The switches are:"<<endl<<endl;
cout<<"-a Add to database; defaults to off"<<endl;
cout<<"-c cfg_name The name of file containing the"
<<" configuration;"<<endl
<<" defaults to \"stide.config\""<<endl;
cout<<"-d db_name The name of the file containing"
<<" the database;"<<endl
<<" defaults to \"default.db\""<<endl;
cout<<"-lf lf_size The size of the locality frame;"
<<" defaults to 1"<<endl;
cout<<"-g Write graphing data in dot format to"
<<" defaults to off"<<endl;
cout<<"-h Help; displays this information"<<endl;
cout<<"-l seq_len Length of sequence; defaults to 6"
cout<<"-p pair_offset Offset for pair number count;"
<<" defaults to 0"<<endl;
cout<<"-s Display db stats; defaults to off"
cout<<"-v Verbose mode on; defaults to off"<<endl;
cout<<"-V Very verbose mode on; defaults to off"<<endl;
cout<<"-hd Compute Hamming distance measures;"
<<" defaults to off"<<endl;
cout<<"-me max_elements Maximum number of different"
<<" elements"<<endl
<<" in the input stream; defaults to"
<<" 500" <<endl;
cout<<"-ms max_num_streams Maximum number of different"
<<" streams in input;"<<endl
<<" defaults to 100"<<endl;
cout<<"-aof add_out_format Format for output when adding to"
<<" database"<<endl
<<" in verbose or very_verbose"
<<" modes; defaults to"<<endl
<<" \"DB Size: %d\\tStream: "
<<"%s\\tPair Number: %p\\n\""<<endl;
cout<<"-cof compare_out_format Format for output when comparing"
<<" with database"<<endl
<<" in verbose or very_verbose modes;"
<<" defaults to"<<endl
<<" \"Pair Number: %p\\tStream"
<<" Number: %s\\n\""<<endl;

@ -0,0 +1,72 @
#ifndef __SEQ_CONFIG_H
#define __SEQ_CONFIG_H
#define CFREV 1
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include "opt_info.h"
using std::vector;
using std::ifstream;
class Config {
Config(const int argc, const char *argv[]); // Constructor; reads
// configuration file and command
// line arguments
string cfg_name; // Name of configuration file
string db_name; // Name of database
int seq_len; // Sequence Length
int max_elements; // Maximum number of different
// data elements we may encounter
int max_streams; // Maximum number of different
// streams we may encounter
int pair_offset; // Number by which to offset
// num_pairs_read
string add_output_format; // Format for verbose-mode output
// when adding to database
string compare_output_format; // Format for verbose-mode output
// when comparing with an
// existing database
int lf_size; // Size of locality frames: 1
// effectively means don't
// compute locality frames
int add_to_db; // Flag indicating that we should
// add to the database rather
// than make comparisons
int output_graph; // Output graphing information in
// Dot format
int compute_hdist; // Compute Hamming distance
int write_db_stats; // Write statistics about the
// database
int verbose; // Output information about each
// anomaly or each new sequence
// added to the database
int very_verbose; // Output information about each
// sequence encountered
char fmt_str[10][50]; // String used for outputting
// information in verbose mode
char write_val[7]; // Do we write the value? used
// with fmt_str
int num_fvars; // Number of format variables
void Config::InitOptArray(vector<OptInfo> &opt_array);
void Config::SetDefaults();
void Config::ReadCommandLine(const int argc, const char *argv[],
vector<OptInfo> &opt_array);
void Config::AssignValToVar(vector<OptInfo> &opt_array, const
string &var_val, const string
&var_name, const int name_type);
void Config::ReadConfigFile(vector<OptInfo> &opt_array);
void Config::ReadOldConfigFile(ifstream &cfg_file,
vector<OptInfo> &opt_array);
void Config::InitOutputFormat();
void Config::CheckValues();
void Config::OuputConfigInfo(const vector<OptInfo> &opt_array) const;
void Config::WriteHelpInfo() const;

@ -0,0 +1,461 @
// flexitree.C
#include "flexitree.h"
extern int counter;
using std::endl;
using std::cerr;
// data structures:
// node for a linked list
class FlexiTreeNode {
FlexiTree *tree; // the element at this node
FlexiTreeNode *next; // pointer to the next node
FlexiTreeNode(int root) {tree = new FlexiTree(root); next = NULL;}
FlexiTree::FlexiTree(void) {
children = NULL;
root = -1;
id = counter;
FlexiTree::FlexiTree(int d) {
children = NULL;
root = d;
id = counter;
FlexiTree::~FlexiTree(void) {
if (children) {
FlexiTreeNode *temp_ptr = children->next, *next_temp_ptr;
if (children->tree) delete children->tree;
delete children;
while (temp_ptr) {
next_temp_ptr = temp_ptr->next;
if (temp_ptr->tree) delete temp_ptr->tree;
delete temp_ptr;
temp_ptr = next_temp_ptr;
int FlexiTree::NumNodes(void) const {
int size = 1;
if (children) {
FlexiTreeNode *temp_ptr = children;
while (temp_ptr) {
size += temp_ptr->tree->NumNodes();
temp_ptr = temp_ptr->next;
return size;
int FlexiTree::NumLeaves(void) const {
int size;
if (children) {
size = 0;
FlexiTreeNode *temp_ptr = children;
while (temp_ptr) {
size += temp_ptr->tree->NumLeaves();
temp_ptr = temp_ptr->next;
} else size = 1;
return size;
int FlexiTree::NumBranches(void) const {
int branches = 0;
if (children) {
FlexiTreeNode *temp_ptr = children;
while (temp_ptr) {
branches += (temp_ptr->tree->NumBranches() + 1);
temp_ptr = temp_ptr->next;
return branches;
* InsertSeq() *
* Inserts a sequence in this tree and returns 1 if the sequence *
* begins with the root of this tree and the sequence isn't already *
* in this tree. It returns -1 if the sequence doesn't begin with *
* the root of this tree. It returns 0 if the sequence was already *
* in this tree. This function is recursive and only compares the *
* portion of the sequence lying between the argument first and the *
* argument last. *
* *
* *
* Input: const vector<int> &seq Current sequence *
* int first The first element of the sequence *
* to consider *
* int last The length of the sequence *
int FlexiTree::InsertSeq(const vector<int> &seq, int first, int last)
// If the root of this tree isn't the same as the first element of
// the sequence, return -1 to indicate that
if (root != seq[first]) {
return -1;
first++; // shift the seq forward
// If we have reached the end of the sequence now, we haven't added
// anything to the tree, so we return 0 to indicate that it was
// already there
if (first > last) {
return 0;
// If there are no children, create some with the correct root,
// insert the sequence and return 1.
if (!children) {
children = new FlexiTreeNode(seq[first]);
children->tree->InsertSeq(seq, first, last);
return 1;
// The root agrees, we're not at the end, and there are children.
// Now we want to know if the sequence is already in the children,
// and if not, we want to find out and add it.
FlexiTreeNode *temp_ptr = children;
int flag;
while (1) {
flag = temp_ptr->tree->InsertSeq(seq, first, last);
// If the sequence is new and gets added, return 1
if (flag == 1) return 1;
// If the sequence is old, return 0
if (flag == 0) return 0;
// Otherwise the new root of the sequence isn't the same as the
// root of this child tree, so we will try the next one. But
// first, if this is the last child, we know it isn't in here, so
// we will add it in and return 1
if (temp_ptr->next == NULL) {
temp_ptr->next = new FlexiTreeNode(seq[first]);
temp_ptr->next->tree->InsertSeq(seq, first, last);
return 1;
temp_ptr = temp_ptr->next;
* IsSeqInTree() *
* Returns 1 if the sequence has a match within this tree and *
* returns 0 otherwise. This function is recursive and only *
* compares the portion of the sequence lying between the argument *
* first and the argument last. *
* *
* *
* Input: vector<int> &seq Current sequence *
* int first The first element of the sequence to *
* consider *
* int last The length of the sequence *
int FlexiTree::IsSeqInTree(const vector<int> &seq, int first, int last) const
// If the first element of the sequence isn't the same as the root
// of this tree, then we know already that there isn't a match here,
// so return 0.
if (root != seq[first]) {
return 0;
first++; // shift the seq forward
// If we have reached the end of the sequence, then we have
// found matches all the way along, so return 1 saying that this is
// a match.
if (first > last) {
return 1;
// Now we want to find out if there is a match in any of the
// subtrees below this tree. The subtrees are contained in the
// linked list children->next->next->...
FlexiTreeNode *next_node = children;
while (next_node != NULL) {
if (next_node->tree->IsSeqInTree(seq, first, last)) {
return 1; //Found it!
next_node = next_node->next;
// Now we've been through all of the subtrees without finding a
// match, so there aren't any matches.
return 0;
* ComputeHDistForTree() *
* Reports the minimum number of mismatches with any sequence on *
* this tree. This is a highly compute-intensive method, because *
* every path down the tree is followed. This function is *
* recursive, and only compares the portion of the sequence lying *
* between the argument first and the argument last. *
* *
* *
* Input: vector<int> &seq Current sequence *
* int first The first element of the sequence to *
* consider *
* int last The length of the sequence *
int FlexiTree::ComputeHDistForTree(vector<int> &seq, int first, int
last) const
int tot_misses = 0;
// If the first element of the sequence isn't the same as the root
// of this tree, then every sequence on this tree will disagree with
// the sequence here, so we increment tot_misses
if (root != seq[first]) {
first++; // shift the seq forward
if (first > last) { // reached the end of the seq
return tot_misses; // return a zero, i.e. no mismatches
// Now we want to add to tot_misses the smallest number of
// mismatches with any of this tree's subtrees. This tree's
// subtrees are in the linked list children->next->next->
FlexiTreeNode *next_node = children;
// last is the last element of the sequence, which is one less than
// the number of elements in the sequence. The most misses possible
// is the number of elements in the sequence.
int min_misses = last + 1;
int misses;
while (next_node != NULL) {
misses = next_node->tree->ComputeHDistForTree(seq, first, last);
if (misses < min_misses) {
min_misses = misses;
next_node = next_node->next;
return (tot_misses + min_misses);
// format for writing out: we do it df, each path is terminated by a negative number,
// which is -(the reqd backtrack length)-1. depth should start out as 0.
// the tree writing out will end with -1.
void FlexiTree::Write(ostream &s, int &depth) const {
s<<root<<" ";
FlexiTreeNode *temp_ptr = children;
while (temp_ptr) {
depth = 0;
temp_ptr->tree->Write(s, depth);
temp_ptr = temp_ptr->next;
if (temp_ptr) s<<"-"<<(depth + 1)<<" ";
depth++; // now incr the count
ostream &operator<<(ostream &s, const FlexiTree &tree) {
int depth = 0;
tree.Write(s, depth);
s<<" -1"; // we terminate with a -1
return s;
// returns 0 if we have reached the end of the file, 1 otherwise
int FlexiTree::Read(istream &s, int &depth) {
int next_num;
if (s.eof()) return 0;
if (next_num == -1) return 0; // we have reached the end of the tree
if (next_num >= 0) {
children = new FlexiTreeNode(next_num);
if (!children->tree->Read(s, depth)) return 0;
FlexiTreeNode *temp_ptr = children;
while (depth == 0) {
if (s.eof()) return 0;
if (next_num == -1) return 0; // we have reached the end of the tree
temp_ptr->next = new FlexiTreeNode(next_num);
temp_ptr = temp_ptr->next;
if (!temp_ptr->tree->Read(s, depth)) return 0;
} else depth = (-1 * next_num) - 1;
if (depth) depth--;
return 1;
istream &operator>>(istream &s, FlexiTree &tree) {
int next_num, depth = 0;
tree.Read(s, depth);
return s;
// writes out in the format that dot uses for dags
int FlexiTree::OutputGraph(ostream &s) const {
// first write out the name of the tree
s<<" "<<id<<" [label=\""<<root<<"\",shape=plaintext];"<<endl;
FlexiTreeNode *temp_ptr = children;
int childid;
while (temp_ptr) {
childid = temp_ptr->tree->OutputGraph(s);
s<<" "<<id<<" -> "<<childid<<";"<<endl;
temp_ptr = temp_ptr->next;
return id;
* IsSeqInForest() *
* Searches through database forest to locate sequence. Returns 1 *
* if it finds it, 0 otherwise *
int SeqForest::IsSeqInForest(const vector<int> &seq, int seq_len) const
// Have we ever seen a sequence starting with the same root?
if (trees_found[seq[0]]) {
// Have we seen this precise sequence?
return trees[seq[0]].IsSeqInTree(seq, 0, seq_len-1);
return 0;
SeqForest::SeqForest(int max_trees)
trees = vector<FlexiTree>(max_trees);
trees_found = vector<int>(max_trees, 0);
#include "fstream.h"
// for test purposes
void main(void) {
FlexiTree tree(1);
vector<int> seq(10);
// try out insert and write
seq[0] = 1; seq[1] = 1; seq[2] = 2; seq[3] = 3;
tree.SeqInsert(seq, 0, 3);
seq[0] = 1; seq[1] = 1; seq[2] = 3; seq[3] = 5;
tree.SeqInsert(seq, 0, 3);
seq[0] = 1; seq[1] = 2; seq[2] = 2; seq[3] = 3;
tree.SeqInsert(seq, 0, 3);
seq[0] = 1; seq[1] = 2; seq[2] = 3; seq[3] = 3;
tree.SeqInsert(seq, 0, 3);
seq[0] = 1; seq[1] = 2; seq[2] = 3; seq[3] = 4;
tree.SeqInsert(seq, 0, 3);
seq[0] = 1; seq[1] = 2; seq[2] = 3; seq[3] = 4;
tree.SeqInsert(seq, 0, 3);
seq[0] = 1; seq[1] = 2; seq[2] = 1; seq[3] = 4;
tree.SeqInsert(seq, 0, 3);
// now try out search
seq[0] = 1; seq[1] = 2; seq[2] = 1; seq[3] = 4;
if (tree.SeqSearch(seq, 0, 3)) cout<<"found 1214"<<endl;
else cout<<"could not find 1214"<<endl;
seq[0] = 1; seq[1] = 2; seq[2] = 2; seq[3] = 4;
if (tree.SeqSearch(seq, 0, 3)) cout<<"found 1224"<<endl;
else cout<<"could not find 1224"<<endl;
seq[0] = 1; seq[1] = 2; seq[2] = 4; seq[3] = 4;
if (tree.SeqSearch(seq, 0, 3)) cout<<"found 1244"<<endl;
else cout<<"could not find 1244"<<endl;
seq[0] = 1; seq[1] = 1; seq[2] = 3; seq[3] = 5;
if (tree.SeqSearch(seq, 0, 3)) cout<<"found 1134"<<endl;
else cout<<"could not find 1134"<<endl;
// try out insert and write with shorter and longer sequences
seq[0] = 1; seq[1] = 3;
tree.SeqInsert(seq, 0, 1);
seq[0] = 1; seq[1] = 1; seq[2] = 4;
tree.SeqInsert(seq, 0, 2);
seq[0] = 1; seq[1] = 2; seq[2] = 3; seq[3] = 1; seq[4] = 1; seq[5] = 2; seq[6] = 1; seq[7] = 4;
tree.SeqInsert(seq, 0, 7);
if (tree.SeqSearch(seq, 0, 7)) cout<<"found 12311214"<<endl;
else cout<<"could not find 12311214"<<endl;
seq[0] = 1; seq[1] = 1; seq[2] = 5;
if (tree.SeqSearch(seq, 0, 2)) cout<<"found 115"<<endl;
else cout<<"could not find 115"<<endl;
if (tree.SeqSearch(seq, 0, 1)) cout<<"found 11"<<endl;
else cout<<"could not find 11"<<endl;
ofstream outf("test.out");
//counter = 0;
FlexiTree intree;
ifstream inf("test.out");
seq[0] = 1; seq[1] = 2; seq[2] = 3; seq[3] = 1; seq[4] = 1; seq[5] = 2; seq[6] = 1; seq[7] = 4;
if (intree.SeqSearch(seq, 0, 7)) cout<<"found 12311214"<<endl;
else cout<<"could not find 12311214"<<endl;
seq[0] = 1; seq[1] = 1; seq[2] = 5;
if (intree.SeqSearch(seq, 0, 2)) cout<<"found 115"<<endl;
else cout<<"could not find 115"<<endl;
if (intree.SeqSearch(seq, 0, 1)) cout<<"found 11"<<endl;
else cout<<"could not find 11"<<endl;
int FlexiTree::Read(istream &s, int &depth) {
int next_num, depth_decr = 0;
if (next_num == -1) return 0; // we have reached the end of the tree
if (next_num >= 0) {
children = new FlexiTreeNode(next_num);
if (!children->tree->Read(s, depth)) return 0;
if (depth) {
depth_decr = 1;
FlexiTreeNode *temp_ptr = children;
while (depth == 0) {
depth_decr = 0;
if (next_num == -1) return 0; // we have reached the end of the tree
temp_ptr->next = new FlexiTreeNode(next_num);
temp_ptr = temp_ptr->next;
if (!temp_ptr->tree->Read(s, depth)) return 0;
if (depth) {
depth_decr = 1;
if (!depth_decr && depth) depth--;
} else
depth = (-1 * next_num) - 1;
return 1;

@ -0,0 +1,54 @
#ifndef __FLEXITREE_H
#define __FLEXITREE_H
using std::ostream;
using std::istream;
using std::vector;
class FlexiTreeNode;
class FlexiTree {
FlexiTreeNode *children;
int root;
int id;
void Write(ostream &s, int &depth) const;
int Read(istream &s, int &depth);
int OutputGraph(ostream &s) const;
FlexiTree(int d);
// FlexiTree(const FlexiTree& ft);
void SetRoot(int d) {root = d;}
int InsertSeq(const vector<int> &seq, int first, int last);
int IsSeqInTree(const vector<int> &seq, int first, int last) const;
int ComputeHDistForTree(vector<int> &seq, int first, int last) const;
friend ostream &operator<<(ostream &s, const FlexiTree &tn);
friend istream &operator>>(istream &s, FlexiTree &tn);
int NumNodes() const; // returns the number of nodes in the tree
int NumLeaves() const; // returns the number of leaves in the tree, i.e num of distinct seqs
int NumBranches() const; // returns the total # of branches, of all nodes
class SeqForest {
// this structure is a an array of N tree nodes, i.e. a tree for each value
// type
vector<FlexiTree> trees;
// this structure is to record what types of values actually occured -
// for efficiency, if there were actually fewer value types than
// specified in the config
vector<int> trees_found;
SeqForest(int max_trees);
int IsSeqInForest(const vector<int> &seq, int seq_len) const;

@ -0,0 +1,34 @
#ifndef __OPT_INFO_H
#define __OPT_INFO_H
#include <string>
#define NUM_OPTS 16
#define SHORT_NAME 0
#define LONG_NAME 1
using std::string;
class OptInfo {
string long_name; // Long name of this option; used in
// configuration file and with the -- marker
// on the command line
string short_name; // Short name of this option; used with the -
// marker on the command line
int set; // Flag indicating if this option has already
// been set
char type; // type of value: legitimate values are f
// (flag, i.e., boolean), i (int), s (string)
// or h (help)
union { // pointer to actual value to be set
int *flag_val; // value if type = 'f'
int *int_val; // value if type = 'i'
string *str_val; // value if type = 's'
OptInfo() {};

@ -0,0 +1,54 @
#ConfigFileRev: 1
#Sample STIDE configuration file containing default values.
db_name: default.db # name of database
seq_len: 6 # length of sequences
max_elements: 500 # maximum number of unique elements in input
max_streams: 100 # maximum number of unique streams in input
pair_offset: 0 # offset for pair number count
add_output_format: \
"DB Size: %d\tStream: %s\tPair Number: %p\n"
# In verbose mode, STIDE will print
# this information for every new
# sequence added to the database. In
# very verbose mode, STIDE will print
# this information for every sequence
# considered. Possible data:
# %d Database Size
# %i Pair number of last data element of
# sequence in its particular
# data stream
# %p Pair number of last data element of
# sequence in the whole input
# stream
# %s Stream Number
compare_output_format: \
"Pair Number: %p\tStream Number: %s\n"
# In verbose mode, STIDE will print
# this information for every sequence
# which is itself an anomaly or whose
# locality frame conatins an anomaly.
# In very verbose mode, STIDE will
# print this information for every
# sequence. Possible data:
# %a 1 if this sequence is an anomaly, 0
# otherwise
# %c locality frame count of this sequence
# %h Hamming distance
# %i Pair number of last data element of
# its particular data stream
# %p Pair number of last data element of
# the entire input
# %s Stream Number
lf_size: 1 # 1 causes locality frame counts not
# to be computed
add_to_db: off # Add this data to the database, or, if there
# is no database, create a new one -- do not
# do comparisons
output_graph: off # Outputs graphing information in Dot
# format
compute_hdist: off # Compute Hamming distances
write_db_stats: off # At end, print out statistics about database
verbose: off # See add_ouput_format and compare_output_format
very_verbose: off # See add_ouput_format and compare_output_format

@ -0,0 +1,576 @
* *
* STIDE: Sequence Time-Delay Embedding v1.2 *
* *
* Written by Steve Hofmeyr 7/21/1996 *
* Revised by Julie Rehmeyer 3/1998 *
* Revised by Hajime Inoue 11/2006 *
* *
* Copyright (C) 1996, 1998 Regents of the University of New Mexico. *
* Copyright (C) 2006 Hajime Inoue. *
* All Rights Reserved. *
* *
* This program is free software; you can redistribute it and/or *
* modify it under the terms of the GNU General Public License as *
* published by the Free Software Foundation; either version 2 of *
* the License, or (at your option) any later version. *
* *
* This program is distributed in the hope that it will be useful, *
* but WITHOUT ANY WARRANTY; without even the implied warranty of *
* GNU General Public License for more details. *
* *
* You should have received a copy of the GNU General Public *
* License along with this program; if not, write to the Free *
* Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, *
* USA. *
* *
#include <stdlib.h>
#include <string>
#include <iostream>
#include <fstream>
#include <vector>
#include <map>
#include "config.h"
#include "stream.h"
#include "flexitree.h"
#define DBREV 1
using std::vector;
using std::cin;
using std::cerr;
using std::cout;
using std::endl;
using std::ofstream;
typedef std::map<int, int> HashTableInt;
int counter = 0;
Stream *GetReadyStream(vector<Stream> &streams, HashTableInt
&sid_table, int &num_streams_fnd, int
&total_pairs_read, const Config &cfg);
int ReadDB(SeqForest &db_forest, const string &db_name,
int &seq_len);
void WriteDB(const SeqForest &db_forest, const string &db_name, const
int db_size, const int seq_len);
void FinalReport(const Config &cfg, const SeqForest &normal, const int
num_streams_fnd, const int num_seqs_added, const
vector<Stream> &streams, const int db_size);
void WriteDBStats(const SeqForest &db_forest, ostream &out_stream,
const int db_size);
void OutputGraph(const SeqForest &db_forest, string db_name);
int GetPrimeLargerThan(const int n);
int ExtToInt(HashTableInt &sid_table, int key, int next_value)
if(sid_table.find(key) == sid_table.end())
sid_table[key] = next_value;
return sid_table[key];
* main() *
* Input: int argc: Number of command-line arguments *
* char *argv[]: array of strings containing *
* command-line arguments *
* Output: 0 if successful, -1 if unsuccessful *
int main(int argc, char *argv[])
Config cfg((const int) argc, (const char **) argv);
// Declare configuration object and do
// the configuration on the basis of the
// command line arguments and the
// configuration file
Stream *active_stream; // This will point to the stream that
// currently has a sequence to be worked
// on (either added to the database or
// compared).
HashTableInt sid_table;
// Hash table relating external stream ids to
// internal sids; make size of table
// smallest prime larger than the number
// of streams
SeqForest normal(cfg.max_elements); // Uninitialized forest of
// normal sequences
vector<Stream> streams(cfg.max_streams); // Array of stream objects,
// one for each data stream
// in input, which are
// allocated as needed
int num_streams_fnd = 0; // Number of data streams
// encountered to date
int total_pairs_read = cfg.pair_offset; // Number of pairs read from
// input to date from all
// the data streams combined
// -- can be offset using
// the "-n" switch
int db_size; // Total number of unique
// sequences in the database
int init_db_size = 0; // Number of unique
// sequences in the
// pre-existing database
// Read database into normal, if database exists
db_size = init_db_size = ReadDB(normal, cfg.db_name, cfg.seq_len);
if (cfg.add_to_db)
while ((active_stream =
GetReadyStream(streams, sid_table, num_streams_fnd,
total_pairs_read, cfg)) != NULL)
active_stream->AddToDB(normal, db_size, total_pairs_read, cfg);
WriteDB(normal, cfg.db_name, db_size, cfg.seq_len);
if (cfg.output_graph)
int i = 0;
while ((active_stream =
GetReadyStream(streams, sid_table, num_streams_fnd,
total_pairs_read, cfg)) != NULL)
active_stream->CompareSeq(cfg, normal, total_pairs_read);
FinalReport(cfg, normal, num_streams_fnd, db_size - init_db_size,
streams, db_size);
* GetReadyStream() *
* This function reads a pair from the input, appends the element *
* to the current sequence string in the appropriate data stream, *
* finds out if that data stream has a complete sequence to be *
* processed, continues until it has found such a data stream, and *
* returns a pointer to it. It updates num_streams_fnd, *
* total_pairs_read, sid_table, and streams. *
* *
* Input: vector<Stream> &streams: the array of streams that we have *
* found so far *
* HashTableInt &sid_table: hash table relating external sids *
* to internal sids *
* int &num_streams_fnd: the number of streams found so far; *
* int &total_pairs_read: the number of pairs read from the *
* input stream so far *
* const Config &cfg: configuration information *
* *
* Output: a pointer to the next stream that is ready for processing *
Stream *GetReadyStream(vector<Stream> &streams, HashTableInt
&sid_table, int &num_streams_fnd, int
&total_pairs_read, const Config &cfg)
Stream *ready_stream = NULL;
int ext_sid;
int int_sid;
int sval;
cin >> ext_sid;
while (!cin.eof()) {
if (ext_sid == -1) {
// int_sid = sid_table.ExtToInt(ext_sid, num_streams_fnd);
int_sid = ExtToInt(sid_table, ext_sid, num_streams_fnd);
cin >> sval;
// Update num_streams_fnd, if necessary
if (int_sid >= num_streams_fnd) {
if (int_sid > cfg.max_streams) {
cerr<<"ERROR: Too many streams to follow, aborting..."<<endl;
// We need a new stream object
if(num_streams_fnd == streams.size())
cerr << "num_streams_fnd: " << num_streams_fnd << endl;
cerr << "cfg.max_streams: " << cfg.max_streams << endl;
streams[num_streams_fnd].Init(cfg, int_sid, ext_sid);
num_streams_fnd = int_sid + 1;
if (streams[int_sid].Ready()) {
ready_stream = &streams[int_sid];
cin >> ext_sid;
return ready_stream;
* ReadDB() *
* Reads the database from a file and returns the number of unique *
* sequences in the database. Checks for appropriate revision *
* number. If it is a revision DBREV database, the second line *
* will be "#DBseq_len: " followed by the sequence length. The *
* next line will contain a single number, giving the root of the *
* first tree. The following lines will contain the tree itself. *
* The first seq_len numbers make up the first sequence (so the *
* first number of the second line will be the same as the number *
* on the first line). The next number will be a negative number *
* between -(seq_len-1) and -2, indicating how far to backtrack in *
* the first sequence, and the following positive numbers give the *
* rest of the second sequence. So, for example, -3 would mean *
* backtrack 3 numbers, take the previous numbers including the *
* one you're on, and append the next two numbers. So after the *
* -3 you would find two positive numbers, followed by a negative *
* number (which you would use the same way as you used the -3, on *
* the most recent sequence). Each tree is terminated by the *
* number -1. So the sample input file *
* 3 *
* 3 4 2 9 10 3 -4 3 9 8 -2 3 -3 4 9 -1 *
* 2 *
* 2 3 4 5 6 7 -3 2 9 -1 *
* yields the sequences: *
* 3 4 2 9 10 3 *
* 3 4 2 3 9 8 *
* 3 4 2 3 9 3 *
* 3 4 2 3 4 9 *
* 2 3 4 5 6 7 *
* 2 3 4 5 2 9 *
* *
* Input: SeqForest &db_forest Forest of sequences *
* const string &db_name Name of database *
* int &seq_len User-specified sequence length *
* *
* Output: the number of unique sequences in the database *
* *
int ReadDB(SeqForest &db_forest, const string &db_name,
int &seq_len)
ifstream in_db_file(db_name.c_str()); // file to read the database from
int db_size = 0; // size of the database
int root; // the first element of the sequences
// we are reading in at the moment;
// i.e., the root of this tree
string buff;
int db_seq_len;
int rev_num;
if (!in_db_file.is_open()) {
cerr<<"WARNING: Cannot open database file " << db_name
<< " for input"<<endl<<"Creating a new file"<<endl;
return 0;
// Check to see if the first line contains "#DBrev:"
if (buff == "#DBrev:") {
if (rev_num > DBREV) {
cerr << "ERROR: The revision number is greater than " << DBREV
<< ". This version of STIDE is only capable of dealing "
<< "with databases through DBrev " << DBREV
<< ". Aborting..."<<endl;
if (rev_num < DBREV) {
cerr << "ERROR: Revision number of database must be >= " << DBREV
<< endl;
// Now we know that it is revision DBREV. Check sequence length of
// database against user-indicated sequence length
// Now check to see if next line is "#DBseq_len: " followed by a
// number
if (buff != "#DBseq_len:") {
cerr << "ERROR: The second line of the database does not "
<< "contain the string \"#DBseq_len: \"" << endl
<< "followed by the sequence length of the database, as "
<< "required of revision " << DBREV
<< " databases. Aborting..."<< endl;
if (db_seq_len != seq_len) {
cerr << "WARNING: Database sequence length is " << db_seq_len
<< ", which does not match "
<< "sequence length specified" << endl
<< "by user (or by default if no specification was given), "
<< "which is " << seq_len << endl
<< "I will use the database sequence length. If that is "
<< "not what you intended, type Ctrl-C to abort." << endl;
seq_len = db_seq_len;
// Read next number into root
in_db_file >> root;
// Otherwise, we assume we have an old-style database, and let the
// user know that that's our assumption
else {
cerr << "WARNING: The string \"DBrev: \" is not in the first "
<< "line of the database." << endl
<< "I'm assuming that it's an older style of database, and "
<< "will read it in" << endl
<< "based on that assumption. If that is not what you want "
<< "me to do, type CTRL-C" << endl << endl;
// we have just read the first root into buff -- put it in root
// instead
root = atoi(buff.c_str());
while (!in_db_file.eof()) {
if (root == -1) break;
db_size += db_forest.trees[root].NumLeaves();
return db_size;
* WriteDB() *
* Writes db_forest to the file db_name, with the format described *
* in the header of ReadDB(). Prints database statistics at the *
* end of the file. *
* *
* Input: const SeqForest &db_forest Forest of sequences in *
* database *
* const string &db_name Name of file in which to *
* put database. *
* const int db_size Number of unique sequences *
* in the database *
* const int seq_len Sequence length *
* *
* Output: none *
void WriteDB(const SeqForest &db_forest, const string &db_name, const
int db_size, const int seq_len)
ofstream out_db_file(db_name.c_str());
if (!out_db_file.is_open())
cerr << "ERROR: Cannot open database file " << db_name
<< "for output, aborting..." << endl ;
out_db_file << "#DBrev: " << DBREV << endl;
out_db_file << "#DBseq_len: " << seq_len << endl;
for (int i = 0; i < db_forest.trees.size(); i++)
if (db_forest.trees_found[i])
out_db_file << db_forest.trees[i] << endl;
out_db_file<<" -1"<<endl;
// we can now write anything, so I will write the db stats
out_db_file<<"; DB STATS"<<endl;
WriteDBStats(db_forest, out_db_file, db_size);
* FinalReport() *
* Reports data at end of run. The number of streams, the number *
* of input pairs, and the number of sequences in the input are *
* always reported. If we have done a comparison run, we report *
* the number of anomalies, and the precentage of sequences that *
* were anomalous. Additionally, if asked for, the Hamming *
* distance or locality frame count is reported. If we have added *
* to the database, we report having done so and report the number *
* of sequences added. If database statistics are asked for, we *
* report the number of nodes, the number of unique sequences, the *
* number of branches, and the average database branch factor. *
* *
* Input: const Config &cfg: Configuration information *
* const SeqForest &normal: DB of normal sequences *
* const int num_streams_fnd: Total number of streams found*
* const int num_seqs_added: Number of unique sequences *
* added *
* const vector<Stream> &streams: Array of data streams *
* const int db_size: Number of unique sequences *
* in DB *
* *
* Output: none *
* *
void FinalReport(const Config &cfg, const SeqForest &normal, const int
num_streams_fnd, const int num_seqs_added, const
vector<Stream> &streams, const int db_size)
int total_pairs = 0;
int total_seqs = 0;
int total_anoms = 0;
int total_max_lfc = 0;
int total_max_hdist = 0;
int db_nodes = 0;
int db_seqs = 0;
int db_branches = 0;
int j;
// Sum up number of pairs input and number of seqs from all the streams
for (j = 0; j < num_streams_fnd; j++) {
total_seqs += streams[j].GetNumSeqsFnd();
total_pairs += streams[j].GetNumPairsRead();
cout << endl;
cout << "Number of different streams in input = "
<< num_streams_fnd << endl;
cout << "Total number of input pairs = "
<< total_pairs << endl;
cout << "Total number of sequences in input = "
<< total_seqs << endl;
if (cfg.add_to_db) {
cout << "File added to database" << endl;
cout << "Number of new sequences added to the database: "
<< num_seqs_added << endl;
else {
cout << "Scan completed" << endl;
// Sum up number of anomalies from all the streams
for (j = 0; j < num_streams_fnd; j++) {
total_anoms += streams[j].GetNumAnoms();
cout << "Number of anomalies = "
<< total_anoms << endl;
cout << "Percentage anomalous = "
<< ((float)total_anoms * 100.0)/total_seqs << endl;
// If asked for, compute Hamming distances across streams and report
if (cfg.compute_hdist) {
for (j = 0; j < num_streams_fnd; j++) {
if (streams[j].GetMaxHDist() > total_max_hdist) {
total_max_hdist = streams[j].GetMaxHDist();
cout << "Largest minimum Hamming distance = "
<< total_max_hdist << endl;
// If asked for, compute lfc across streams and report
if (cfg.lf_size > 1) {
for (j = 0; j < num_streams_fnd; j++) {
if (streams[j].GetMaxLFC() > total_max_lfc) {
total_max_lfc = streams[j].GetMaxLFC();
cout << "Maximum lfc = " << total_max_lfc << endl;
// If asked for, compute db stats and report
if (cfg.write_db_stats) {
WriteDBStats(normal, cout, db_size);
* WriteDBStats() *
* Computes and writes to standard output the number of nodes in *
* the database, the number of unique sequences, the number of *
* branches, and the average database branch factor. *
* *
* Input: const SeqForest &db_forest Forest of sequences in *
* database *
* ostream &out_stream Where to write info *
* const int db_size Number of unique sequences in the *
* database *
* *
* Output: none *
void WriteDBStats(const SeqForest &db_forest, ostream &out_stream,
const int db_size)
int db_nodes = 0;
int db_branches = 0;
for (int i = 0; i < db_forest.trees.size(); i++)
if (db_forest.trees_found[i])
db_nodes += db_forest.trees[i].NumNodes();
db_branches += db_forest.trees[i].NumBranches();
out_stream << "Number of DB nodes = " << db_nodes << endl;
out_stream << "Number of unique sequences = "<<db_size << endl;
out_stream << "Number of branches (edges) = "<<db_branches << endl;
out_stream << "Average DB branch factor = "
<<((float)db_branches/(db_nodes - db_size))<<endl;
* OutputGraph() *
* Writes a file containing input for the program Dot. *
* Running Dot on produces a PostScript file *
* containing a picture of the whole database tree. *
* *
* Input: const SeqForest &db_forest Forest of sequences in *
* database *
* const string db_name Filename to use *
* *
* Output: none *
void OutputGraph(const SeqForest &db_forest, const string db_name)
char *dot_filename;
dot_filename = new char [strlen(db_name.c_str())+4];
strcpy(dot_filename, db_name.c_str());
ofstream output_file(strcat(dot_filename,".dot"));
output_file<<"digraph \""<<db_name<<"\" {"<<endl;
output_file<<" ratio=auto;"<<endl;
output_file<<" page=\"8.5,11\";"<<endl;
for (int i = 0; i < db_forest.trees.size(); i++) {
if (db_forest.trees_found[i])

@ -0,0 +1,367 @
#include <stdlib.h>
#include <stdio.h>
#include <string>
#include <iostream>
#include <fstream>
#include "stream.h"
* Init() *
* Initializes an instance of Stream. *
* *
* Input: const Config &cfg Configuration information *
* const int intern internal stream identifier *
* const int extern external stream identifier *
* Output: none *
using std::cerr;
using std::endl;
void Stream::Init(const Config &cfg,
const int intern_id, const int extern_id) {
int i;
// initialize all the arrays
for(i=0; i < cfg.seq_len; i++)
current_seq[i] = -1;
num_in_seq = -1;
num_pairs_read = 0;
num_anoms = 0;
num_seqs_fnd = 0;
int_sid = intern_id;
ext_sid = extern_id;
max_hdist = 0;
seq_hdist = 0;
for(i=0; i < cfg.lf_size; i++)
lf[i] = 0;
seq_lfc = 0;
max_lfc = 0;
ready = 0;
seq_len = cfg.seq_len;
* Append() *
* This function puts the integer given into the current_seq array *
* as the last element. It flags ready according to whether *
* current_seq is full. Updates num_in_seq, ready, current_seq, *
* num_seqs_fnd, and num_pairs_read. *
* *
* Input: const int new_value The next value to be put into the *
* current_seq array *
* Output: none *
void Stream::Append(const int new_value)
// missing system call - zero the current sequence
if (new_value == -1) {
num_in_seq = -1;
ready = 0;
else {
if (num_in_seq < seq_len - 1) { // window not yet full
current_seq[num_in_seq] = new_value;
if (num_in_seq == seq_len - 1) {
ready = 1;
else {
// Roll over current_seq array
for (int k = 0; k < num_in_seq; k++) {
current_seq[k] = current_seq[k + 1];
current_seq[num_in_seq] = new_value;
* AddToDB() *
* *
* Adds current_seq to the database if it isn't already there; *
* Returns 0 if it is already there, 1 if it is new. Updates *
* normal and db_size. *
* *
* Input: SeqForest &normal Forest of normal sequences *
* int &db_size Number of unique sequences in the *
* database *
* const int total_pairs_read Number of pairs read from the *
* entire input stream *
* const Config &cfg Configuration Information *
* Output: 0 if sequence isn't new, 1 if it is *
int Stream::AddToDB(SeqForest &normal, int &db_size, const int
total_pairs_read, const Config &cfg) const
int is_new;
// If there is not a tree with the same root as this sequence has,
// make a new tree with that root and flag trees_found
if (!normal.trees_found[current_seq[0]])
normal.trees_found[current_seq[0]] = 1;
// Try to add the sequence. If it's already there, is_new will be
// set to 0, otherwise it will be set to 1.
is_new = normal.trees[current_seq[0]].InsertSeq(current_seq, 0, seq_len-1);
db_size += is_new;
if ((is_new && cfg.verbose) || cfg.very_verbose)
ReportNewSeq(cfg, total_pairs_read, db_size);
if (is_new)
return 1;
return 0;
* CompareSeq() *
* Compares the current sequence in this stream to the database, *
* in the manner indicated by the configuration file. Reports *
* on anomalies if told to by the configuration file. Updates *
* num_anoms, seq_hdist, max_hdist, seq_lfc, and max_lfc. *
* *
* Input: const Config &cfg: Information from configuration file *
* const SeqForest &normal: DB of normal sequences *
* const int total_pairs_read: Number of pairs read from *
* all of the streams *
* Output: none *
void Stream::CompareSeq(const Config &cfg, const SeqForest &normal,
const int total_pairs_read)
int is_anom; // flag to indicate whether current_seq is an anomaly
is_anom = ComputeMisses(normal);
if ((is_anom) && (cfg.compute_hdist)) {
if (cfg.lf_size > 1) {
ComputeLF(is_anom, cfg.lf_size);
// if we're in verbose mode and either current_seq is an anomaly or
// its locality frame contains an anomaly, report it
if ((cfg.very_verbose) || (cfg.verbose && (is_anom || seq_lfc))) {
ReportSeq(cfg, total_pairs_read, is_anom);
* ComputeMisses() *
* Compares the current sequence to the database sequences. If *
* there is an exact match, we return 0. Otherwise we return 1. *
* Updates num_anoms and seq_hdist. *
* *
* Input: const SeqForest &normal: DB of normal sequences *
* Output: 0 if there is an exact match *
* 1 if the sequence is anomalous *
int Stream::ComputeMisses(const SeqForest &normal)
if (normal.IsSeqInForest(current_seq, seq_len)) {
seq_hdist = 0;
// We have an anomaly
* ComputeHDist() *
* Compares the current sequence in this stream to each sequence *
* in the database in turn, adding up the number of mismatches *
* between the two sequences. The smallest difference between *
* the current sequence and the database sequences is the minimum *
* Hamming distance for the current sequence. If this minimum *
* Hamming distance is greater than the largest minimum Hamming *
* distance encountered so far, then the variable max_hdist is *
* updated. Updates seq_hdist and max_hdist. *
* *
* Input: const SeqForest &normal: DB of normal sequences *
* *
* Output: none *
void Stream::ComputeHDist(const SeqForest &normal)
int misses_on_this_seq; // the number of mismatches between
// current_seq and the sequence we're
// comparing it with at the moment
seq_hdist = seq_len; // start with seq_hdist as high as
// possible
// We compare current_seq with each sequence in our database tree
for (int i = 0; i < normal.trees.size(); i++) {
// Have we seen any sequences starting with element i? If not, we
// can go on to consider sequences starting with element i+1.
if (normal.trees_found[i]) {
misses_on_this_seq =
normal.trees[i].ComputeHDistForTree(current_seq, 0, seq_len-1);
if (misses_on_this_seq < seq_hdist) {
seq_hdist = misses_on_this_seq;
if (seq_hdist > max_hdist) {
max_hdist = seq_hdist;
* ComputeLF() *
* Computes the number of misses in current_seq's locality frame. *
* Updates lf, seq_lfc and max_lfc. *
* *
* Input: const int is_anom Flag to indicate whether *
* current_seq is an anomaly *
* const int lf_size Size of locality frame *
* Output: none *
void Stream::ComputeLF(const int is_anom, const int lf_size)
// When num_seqs_fnd is less than lf_size, the locality frame
// array is not full
if (num_seqs_fnd <= lf_size) {
lf[num_seqs_fnd-1] = is_anom;
seq_lfc += is_anom;
else {
// We're about to remove the first element of lf; since seq_lfc is
// the sum of the elements of lf, we should subtract lf[0] from
// seq_lfc to remove it from the sum.
seq_lfc -= lf[0];
// Now we add is_anom and seq_lfc is the sum of the new locality
// frame.
seq_lfc += is_anom;
// roll over the array
for (int i = 0; i < lf_size-1; i++) {
lf[i] = lf[i+1];
lf[lf_size-1] = is_anom;
if (seq_lfc > max_lfc) {
max_lfc = seq_lfc;
* ReportSeq() *
* This function reports data about a sequence. Specifically, it *
* can report the external stream id, a number indicating where *
* the first element of the current sequence occurs in the input, *
* a number indicating how many pairs from this particular data *
* stream have been read prior to the first element of the *
* sequence, the minimum Hamming distance for the current *
* sequence, the locality frame count, the locality frame count, *
* and whether this particular sequence is itself an anomaly (it *
* could be that some other sequence in its locality frame is *
* anomalous). The configuration file determines which of those *
* possible data are reported and in what format. Updates no *
* values. *
* *
* Input: const Config &cfg Configuration information *
* const int total_pairs_read Total number of pairs read *
* from the input stream from any data *
* stream, not just this one *
* const int is_anom flag for whether the current *
* sequence is itself an anomaly *
* Output: none *
void Stream::ReportSeq(const Config &cfg, const int total_pairs_read,
const int is_anom) const
for (int i = 0; i < cfg.num_fvars; i++) {
switch (cfg.write_val[i]) {
case 'a':
printf(cfg.fmt_str[i], is_anom); break;
case 'c':
if (cfg.lf_size > 1) {
printf(cfg.fmt_str[i], seq_lfc);
case 'h':
if (cfg.compute_hdist) {
printf(cfg.fmt_str[i], seq_hdist);
case 'i':
printf(cfg.fmt_str[i], num_pairs_read); break;
case 'p':
printf(cfg.fmt_str[i], total_pairs_read); break;
case 's':
printf(cfg.fmt_str[i], ext_sid); break;
* ReportNewSeq() *
* This function reports on sequences which have been newly added *
* to the database. It can report the external stream *
* identifier, where the first element of the sequence occurs *
* both within the whole input stream and within its own data *
* stream, and the number of unique sequences in the database *
* after this sequence has been added. The configuration file *
* determines which of those possible data are reported and in *
* what format. Updates no values. *
* *
* Input: const Config &cfg Configuration information *
* const int total_pairs_read Total number of pairs read *
* from the input stream from any data *
* stream, not just this one *
* const int db_size Number of unique sequences *
* in the database *
* Output: none *
void Stream::ReportNewSeq(const Config &cfg, const int total_pairs_read,
const int db_size) const
for (int i = 0; i < cfg.num_fvars; i++) {
switch (cfg.write_val[i]) {
case 'd':
printf(cfg.fmt_str[i], db_size); break;
case 'i':
printf(cfg.fmt_str[i], num_pairs_read); break;
case 'p':
printf(cfg.fmt_str[i], total_pairs_read); break;
case 's':
printf(cfg.fmt_str[i], ext_sid); break;

@ -0,0 +1,63 @
#ifndef __STREAM_H
#define __STREAM_H
#include <vector>
#include "config.h"
#include "flexitree.h"
using std::vector;
class Stream {
Stream() {};
void Init(const Config &cfg, const int intern_id, const int
void Append(const int next_value);
int AddToDB(SeqForest &normal, int &db_size, int total_pairs_read,
const Config &cfg) const;
void CompareSeq(const Config &cfg, const SeqForest &normal, const
int total_pairs_read);
int GetMaxHDist(void) const {return max_hdist;}
int GetMaxLFC(void) const {return max_lfc;}
int Ready(void) const {return ready;}
int GetNumAnoms(void) const {return num_anoms;}
int GetNumPairsRead(void) const {return num_pairs_read;}
int GetNumSeqsFnd(void) const {return num_seqs_fnd;}
vector<int> current_seq; // current sequence being filled or
// processed
int num_in_seq; // current_seq is full up through
// num_in_seq
int num_pairs_read; // the number of input pairs belonging to
// this stream that have been read so far
int num_anoms; // the number of anomalies found so far
int num_seqs_fnd; // the number of (not necessarily unique)
// sequences belonging to this stream
// found so far
int ext_sid; // the external stream id
int int_sid; // the internal stream id
int max_hdist; // the largest minimum Hamming distance
// found in this stream
int seq_hdist; // the minimum Hamming distance for
// current_seq
vector<int> lf; // array for locality frame
int seq_lfc; // the locality frame count for this
// sequence
int max_lfc; // the largest locality frame count
// encountered so far
int ready; // a flag to indicate whether this stream
// has a full sequence ready to be
// processed. 0 = no, 1 = yes.
int seq_len; // sequence length
int ComputeMisses(const SeqForest &normal);
void ComputeHDist(const SeqForest &normal);
void ComputeLF(const int is_anom, const int lf_size);
void ReportSeq(const Config &cfg, const int total_pairs_read,
const int is_anom) const;
void ReportNewSeq(const Config &cfg, const int total_pairs_read,
const int db_size) const;

