This tool dynamically finds segments in data.
This code takes a list of events (each event occurring at a certain time), and segments the events into time intervals.
The first input is some InputFile
. The most basic format has one
column for time, and all remaining columns are interpreted as strings,
and specify the event.
For efficiency reasons, the input file can be compiled into a sqlite
database using events.py
. This allows optimized lookups on all
events, and ability to quickly re-run with different options. This
happens internally when segmenting, anyway, so you may as well do it
yourself first.
Finally, one would use dynsnap.py
to run on the compiled data (or
raw data, if you skipped that step), and produce some text output
files, and optionally plot files. These would need to be analyzed by
yourself to understand what is going on.
This is not fully explanatory, but provides an idea of how this tool is used.
Input file:
0 1 A B 0 2 A C 1 1 B D 1 4 B C ...
In this file, the first column is the time, the second column is some event weight (optional), and other columns are the event identifiers. In this case, the event identifiers look something like edges, with two events. The events are ordered tuples (A, B), etc. (There are extra options for unordered tuples or selecting only certain columns).
Now, we process this into the optimized sqlite database:
python events.py input.txt input.sqlite -t 0 -w 1
input.txt
and input.sqlite
names are input/output filenames.
-t 0
indicates that time is in the zeroth column. -w 1
indicates that the first (zero-indexed) column contains weights.
These weights are recorded within the database, but to specify a
weighed analysis you need to give the weighted option to the next
step, too. Weights default to 1 if -w
is not specified.
Then, we can run the actual analysis:
python dynsnap.py input.sqlite input.out -p -w -1
The -p
option indicates that we want to create some plots, in
addition to the standard output files. You must include the -w -1
option here to get the weighted analysis, but the column number is
irrelevant since weights are already stored in the database (so -1
is used).
Several output files are written, starting with input.out
.
- input.out.txt:
- Contains columns
t_low t_high interval_length similarity_value ...
- input.out.J.txt:
- Contains information on the maximization process.
- input.out.{pdf, png}:
- Plots of the segmentation.
- Be able to run the code on your input file. Relevant option would
be
-w
for the weights (if any),-t
for the column with times. See options forevents.py
, they also apply toodynsnap.py
. - If needed, look at options related to time scales.
- Save output by giving a output prefix as a second positional argument on the command line.
This program does the actual analysis. The most efficient method is
to first use events.py
to create a database of events, and run
this program on that database. However, you can run directly on input
.txt files, too, and for that some options of events.py
are
duplicated here.
There are very many options for controlling this process. While a description of options is given here, details are in the accompanying scientific README.
Basic syntax:
python dynsnap.py {InputFile} {OutputName} [options]
Arguments:
InputFile: | Raw input file, or processed .sqlite input file. |
---|---|
OutputName: | Prefix for output files. Output is written to OutputName.out.txt, etc. |
User interaction options:
--plot, -p | Write some plots at OutputName.{pdf,png} . Requires matplotlib. |
--plotstyle | Format of the plot style. 1 or 2 for different
plotting formats. These are designed for quick information, not
to be ready for use in other publications, etc. The default is
2 . 1 plots intervals, interval length, and similarity
values. 2 plots intervals, local event density and similarity
score. |
--interact | If given, start an interactive IPython shell at the end of the main function. This could be useful for investigating the results and data some, but would require looking at the code quite a bit. |
--tstart T | Start analysis only after this time. Events before this time are not segmented, and the first segment starts at this time. Argument should be a number matching the numeric time scale. |
--tstop T | Stop analysis after this time. Events after this time are not
segmented. Same format as --tstart . |
--tformat=FORMAT | |
Options available: unixtime . The time number is interpreted
according to this rule, and converted into the date and time.
This is used for plotting and also for printing to stdout and the
output files. | |
--stats | Don't do anything, just print some stats on the data and exit.
Perhaps this option would be better placed in events.py , but
here it can use the values from --tstart , --tstop , and
--tformat to make the results more useful. |
Options related to the segmentation algorithm:
-w N | Use as -w -1 if running without a text file input. Turn on
weighted analysis, even if not reading from an input text file.
Normally, this is used to specify the weight column in input
files. However, if data is stored in the database but -w -1
is not given, unweighted sets will be used and the weights are
lost. Basically, make sure that some form of the -w option is
on the command line when you want to do use a weighted similarity
measure. |
--measure | Specify similarity measure to use. Options are
|
--merge-first | Perform the "merge first two intervals" process. This might be useful in cases where your data always has sharp transitions or critical events. |
--dtmode=NAME | Select among the three types of search patterns:
|
Options related to timescales:
These options are particularly important. The method works by finding similarity peaks as a function of dt. How do we find a maximum, global or local? By default, the algorithm tries hard to find global maxima. This mean that it tries hard to search forward in time very far (but also efficiently). On the other hand, once we find a local maximum, we could stop and use that. Using these options, you can balance these factors.
Let's say that your data has long-term similarity. To check this, ask what is the expected similarity between the first 1/3rd and last 1/3rd of all data. If this is high (say, let's say it is 1), that means that according to our method, the longest self-similar time scale is all time. An example of this type of data is something with a uniformly repeating pattern. There is local similarity, differences on medium scales, and then similarity on the longest scale. The method is expected to detect the longest scale similarity.
This method is supposed to be parameter free. Small changes in these options will not affect "true" intervals in well-formed data. However, if there are multiple timescales of similarity, these options can help to switch from one to another.
In the below options, the "peak" means the similarity maximum value,
as detected by the corresponding --peakfinder
. We always start
searching at a small dt, and increase. At each time point,
--peakfinder
proposes some dt_peak, and then the other options
control the stop condition. These options are considered in this
order: --dt-search-min
first, then --dt-pastpeak-factor
/
--self.dt_peak_factor
/ --dt-pastpeak-max
.
By default, we try very hard to find the longest self-similar
timescales. If that is too long (for example, all time), first try
adjusting --dt-peak-factor
(maybe 0.2 to 0.8). If you then get a
lot of noisy, short intervals, then use --dt-search-min
(something
large enough to compensate for random noise at short times).
--dt-pastpeak-max
(shorter than long time scale) and
--dt-pastpeak-factor
(try as low as 2) can be used to try to
exclude whole-interval timescales, too.
--peakfinder=NAME | |
Method of finding peaks of similarity, if there is a plateau of
the same value. Options are
| |
--dt-peak-factor=FACTOR | |
If given, this will mean that we stop searching if the similarity
drops to FACTOR*similarity_max. In other words, this makes a
trade off between greedy and longest . For example, if
x=0.5, that means that if after a maximum, we drop down to 0.5
of that maximal similarity value, we stop searching and use that
maximum. greedy is in effect --peak-factor=.999... and
longest /shortest is -peak-factor=0 . By default, this
is set to -1, which should never be triggered, so this condition
is never triggered. However, in all cases we will scan at least
--dt-search-min time units and 10 steps forward before
possibly using the peak factor. This is to prevent us from
finding extremely small dts at short times. (default: -1) | |
--dt-pastpeak-factor=FACTOR | |
Scan forward in time to at least FACTOR*dt_peak. Smaller value means possibility of shorter intervals, but only if there is a true short-to-long timescale transition. This does not make an early stop if similarity is continually increasing. Default=25. | |
--dt-pastpeak-max=DT | |
Never scan more than this far past a peak. In other words, there is a mandatory stop condition at dt_peak=DT_PEAKFACTOR_MAX. This does not make an early stop if similarity is continually increasing. (default: None) | |
--dt-pastpeak-min=DT | |
Scan forward in time past the maximum at least this far. The
difference with --dt-search-min is that this option measures
past the peak, and --peak-factor overrides it, but
--dt-search-min measures from dt=0 and must be met before
--peak-factor can stop the search. In most cases, you can
just use --dt-search-min . (default: 0) | |
--dt-search-min=DT | |
When scanning forward in time, always search at least this amount
of time before stopping, even if other stop condition are met.
This prevents finding similarity maxima caused by noise at very
small time scales. Note that this is already mostly compensated
for by --dt-peak-factor , one would only need this in special
cases. This option only applies to --peakfinder=longest and
--peakfinder=shortest (default: 0). |
Options for --dtmode=linear
--dtstep | Set the increment for searching. Only for the linear scan
mode. Default is 1. |
--dtmin | Set the minimum search time. Only for the linear scan mode.
Default is dtstep. |
--dtmax | Set the maximum search time. Only for the linear scan mode.
Default is 1000*dtstep. |
Options for --dtmode=log
--log-dtmin | Set the increment for searching. Only for the log scan mode.
Default is an adaptive mode. |
--log-dtmax | Not used. |
Options related to input. These options relate to data input, and
have the same usage as in events.py
. See that section for full
information. Column numbers start from 0.
-w N | Specify weighted analysis. If operating on raw input, the
zero-indexed column number of the weights in the file. If
operating on an sqlite database, specify -1 or anything to
turn on weighted analysis. |
-t N | Time column |
--unordered | Data columns should be treated as unordered. |
--grouped | See documentation for events.py below. |
--datacols | Data columns. See events.py below. |
--cache | If given, dynsnap.py operates a bit like event.py so that
the initial data is stored in an sqlite database, with a name
based on the input filename. If the cache already exists and this
option is given, use that cache and don't re-read the original
data. Note that the data-related options thus have no effect
(except -w -1 ). Recommend to compile using events.py
since there is less risk of unexpected behavior. This option is
deprecated and will be removed eventually. |
--regen | Delete and regenerate the cache. Only has any effect if
--cache is specified. |
This program will compile an input file into an optimized sqlite database of events. This is used to make runs of the segmentation faster, since in general data doesn't change, but segmentation is often re-run with different options.
The output is an sqlite
database self-contained within one file.
It can be examined using the sqlite3
command line utility, and the
format is somewhat self-explanatory.
Basic syntax:
python events.py {InputFile} {OutputName} [options]
Arguments:
InputFile: | Raw input file. Should be a space-separated text file. See the section Input Formats for more information. |
---|---|
OutputName: | Output sqlite database. |
-t N | Specify column holding time. Columns are zero-indexed! Default: 0. |
-w N | Specify column holding weight. Columns are zero-indexed! Default: None. |
--unordered | If given, the ordering of other columns does not matter, and lines with events "aaa bbb ccc" and "aaa ccc bbb" are considered the same. This, for example, makes graph edges be considered as undirected. |
--grouped | Alternative input format where each line has multiple space-separated events. See the section Input Formats. |
--datacols | If given, only these columns are considered for specifying
events. All other columns (besides the time and weight columns)
are ignored. Format must be as in --datacols="0,1,2,5" .
Columns are zero-indexed! |
This is an interface to various toy models. Run the program with a
name of a model to generate output on stdout
. The --grouped
option can be given to produce output in grouped format (see below).
Syntax:
python models.py {ModelName} [options]
Options are model-dependent and not documented here, and the models and usage of this module is subject to change.
Models include:
toy101 toy102 toy103 drift periodic
Automated tests of various models and options.
Above, a command line interface is presented. All code is modular an can be imported and used directly from Python, without needing to create temporary files. This is the author's primary method of using this program.
Unfortunately, this isn't documented yet (and the interface isn't totally settled yet)
Input is text files. There is one row for each event. One column represents the time. Optionally, one column can represent the weight of each event. All other columns specify the event. Comments (lines beginning with '#') are ignored. Time and weight columns must be numbers, but all other columns can be any string.
Use the "-t" option to specify the column with times (default: 0th column), and use "-w" to specify the weight column (default: unweighted). NOTE: column indexes begin from zero!
Example 1: simple file.:
#t event 0 aaa 0 bbb 1 aaa 2 ccc
Example 2: directed graph. 'a', 'b', 'c' are nodes. To use an undirected graph, use the "--unordered" option.:
#t e1 e2 0 a b 0 a c 1 c b 2 a c
Example 2: Weighted graph. Note the column order. To read this, use the options "-t 3 -w 2". For a undirected graph, use the "--unordered" option.:
# e1 e2 weight time a b 1.0 0 a . 1.0 0 c b 2.0 1 a c 1.0 2
GROUPED FORMAT: With the option "--grouped", you can have multiple events on the same line. Each event is one space-separated string. Time lines can repeat. Use "-t" to specify the time column, if not the first, and "-w" to represent a weight column if it exists. The same weight applies to everything on the line.:
# t events 0 a b d 1 a e f g h 2 b c d
The following files are written:
OutputName.out.txt: | Contains one row for each interval. There is a comment at the top describing format. Columns are:
Note: first line has slightly different format, since it is the starting interval. |
||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OutputName.out.J.txt: | Contains information on every unique minimization process. There is one block for each segment interval, separated by blank lines.
|
||||||||||||
OutputName.out.J.{pdf,png}: | Plots |