Skip to content

Fast Locate

Jouni Siren edited this page Jun 9, 2022 · 23 revisions

General

The r-index is a fast locate() structure that can be used to augment the compressed GBWT. It is defined as class FastLocate in fast_locate.h.

The implementation is based on the original r-index described in:

Travis Gagie, Gonzalo Navarro, and Nicola Prezza: Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space. Journal of the ACM, 2020. DOI: 10.1145/3375890

The r-index stores one suffix array sample at the start and end of each logical run in the BWT. (A logical run is either a maximal run of actual nodes or a single occurrence of the endmarker node in the BWT.) In contrast, the default locate() structure stores the path identifiers (samples the document array) at fixed intervals. Because the r-index is based on global properties of the sequences, it must be rebuilt whenever the GBWT changes.

In a typical case, the r-index uses 3-4 times more space than the BWT itself and much more than the default locate() structure with sample rate 1024. In a 1000GP GBWT, it can report roughly 1.5 million occurrences/second. In contrast, the default structure can report ~10 thousand individual occurrences/second or ~100 thousand occurrences/second when reporting ranges of occurrences.

Version history

R-index version GBWT version Changes
1 v1.1 First version.

Construction

Interface

explicit FastLocate(const GBWT& source)

The construction is very straightforward. It extracts all sequences from the GBWT using multiple OMP threads and stores the suffix array samples at the start/end of each logical run. After the extraction has finished, the construction sorts and compresses the samples.

By default, the construction writes some statistics to stderr. This can be adjusted with the verbosity settings.

Serialization and loading

Like other GBWT structures, the r-index uses the SDSL interface for serialization and loading.

size_type serialize(std::ostream& out, sdsl::structure_tree_node* v = nullptr, std::string name = "") const;
void load(std::istream& in);

The r-index cannot be used without the GBWT. After the index has been loaded, the GBWT must be set explicitly.

void setGBWT(const GBWT& source)

Construction tool

build_ri is a simple tool for building the r-index.

build_ri [options] base_name

The tool reads base_name.gbwt, builds the r-index, and writes it to base_name.ri.

  • -t N: Use N parallel extraction threads.
  • -v: Verify the r-index. This requires the original sequences in file base_name.

Queries

The r-index is a structure that enumerates the suffixes of the sequences in lexicographic order. Because the GBWT is an FM-index of the reverse sequences, the implementation actually enumerates the reverse prefixes.

For best results, use the r-index version of find() that returns the first occurrence in the range in addition to the range itself. The occurrence can then be given as a seed to a locate() query. If no seed is given, the locate() query uses the start of the first run overlapping with the range as a seed. This may cause significant slowdown if the runs are long but the range is short. In such cases, it may be better to repeat the search for the final pattern using the r-index find().

Searching

The r-index uses the following variants of the find() / extend() queries (see Search Interface).

SearchState find(node_type node, size_type& first) const;
template<class Iterator> SearchState find(Iterator begin, Iterator end, size_type& first) const;

SearchState extend(SearchState state, node_type node, size_type& first) const;
template<class Iterator> SearchState extend(SearchState state, Iterator begin, Iterator end, size_type& first) const;

If the search is successful (the returned state is non-empty), first will contain the first occurrence in the range.

Locating the occurrences

The r-index version locate() can optionally take the first occurrence as a seed. If no seed is given, the query will be slower. See also Search Interface.

std::vector<size_type> locate(SearchState state, size_type first = NO_POSITION) const;
std::vector<size_type> locate(node_type node, range_type range, size_type first = NO_POSITION) const;

Example:

size_type first;
SearchState state = r_index.find(pattern.begin(), pattern.end(), first);
std::vector<size_type> result = r_index.locate(state, first);

This searches for vector_type pattern in FastLocate r_index and stores the identifiers of the paths containing it in result.

Decompressing the SA / DA

The r-index can also decompress the part of the suffix array or the document array corresponding to a node. This is almost the same as locate(), but the duplicates are not filtered out and the results are returned in order.

std::vector<size_type> decompressSA(node_type node) const;
std::vector<size_type> decompressDA(node_type node) const;

Example:

std::vector<size_type> result = r_index.decompressSA(node);
for(size_type i = 0; i < result.size(); i++)
{
  result[i] = r_index.seqId(result[i]);
}

This is equivalent to decompressDA(node).

Example:

std::vector<size_type> result = r_index.decompressDA(node);
removeDuplicates(result, false);

This is equivalent to r_index.locate(r_index.index->find(node)).

Low-level interface

A suffix array value (offset) encodes both a path identifier (seq_id) and a sequence offset (seq_offset) in size_type. As a result, the product of the number of sequences and the length of the longest sequence must fit in 64 bits.

size_type pack(size_type seq_id, size_type seq_offset) const;
size_type seqId(size_type offset) const;
size_type seqOffset(size_type offset) const;
std::pair<size_type, size_type> unpack(size_type offset) const;
  • pack(): Encode path identifier and sequence offset as a suffix array value.
  • seqId(): Extract the path identifier.
  • seqOffset(): Extract the sequence offset.
  • unpack(): Decode the path identifier and the sequence offset (in this order).

The following function is the backbone of the r-index:

size_type locateNext(size_type prev) const;

Given the value prev at unknown suffix array offset i, return the next value at suffix array offset i + 1.

size_type locateFirst(node_type node) const;

This returns the first suffix array value in the given node. It can be used as a starting point for iterating over the node, as in decompressSA() and decompressDA().

File format

See also File Formats.

class FastLocate
{
  Header              header;
  sdsl::int_vector<0> samples;
  sdsl::sd_vector<>   last;
  sdsl::int_vector<0> last_to_run;
  sdsl::int_vector<0> comp_to_run;
};
  • header: R-index header (see below).
  • samples: The suffix array value at the start of each logical run in the BWT. In lexicographic order.
  • last: The suffix array value at the end of each logical run in the BWT. In text order, with the values marked with 1-bits.
  • last_to_run: The run identifier for each marked position in last.
  • comp_to_run: The cumulative number of logical runs before each record in the BWT.

Header

struct FastLocate::Header
{
  std::uint32_t tag;
  std::uint32_t version;
  std::uint64_t max_length;
  std::uint64_t flags;
};
  • tag: Must be 0x6B3741D8.
  • version: R-index version. Must be 1.
  • max_length: Length of the longest sequence in the index. Used for pack() / unpack().
  • flags: Must be 0.