Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

Commit

Permalink
Add draft design doc for Bloom Filter Index
Browse files Browse the repository at this point in the history
  • Loading branch information
sezruby committed Sep 25, 2020
1 parent 1d3ac49 commit 2eb669b
Showing 1 changed file with 196 additions and 0 deletions.
196 changes: 196 additions & 0 deletions docs/design/161-bloom-filter-index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# Proposal: Bloom Filter Index

Discussion at https://github.com/microsoft/hyperspace/161

## Abstract

## Background

TBD

Introduction of Bloom Filter Index as a new type of index in Hyperspace.

We could use [Spark's BloomFilter API](https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/BloomFilter.html)
or other efficient implementations of Bloom Filter.


## Proposal

We could utilize the bloom filter in 2 main uses:
1. File-level Bloom Filter, which can apply to a pushed down equal condition.
2. Global Bloom Filter, which can apply to an Equi join plan.

For File-level BF, we could build a BF on a column for each input file and check
if each file might contain the value of the equal condition or certainly not.
If it returns false, we can exclude the file path from the source file list
without reading and filtering the file as BF guarantees that there is no such
value in the file.

For Global BF, we could build a bloom filter on a column for whole source data
so that the BF can be utilized to check the existence of a value across all
source files. This BF can be applied to the counterpart of an Equi-join.


## Rationale

[A discussion of alternate approaches and the trade offs, advantages, and disadvantages of the specified approach.]
TBD

## Compatibility

[A discussion of the change with regard to the
[compatibility guidelines](../../COMPATIBILITY.md).]
TBD

## Design

[A description of the proposed design/algorithm. This should include a discussion of how the work fits
into [Hyperspace's roadmap](../ROADMAP.md).]

### Index creation

#### createIndex API Extension
```
hs.createIndex(df, BFIndexConfig(“indexName”, Seq("indexedColumnName1", "indexedColumnName2"), expectedNumItems, fpp))
```
- new `BFIndexConfig`
- `"indexedColumnName"`, `expectedNumItems`, `fpp` => BloomFilterIndex constructor
- Here, we build multiple bloom filters(BF) for each indexed column, not just
1 index using the composite key with given indexed columns. It is because the
BF with single key column would be usable more widely. The multiple BF entries can
be handled by sub-directories in the root directory of index.

#### Metadata extension
```
case class BloomFilterIndex(properties: Seq[BloomFilterIndex.Properties]) {
val kind = "BloomFilterIndex"
}
object BloomFilterIndex {
case class Properties(indexColumn: String, expectedNumItems: Int, fpp: float, globalBFPath: String)}
}
```
- Seq[Properties] as we support multiple BF entries.

#### Index data schema

##### File-level BF

- In Parquet format
- fileName: file path (i.e. linage column)
- BFBinary: BF data for each file from [writeTo(java.io.OutputStream out) API](https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/BloomFilter.html#writeTo-java.io.OutputStream-)

| fileName | BFBinary |
|----------------------------------------|---------------------------------------------|
| path/to/part-0001-xxxxx.parquet | 0xacdfacdfacdfacdfacdfacdf… |
| path/to/part-0002-xxxxx.parquet | 0xabcdabcdabcdabcdabcda… |
| path/to/part-0003-xxxxx.parquet | 0xfffffffffffffffffffffasdfffffffff… |

##### Global BF
- Single BF for all values in the indexed column
- Stored as a separate file
- Build
- Option 1) can be generated by merging all file-level BF entries if possible
- Option 2) multi-level BF based on file-level BF
- Option 3) reading all values from all files => refresh also requires full scan

### Index refreshment
#### Full

File-level BF

1. get deleted file paths & appended file paths
2. remove BF entries of deleted file paths in File-level index data
- index data has only valid rows after removing invalidated entries
3. construct a df with new BF entries for appended files
4. merge(union) both and write as a new version

Global BF

1. Update global BF using the updated file-level BF entries
- [mergeInPlace(BloomFilter other) API](https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/BloomFilter.html#mergeInPlace-org.apache.spark.util.sketch.BloomFilter-)
if possible
... TBD

Refresh of global BF depends on the BF schema.

#### Quick

#### Appended files

1. Construct a df with new BF entries for appended files
2. write the df as “append” mode - stored as a separate file
3. merge new BF entries into global BF (TBD)

#### Deleted files

1. keep the deleted file list in `excluded`
2. globalBF is still valid
- Update globalBF for better performance

### New Rules

#### PushDownBFIndexRule

For each a relation including pushed down EQ filter
1. get available BF indexes & pick the best
2. apply index
- Option 1) Optimizer(apply): run a simple spark job and get the list of files can be excluded
- Option 2) physical operator: keep each BF as a separate file and check in FileScanRDD?
- TBD) how can we _broadcast_ BF data?
- Broadcast => might cause OOM in case of huge BF
- Read from storage => remote storage throttling

#### JoinBFIndexRule

For each Equi-join
1. get available BF indexes for the counterpart relation
- ex) Table_A join Table_B on col1 => can apply global BF of Table_A.col1 to Table_B.col1
2. inject a BF filter plan before Shuffle or broadcast(?)
- Using UDF – for functionality check
- Using a newly defined physical operator – for better optimization
- TBD) need to find an efficient way of broadcasting (& maintaining) global BF

#### Hybrid Scan

For appended files
- File-level BF (PushDownBFIndexRule)
- appended file paths are left in the relation file list
- => no additional work
- Global BF (JoinBFIndexRule)
- appended files invalidate globalBF; cannot utilize the outdated globalBF without update
- Option 1) quick refresh
- Option 2) on-the-fly build new BF entries for appended files and merge them into globalBF
- TBD

For deleted files
- File-level BF (PushDownBFIndexRule)
- deleted file paths are not present in the relation file list
- can just ignore the corresponding BF entries in BF index data
- => no additional work
- Global BF (JoinBFIndexRule)
- globalBF is still valid
=> no additional work
- Otherwise
- Option 1) quick refresh for better performance


## Implementation

[A description of the steps in the implementation, who will do them, and when.]

> Note: If you want to use any images, please upload the .svg AND .png/.jpg file them to `/docs/design/img/` and link to them here.
## Impact on Performance (if applicable)

[A discussion of impact on performance and any corner cases that the author is aware of. If there is a negative impact on performance, please make sure
to capture an issue in the next section. This section may be omitted if there are none.]

## Open issues (if applicable)

[A discussion of issues relating to this proposal for which the author does not
know the solution. If you have already opened the corresponding issues, please link
to them here. This section may be omitted if there are none.]

- This is the first issue ([issue-link]())
- This is the second issue ([issue-link]())
- ...

0 comments on commit 2eb669b

Please sign in to comment.