This repository has been archived by the owner on Jun 14, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 115
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add draft design doc for Bloom Filter Index
- Loading branch information
Showing
1 changed file
with
196 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,196 @@ | ||
# Proposal: Bloom Filter Index | ||
|
||
Discussion at https://github.com/microsoft/hyperspace/161 | ||
|
||
## Abstract | ||
|
||
## Background | ||
|
||
TBD | ||
|
||
Introduction of Bloom Filter Index as a new type of index in Hyperspace. | ||
|
||
We could use [Spark's BloomFilter API](https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/BloomFilter.html) | ||
or other efficient implementations of Bloom Filter. | ||
|
||
|
||
## Proposal | ||
|
||
We could utilize the bloom filter in 2 main uses: | ||
1. File-level Bloom Filter, which can apply to a pushed down equal condition. | ||
2. Global Bloom Filter, which can apply to an Equi join plan. | ||
|
||
For File-level BF, we could build a BF on a column for each input file and check | ||
if each file might contain the value of the equal condition or certainly not. | ||
If it returns false, we can exclude the file path from the source file list | ||
without reading and filtering the file as BF guarantees that there is no such | ||
value in the file. | ||
|
||
For Global BF, we could build a bloom filter on a column for whole source data | ||
so that the BF can be utilized to check the existence of a value across all | ||
source files. This BF can be applied to the counterpart of an Equi-join. | ||
|
||
|
||
## Rationale | ||
|
||
[A discussion of alternate approaches and the trade offs, advantages, and disadvantages of the specified approach.] | ||
TBD | ||
|
||
## Compatibility | ||
|
||
[A discussion of the change with regard to the | ||
[compatibility guidelines](../../COMPATIBILITY.md).] | ||
TBD | ||
|
||
## Design | ||
|
||
[A description of the proposed design/algorithm. This should include a discussion of how the work fits | ||
into [Hyperspace's roadmap](../ROADMAP.md).] | ||
|
||
### Index creation | ||
|
||
#### createIndex API Extension | ||
``` | ||
hs.createIndex(df, BFIndexConfig(“indexName”, Seq("indexedColumnName1", "indexedColumnName2"), expectedNumItems, fpp)) | ||
``` | ||
- new `BFIndexConfig` | ||
- `"indexedColumnName"`, `expectedNumItems`, `fpp` => BloomFilterIndex constructor | ||
- Here, we build multiple bloom filters(BF) for each indexed column, not just | ||
1 index using the composite key with given indexed columns. It is because the | ||
BF with single key column would be usable more widely. The multiple BF entries can | ||
be handled by sub-directories in the root directory of index. | ||
|
||
#### Metadata extension | ||
``` | ||
case class BloomFilterIndex(properties: Seq[BloomFilterIndex.Properties]) { | ||
val kind = "BloomFilterIndex" | ||
} | ||
object BloomFilterIndex { | ||
case class Properties(indexColumn: String, expectedNumItems: Int, fpp: float, globalBFPath: String)} | ||
} | ||
``` | ||
- Seq[Properties] as we support multiple BF entries. | ||
|
||
#### Index data schema | ||
|
||
##### File-level BF | ||
|
||
- In Parquet format | ||
- fileName: file path (i.e. linage column) | ||
- BFBinary: BF data for each file from [writeTo(java.io.OutputStream out) API](https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/BloomFilter.html#writeTo-java.io.OutputStream-) | ||
|
||
| fileName | BFBinary | | ||
|----------------------------------------|---------------------------------------------| | ||
| path/to/part-0001-xxxxx.parquet | 0xacdfacdfacdfacdfacdfacdf… | | ||
| path/to/part-0002-xxxxx.parquet | 0xabcdabcdabcdabcdabcda… | | ||
| path/to/part-0003-xxxxx.parquet | 0xfffffffffffffffffffffasdfffffffff… | | ||
|
||
##### Global BF | ||
- Single BF for all values in the indexed column | ||
- Stored as a separate file | ||
- Build | ||
- Option 1) can be generated by merging all file-level BF entries if possible | ||
- Option 2) multi-level BF based on file-level BF | ||
- Option 3) reading all values from all files => refresh also requires full scan | ||
|
||
### Index refreshment | ||
#### Full | ||
|
||
File-level BF | ||
|
||
1. get deleted file paths & appended file paths | ||
2. remove BF entries of deleted file paths in File-level index data | ||
- index data has only valid rows after removing invalidated entries | ||
3. construct a df with new BF entries for appended files | ||
4. merge(union) both and write as a new version | ||
|
||
Global BF | ||
|
||
1. Update global BF using the updated file-level BF entries | ||
- [mergeInPlace(BloomFilter other) API](https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/BloomFilter.html#mergeInPlace-org.apache.spark.util.sketch.BloomFilter-) | ||
if possible | ||
... TBD | ||
|
||
Refresh of global BF depends on the BF schema. | ||
|
||
#### Quick | ||
|
||
#### Appended files | ||
|
||
1. Construct a df with new BF entries for appended files | ||
2. write the df as “append” mode - stored as a separate file | ||
3. merge new BF entries into global BF (TBD) | ||
|
||
#### Deleted files | ||
|
||
1. keep the deleted file list in `excluded` | ||
2. globalBF is still valid | ||
- Update globalBF for better performance | ||
|
||
### New Rules | ||
|
||
#### PushDownBFIndexRule | ||
|
||
For each a relation including pushed down EQ filter | ||
1. get available BF indexes & pick the best | ||
2. apply index | ||
- Option 1) Optimizer(apply): run a simple spark job and get the list of files can be excluded | ||
- Option 2) physical operator: keep each BF as a separate file and check in FileScanRDD? | ||
- TBD) how can we _broadcast_ BF data? | ||
- Broadcast => might cause OOM in case of huge BF | ||
- Read from storage => remote storage throttling | ||
|
||
#### JoinBFIndexRule | ||
|
||
For each Equi-join | ||
1. get available BF indexes for the counterpart relation | ||
- ex) Table_A join Table_B on col1 => can apply global BF of Table_A.col1 to Table_B.col1 | ||
2. inject a BF filter plan before Shuffle or broadcast(?) | ||
- Using UDF – for functionality check | ||
- Using a newly defined physical operator – for better optimization | ||
- TBD) need to find an efficient way of broadcasting (& maintaining) global BF | ||
|
||
#### Hybrid Scan | ||
|
||
For appended files | ||
- File-level BF (PushDownBFIndexRule) | ||
- appended file paths are left in the relation file list | ||
- => no additional work | ||
- Global BF (JoinBFIndexRule) | ||
- appended files invalidate globalBF; cannot utilize the outdated globalBF without update | ||
- Option 1) quick refresh | ||
- Option 2) on-the-fly build new BF entries for appended files and merge them into globalBF | ||
- TBD | ||
|
||
For deleted files | ||
- File-level BF (PushDownBFIndexRule) | ||
- deleted file paths are not present in the relation file list | ||
- can just ignore the corresponding BF entries in BF index data | ||
- => no additional work | ||
- Global BF (JoinBFIndexRule) | ||
- globalBF is still valid | ||
=> no additional work | ||
- Otherwise | ||
- Option 1) quick refresh for better performance | ||
|
||
|
||
## Implementation | ||
|
||
[A description of the steps in the implementation, who will do them, and when.] | ||
|
||
> Note: If you want to use any images, please upload the .svg AND .png/.jpg file them to `/docs/design/img/` and link to them here. | ||
## Impact on Performance (if applicable) | ||
|
||
[A discussion of impact on performance and any corner cases that the author is aware of. If there is a negative impact on performance, please make sure | ||
to capture an issue in the next section. This section may be omitted if there are none.] | ||
|
||
## Open issues (if applicable) | ||
|
||
[A discussion of issues relating to this proposal for which the author does not | ||
know the solution. If you have already opened the corresponding issues, please link | ||
to them here. This section may be omitted if there are none.] | ||
|
||
- This is the first issue ([issue-link]()) | ||
- This is the second issue ([issue-link]()) | ||
- ... |