From 2eb669b5ccfab664ca6a71175c04f8b6adf37f66 Mon Sep 17 00:00:00 2001 From: sezruby Date: Fri, 25 Sep 2020 00:17:19 +0900 Subject: [PATCH] Add draft design doc for Bloom Filter Index --- docs/design/161-bloom-filter-index.md | 196 ++++++++++++++++++++++++++ 1 file changed, 196 insertions(+) create mode 100644 docs/design/161-bloom-filter-index.md diff --git a/docs/design/161-bloom-filter-index.md b/docs/design/161-bloom-filter-index.md new file mode 100644 index 000000000..6877a70fe --- /dev/null +++ b/docs/design/161-bloom-filter-index.md @@ -0,0 +1,196 @@ +# Proposal: Bloom Filter Index + +Discussion at https://github.com/microsoft/hyperspace/161 + +## Abstract + +## Background + +TBD + +Introduction of Bloom Filter Index as a new type of index in Hyperspace. + +We could use [Spark's BloomFilter API](https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/BloomFilter.html) +or other efficient implementations of Bloom Filter. + + +## Proposal + +We could utilize the bloom filter in 2 main uses: +1. File-level Bloom Filter, which can apply to a pushed down equal condition. +2. Global Bloom Filter, which can apply to an Equi join plan. + +For File-level BF, we could build a BF on a column for each input file and check +if each file might contain the value of the equal condition or certainly not. +If it returns false, we can exclude the file path from the source file list +without reading and filtering the file as BF guarantees that there is no such +value in the file. + +For Global BF, we could build a bloom filter on a column for whole source data +so that the BF can be utilized to check the existence of a value across all +source files. This BF can be applied to the counterpart of an Equi-join. + + +## Rationale + +[A discussion of alternate approaches and the trade offs, advantages, and disadvantages of the specified approach.] +TBD + +## Compatibility + +[A discussion of the change with regard to the +[compatibility guidelines](../../COMPATIBILITY.md).] +TBD + +## Design + +[A description of the proposed design/algorithm. This should include a discussion of how the work fits +into [Hyperspace's roadmap](../ROADMAP.md).] + +### Index creation + +#### createIndex API Extension +``` +hs.createIndex(df, BFIndexConfig(“indexName”, Seq("indexedColumnName1", "indexedColumnName2"), expectedNumItems, fpp)) +``` + - new `BFIndexConfig` + - `"indexedColumnName"`, `expectedNumItems`, `fpp` => BloomFilterIndex constructor + - Here, we build multiple bloom filters(BF) for each indexed column, not just + 1 index using the composite key with given indexed columns. It is because the + BF with single key column would be usable more widely. The multiple BF entries can + be handled by sub-directories in the root directory of index. + +#### Metadata extension +``` +case class BloomFilterIndex(properties: Seq[BloomFilterIndex.Properties]) { + val kind = "BloomFilterIndex" +} +object BloomFilterIndex { + case class Properties(indexColumn: String, expectedNumItems: Int, fpp: float, globalBFPath: String)} +} +``` + - Seq[Properties] as we support multiple BF entries. + +#### Index data schema + +##### File-level BF + +- In Parquet format + - fileName: file path (i.e. linage column) + - BFBinary: BF data for each file from [writeTo(java.io.OutputStream out) API](https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/BloomFilter.html#writeTo-java.io.OutputStream-) + +| fileName | BFBinary | +|----------------------------------------|---------------------------------------------| +| path/to/part-0001-xxxxx.parquet | 0xacdfacdfacdfacdfacdfacdf… | +| path/to/part-0002-xxxxx.parquet | 0xabcdabcdabcdabcdabcda… | +| path/to/part-0003-xxxxx.parquet | 0xfffffffffffffffffffffasdfffffffff… | + +##### Global BF +- Single BF for all values in the indexed column +- Stored as a separate file +- Build + - Option 1) can be generated by merging all file-level BF entries if possible + - Option 2) multi-level BF based on file-level BF + - Option 3) reading all values from all files => refresh also requires full scan + +### Index refreshment +#### Full + +File-level BF + +1. get deleted file paths & appended file paths +2. remove BF entries of deleted file paths in File-level index data + - index data has only valid rows after removing invalidated entries +3. construct a df with new BF entries for appended files +4. merge(union) both and write as a new version + +Global BF + +1. Update global BF using the updated file-level BF entries + - [mergeInPlace(BloomFilter other) API](https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/BloomFilter.html#mergeInPlace-org.apache.spark.util.sketch.BloomFilter-) + if possible +... TBD + +Refresh of global BF depends on the BF schema. + +#### Quick + +#### Appended files + +1. Construct a df with new BF entries for appended files +2. write the df as “append” mode - stored as a separate file +3. merge new BF entries into global BF (TBD) + +#### Deleted files + +1. keep the deleted file list in `excluded` +2. globalBF is still valid + - Update globalBF for better performance + +### New Rules + +#### PushDownBFIndexRule + +For each a relation including pushed down EQ filter +1. get available BF indexes & pick the best +2. apply index + - Option 1) Optimizer(apply): run a simple spark job and get the list of files can be excluded + - Option 2) physical operator: keep each BF as a separate file and check in FileScanRDD? + - TBD) how can we _broadcast_ BF data? + - Broadcast => might cause OOM in case of huge BF + - Read from storage => remote storage throttling + +#### JoinBFIndexRule + +For each Equi-join +1. get available BF indexes for the counterpart relation + - ex) Table_A join Table_B on col1 => can apply global BF of Table_A.col1 to Table_B.col1 +2. inject a BF filter plan before Shuffle or broadcast(?) + - Using UDF – for functionality check + - Using a newly defined physical operator – for better optimization + - TBD) need to find an efficient way of broadcasting (& maintaining) global BF + +#### Hybrid Scan + +For appended files +- File-level BF (PushDownBFIndexRule) + - appended file paths are left in the relation file list + - => no additional work +- Global BF (JoinBFIndexRule) + - appended files invalidate globalBF; cannot utilize the outdated globalBF without update + - Option 1) quick refresh + - Option 2) on-the-fly build new BF entries for appended files and merge them into globalBF + - TBD + +For deleted files +- File-level BF (PushDownBFIndexRule) + - deleted file paths are not present in the relation file list + - can just ignore the corresponding BF entries in BF index data + - => no additional work +- Global BF (JoinBFIndexRule) + - globalBF is still valid + => no additional work + - Otherwise + - Option 1) quick refresh for better performance + + +## Implementation + +[A description of the steps in the implementation, who will do them, and when.] + +> Note: If you want to use any images, please upload the .svg AND .png/.jpg file them to `/docs/design/img/` and link to them here. + +## Impact on Performance (if applicable) + +[A discussion of impact on performance and any corner cases that the author is aware of. If there is a negative impact on performance, please make sure +to capture an issue in the next section. This section may be omitted if there are none.] + +## Open issues (if applicable) + +[A discussion of issues relating to this proposal for which the author does not +know the solution. If you have already opened the corresponding issues, please link +to them here. This section may be omitted if there are none.] + + - This is the first issue ([issue-link]()) + - This is the second issue ([issue-link]()) + - ... \ No newline at end of file