From 2eb669b5ccfab664ca6a71175c04f8b6adf37f66 Mon Sep 17 00:00:00 2001
From: sezruby <sezruby@gmail.com>
Date: Fri, 25 Sep 2020 00:17:19 +0900
Subject: [PATCH] Add draft design doc for Bloom Filter Index

---
 docs/design/161-bloom-filter-index.md | 196 ++++++++++++++++++++++++++
 1 file changed, 196 insertions(+)
 create mode 100644 docs/design/161-bloom-filter-index.md

diff --git a/docs/design/161-bloom-filter-index.md b/docs/design/161-bloom-filter-index.md
new file mode 100644
index 000000000..6877a70fe
--- /dev/null
+++ b/docs/design/161-bloom-filter-index.md
@@ -0,0 +1,196 @@
+# Proposal: Bloom Filter Index
+
+Discussion at https://github.com/microsoft/hyperspace/161
+
+## Abstract
+
+## Background
+
+TBD
+
+Introduction of Bloom Filter Index as a new type of index in Hyperspace.
+
+We could use [Spark's BloomFilter API](https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/BloomFilter.html)
+or other efficient implementations of Bloom Filter.
+
+
+## Proposal
+
+We could utilize the bloom filter in 2 main uses:
+1. File-level Bloom Filter, which can apply to a pushed down equal condition.
+2. Global Bloom Filter, which can apply to an Equi join plan.
+
+For File-level BF, we could build a BF on a column for each input file and check
+if each file might contain the value of the equal condition or certainly not.
+If it returns false, we can exclude the file path from the source file list
+without reading and filtering the file as BF guarantees that there is no such
+value in the file.
+
+For Global BF, we could build a bloom filter on a column for whole source data
+so that the BF can be utilized to check the existence of a value across all
+source files. This BF can be applied to the counterpart of an Equi-join.
+
+
+## Rationale
+
+[A discussion of alternate approaches and the trade offs, advantages, and disadvantages of the specified approach.]
+TBD
+
+## Compatibility
+
+[A discussion of the change with regard to the
+[compatibility guidelines](../../COMPATIBILITY.md).]
+TBD
+
+## Design
+
+[A description of the proposed design/algorithm. This should include a discussion of how the work fits 
+into [Hyperspace's roadmap](../ROADMAP.md).]
+
+### Index creation
+
+#### createIndex API Extension
+```
+hs.createIndex(df, BFIndexConfig(“indexName”, Seq("indexedColumnName1", "indexedColumnName2"), expectedNumItems, fpp))
+```
+  - new `BFIndexConfig`
+  - `"indexedColumnName"`, `expectedNumItems`, `fpp` => BloomFilterIndex constructor
+  - Here, we build multiple bloom filters(BF) for each indexed column, not just
+   1 index using the composite key with given indexed columns. It is because the
+   BF with single key column would be usable more widely. The multiple BF entries can
+   be handled by sub-directories in the root directory of index.
+  
+#### Metadata extension
+```
+case class BloomFilterIndex(properties: Seq[BloomFilterIndex.Properties]) {
+  val kind = "BloomFilterIndex"
+}
+object BloomFilterIndex {
+  case class Properties(indexColumn: String, expectedNumItems: Int, fpp: float, globalBFPath: String)}
+}
+```
+  - Seq[Properties] as we support multiple BF entries.
+  
+#### Index data schema
+
+##### File-level BF
+
+- In Parquet format
+  - fileName: file path (i.e. linage column)
+  - BFBinary: BF data for each file from [writeTo(java.io.OutputStream out) API](https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/BloomFilter.html#writeTo-java.io.OutputStream-)
+
+| fileName                               | BFBinary                                    |
+|----------------------------------------|---------------------------------------------|
+|     path/to/part-0001-xxxxx.parquet    |          0xacdfacdfacdfacdfacdfacdf…        |
+|     path/to/part-0002-xxxxx.parquet    |           0xabcdabcdabcdabcdabcda…          |
+|     path/to/part-0003-xxxxx.parquet    |     0xfffffffffffffffffffffasdfffffffff…    |
+
+##### Global BF
+- Single BF for all values in the indexed column
+- Stored as a separate file
+- Build
+  - Option 1) can be generated by merging all file-level BF entries if possible
+  - Option 2) multi-level BF based on file-level BF
+  - Option 3) reading all values from all files => refresh also requires full scan
+
+### Index refreshment
+#### Full
+
+File-level BF
+
+1. get deleted file paths & appended file paths
+2. remove BF entries of deleted file paths in File-level index data
+   - index data has only valid rows after removing invalidated entries
+3. construct a df with new BF entries for appended files
+4. merge(union) both and write as a new version
+
+Global BF
+
+1. Update global BF using the updated file-level BF entries
+   - [mergeInPlace(BloomFilter other) API](https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/BloomFilter.html#mergeInPlace-org.apache.spark.util.sketch.BloomFilter-)
+  if possible
+... TBD
+
+Refresh of global BF depends on the BF schema.
+
+#### Quick
+
+#### Appended files
+
+1. Construct a df with new BF entries for appended files
+2. write the df as “append” mode - stored as a separate file
+3. merge new BF entries into global BF (TBD)
+
+#### Deleted files
+
+1. keep the deleted file list in `excluded`
+2. globalBF is still valid
+   - Update globalBF for better performance 
+  
+### New Rules
+
+#### PushDownBFIndexRule
+
+For each a relation including pushed down EQ filter
+1. get available BF indexes & pick the best
+2. apply index
+  - Option 1) Optimizer(apply): run a simple spark job and get the list of files can be excluded
+  - Option 2) physical operator: keep each BF as a separate file and check in FileScanRDD?
+  - TBD) how can we _broadcast_ BF data?
+    - Broadcast => might cause OOM in case of huge BF
+    - Read from storage => remote storage throttling
+
+#### JoinBFIndexRule
+
+For each Equi-join
+1. get available BF indexes for the counterpart relation
+   - ex) Table_A join Table_B on col1 => can apply global BF of Table_A.col1 to Table_B.col1
+2. inject a BF filter plan before Shuffle or broadcast(?)
+   - Using UDF – for functionality check
+   - Using a newly defined physical operator – for better optimization
+   - TBD) need to find an efficient way of broadcasting (& maintaining) global BF
+  
+#### Hybrid Scan
+
+For appended files
+- File-level BF (PushDownBFIndexRule)
+  - appended file paths are left in the relation file list
+  - => no additional work
+- Global BF (JoinBFIndexRule)
+  - appended files invalidate globalBF; cannot utilize the outdated globalBF without update
+  - Option 1) quick refresh
+  - Option 2) on-the-fly build new BF entries for appended files and merge them into globalBF
+  - TBD
+
+For deleted files
+- File-level BF (PushDownBFIndexRule)
+  - deleted file paths are not present in the relation file list
+  - can just ignore the corresponding BF entries in BF index data
+  - => no additional work
+- Global BF (JoinBFIndexRule)
+  - globalBF is still valid
+    => no additional work
+  - Otherwise 
+    - Option 1) quick refresh for better performance
+
+
+## Implementation
+
+[A description of the steps in the implementation, who will do them, and when.]
+
+> Note: If you want to use any images, please upload the .svg AND .png/.jpg file them to `/docs/design/img/` and link to them here.
+
+## Impact on Performance (if applicable)
+
+[A discussion of impact on performance and any corner cases that the author is aware of. If there is a negative impact on performance, please make sure 
+to capture an issue in the next section. This section may be omitted if there are none.]
+
+## Open issues (if applicable)
+
+[A discussion of issues relating to this proposal for which the author does not
+know the solution. If you have already opened the corresponding issues, please link
+to them here. This section may be omitted if there are none.]
+
+  - This is the first issue ([issue-link]())
+  - This is the second issue ([issue-link]())
+  - ...
\ No newline at end of file