[PROPOSAL]: Design Doc Bloom Filter #341

thugsatbay · 2021-01-27T20:27:09Z

Bloom Filter non-covering index for HyperSpace

Discussion of #161 Introduction to Bloom Filter.

A design doc proposing how we might go on implementing Bloom Filter in HyperSpace.

Describe the problem (Background)

Hyperspace currently only supports covering indexing over the datasets. The covering indexing is good
when user knows or has a pre-defined set of query's he wants to execute on the data. However, in cases where
user wants to run some queries on certain columns which are not widely used but also want to leverage our
indexing system, maintaining a full fledged covering index can be expensive. Or another scenario where user want
to leverage our index system, but the user data is just too big and maintaining a covering index is not
worthwhile (storage expensive). Hence, we propose bloom filter. A non-covering index that is space-efficient
probabilistic data structure to calculate and store contain property. Eventually benefiting user query by reducing scan time/files.

Proposal

In this design document, we propose an addition to hyperspace indexing system. By adding a potential first
'non-covering' index.
Covering and non-covering index config API in Hyperspace which allows users to build indexes on their dataset.
This doc goes in parallel with #342.

Design

1) Creating covering non-covering index config.

Proposal	Bloom Filter Config Design	Covering Index Config Design Changes
Initial Config	`sealed trait IndexConfigBase { indexName: String } trait CoveringIndexConfig extends IndexConfigBase { includedColumns: Seq[String] indexedColumns: Seq[String] } trait NonCoveringIndexConfig extends IndexConfigBase { }`
Defining Config	`case class BloomIndexConfig private ( indexName: String, indexedColumns: Seq[String], expectedNumItems: Long, fpp: Double, numBits: Long ) extends NonCoveringIndexConfig def this( indexName: String, indexedColumns: Seq[String], expectedNumItems: Long ) def this( indexName: String, indexedColumns: Seq[String], expectedNumItems: Long, numBits: Long ) def this( indexName: String, indexedColumns: Seq[String], expectedNumItems: Long, fpp: Double )` Or we can substitute this with 3 builders design.	`final case class IndexConfig( indexName: String, indexedColumns: Seq[String], includedColumns: Seq[String] = Seq() ) extends CoveringIndexConfig` By allowing Index Config to remain same we allow backward compatibility with older scripts.
Additional Methods	`// Returns the erroraneous probability of this // BloomFilter returning true for an element not // actually being put in this BloomFilter def expectedFpp(): Double`	`// TODO - proposed def addAllIndexedColumns(columnName: String): IndexConfig def removeAllIndexedColumns(columnName: String): IndexConfig def addAllIncludedColumns(columnName: String): IndexConfig def removeAllIncludedColumns(columnName: String): IndexConfig`
Creating Index	Three different ways we could create bloom filter. If using Builder configuration the declaration of BloomIndexConfig will change but will look more or less the same. `hs.createIndex(df, BloomIndexConfig("indexName", "indexedColumns", 1000) hs.createIndex(df, BloomIndexConfig("indexName", "indexedColumns", 1000, 128) hs.createIndex(df, BloomIndexConfig("indexName", "indexedColumns", 1000, 0.2)`	`hs.createIndex(df, IndexConfig("indexName", "indexedColumns", "includedColumns"))`
Other Ops on Index	Calls should have no change in signature. Working might end up different for refresh and optimize.

2) Index Refresh & Optimize

How Merge BloomFilter Op works !

Index Refresh

Quick

Incremental

Full

Only find files that have been deleted and invalidate the BF data.

Append Files	Delete Files
- Construct a df with new BF entries for appended files - write the df as “append” mode, wherever we end up storing BF information - merge new BF entries into global BF	- keep the deleted file list in `excluded` - Global BF is still valid, Update Global BF for better performance ?

File Level	Global Level
- get deleted file paths and appended file paths - remove BF entries of deleted file paths in File-level index data - index data has only valid rows after removing invalidated entries - construct a df with new BF entries for appended files	- Update global BF using the updated file-level BF entries - merge in place or union, TODO explore

Index Optimize
This op should do nothing as we need to maintain BF for each file.

3) Creating covering non-covering index inside hyperspace / refactoring to support non covering index.

There will be code changes inside the CreateActionBase & IndexLogEntry that should span
many small refactoring changes in the codebase. TODO, needs experimentation with current JSON parser.

Proposal Covering Index Non Covering Index

Base

sealed trait HyperSpaceIndex {
    kind: String
    kindAbbr: String
}

Definition

case class CoveringIndex(
    kind: String = "Covering", 
    kindAbbr: String = "CI", 
    properties: CoveringIndex.Properties
) extends HyperSpaceIndex

case class BloomIndex(
    kind: String = "NonCovering", 
    kindAbbr: String = "BFNC", 
    properties: CoveringIndex.Properties
) extends HyperSpaceIndex

4) Storing BFBinaryData, BF metadta file

Either in parquet file for each user data file or can store data directly in index file / config.

FileName	BloomFilterBinaryData	Updated
path/to/part-0001-xxxxx.parquet	0xacdfacdfacdfacdfacdfacdf…	1
path/to/part-0002-xxxxx.parquet	0xabcdabcdabcdabcdabcda…	0
path/to/part-0003-xxxxx.parquet	0xabcdabasdadcdabcdabdda…	1

The Updated column tells if the BF data can be used or is stale.
Global BF, for all data in the indexed column can be stored as a separate file or in the IndexLogEntry File ?

5) Rules

Detailed Discussion in #342.

Proposed -

trait HyperSpaceRule {
	ruleName: String
	def diffIndexSupporterdByRule(): Seq[HyperSpaceIndex]
	def forceIndexOnPlan(index: HyperSpaceIndex): Unit
	def canUseRule(plan: LogicalPlan): Boolean
}

[TODO] Are we allowed to compare two different index types HyperSpaceIndex#kindAbbr or we would always prefer one over the another ?
[TODO] Each IndexLogEntry should have equals and compareTo method now which may or may not be used in ranker. That defines how each index compares to other index based on certain heuristics or rules.
[TODO] Based on rule can comparison methods/routine of index change ?

All answers are provided through construction of rankers for the index and the rules.

Implementation

Impact on Performance (if applicable)

[A discussion of impact on performance and any corner cases that the author is aware of.

If there is a negative impact on performance, please make sure
to capture an issue in the next section. This section may be omitted if there are none.]

Mostly we need to run a small spark job on the Bloom Filter information to figure which data files are worth
keeping for scanning. TODO, detailed analysis required.

Open issues (if applicable)

[A discussion of issues relating to this proposal for which the author does not
know the solution. If you have already opened the corresponding issues, please link
to them here. This section may be omitted if there are none.]

Understanding how the plan will look for join operation, working on it TODO(thugsatbay)

The text was updated successfully, but these errors were encountered:

clee704 · 2021-06-22T12:34:12Z

Closing old issues. Further discussions can continue in #441 and #405.

thugsatbay added proposal This is the default tag for a newly created design proposal untriaged This is the default tag for a newly created issue labels Jan 27, 2021

thugsatbay mentioned this issue Jan 27, 2021

[WIP] Bloom Filter Design Doc #335

Closed

rapoth mentioned this issue Jan 28, 2021

[PROPOSAL]: Introduce BloomFilter index #161

Closed

11 tasks

apoorvedave1 mentioned this issue Feb 2, 2021

[PROPOSAL]: Changes to support for multiple index types #342

Closed

thugsatbay mentioned this issue Feb 12, 2021

[WIP] Refactor Hyperspace Index Configs to allow support for addition of new index #357

Closed

clee704 closed this as completed Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL]: Design Doc Bloom Filter #341

[PROPOSAL]: Design Doc Bloom Filter #341

thugsatbay commented Jan 27, 2021 •

edited

Loading

clee704 commented Jun 22, 2021

[PROPOSAL]: Design Doc Bloom Filter #341

[PROPOSAL]: Design Doc Bloom Filter #341

Comments

thugsatbay commented Jan 27, 2021 • edited Loading

Bloom Filter non-covering index for HyperSpace

Describe the problem (Background)

Proposal

Design

1) Creating covering non-covering index config.

2) Index Refresh & Optimize

3) Creating covering non-covering index inside hyperspace / refactoring to support non covering index.

4) Storing BFBinaryData, BF metadta file

5) Rules

Implementation

Impact on Performance (if applicable)

Open issues (if applicable)

clee704 commented Jun 22, 2021

thugsatbay commented Jan 27, 2021 •

edited

Loading