Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

[PROPOSAL]: Design Doc Bloom Filter #341

Closed
12 tasks
thugsatbay opened this issue Jan 27, 2021 · 1 comment
Closed
12 tasks

[PROPOSAL]: Design Doc Bloom Filter #341

thugsatbay opened this issue Jan 27, 2021 · 1 comment
Labels
proposal This is the default tag for a newly created design proposal untriaged This is the default tag for a newly created issue

Comments

@thugsatbay
Copy link
Contributor

thugsatbay commented Jan 27, 2021

Bloom Filter non-covering index for HyperSpace

Discussion of #161 Introduction to Bloom Filter.

A design doc proposing how we might go on implementing Bloom Filter in HyperSpace.

Describe the problem (Background)

Hyperspace currently only supports covering indexing over the datasets. The covering indexing is good
when user knows or has a pre-defined set of query's he wants to execute on the data. However, in cases where
user wants to run some queries on certain columns which are not widely used but also want to leverage our
indexing system, maintaining a full fledged covering index can be expensive. Or another scenario where user want
to leverage our index system, but the user data is just too big and maintaining a covering index is not
worthwhile (storage expensive). Hence, we propose bloom filter. A non-covering index that is space-efficient
probabilistic data structure to calculate and store contain property. Eventually benefiting user query by reducing scan time/files.

Proposal

In this design document, we propose an addition to hyperspace indexing system. By adding a potential first
'non-covering' index.
Covering and non-covering index config API in Hyperspace which allows users to build indexes on their dataset.
This doc goes in parallel with #342.

Design

1) Creating covering non-covering index config.

Proposal Bloom Filter Config Design Covering Index Config Design Changes
Initial Config
sealed trait IndexConfigBase {
    indexName: String
}

trait CoveringIndexConfig extends IndexConfigBase {
    includedColumns: Seq[String]
    indexedColumns: Seq[String]
}

trait NonCoveringIndexConfig extends IndexConfigBase {
} 
Defining Config
case class BloomIndexConfig private (
    indexName: String, 
    indexedColumns: Seq[String], 
    expectedNumItems: Long, 
    fpp: Double, 
    numBits: Long
) extends NonCoveringIndexConfig

def this(
    indexName: String, 
    indexedColumns: Seq[String], 
    expectedNumItems: Long
)

def this(
    indexName: String, 
    indexedColumns: Seq[String], 
    expectedNumItems: Long,
    numBits: Long
)

def this(
    indexName: String, 
    indexedColumns: Seq[String], 
    expectedNumItems: Long, 
    fpp: Double
)

Or we can substitute this with 3 builders design.

final case class IndexConfig(
    indexName: String,
    indexedColumns: Seq[String],
    includedColumns: Seq[String] = Seq()
) extends CoveringIndexConfig

By allowing Index Config to remain same we allow

backward compatibility with older scripts.

Additional Methods
// Returns the erroraneous probability of this 
// BloomFilter returning true for an element not 
// actually being put in this BloomFilter 
def expectedFpp(): Double   
// TODO - proposed
def addAllIndexedColumns(columnName: String*): IndexConfig
def removeAllIndexedColumns(columnName: String*): IndexConfig
def addAllIncludedColumns(columnName: String*): IndexConfig
def removeAllIncludedColumns(columnName: String*): IndexConfig
Creating Index Three different ways we could create bloom filter.
If using Builder configuration the declaration of BloomIndexConfig
will change but will look more or less the same.
hs.createIndex(df, BloomIndexConfig("indexName", "indexedColumns", 1000)
hs.createIndex(df, BloomIndexConfig("indexName", "indexedColumns", 1000, 128)
hs.createIndex(df, BloomIndexConfig("indexName", "indexedColumns", 1000, 0.2)
hs.createIndex(df,
  IndexConfig("indexName", "indexedColumns", "includedColumns"))
Other Ops on Index Calls should have no change in signature. Working might end up different for refresh and optimize.

2) Index Refresh & Optimize

How Merge BloomFilter Op works !

Index Refresh

Quick Incremental Full
Only find files that have been deleted and invalidate the BF data.
Append FilesDelete Files
- Construct a df with new BF entries for appended files
- write the df as “append” mode, wherever we end up storing BF information
- merge new BF entries into global BF
- keep the deleted file list in `excluded`
- Global BF is still valid, Update Global BF for better performance ?
File LevelGlobal Level
- get deleted file paths and appended file paths
- remove BF entries of deleted file paths in File-level index data
- index data has only valid rows after removing invalidated entries
- construct a df with new BF entries for appended files
- Update global BF using the updated file-level BF entries
- merge in place or union, TODO explore

Index Optimize
This op should do nothing as we need to maintain BF for each file.


3) Creating covering non-covering index inside hyperspace / refactoring to support non covering index.

There will be code changes inside the CreateActionBase & IndexLogEntry that should span
many small refactoring changes in the codebase. TODO, needs experimentation with current JSON parser.

Proposal Covering Index Non Covering Index
Base
sealed trait HyperSpaceIndex {
    kind: String
    kindAbbr: String
}
Definition
case class CoveringIndex(
    kind: String = "Covering", 
    kindAbbr: String = "CI", 
    properties: CoveringIndex.Properties
) extends HyperSpaceIndex
case class BloomIndex(
    kind: String = "NonCovering", 
    kindAbbr: String = "BFNC", 
    properties: CoveringIndex.Properties
) extends HyperSpaceIndex

4) Storing BFBinaryData, BF metadta file

Either in parquet file for each user data file or can store data directly in index file / config.

FileName BloomFilterBinaryData Updated
path/to/part-0001-xxxxx.parquet 0xacdfacdfacdfacdfacdfacdf… 1
path/to/part-0002-xxxxx.parquet 0xabcdabcdabcdabcdabcda… 0
path/to/part-0003-xxxxx.parquet 0xabcdabasdadcdabcdabdda… 1

The Updated column tells if the BF data can be used or is stale.
Global BF, for all data in the indexed column can be stored as a separate file or in the IndexLogEntry File ?


5) Rules

Detailed Discussion in #342.

Proposed -

trait HyperSpaceRule {
	ruleName: String
	def diffIndexSupporterdByRule(): Seq[HyperSpaceIndex]
	def forceIndexOnPlan(index: HyperSpaceIndex): Unit
	def canUseRule(plan: LogicalPlan): Boolean
}

[TODO] Are we allowed to compare two different index types HyperSpaceIndex#kindAbbr or we would always prefer one over the another ?
[TODO] Each IndexLogEntry should have equals and compareTo method now which may or may not be used in ranker. That defines how each index compares to other index based on certain heuristics or rules.
[TODO] Based on rule can comparison methods/routine of index change ?

  • All answers are provided through construction of rankers for the index and the rules.

Implementation

  • Refactor Index Config to introduce non covering index (1 week) - 357
  • Support new index configs for bloom filter (1.5 week) - 357
    • Add a new case class for non-covering Index
  • Modify IndexLogEntry to support design changes including non-covering index (1 week)
  • Create the index using the config (2 week)
  • Store the newly created index config info in a usable file format (1 week)
  • Define Ranker for this index (3 week)
  • Plugin the index in the FilterIndexRule (2 week)
  • Plugin the index in the JoinIndexRule - Separate Design and Time requirement
    • Join Rule is complex and can be divided into as :
      • When BF moves to the right side allowing which files to scan ?
      • Can BF be used for each key in Join and decide for each reducer how scan shuffle takes place ?

Impact on Performance (if applicable)

[A discussion of impact on performance and any corner cases that the author is aware of.

If there is a negative impact on performance, please make sure
to capture an issue in the next section. This section may be omitted if there are none.]

  • Mostly we need to run a small spark job on the Bloom Filter information to figure which data files are worth
    keeping for scanning. TODO, detailed analysis required.

Open issues (if applicable)

[A discussion of issues relating to this proposal for which the author does not
know the solution. If you have already opened the corresponding issues, please link
to them here. This section may be omitted if there are none.]

  • Understanding how the plan will look for join operation, working on it TODO(thugsatbay)
@clee704
Copy link

clee704 commented Jun 22, 2021

Closing old issues. Further discussions can continue in #441 and #405.

@clee704 clee704 closed this as completed Jun 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
proposal This is the default tag for a newly created design proposal untriaged This is the default tag for a newly created issue
Projects
None yet
Development

No branches or pull requests

2 participants