Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

[PROPOSAL]: Introduce BloomFilter index #161

Closed
11 tasks
sezruby opened this issue Sep 14, 2020 · 1 comment
Closed
11 tasks

[PROPOSAL]: Introduce BloomFilter index #161

sezruby opened this issue Sep 14, 2020 · 1 comment
Labels
proposal This is the default tag for a newly created design proposal untriaged This is the default tag for a newly created issue

Comments

@sezruby
Copy link
Collaborator

sezruby commented Sep 14, 2020

This is WIP, but please feel free to add any comment on this 😃

Support BloomFilter index utilizing Spark's feature (https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/BloomFilter.html)

This is the second type of index in Hyperspace, so it may require additional works on the overall codebase to support different types of indexes. For example, getCandidateIndex needs to be fixed so that the current FilterIndexRule and JoinIndexRule only use the covering indexes.

How can we derive benefit from the index

We could use BloomFilter in 2 ways:

  • Bloom Filter for each file
    • where: "equal" condition pushed down
    • how: reducing the number of files to read
  • Bloom Filter for Join
    • where: Join plan
    • how: reducing the rows of counterpart of join

Is the overhead during query time reasonable?

TODO

How can we create and refresh the index (immutable dataset)

API extension

  • IndexConfig for BF index => BFIndexConfig
hs.createIndex(df, BFIndexConfig("indexed_column", expectedNumItems, fpp))

Metadata change

case class BloomFilterIndex(properties: BloomFilterIndex.Properties) {
  val kind = "BloomFilterIndex"
}
object BloomFilterIndex {
  case class Properties(indexColumn: String, expectedNumItems: Int, fpp: float, globalBF: Binary)
}
  • indexColumn, expectedNumItems, fpp : for BloomFilter construction
  • globalBF: BloomFilter binary data for whole dataset.

At creation

We firstly build the bloom filter for each source file and then merge them into 1 globalBF for BF join purpose.
The file-level BF index can be stored as parquet format with 2 columns - "FilePath", "BloomFilterData".

  • in parquet format
    • FilePath: file path string
    • BloomFilterData: Binary (writeTo function from BloomFilter API)

For globalBF, the location can be in IndexLogEntry or a separate file, but it should be managed together.

At refresh

  • Full refresh

    1. filter file-level index data with deleted file paths => now it has only valid rows
    2. create a DF by generating new entries for appended files
    3. union both and write as a new version
    4. update globalBF using the updated file-level BF
  • quick refresh - (if full refresh is not much slow, we might not need to implement this.)

    • For appended files
      • add new BF entries and "append" to the index content directory. => like incremental index.
      • merge newly created BFs into global BF.
    • For deleted files
      • basically it's okay not to update BF indexes for deleted files because it's false-positive matches.
      • for better performance, we can recalculate the global BF newly only using valid BF entries

How can we apply the index

New rule for push-down EQ filter

  • find a relation including pushed down EQ filter
  • get available BF indexes & pick the best
  • run a simple spark job and get the list of files can be excluded
  • remove file list from the relation

Note that this can be done in optimization time, so we need to check the performance regression from the spark job.

New rule for join

  • for each Join plan
  • get available BF indexes for the counterpart relation
  • injecting a BF filter operator plan (using UDF?) before shuffle

How can we get available BF

  • new parameters => getCandidateIndex(type, indexed column)

Work item

(todo list of PR level)

  • check feasibility with a prototype for plan modification - running a spark job, injecting BF filter, ..
  • import BloomFilter & IndexLogEntry metadata extension
  • API extension for a new index type (empty impl) in general
  • create a BF Index
  • full refresh a BF Index
  • add utility function for the simple spark job to get "excludable" files based on the BF index
  • implement candidate selection among BF indexes
  • implement BFFileFilterRule
  • create a enable/disable config & add the new rule to extraOptimizations in package
  • implement BFJoinIndexRule
  • quick refresh a BF Index

How can we validate

Running TPCH / TPCDS

Design document link

TODO

How can we extend for Hybrid Scan (using index w/o refresh)

  • For appended files
    • Rule1: push down EQ filter
      • This is based on file-level BF, so it won't affect the other BF indexes.
      • the newly appended files are left in the file list after removing "excludable" files.
      • => no additional work required.
    • Rule2: BF Join
      • With newly appended files, we cannot utilize the outdated global BF without refresh
      • So1) quick refresh or 2) build a new BF from appended files and merge it into the outdated global BF.
  • For deleted files
    • Rule1: push-down EQ filter
      • They don't appear in the original file paths and we could just ignore the corresponding BF entry in BF index.
      • => no additional work required.
    • Rule2: BF join
      • We can perform without any update with the outdated global BF.
      • Or quick refresh can be an option for better performance.
      • => no additional work required.
@sezruby sezruby added proposal This is the default tag for a newly created design proposal untriaged This is the default tag for a newly created issue labels Sep 14, 2020
@rapoth
Copy link
Contributor

rapoth commented Jan 28, 2021

Closing in favor of #341. Thanks @sezruby!

@rapoth rapoth closed this as completed Jan 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
proposal This is the default tag for a newly created design proposal untriaged This is the default tag for a newly created issue
Projects
None yet
Development

No branches or pull requests

2 participants