[PROPOSAL]: Introduce BloomFilter index #161

sezruby · 2020-09-14T05:44:08Z

This is WIP, but please feel free to add any comment on this 😃

Support BloomFilter index utilizing Spark's feature (https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/BloomFilter.html)

This is the second type of index in Hyperspace, so it may require additional works on the overall codebase to support different types of indexes. For example, getCandidateIndex needs to be fixed so that the current FilterIndexRule and JoinIndexRule only use the covering indexes.

How can we derive benefit from the index

We could use BloomFilter in 2 ways:

Bloom Filter for each file
- where: "equal" condition pushed down
- how: reducing the number of files to read
Bloom Filter for Join
- where: Join plan
- how: reducing the rows of counterpart of join

Is the overhead during query time reasonable?

TODO

How can we create and refresh the index (immutable dataset)

API extension

IndexConfig for BF index => BFIndexConfig

hs.createIndex(df, BFIndexConfig("indexed_column", expectedNumItems, fpp))

Metadata change

case class BloomFilterIndex(properties: BloomFilterIndex.Properties) {
  val kind = "BloomFilterIndex"
}
object BloomFilterIndex {
  case class Properties(indexColumn: String, expectedNumItems: Int, fpp: float, globalBF: Binary)
}

indexColumn, expectedNumItems, fpp : for BloomFilter construction
globalBF: BloomFilter binary data for whole dataset.

At creation

We firstly build the bloom filter for each source file and then merge them into 1 globalBF for BF join purpose.
The file-level BF index can be stored as parquet format with 2 columns - "FilePath", "BloomFilterData".

in parquet format
- FilePath: file path string
- BloomFilterData: Binary (writeTo function from BloomFilter API)

For globalBF, the location can be in IndexLogEntry or a separate file, but it should be managed together.

At refresh

Full refresh
1. filter file-level index data with deleted file paths => now it has only valid rows
2. create a DF by generating new entries for appended files
3. union both and write as a new version
4. update globalBF using the updated file-level BF
quick refresh - (if full refresh is not much slow, we might not need to implement this.)
- For appended files
  - add new BF entries and "append" to the index content directory. => like incremental index.
  - merge newly created BFs into global BF.
- For deleted files
  - basically it's okay not to update BF indexes for deleted files because it's false-positive matches.
  - for better performance, we can recalculate the global BF newly only using valid BF entries

How can we apply the index

New rule for push-down EQ filter

find a relation including pushed down EQ filter
get available BF indexes & pick the best
run a simple spark job and get the list of files can be excluded
remove file list from the relation

Note that this can be done in optimization time, so we need to check the performance regression from the spark job.

New rule for `join`

for each Join plan
get available BF indexes for the counterpart relation
injecting a BF filter operator plan (using UDF?) before shuffle

How can we get available BF

new parameters => getCandidateIndex(type, indexed column)

Work item

(todo list of PR level)

How can we validate

Running TPCH / TPCDS

Design document link

TODO

How can we extend for Hybrid Scan (using index w/o refresh)

For appended files
- Rule1: push down EQ filter
  - This is based on file-level BF, so it won't affect the other BF indexes.
  - the newly appended files are left in the file list after removing "excludable" files.
  - => no additional work required.
- Rule2: BF Join
  - With newly appended files, we cannot utilize the outdated global BF without refresh
  - So1) quick refresh or 2) build a new BF from appended files and merge it into the outdated global BF.
For deleted files
- Rule1: push-down EQ filter
  - They don't appear in the original file paths and we could just ignore the corresponding BF entry in BF index.
  - => no additional work required.
- Rule2: BF join
  - We can perform without any update with the outdated global BF.
  - Or quick refresh can be an option for better performance.
  - => no additional work required.

The text was updated successfully, but these errors were encountered:

rapoth · 2021-01-28T02:15:49Z

Closing in favor of #341. Thanks @sezruby!

sezruby added proposal This is the default tag for a newly created design proposal untriaged This is the default tag for a newly created issue labels Sep 14, 2020

sezruby mentioned this issue Sep 14, 2020

Extensibility Hooks for a new Hyperspace Index #157

Open

sezruby mentioned this issue Sep 24, 2020

[PROPOSAL] Introduce Bloom Filter Index #176

Closed

This was referenced Jan 26, 2021

[WIP] Bloom Filter Design Doc #335

Closed

[PROPOSAL]: Design Doc Bloom Filter #341

Closed

rapoth closed this as completed Jan 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL]: Introduce BloomFilter index #161

[PROPOSAL]: Introduce BloomFilter index #161

sezruby commented Sep 14, 2020 •

edited

Loading

rapoth commented Jan 28, 2021

[PROPOSAL]: Introduce BloomFilter index #161

[PROPOSAL]: Introduce BloomFilter index #161

Comments

sezruby commented Sep 14, 2020 • edited Loading

How can we derive benefit from the index

Is the overhead during query time reasonable?

How can we create and refresh the index (immutable dataset)

API extension

Metadata change

At creation

At refresh

How can we apply the index

New rule for push-down EQ filter

New rule for join

How can we get available BF

Work item

How can we validate

Design document link

How can we extend for Hybrid Scan (using index w/o refresh)

rapoth commented Jan 28, 2021

sezruby commented Sep 14, 2020 •

edited

Loading

New rule for `join`