[PERF] Micropartition, lazy loading and Column Stats #1470

samster25 · 2023-10-06T03:09:58Z

No description provided.

codecov · 2023-10-06T03:21:06Z

Codecov Report

Merging #1470 (29aaf59) into main (9d20890) will not change coverage.
Report is 1 commits behind head on main.
The diff coverage is n/a.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1470   +/-   ##
=======================================
  Coverage   74.87%   74.87%           
=======================================
  Files          59       59           
  Lines        6106     6106           
=======================================
  Hits         4572     4572           
  Misses       1534     1534

jaychia · 2023-10-12T04:32:20Z

src/daft-micropartition/src/column_stats/comparison.rs

+impl DaftCompare<&ColumnRangeStatistics> for ColumnRangeStatistics {
+    type Output = crate::Result<ColumnRangeStatistics>;
+    fn equal(&self, rhs: &ColumnRangeStatistics) -> Self::Output {
+        // lower_bound: do they exactly overlap


Is this correct? Even in the case when ranges exactly overlap, I think the result is stilI a maybe:

Lhs = [0, 1, 2, 3] -> (min=0, max=3) Rhs = [2, 1, 0, 3] -> (min=0, max=3) Result = [0, 1, 0, 1] -> (min=0, max=1)

Shouldn't the correct logic be:
(False, False) // no overlap at all
(False, True) // some overlap

jaychia · 2023-10-12T04:53:48Z

src/daft-micropartition/src/column_stats/mod.rs

+impl ColumnRangeStatistics {
+    pub fn new(lower: Option<Series>, upper: Option<Series>) -> Result<Self> {
+        match (lower, upper) {
+            //TODO: also need to check dtype and length==1, and upper > lower.


Should we just throw an assert in here? Would be pretty nasty if this assumption isn't met

jaychia · 2023-10-12T05:16:21Z

src/daft-micropartition/src/column_stats/from_parquet.rs

+                    let lower =
+                        Utf8Array::from(("lower", [lower.as_str()].as_slice())).into_series();
+                    let upper =
+                        Utf8Array::from(("upper", [upper.as_str()].as_slice())).into_series();


Oh just realized too that if we ever do any arithmetic on the series, they might get really weird with the names if we don't slot them in the correct spots (e.g. lower + upper will be called lower)

Doesn't really matter though?

jaychia · 2023-10-12T05:19:28Z

src/daft-micropartition/src/micropartition.rs

+                            .as_ref()
+                            .map(|v| v.iter().map(|s| s.as_ref()).collect::<Vec<_>>());
+                        let urls = params.urls.iter().map(|s| s.as_str()).collect::<Vec<_>>();
+                        daft_parquet::read::read_parquet_bulk(


Big money move right here 😍

jaychia · 2023-10-12T05:22:58Z

src/daft-micropartition/src/micropartition.rs

+        if predicate.is_empty() {
+            return Ok(Self::new(
+                self.schema.clone(),
+                TableState::Loaded(vec![].into()),


Why is empty predicate a full filter instead of a no-op?

E.g. if we perform some predicate pushdown and somehow end up with a filter([]), shouldn't that be a no-op?

jaychia · 2023-10-12T05:24:58Z

src/daft-micropartition/src/micropartition.rs

+    }
+}
+
+fn read_parquet_into_micropartition(


Is this only used in tests? Maybe we can move into the test block below

jaychia · 2023-10-12T05:26:04Z

src/daft-micropartition/src/utils/deserialize.rs

@@ -0,0 +1,45 @@
+use parquet2::{schema::types::TimeUnit, types::int96_to_i64_ns};
+


We should add references to where we got these functions

jaychia · 2023-10-12T05:31:07Z

src/daft-micropartition/src/table_stats/mod.rs

+                    Gt => lhs.gt(&rhs),
+                    Plus => &lhs + &rhs,
+                    Minus => &lhs - &rhs,
+                    _ => todo!(),


Should we return a Missing here instead of todo? Ditto for the todo at the bottom.

samster25 added 9 commits October 9, 2023 15:24

first stab at column stats

f35f53e

impl gt and into truthvalue

800012a

add display for truthcvalues and correct bug in lt gt

9686161

finish comparision ops

21c0c33

factor out ops

a643bcf

test case for equal

e16164f

filter on table stats

18189fe

add bit or

a068110

wip

87dd2bd

samster25 force-pushed the sammy/init-micropartition branch from 5ca95e1 to 87dd2bd Compare October 9, 2023 22:24

samster25 added 9 commits October 9, 2023 17:41

column stats to range stats

32e3b2b

support for binarystats

2a3f35c

fix date repr

d27e279

date type

030bb2a

add timestamp type

d00175d

factor out conversion code

6cc4335

refactor into try from

3c627ba

factor columnrangestats into option

30a3f57

add handling of missing

f5f40f6

samster25 marked this pull request as ready for review October 11, 2023 22:48

samster25 added 8 commits October 11, 2023 16:06

factor out timestamp into common fn

0499fd0

factor out timestamp into common fn

91d7df6

lint fixes

3e485cf

clippy fixes

d5d1237

drop some prints

557a24a

connect to parquet for defered loading

521c67d

clippy fixes

be5c162

comments

c9d1468

jaychia reviewed Oct 12, 2023

View reviewed changes

samster25 added 2 commits October 16, 2023 13:50

thread in iostats

bfcd297

add asserts

29aaf59

samster25 changed the title ~~Sammy/init micropartition~~ [PEF] Micropartition, lazy loading and Column Stats Oct 16, 2023

samster25 changed the title ~~[PEF] Micropartition, lazy loading and Column Stats~~ [PERF] Micropartition, lazy loading and Column Stats Oct 16, 2023

github-actions bot added the performance label Oct 16, 2023

samster25 merged commit 85dac45 into main Oct 16, 2023
26 of 28 checks passed

samster25 deleted the sammy/init-micropartition branch October 16, 2023 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] Micropartition, lazy loading and Column Stats #1470

[PERF] Micropartition, lazy loading and Column Stats #1470

samster25 commented Oct 6, 2023

codecov bot commented Oct 6, 2023 •

edited

Loading

jaychia Oct 12, 2023

jaychia Oct 12, 2023

jaychia Oct 12, 2023

jaychia Oct 12, 2023

jaychia Oct 12, 2023

jaychia Oct 12, 2023

jaychia Oct 12, 2023

jaychia Oct 12, 2023

		@@ -0,0 +1,45 @@
		use parquet2::{schema::types::TimeUnit, types::int96_to_i64_ns};

[PERF] Micropartition, lazy loading and Column Stats #1470

[PERF] Micropartition, lazy loading and Column Stats #1470

Conversation

samster25 commented Oct 6, 2023

codecov bot commented Oct 6, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 6, 2023 •

edited

Loading