Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] Micropartition, lazy loading and Column Stats #1470

Merged
merged 28 commits into from
Oct 16, 2023

Conversation

samster25
Copy link
Member

No description provided.

@codecov
Copy link

codecov bot commented Oct 6, 2023

Codecov Report

Merging #1470 (29aaf59) into main (9d20890) will not change coverage.
Report is 1 commits behind head on main.
The diff coverage is n/a.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1470   +/-   ##
=======================================
  Coverage   74.87%   74.87%           
=======================================
  Files          59       59           
  Lines        6106     6106           
=======================================
  Hits         4572     4572           
  Misses       1534     1534           

@samster25 samster25 marked this pull request as ready for review October 11, 2023 22:48
impl DaftCompare<&ColumnRangeStatistics> for ColumnRangeStatistics {
type Output = crate::Result<ColumnRangeStatistics>;
fn equal(&self, rhs: &ColumnRangeStatistics) -> Self::Output {
// lower_bound: do they exactly overlap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? Even in the case when ranges exactly overlap, I think the result is stilI a maybe:

Lhs = [0, 1, 2, 3] -> (min=0, max=3)
Rhs = [2, 1, 0, 3] -> (min=0, max=3)
Result = [0, 1, 0, 1] -> (min=0, max=1)

Shouldn't the correct logic be:
(False, False) // no overlap at all
(False, True) // some overlap

impl ColumnRangeStatistics {
pub fn new(lower: Option<Series>, upper: Option<Series>) -> Result<Self> {
match (lower, upper) {
//TODO: also need to check dtype and length==1, and upper > lower.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just throw an assert in here? Would be pretty nasty if this assumption isn't met

let lower =
Utf8Array::from(("lower", [lower.as_str()].as_slice())).into_series();
let upper =
Utf8Array::from(("upper", [upper.as_str()].as_slice())).into_series();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh just realized too that if we ever do any arithmetic on the series, they might get really weird with the names if we don't slot them in the correct spots (e.g. lower + upper will be called lower)

Doesn't really matter though?

.as_ref()
.map(|v| v.iter().map(|s| s.as_ref()).collect::<Vec<_>>());
let urls = params.urls.iter().map(|s| s.as_str()).collect::<Vec<_>>();
daft_parquet::read::read_parquet_bulk(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big money move right here 😍

if predicate.is_empty() {
return Ok(Self::new(
self.schema.clone(),
TableState::Loaded(vec![].into()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is empty predicate a full filter instead of a no-op?

E.g. if we perform some predicate pushdown and somehow end up with a filter([]), shouldn't that be a no-op?

}
}

fn read_parquet_into_micropartition(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only used in tests? Maybe we can move into the test block below

@@ -0,0 +1,45 @@
use parquet2::{schema::types::TimeUnit, types::int96_to_i64_ns};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add references to where we got these functions

Gt => lhs.gt(&rhs),
Plus => &lhs + &rhs,
Minus => &lhs - &rhs,
_ => todo!(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return a Missing here instead of todo? Ditto for the todo at the bottom.

@samster25 samster25 changed the title Sammy/init micropartition [PEF] Micropartition, lazy loading and Column Stats Oct 16, 2023
@samster25 samster25 changed the title [PEF] Micropartition, lazy loading and Column Stats [PERF] Micropartition, lazy loading and Column Stats Oct 16, 2023
@samster25 samster25 merged commit 85dac45 into main Oct 16, 2023
26 of 28 checks passed
@samster25 samster25 deleted the sammy/init-micropartition branch October 16, 2023 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants