Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: fix case where num_rows and total_byte_size are not defined (stat should be None instead of Some(0)) #2976

Open
thomas-k-cameron opened this issue Jul 27, 2022 · 0 comments
Labels
bug Something isn't working

Comments

@thomas-k-cameron
Copy link
Contributor

thomas-k-cameron commented Jul 27, 2022

I'd love to create a pull request if there isn't any problem.
master...thomas-k-cameron:arrow-datafusion:stat-should-be-None-instead-of-Some(0)

Describe the bug
Some fields of Statistics could return Some(0) when the value is not available.

TODO comment says that some fields are supposed to be None instead of Some(0) under some circumstances but the value is set to Some(0) even when it is supposed to have None.

/// Get all files as well as the file level summary statistics (no statistic for partition columns).
/// If the optional `limit` is provided, includes only sufficient files.
/// Needed to read up to `limit` number of rows.
/// TODO fix case where `num_rows` and `total_byte_size` are not defined (stat should be None instead of Some(0))
pub async fn get_statistics_with_limit(
    all_files: impl Stream<Item = Result<(PartitionedFile, Statistics)>>,
    file_schema: SchemaRef,
    limit: Option<usize>,
) -> Result<(Vec<PartitionedFile>, Statistics)>

To Reproduce
I haven't ran into any breaking bugs yet, however, the functions is used in a method provided by one of the method implemented on ListingTable struct and Some(0) is returned when the options.collect_stat is set to false (aka when it is not supposed to be available).

And it appears that it is generating

Statistics is passed to a method implemented on Arc<dyn FileFormat> so chance of someone running into a bug is not zero.

Also, pub async fn get_statistics_with_limit is a exported function.

Expected behavior
returns None instead of Some(0).

Alternatives Considered

  1. Turn Some(0) to None.
    However, this solution would be rather labour intensive as the field is being used by everywhere (over 200 occurrence)
    Also anyone using it cannot determine is the size is 0 or the value is just not available
  2. Assign None in ListingTable::async fn list_files_for_scan<'a>
    ListingTable::async fn list_files_for_scan<'a> is the only function calling get_statistics_with_limit. It can be fixed by assigning None to statistics.
    However, get_statistics_with_limit is being exported so I didn't think it really solves the problem.
  3. Return a number that was available
    The Statistics struct is basically an aggregate of all Statistics struct passed onto the function.
    Some values may have fields that is not None.
    Considering that Statistics is not meant to always return a accurate value, I think this could work too.

Additional context
Is it suppose to run get_statistics_with_limit when options.collect_stat is set to false?

Thank you very much for reading my issue and please let me know if I'm getting it wrong.

@thomas-k-cameron thomas-k-cameron added the bug Something isn't working label Jul 27, 2022
@xudong963 xudong963 changed the title /// TODO fix case where num_rows and total_byte_size are not defined (stat should be None instead of Some(0)) Bug: fix case where num_rows and total_byte_size are not defined (stat should be None instead of Some(0)) Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant