Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data type field to NullArray #5173

Closed
waynexia opened this issue Dec 6, 2023 · 6 comments
Closed

Add data type field to NullArray #5173

waynexia opened this issue Dec 6, 2023 · 6 comments
Labels
development-process Related to development process of arrow-rs enhancement Any new improvement worthy of a entry in the changelog

Comments

@waynexia
Copy link
Member

waynexia commented Dec 6, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

NullArray::data_type() returns Null type unconditionally. This limits its use case to Nulls only array. I'm trying to leverage NullArray in some scenarios where I know all the values of an array are Null (but typed). This can save lots of memory.

Describe the solution you'd like

Add a data_type field to NullArray, something like

pub struct NullArray {
    data_type: DataType,
    len: usize,
}

I've referenced the CPP NullArray implementation, which inherits a complete ArrayData from Array class. We don't need to do the same in arrow-rs but can consider adding useful fields like len and data_type separately.

Describe alternatives you've considered

Additional context

@waynexia waynexia added the enhancement Any new improvement worthy of a entry in the changelog label Dec 6, 2023
@tustvold
Copy link
Contributor

tustvold commented Dec 6, 2023

I'm not sure this will work, aside from this diverging from the arrow standard, DataType must correspond to the appropriate array variant. Perhaps you might consider using RunArray instead?

@waynexia
Copy link
Member Author

waynexia commented Dec 6, 2023

BTW, I also find the behavior of Array::is_null() is different between CPP and Rust implementations. Related discussion is #4835 cc @alamb @tustvold. CPP impl always return true and Rust always returns false.

source:

  /// \brief Return true if value at index is null. Does not boundscheck
  bool IsNull(int64_t i) const { return !IsValid(i); }

  /// \brief Return true if value at index is valid (not null). Does not
  /// boundscheck
  bool IsValid(int64_t i) const {
    if (null_bitmap_data_ != NULLPTR) {
      return bit_util::GetBit(null_bitmap_data_, i + data_->offset);
    }
    // Dispatching with a few conditionals like this makes IsNull more
    // efficient for how it is used in practice. Making IsNull virtual
    // would add a vtable lookup to every call and prevent inlining +
    // a potential inner-branch removal.
    if (type_id() == Type::SPARSE_UNION) {
      return !internal::IsNullSparseUnion(*data_, i);
    }
    if (type_id() == Type::DENSE_UNION) {
      return !internal::IsNullDenseUnion(*data_, i);
    }
    if (type_id() == Type::RUN_END_ENCODED) {
      return !internal::IsNullRunEndEncoded(*data_, i);
    }
    return data_->null_count != data_->length;
  }

Hence I come across another question: should we strive to maintain consistent behavior of common APIs across various implementations?

@tustvold
Copy link
Contributor

tustvold commented Dec 6, 2023

See #4840

consistent behavior of common APIs across various implementations

Where possible yes, but keeping the APIs coherent across the crate is more important. Some divergence is to be expected

@waynexia
Copy link
Member Author

waynexia commented Dec 6, 2023

I'm not sure this will work, aside from this diverging from the arrow standard, DataType must correspond to the appropriate array variant. Perhaps you might consider using RunArray instead?

Thanks for replying. I find what confuses me is how to understand the Null type. For some time it indicates the missing value (maybe this is the "logical null" in #4840), thus it should have another type of the possible value. Like None for Option<i64>. And this is how I interpret the NullArray -- a container that is only for None.

But when Null means null itself (maybe nullptr? not an appropriate analogy), it doesn't ship any other extra info. Like the NullArray implementation at present.

From the spec I can't figure out which way accord with the NullArray's definition.

We provide a simplified memory-efficient layout for the Null data type where all values are null. In this case no memory buffers are allocated.

@tustvold
Copy link
Contributor

tustvold commented Dec 6, 2023

NullArray is a bit of an odd one, but it is for the case where there is no value type known or possible, e.g. a SQL NULL literal, and it is therefore necessary to provide a container with no value type. There may be other use-cases, but that is the major one. The notion to then include a type on a container that is by design untyped is a little surprising to me

@alamb
Copy link
Contributor

alamb commented Dec 6, 2023

My opinion is that the behavior described on #4835 / #4840 may be pedantically correct but is incredibly confusing from a user perspective

NullArray::data_type() returns Null type unconditionally. This limits its use case to Nulls only array. I'm trying to leverage NullArray in some scenarios where I know all the values of an array are Null (but typed). This can save lots of memory.

The idea of saving memory for null only arrays sounds like a very reasonable usecase to me and it is my understanding of one of the main uses of the DataType::Null / NullArray in the the first place.

Representing this idea via the existing NullArray is likely challenging for several of the reasons, as you and @tustvold have mentioned.

I wonder if you could use a RunEndEncoded / RunArray array 🤔 It should allow you to make an underyling array with a single element and then a run length however long you wanted

I realize the support for RunEndEncoded is relatively sparse at the moment, but maybe this could be a reason to improve it 🤔

@tustvold tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Jan 1, 2024
@tustvold tustvold added the development-process Related to development process of arrow-rs label Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of arrow-rs enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants