Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44010: [C++] Add arrow::RecordBatch::MakeStatisticsArray() #44252

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kou
Copy link
Member

@kou kou commented Sep 30, 2024

Rationale for this change

Statistics schema for Arrow C data interface (GH-43553) is complex because it uses nested types (struct, map and union). So reusable implementation to make statistics array is useful.

What changes are included in this PR?

arrow::RecordBatch::MakeStatisticsArray() is a convenient function that converts arrow::ArrayStatistics in a arrow::RecordBatch to arrow::Array for the Arrow C data interface.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

Copy link

⚠️ GitHub issue #44010 has been automatically assigned in GitHub to PR creator.

cpp/src/arrow/compare.cc Outdated Show resolved Hide resolved
Comment on lines +475 to +541
auto enumerate_statistics =
[&](std::function<Status(int nth_statistics, bool start_new_column,
std::optional<int32_t> nth_column, const char* key,
const std::shared_ptr<DataType>& type,
const ArrayStatistics::ValueType& value)>
yield) {
int nth_statistics = 0;
RETURN_NOT_OK(yield(nth_statistics++, true, std::nullopt,
ARROW_STATISTICS_KEY_ROW_COUNT_EXACT, int64(),
ArrayStatistics::ValueType{num_rows_}));

int num_fields = schema_->num_fields();
for (int nth_column = 0; nth_column < num_fields; ++nth_column) {
auto statistics = column(nth_column)->statistics();
if (!statistics) {
continue;
}

bool start_new_column = true;
if (statistics->null_count.has_value()) {
RETURN_NOT_OK(yield(
nth_statistics++, start_new_column, std::optional<int32_t>(nth_column),
ARROW_STATISTICS_KEY_NULL_COUNT_EXACT, int64(),
ArrayStatistics::ValueType{statistics->null_count.value()}));
start_new_column = false;
}

if (statistics->distinct_count.has_value()) {
RETURN_NOT_OK(yield(
nth_statistics++, start_new_column, std::optional<int32_t>(nth_column),
ARROW_STATISTICS_KEY_DISTINCT_COUNT_EXACT, int64(),
ArrayStatistics::ValueType{statistics->distinct_count.value()}));
start_new_column = false;
}

if (statistics->min.has_value()) {
if (statistics->is_min_exact) {
RETURN_NOT_OK(yield(nth_statistics++, start_new_column,
std::optional<int32_t>(nth_column),
ARROW_STATISTICS_KEY_MIN_VALUE_EXACT,
statistics->MinArrowType(), statistics->min.value()));
} else {
RETURN_NOT_OK(yield(nth_statistics++, start_new_column,
std::optional<int32_t>(nth_column),
ARROW_STATISTICS_KEY_MIN_VALUE_APPROXIMATE,
statistics->MinArrowType(), statistics->min.value()));
}
start_new_column = false;
}

if (statistics->max.has_value()) {
if (statistics->is_max_exact) {
RETURN_NOT_OK(yield(nth_statistics++, start_new_column,
std::optional<int32_t>(nth_column),
ARROW_STATISTICS_KEY_MAX_VALUE_EXACT,
statistics->MaxArrowType(), statistics->max.value()));
} else {
RETURN_NOT_OK(yield(nth_statistics++, start_new_column,
std::optional<int32_t>(nth_column),
ARROW_STATISTICS_KEY_MAX_VALUE_APPROXIMATE,
statistics->MaxArrowType(), statistics->max.value()));
}
start_new_column = false;
}
}
return Status::OK();
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to extract this as an internal function.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 30, 2024
@kou kou force-pushed the cpp-record-batch-make-statistics-array branch from 903e3f4 to 92afc83 Compare September 30, 2024 08:17
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 30, 2024
It's a convenient function that converts `arrow::ArrayStatistics` in a
`arrow::RecordBatch` to `arrow::Array` for the Arrow C data interface.
@kou kou force-pushed the cpp-record-batch-make-statistics-array branch from 92afc83 to b194430 Compare September 30, 2024 09:03
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 30, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 30, 2024
@kou
Copy link
Member Author

kou commented Oct 2, 2024

@pitrou @ianmcook What do you think about this?

Statistics schema https://github.com/apache/arrow/pull/43553/files#diff-f3758fb6986ea8d24bb2e13c2feb625b68bbd6b93b3fbafd3e2a03dcdc7ba263R86-R95 is compact but it may be complex to build. Because it uses many nested types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant