Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-32763: [C++] Add FromProto for fetch & sort #34651

Merged

Conversation

westonpace
Copy link
Member

@westonpace westonpace commented Mar 21, 2023

This does not support the clustered sort direction, custom sort functions, or complex (non-reference) expressions as sort keys.

@westonpace westonpace marked this pull request as draft March 21, 2023 05:07
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #32763 has been automatically assigned in GitHub to PR creator.

@westonpace westonpace force-pushed the feature/GH-32763--substrait-fetch-order-by branch from d77339c to 167d4a3 Compare March 21, 2023 05:10
@westonpace westonpace marked this pull request as ready for review March 21, 2023 05:10
@westonpace
Copy link
Member Author

CC @vibhatha / @rtpsw mind taking a look?

@amol-
Copy link
Member

amol- commented Apr 5, 2023

@vibhatha @rtpsw are you able to take a look as suggest by @westonpace ?
If this one is mostly complete and only needs rebasing, it would be great to get it reviewed and ship before v12 is released.

@westonpace westonpace force-pushed the feature/GH-32763--substrait-fetch-order-by branch from 167d4a3 to 87451a6 Compare April 10, 2023 21:32
@westonpace
Copy link
Member Author

Rebased.

@westonpace
Copy link
Member Author

I think @rtpsw is out this week but maybe @icexelloss can take a look

@vibhatha
Copy link
Collaborator

@amol- @westonpace I will take a look.

compute::SortOrder SortOrderFromDirection(
const substrait::SortField::SortDirection& direction) {
if (direction < 3) {
return compute::SortOrder::Ascending;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we consider UNSPECIFIED as Ascending?

Here:

 enum SortDirection {
    SORT_DIRECTION_UNSPECIFIED = 0;
    SORT_DIRECTION_ASC_NULLS_FIRST = 1;
    SORT_DIRECTION_ASC_NULLS_LAST = 2;
    SORT_DIRECTION_DESC_NULLS_FIRST = 3;
    SORT_DIRECTION_DESC_NULLS_LAST = 4;
    SORT_DIRECTION_CLUSTERED = 5;
  }

I assume since the 1, 2 are Ascending, <3 is to pick the first three as Ascending.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, technically unspecified would be an invalid plan I think. So it would probably be better to reject.

Copy link
Member Author

@westonpace westonpace May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now an explicit rejection.

      case substrait::SortField::SortDirection::
          SortField_SortDirection_SORT_DIRECTION_UNSPECIFIED:
        return Status::Invalid("The substrait plan does not specify a sort direction");

namespace {

bool IsSortNullsFirst(const substrait::SortField::SortDirection& direction) {
return direction % 2 == 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is efficient, but shall we leave a comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But CLUSTERED has no null preference though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now throw an error on CLUSTERED:

      case substrait::SortField::SortDirection::
          SortField_SortDirection_SORT_DIRECTION_CLUSTERED:
      default:
        return Status::NotImplemented(
            "Acero does not support the specified sort direction: ", dir);

Comment on lines 5257 to 5374
NamedTableProvider table_provider = [&](const std::vector<std::string>& names,
const Schema&) {
std::shared_ptr<acero::ExecNodeOptions> options =
std::make_shared<acero::TableSourceNodeOptions>(input_table);
return acero::Declaration("table_source", {}, options, "mock_source");
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead could we use the following?

Suggested change
NamedTableProvider table_provider = [&](const std::vector<std::string>& names,
const Schema&) {
std::shared_ptr<acero::ExecNodeOptions> options =
std::make_shared<acero::TableSourceNodeOptions>(input_table);
return acero::Declaration("table_source", {}, options, "mock_source");
};
NamedTableProvider table_provider = AlwaysProvideSameTable(std::move(input_table));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good cleanup, thanks. I've switched to this.

Copy link
Collaborator

@vibhatha vibhatha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace the PR looks good to me. I have added a few suggestions.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 11, 2023
Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few questions. Mostly looks good.

@westonpace westonpace force-pushed the feature/GH-32763--substrait-fetch-order-by branch from 87451a6 to 1eb683a Compare April 14, 2023 21:52
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 14, 2023
@westonpace
Copy link
Member Author

I'll merge this when green but I don't think there is any significant rush to get this into 12.0.0?

@danepitkin
Copy link
Member

With Arrow v12 released (🎉 ) you should be all clear to merge at your leisure!

@westonpace westonpace force-pushed the feature/GH-32763--substrait-fetch-order-by branch from 1eb683a to 4c457e8 Compare May 15, 2023 19:04
@westonpace
Copy link
Member Author

I've rebased just in case and will let CI run one more time.

@westonpace
Copy link
Member Author

Also, I see I missed some feedback from @vibhatha so I will try and get to that today.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 15, 2023
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments, but I'm not competent enough to comment on the core logic.

namespace {

bool IsSortNullsFirst(const substrait::SortField::SortDirection& direction) {
return direction % 2 == 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, rather than arithmetic on enum values, I'd rather see a proper switch/case statement to make code more readable and maintainable. This is not so performance-sensitive that it must be optimized to the latest nanosecond.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched to proper handling of these cases and I return NullPlacement and SortOrder. I agree the code is more readable now.


compute::SortOrder SortOrderFromDirection(
const substrait::SortField::SortDirection& direction) {
if (direction < 3) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if ((null_placement == compute::NullPlacement::AtStart &&
!IsSortNullsFirst(sort.direction())) ||
(null_placement == compute::NullPlacement::AtEnd &&
IsSortNullsFirst(sort.direction()))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can change IsSortNullsFirst to return a compute::NullPlacement directly and this will make these lines a bit simpler (just compare the old null placement with the new one).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've cleaned this up.

@westonpace westonpace force-pushed the feature/GH-32763--substrait-fetch-order-by branch from 4c457e8 to 1e3b763 Compare May 19, 2023 13:16
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels May 19, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels May 19, 2023
@westonpace westonpace requested a review from pitrou May 19, 2023 13:39
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks for the update @westonpace

@pitrou pitrou merged commit fbe0d5f into apache:main May 22, 2023
@ursabot
Copy link

ursabot commented May 31, 2023

Benchmark runs are scheduled for baseline = 2a6848c and contender = fbe0d5f. fbe0d5f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.18% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.98% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.51% ⬆️0.09%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] fbe0d5f1 ec2-t3-xlarge-us-east-2
[Finished] fbe0d5f1 test-mac-arm
[Finished] fbe0d5f1 ursa-i9-9960x
[Finished] fbe0d5f1 ursa-thinkcentre-m75q
[Finished] 2a6848c8 ec2-t3-xlarge-us-east-2
[Finished] 2a6848c8 test-mac-arm
[Finished] 2a6848c8 ursa-i9-9960x
[Finished] 2a6848c8 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Adding Fetch Node FromProto
7 participants