Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add comparison Spark functions #5569

Closed
wants to merge 10 commits into from
Closed

Conversation

yma11
Copy link
Contributor

@yma11 yma11 commented Jul 8, 2023

Compare functions for spark have different rules for value NaN, which is not same as PrestoSQL. This PR provides corresponding vector functions and UTs are added.

@netlify
Copy link

netlify bot commented Jul 8, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit de52e2d
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65095cc1c972070008663d1c

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 8, 2023
@yma11
Copy link
Contributor Author

yma11 commented Jul 10, 2023

@Yuhta can you also help on review this PR? Thanks.


template <typename T>
struct BetweenFunction {
template <typename TInput>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BetweenFunction exists in presto velox/functions/prestosql/Comparisons.h, looks like it is same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I copy them here to remove the dependency of velox/functions/prestosql/Comparisons.h.

makeGreaterThanOrEqual);
// Compare nullsafe functions
exec::registerStatefulVectorFunction(
prefix + "equalnullsafe", equalNullSafeSignatures(), makeEqualNullSafe);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the register equalnullsafe from velox/functions/sparksql/Register.cpp

@@ -40,6 +50,10 @@ void registerCompareFunctions(const std::string& prefix) {
{prefix + "between"});
registerFunction<BetweenFunction, bool, float, float, float>(
{prefix + "between"});
registerFunction<BetweenFunction, bool, int64_t, int64_t, int64_t>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move it to line 49, integral input function should put together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated with Line 47

protected:
template <typename T>
std::optional<bool> equaltonullsafe(std::optional<T> a, std::optional<T> b) {
return evaluateOnce<bool>("equalnullsafe(c0, c1)", a, b);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated with CompareNullSafeTests.cpp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompareNullSafeTests.cpp should be removed.

@@ -40,6 +50,10 @@ void registerCompareFunctions(const std::string& prefix) {
{prefix + "between"});
registerFunction<BetweenFunction, bool, float, float, float>(
{prefix + "between"});
registerFunction<BetweenFunction, bool, int64_t, int64_t, int64_t>(
{prefix + "between"});
registerFunction<BetweenFunction, bool, int128_t, int128_t, int128_t>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the document velox/docs/functions/spark/comparison.rst

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are already added in doc, even previously uses Presto implementation. Refer comparison.rst.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@@ -40,6 +50,10 @@ void registerCompareFunctions(const std::string& prefix) {
{prefix + "between"});
registerFunction<BetweenFunction, bool, float, float, float>(
{prefix + "between"});
registerFunction<BetweenFunction, bool, int64_t, int64_t, int64_t>(
{prefix + "between"});
registerFunction<BetweenFunction, bool, int128_t, int128_t, int128_t>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now int128 only used in long decimal, I suppose it cannot compare like this because the scale maybe different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is that before doing decimal comparison, scale/precision should have been already unified for lhs and rhs. Isn't it right? For comparison like equal, gt, etc. will pop error if unification not done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

@yma11
Copy link
Contributor Author

yma11 commented Jul 10, 2023

@jinchengchenghh Thanks for your nice review. I updated so please take a look again.

@@ -17,6 +17,7 @@ add_library(
ArraySort.cpp
Bitwise.cpp
CompareFunctionsNullSafe.cpp
Copy link
Contributor

@jinchengchenghh jinchengchenghh Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you move CompareFunctionsNullSafe.cpp to Comparisons.cpp? What do you think? @Yuhta

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we should put them in same file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta @jinchengchenghh , equalnullsafe is now moved into cmp functions family. Please take a further look.

std::shared_ptr<exec::VectorFunction> makeGreaterThan(
const std::string& name,
const std::vector<exec::VectorFunctionArg>& inputArgs,
const core::QueryConfig& /*config*/);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@yma11 yma11 Jul 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

registerBinaryScalar<GteFunction, bool>({prefix + "greaterthanorequal"});

// Register compare functions
exec::registerStatefulVectorFunction(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it still be SimpleFunction?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it is stateless #4029 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are stateless. It has some problems when try to use registerVectorFunction API as there are two input parameters needed as input when constructing the function, which will cause convert failure. It also happens for the existing "EqualNullSafe" and should be why it also use the registerStatefulVectorFunction. By the way, registerVectorFunction actually do a function factory wrapper and then call registerStatefulVectorFunction eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta Any suggestion on this open? I see Presto has SIMD implementation for these functions. Maybe we can have such for Spark in later PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference between the comparisons in Presto and Spark? Maybe we should move the one in Presto to a common place and reuse them in both engines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Spark comparison functions have some different rule for NaN handling, so they will call Cmp like Less.

registerBinaryScalar<GteFunction, bool>({prefix + "greaterthanorequal"});

// Register compare functions
exec::registerStatefulVectorFunction(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference between the comparisons in Presto and Spark? Maybe we should move the one in Presto to a common place and reuse them in both engines.

@@ -6,7 +6,7 @@ Comparison Functions

Returns true if x is within the specified [min, max] range
inclusive. The types of all arguments must be the same.
Supported types are: TINYINT, SMALLINT, INTEGER, BIGINT, DOUBLE, REAL.
Supported types are: TINYINT, SMALLINT, INTEGER, BIGINT, DOUBLE, REAL, HUGEINT.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HUGEINT is not a logical type. We probably want to support DECIMAL here

@@ -17,6 +17,7 @@ add_library(
ArraySort.cpp
Bitwise.cpp
CompareFunctionsNullSafe.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we should put them in same file

registerBinaryScalar<LteFunction, bool>({prefix + "lessthanorequal"});
registerBinaryScalar<GteFunction, bool>({prefix + "greaterthanorequal"});

// Register compare functions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we follow the comment style by adding . at the end?

prefix + "greaterthanorequal",
comparisonSignatures(),
makeGreaterThanOrEqual);
// Compare nullsafe functions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

return BaseVector::wrapInConstant(base->size(), 0, base);
};
// lhs: 0, null, 2, null, 4

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra empty line?

auto makeConstantDic = [&](const VectorPtr& base) {
return BaseVector::wrapInConstant(base->size(), 0, base);
};
// lhs: 0, null, 2, null, 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please capitalize the first letter and add . at the end.

auto rowVector = makeRowVector({lhsVector, rhsVector});
auto result = evaluate<SimpleVector<bool>>(
fmt::format("{}(c0, c1)", "greaterthan"), rowVector);
// result : false, null, false, null, false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

makeRowVector({makeDictionary(lhs), makeConstantDic(constVector)});
// lhs: 0, null, 2, null, 4
// rhs: const 100
// lessthanorequal result : true, null, true, null, true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

5, [](auto row) { return true; }, nullEvery(2, 1)));
// lhs: const 100
// rhs: 0, null, 2, null, 4
// greaterthanorequal result : true, null, true, null, true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

@yma11 yma11 changed the title Implement Compare functions for Spark Add comparison Spark functions Aug 7, 2023
@yma11
Copy link
Contributor Author

yma11 commented Aug 8, 2023

@rui-mo Thanks for review, your comments are addressed, please confirm. @Yuhta can you help review this PR again? The key point left is whether it's acceptable to use registerStatefulFunction API here. Thanks advance!

@Yuhta Yuhta self-requested a review August 8, 2023 23:08
@Yuhta
Copy link
Contributor

Yuhta commented Aug 22, 2023

@kgpai Can you help review this one as you wrote the counterpart in Presto?

* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename the file to ComparisonTest.cpp

@kgpai
Copy link
Contributor

kgpai commented Aug 29, 2023

Looking.

} else {
rows.applyToSelected([&](auto i) {
flatResult->set(
i, cmp(decodedArg0->valueAt<T>(i), decodedArg1->valueAt<T>(i)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When both the args are constants or identity mapping we can not use valueAt and compare buffers - presto uses simd to compare(https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/Comparisons.cpp) but atleast bypassing valueAt would be a good start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Please take a look again. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgpai Have time to take a further look? There is a function signature mismatch in CI but I think it's expected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have a look soon !

Copy link
Contributor

@kgpai kgpai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Some nits.

flatResult->set(i, cmp(rawValues[i], constant));
});
} else {
// Fast path if one or more arguments are encoded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Once we have to decode its not really fast path, just change it to Path if one or more arguments are encoded.

} else {
// Fast path if one or more arguments are encoded.
exec::DecodedArgs decodedArgs(rows, args, context);
auto decoded0 = decodedArgs.at(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for consistencies sake, I would also name this decodedA, decodedB like you have rawA, rawB above (or even decodedLhs, decodedRhs ).

}
};

// BoolComparisonFunction for bool as it uses compact representation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you change comment to say , 'ComparisonFunction instance for bool as it uses compact representation'.

void applyTyped(
const SelectivityVector& rows,
std::vector<VectorPtr>& args,
DecodedVector* decoded0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Instead of naming decoded0,1 , can you name it decodedLhs, decodedRhs etc.

auto isNull1 = rawNulls1 && bits::isBitNull(rawNulls1, i);
flatResult->set(
i,
(isNull0 || isNull1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure this is spark semantics , if any of values is Null, then its only true if both are null , right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in Spark, it will return true if both sides are null and return false if only one side is null.

DecodedVector* /* decoded1 */,
exec::EvalCtx& /* context */,
FlatVector<bool>* /* flatResult */) {
VELOX_NYI("equalnullsafe does not support arrays.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change name to equaltonullsafe ?

@kgpai
Copy link
Contributor

kgpai commented Sep 13, 2023

Can you check whether the fuzzer failure run is related to these changes?

@kgpai
Copy link
Contributor

kgpai commented Sep 18, 2023

@yma11 Please let me know when the CI runs are green and I can review again/start the merge process. Thanks !

@kgpai
Copy link
Contributor

kgpai commented Sep 19, 2023

@yma11 Can you fix the mac build failures in ci ; I can validate the signature checks after that.

@yma11
Copy link
Contributor Author

yma11 commented Sep 20, 2023

@yma11 Can you fix the mac build failures in ci ; I can validate the signature checks after that.

@kgpai Thanks for your review. The mac build passes now.

@yma11
Copy link
Contributor Author

yma11 commented Sep 27, 2023

@yma11 Can you fix the mac build failures in ci ; I can validate the signature checks after that.

@kgpai Thanks for your review. The mac build passes now.

@kgpai can you help do final check?

@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kgpai merged this pull request in 6b6dd58.

@conbench-facebook
Copy link

Conbench analyzed the 1 benchmark run on commit 6b6dd58d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

ericyuliu pushed a commit to ericyuliu/velox that referenced this pull request Oct 12, 2023
Summary:
Compare functions for spark have different rules for value `NaN`, which is not same as PrestoSQL. This PR provides corresponding vector functions and UTs are added.

Pull Request resolved: facebookincubator#5569

Reviewed By: xiaoxmeng

Differential Revision: D49676085

Pulled By: kgpai

fbshipit-source-id: 2fc5c7741a620b0be77d571b4828cc7df3557f9b
@yma11 yma11 deleted the compare-func branch January 24, 2024 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants