-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add comparison Spark functions #5569
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@Yuhta can you also help on review this PR? Thanks. |
|
||
template <typename T> | ||
struct BetweenFunction { | ||
template <typename TInput> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BetweenFunction
exists in presto velox/functions/prestosql/Comparisons.h
, looks like it is same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I copy them here to remove the dependency of velox/functions/prestosql/Comparisons.h
.
makeGreaterThanOrEqual); | ||
// Compare nullsafe functions | ||
exec::registerStatefulVectorFunction( | ||
prefix + "equalnullsafe", equalNullSafeSignatures(), makeEqualNullSafe); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the register equalnullsafe from velox/functions/sparksql/Register.cpp
@@ -40,6 +50,10 @@ void registerCompareFunctions(const std::string& prefix) { | |||
{prefix + "between"}); | |||
registerFunction<BetweenFunction, bool, float, float, float>( | |||
{prefix + "between"}); | |||
registerFunction<BetweenFunction, bool, int64_t, int64_t, int64_t>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move it to line 49, integral input function should put together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated with Line 47
protected: | ||
template <typename T> | ||
std::optional<bool> equaltonullsafe(std::optional<T> a, std::optional<T> b) { | ||
return evaluateOnce<bool>("equalnullsafe(c0, c1)", a, b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated with CompareNullSafeTests.cpp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CompareNullSafeTests.cpp
should be removed.
@@ -40,6 +50,10 @@ void registerCompareFunctions(const std::string& prefix) { | |||
{prefix + "between"}); | |||
registerFunction<BetweenFunction, bool, float, float, float>( | |||
{prefix + "between"}); | |||
registerFunction<BetweenFunction, bool, int64_t, int64_t, int64_t>( | |||
{prefix + "between"}); | |||
registerFunction<BetweenFunction, bool, int128_t, int128_t, int128_t>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the document velox/docs/functions/spark/comparison.rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions are already added in doc, even previously uses Presto implementation. Refer comparison.rst.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the supported type https://github.com/facebookincubator/velox/blob/main/velox/docs/functions/spark/comparison.rst?plain=1#L9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
@@ -40,6 +50,10 @@ void registerCompareFunctions(const std::string& prefix) { | |||
{prefix + "between"}); | |||
registerFunction<BetweenFunction, bool, float, float, float>( | |||
{prefix + "between"}); | |||
registerFunction<BetweenFunction, bool, int64_t, int64_t, int64_t>( | |||
{prefix + "between"}); | |||
registerFunction<BetweenFunction, bool, int128_t, int128_t, int128_t>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now int128 only used in long decimal, I suppose it cannot compare like this because the scale maybe different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought is that before doing decimal comparison, scale/precision should have been already unified for lhs
and rhs
. Isn't it right? For comparison like equal
, gt
, etc. will pop error if unification not done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
@jinchengchenghh Thanks for your nice review. I updated so please take a look again. |
@@ -17,6 +17,7 @@ add_library( | |||
ArraySort.cpp | |||
Bitwise.cpp | |||
CompareFunctionsNullSafe.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you move CompareFunctionsNullSafe.cpp
to Comparisons.cpp
? What do you think? @Yuhta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think we should put them in same file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta @jinchengchenghh , equalnullsafe
is now moved into cmp functions family. Please take a further look.
std::shared_ptr<exec::VectorFunction> makeGreaterThan( | ||
const std::string& name, | ||
const std::vector<exec::VectorFunctionArg>& inputArgs, | ||
const core::QueryConfig& /*config*/); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't comment argument in header file, example is this one https://github.com/facebookincubator/velox/blob/main/velox/functions/sparksql/RegexFunctions.h#L56
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a common practice or C++ standard? I found there are such comments in header files like https://github.com/facebookincubator/velox/blob/main/velox/functions/sparksql/DateTimeFunctions.h and https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/HyperLogLogFunctions.h.
registerBinaryScalar<GteFunction, bool>({prefix + "greaterthanorequal"}); | ||
|
||
// Register compare functions | ||
exec::registerStatefulVectorFunction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it still be SimpleFunction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it is stateless #4029 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions are stateless. It has some problems when try to use registerVectorFunction
API as there are two input parameters needed as input when constructing the function, which will cause convert failure. It also happens for the existing "EqualNullSafe" and should be why it also use the registerStatefulVectorFunction
. By the way, registerVectorFunction
actually do a function factory wrapper and then call registerStatefulVectorFunction
eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta Any suggestion on this open? I see Presto has SIMD implementation for these functions. Maybe we can have such for Spark in later PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any difference between the comparisons in Presto and Spark? Maybe we should move the one in Presto to a common place and reuse them in both engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Spark comparison functions have some different rule for NaN
handling, so they will call Cmp
like Less.
registerBinaryScalar<GteFunction, bool>({prefix + "greaterthanorequal"}); | ||
|
||
// Register compare functions | ||
exec::registerStatefulVectorFunction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any difference between the comparisons in Presto and Spark? Maybe we should move the one in Presto to a common place and reuse them in both engines.
@@ -6,7 +6,7 @@ Comparison Functions | |||
|
|||
Returns true if x is within the specified [min, max] range | |||
inclusive. The types of all arguments must be the same. | |||
Supported types are: TINYINT, SMALLINT, INTEGER, BIGINT, DOUBLE, REAL. | |||
Supported types are: TINYINT, SMALLINT, INTEGER, BIGINT, DOUBLE, REAL, HUGEINT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HUGEINT
is not a logical type. We probably want to support DECIMAL
here
@@ -17,6 +17,7 @@ add_library( | |||
ArraySort.cpp | |||
Bitwise.cpp | |||
CompareFunctionsNullSafe.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think we should put them in same file
registerBinaryScalar<LteFunction, bool>({prefix + "lessthanorequal"}); | ||
registerBinaryScalar<GteFunction, bool>({prefix + "greaterthanorequal"}); | ||
|
||
// Register compare functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we follow the comment style by adding .
at the end?
prefix + "greaterthanorequal", | ||
comparisonSignatures(), | ||
makeGreaterThanOrEqual); | ||
// Compare nullsafe functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
return BaseVector::wrapInConstant(base->size(), 0, base); | ||
}; | ||
// lhs: 0, null, 2, null, 4 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra empty line?
auto makeConstantDic = [&](const VectorPtr& base) { | ||
return BaseVector::wrapInConstant(base->size(), 0, base); | ||
}; | ||
// lhs: 0, null, 2, null, 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please capitalize the first letter and add .
at the end.
auto rowVector = makeRowVector({lhsVector, rhsVector}); | ||
auto result = evaluate<SimpleVector<bool>>( | ||
fmt::format("{}(c0, c1)", "greaterthan"), rowVector); | ||
// result : false, null, false, null, false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
makeRowVector({makeDictionary(lhs), makeConstantDic(constVector)}); | ||
// lhs: 0, null, 2, null, 4 | ||
// rhs: const 100 | ||
// lessthanorequal result : true, null, true, null, true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
5, [](auto row) { return true; }, nullEvery(2, 1))); | ||
// lhs: const 100 | ||
// rhs: 0, null, 2, null, 4 | ||
// greaterthanorequal result : true, null, true, null, true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
@kgpai Can you help review this one as you wrote the counterpart in Presto? |
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename the file to ComparisonTest.cpp
Looking. |
} else { | ||
rows.applyToSelected([&](auto i) { | ||
flatResult->set( | ||
i, cmp(decodedArg0->valueAt<T>(i), decodedArg1->valueAt<T>(i))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When both the args are constants or identity mapping we can not use valueAt and compare buffers - presto uses simd to compare(https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/Comparisons.cpp) but atleast bypassing valueAt would be a good start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Please take a look again. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgpai Have time to take a further look? There is a function signature mismatch in CI but I think it's expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will have a look soon !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Some nits.
flatResult->set(i, cmp(rawValues[i], constant)); | ||
}); | ||
} else { | ||
// Fast path if one or more arguments are encoded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Once we have to decode its not really fast path, just change it to Path if one or more arguments are encoded.
} else { | ||
// Fast path if one or more arguments are encoded. | ||
exec::DecodedArgs decodedArgs(rows, args, context); | ||
auto decoded0 = decodedArgs.at(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for consistencies sake, I would also name this decodedA, decodedB like you have rawA, rawB above (or even decodedLhs, decodedRhs ).
} | ||
}; | ||
|
||
// BoolComparisonFunction for bool as it uses compact representation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can you change comment to say , 'ComparisonFunction instance for bool as it uses compact representation'.
void applyTyped( | ||
const SelectivityVector& rows, | ||
std::vector<VectorPtr>& args, | ||
DecodedVector* decoded0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Instead of naming decoded0,1 , can you name it decodedLhs, decodedRhs etc.
auto isNull1 = rawNulls1 && bits::isBitNull(rawNulls1, i); | ||
flatResult->set( | ||
i, | ||
(isNull0 || isNull1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to make sure this is spark semantics , if any of values is Null, then its only true if both are null , right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in Spark, it will return true if both sides are null and return false if only one side is null.
DecodedVector* /* decoded1 */, | ||
exec::EvalCtx& /* context */, | ||
FlatVector<bool>* /* flatResult */) { | ||
VELOX_NYI("equalnullsafe does not support arrays."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: change name to equaltonullsafe ?
Can you check whether the fuzzer failure run is related to these changes? |
@yma11 Please let me know when the CI runs are green and I can review again/start the merge process. Thanks ! |
@yma11 Can you fix the mac build failures in ci ; I can validate the signature checks after that. |
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Summary: Compare functions for spark have different rules for value `NaN`, which is not same as PrestoSQL. This PR provides corresponding vector functions and UTs are added. Pull Request resolved: facebookincubator#5569 Reviewed By: xiaoxmeng Differential Revision: D49676085 Pulled By: kgpai fbshipit-source-id: 2fc5c7741a620b0be77d571b4828cc7df3557f9b
Compare functions for spark have different rules for value
NaN
, which is not same as PrestoSQL. This PR provides corresponding vector functions and UTs are added.