Add comparison Spark functions #5569

yma11 · 2023-07-08T02:09:43Z

Compare functions for spark have different rules for value NaN, which is not same as PrestoSQL. This PR provides corresponding vector functions and UTs are added.

netlify · 2023-07-08T02:09:48Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`de52e2d`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/65095cc1c972070008663d1c

yma11 · 2023-07-10T05:05:06Z

@Yuhta can you also help on review this PR? Thanks.

velox/functions/sparksql/Comparisons.cpp

velox/functions/sparksql/tests/CompareTests.cpp

velox/functions/sparksql/Comparisons.h

jinchengchenghh · 2023-07-10T07:35:18Z

velox/functions/sparksql/Comparisons.h

+
+template <typename T>
+struct BetweenFunction {
+  template <typename TInput>


BetweenFunction exists in presto velox/functions/prestosql/Comparisons.h, looks like it is same?

Yes. I copy them here to remove the dependency of velox/functions/prestosql/Comparisons.h.

jinchengchenghh · 2023-07-10T07:36:57Z

velox/functions/sparksql/RegisterCompare.cpp

+      makeGreaterThanOrEqual);
+  // Compare nullsafe functions
+  exec::registerStatefulVectorFunction(
+      prefix + "equalnullsafe", equalNullSafeSignatures(), makeEqualNullSafe);


Please remove the register equalnullsafe from velox/functions/sparksql/Register.cpp

jinchengchenghh · 2023-07-10T07:38:00Z

velox/functions/sparksql/RegisterCompare.cpp

@@ -40,6 +50,10 @@ void registerCompareFunctions(const std::string& prefix) {
      {prefix + "between"});
  registerFunction<BetweenFunction, bool, float, float, float>(
      {prefix + "between"});
+  registerFunction<BetweenFunction, bool, int64_t, int64_t, int64_t>(


Please move it to line 49, integral input function should put together.

Duplicated with Line 47

jinchengchenghh · 2023-07-10T07:39:22Z

velox/functions/sparksql/tests/CompareTests.cpp

+ protected:
+  template <typename T>
+  std::optional<bool> equaltonullsafe(std::optional<T> a, std::optional<T> b) {
+    return evaluateOnce<bool>("equalnullsafe(c0, c1)", a, b);


Duplicated with CompareNullSafeTests.cpp?

CompareNullSafeTests.cpp should be removed.

jinchengchenghh · 2023-07-10T07:46:53Z

velox/functions/sparksql/RegisterCompare.cpp

@@ -40,6 +50,10 @@ void registerCompareFunctions(const std::string& prefix) {
      {prefix + "between"});
  registerFunction<BetweenFunction, bool, float, float, float>(
      {prefix + "between"});
+  registerFunction<BetweenFunction, bool, int64_t, int64_t, int64_t>(
+      {prefix + "between"});
+  registerFunction<BetweenFunction, bool, int128_t, int128_t, int128_t>(


Update the document velox/docs/functions/spark/comparison.rst

These functions are already added in doc, even previously uses Presto implementation. Refer comparison.rst.

Please update the supported type https://github.com/facebookincubator/velox/blob/main/velox/docs/functions/spark/comparison.rst?plain=1#L9

jinchengchenghh · 2023-07-10T07:52:49Z

velox/functions/sparksql/RegisterCompare.cpp

@@ -40,6 +50,10 @@ void registerCompareFunctions(const std::string& prefix) {
      {prefix + "between"});
  registerFunction<BetweenFunction, bool, float, float, float>(
      {prefix + "between"});
+  registerFunction<BetweenFunction, bool, int64_t, int64_t, int64_t>(
+      {prefix + "between"});
+  registerFunction<BetweenFunction, bool, int128_t, int128_t, int128_t>(


Now int128 only used in long decimal, I suppose it cannot compare like this because the scale maybe different.

My thought is that before doing decimal comparison, scale/precision should have been already unified for lhs and rhs. Isn't it right? For comparison like equal, gt, etc. will pop error if unification not done.

Make sense.

yma11 · 2023-07-10T11:19:09Z

@jinchengchenghh Thanks for your nice review. I updated so please take a look again.

jinchengchenghh · 2023-07-11T01:27:23Z

velox/functions/sparksql/CMakeLists.txt

@@ -17,6 +17,7 @@ add_library(
  ArraySort.cpp
  Bitwise.cpp
  CompareFunctionsNullSafe.cpp


Would you move CompareFunctionsNullSafe.cpp to Comparisons.cpp? What do you think? @Yuhta

Yes I think we should put them in same file

@Yuhta @jinchengchenghh , equalnullsafe is now moved into cmp functions family. Please take a further look.

jinchengchenghh · 2023-07-11T01:37:30Z

velox/functions/sparksql/Comparisons.h

+std::shared_ptr<exec::VectorFunction> makeGreaterThan(
+    const std::string& name,
+    const std::vector<exec::VectorFunctionArg>& inputArgs,
+    const core::QueryConfig& /*config*/);


Please don't comment argument in header file, example is this one https://github.com/facebookincubator/velox/blob/main/velox/functions/sparksql/RegexFunctions.h#L56

Is this a common practice or C++ standard? I found there are such comments in header files like https://github.com/facebookincubator/velox/blob/main/velox/functions/sparksql/DateTimeFunctions.h and https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/HyperLogLogFunctions.h.

jinchengchenghh · 2023-07-12T08:28:11Z

velox/functions/sparksql/RegisterCompare.cpp

-  registerBinaryScalar<GteFunction, bool>({prefix + "greaterthanorequal"});
-
+  // Register compare functions
+  exec::registerStatefulVectorFunction(


Can it still be SimpleFunction?

Looks like it is stateless #4029 (comment)

These functions are stateless. It has some problems when try to use registerVectorFunction API as there are two input parameters needed as input when constructing the function, which will cause convert failure. It also happens for the existing "EqualNullSafe" and should be why it also use the registerStatefulVectorFunction. By the way, registerVectorFunction actually do a function factory wrapper and then call registerStatefulVectorFunction eventually.

@Yuhta Any suggestion on this open? I see Presto has SIMD implementation for these functions. Maybe we can have such for Spark in later PR.

Is there any difference between the comparisons in Presto and Spark? Maybe we should move the one in Presto to a common place and reuse them in both engines.

Yes. Spark comparison functions have some different rule for NaN handling, so they will call Cmp like Less.

Yuhta · 2023-07-17T14:40:30Z

velox/functions/sparksql/RegisterCompare.cpp

-  registerBinaryScalar<GteFunction, bool>({prefix + "greaterthanorequal"});
-
+  // Register compare functions
+  exec::registerStatefulVectorFunction(


Is there any difference between the comparisons in Presto and Spark? Maybe we should move the one in Presto to a common place and reuse them in both engines.

Yuhta · 2023-07-17T14:41:09Z

velox/docs/functions/spark/comparison.rst

@@ -6,7 +6,7 @@ Comparison Functions

    Returns true if x is within the specified [min, max] range
    inclusive. The types of all arguments must be the same.
-    Supported types are: TINYINT, SMALLINT, INTEGER, BIGINT, DOUBLE, REAL.
+    Supported types are: TINYINT, SMALLINT, INTEGER, BIGINT, DOUBLE, REAL, HUGEINT.


HUGEINT is not a logical type. We probably want to support DECIMAL here

Yuhta · 2023-07-17T14:41:26Z

velox/functions/sparksql/CMakeLists.txt

@@ -17,6 +17,7 @@ add_library(
  ArraySort.cpp
  Bitwise.cpp
  CompareFunctionsNullSafe.cpp


Yes I think we should put them in same file

rui-mo · 2023-08-04T00:32:44Z

velox/functions/sparksql/RegisterCompare.cpp

-  registerBinaryScalar<LteFunction, bool>({prefix + "lessthanorequal"});
-  registerBinaryScalar<GteFunction, bool>({prefix + "greaterthanorequal"});
-
+  // Register compare functions


Nit: can we follow the comment style by adding . at the end?

rui-mo · 2023-08-04T00:33:06Z

velox/functions/sparksql/RegisterCompare.cpp

+      prefix + "greaterthanorequal",
+      comparisonSignatures(),
+      makeGreaterThanOrEqual);
+  // Compare nullsafe functions


rui-mo · 2023-08-04T00:34:03Z

velox/functions/sparksql/tests/CompareTests.cpp

+    return BaseVector::wrapInConstant(base->size(), 0, base);
+  };
+  // lhs: 0, null, 2, null, 4
+


Extra empty line?

rui-mo · 2023-08-04T00:34:35Z

velox/functions/sparksql/tests/CompareTests.cpp

+  auto makeConstantDic = [&](const VectorPtr& base) {
+    return BaseVector::wrapInConstant(base->size(), 0, base);
+  };
+  // lhs: 0, null, 2, null, 4


Please capitalize the first letter and add . at the end.

rui-mo · 2023-08-04T00:34:45Z

velox/functions/sparksql/tests/CompareTests.cpp

+  auto rowVector = makeRowVector({lhsVector, rhsVector});
+  auto result = evaluate<SimpleVector<bool>>(
+      fmt::format("{}(c0, c1)", "greaterthan"), rowVector);
+  // result : false, null, false, null, false


rui-mo · 2023-08-04T00:34:53Z

velox/functions/sparksql/tests/CompareTests.cpp

+      makeRowVector({makeDictionary(lhs), makeConstantDic(constVector)});
+  // lhs: 0, null, 2, null, 4
+  // rhs: const 100
+  // lessthanorequal result : true, null, true, null, true


rui-mo · 2023-08-04T00:35:00Z

velox/functions/sparksql/tests/CompareTests.cpp

+          5, [](auto row) { return true; }, nullEvery(2, 1)));
+  // lhs: const 100
+  // rhs: 0, null, 2, null, 4
+  // greaterthanorequal result : true, null, true, null, true


yma11 · 2023-08-08T01:46:44Z

@rui-mo Thanks for review, your comments are addressed, please confirm. @Yuhta can you help review this PR again? The key point left is whether it's acceptable to use registerStatefulFunction API here. Thanks advance!

Yuhta · 2023-08-22T15:14:37Z

@kgpai Can you help review this one as you wrote the counterpart in Presto?

jinchengchenghh · 2023-08-24T00:29:50Z

velox/functions/sparksql/tests/CompareTests.cpp

+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,


Please rename the file to ComparisonTest.cpp

kgpai · 2023-08-29T01:46:19Z

Looking.

kgpai · 2023-08-29T02:10:36Z

velox/functions/sparksql/Comparisons.cpp

+    } else {
+      rows.applyToSelected([&](auto i) {
+        flatResult->set(
+            i, cmp(decodedArg0->valueAt<T>(i), decodedArg1->valueAt<T>(i)));


When both the args are constants or identity mapping we can not use valueAt and compare buffers - presto uses simd to compare(https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/Comparisons.cpp) but atleast bypassing valueAt would be a good start.

Updated. Please take a look again. Thanks.

@kgpai Have time to take a further look? There is a function signature mismatch in CI but I think it's expected?

Will have a look soon !

kgpai

Looks good to me. Some nits.

kgpai · 2023-09-12T23:38:51Z

velox/functions/sparksql/Comparisons.cpp

+        flatResult->set(i, cmp(rawValues[i], constant));
+      });
+    } else {
+      // Fast path if one or more arguments are encoded.


nit: Once we have to decode its not really fast path, just change it to Path if one or more arguments are encoded.

kgpai · 2023-09-12T23:39:51Z

velox/functions/sparksql/Comparisons.cpp

+    } else {
+      // Fast path if one or more arguments are encoded.
+      exec::DecodedArgs decodedArgs(rows, args, context);
+      auto decoded0 = decodedArgs.at(0);


nit: for consistencies sake, I would also name this decodedA, decodedB like you have rawA, rawB above (or even decodedLhs, decodedRhs ).

kgpai · 2023-09-12T23:45:09Z

velox/functions/sparksql/Comparisons.cpp

+  }
+};
+
+// BoolComparisonFunction for bool as it uses compact representation


nit: Can you change comment to say , 'ComparisonFunction instance for bool as it uses compact representation'.

kgpai · 2023-09-13T00:01:05Z

velox/functions/sparksql/Comparisons.cpp

+void applyTyped(
+    const SelectivityVector& rows,
+    std::vector<VectorPtr>& args,
+    DecodedVector* decoded0,


nit: Instead of naming decoded0,1 , can you name it decodedLhs, decodedRhs etc.

kgpai · 2023-09-13T00:03:47Z

velox/functions/sparksql/Comparisons.cpp

+      auto isNull1 = rawNulls1 && bits::isBitNull(rawNulls1, i);
+      flatResult->set(
+          i,
+          (isNull0 || isNull1)


I want to make sure this is spark semantics , if any of values is Null, then its only true if both are null , right ?

Yes, in Spark, it will return true if both sides are null and return false if only one side is null.

kgpai · 2023-09-13T00:09:37Z

velox/functions/sparksql/Comparisons.cpp

+    DecodedVector* /* decoded1 */,
+    exec::EvalCtx& /* context */,
+    FlatVector<bool>* /* flatResult */) {
+  VELOX_NYI("equalnullsafe does not support arrays.");


nit: change name to equaltonullsafe ?

kgpai · 2023-09-13T00:22:23Z

Can you check whether the fuzzer failure run is related to these changes?

kgpai · 2023-09-18T15:13:36Z

@yma11 Please let me know when the CI runs are green and I can review again/start the merge process. Thanks !

kgpai · 2023-09-19T06:05:01Z

@yma11 Can you fix the mac build failures in ci ; I can validate the signature checks after that.

yma11 · 2023-09-20T01:10:24Z

@yma11 Can you fix the mac build failures in ci ; I can validate the signature checks after that.

@kgpai Thanks for your review. The mac build passes now.

yma11 · 2023-09-27T07:04:55Z

@yma11 Can you fix the mac build failures in ci ; I can validate the signature checks after that.

@kgpai Thanks for your review. The mac build passes now.

@kgpai can you help do final check?

facebook-github-bot · 2023-09-27T07:17:45Z

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-09-28T08:43:12Z

@kgpai merged this pull request in 6b6dd58.

conbench-facebook · 2023-09-28T09:02:35Z

Conbench analyzed the 1 benchmark run on commit 6b6dd58d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Summary: Compare functions for spark have different rules for value `NaN`, which is not same as PrestoSQL. This PR provides corresponding vector functions and UTs are added. Pull Request resolved: facebookincubator#5569 Reviewed By: xiaoxmeng Differential Revision: D49676085 Pulled By: kgpai fbshipit-source-id: 2fc5c7741a620b0be77d571b4828cc7df3557f9b

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 8, 2023

yma11 force-pushed the compare-func branch from 4872bc9 to 1c31582 Compare July 10, 2023 02:57