Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diff on parquet filter agg #11257

Open
zml1206 opened this issue Oct 15, 2024 · 7 comments
Open

Diff on parquet filter agg #11257

zml1206 opened this issue Oct 15, 2024 · 7 comments
Labels
bug Something isn't working gcc triage Newly created issue that needs attention.

Comments

@zml1206
Copy link
Contributor

zml1206 commented Oct 15, 2024

Bug description

Write parquet file requires disable gluten.

spark.sql("set spark.gluten.enabled=false")
spark.range(1000).selectExpr("id%2 as c1", "id%5 as c2", "id as c3").write.mode("overwrite").parquet("tmp/t1")
spark.sql("set spark.gluten.enabled=true")
spark.read.parquet("tmp/t1").createOrReplaceTempView("t1")
spark.sql("select c2, sum(c3)  from t1 where  c1= 1 group by c2").show

result

+---+---------------+
| c2|        sum(c3)|
+---+---------------+
|  0|559882429285360|
|  1|559885503421750|
|  3|839826576815406|
|  2|839827141809990|
|  4|559885785918562|
+---+---------------+

Through testing, found that #11010 caused, it worked after reverted it.

System information

Velox System Info v0.0.2
Commit: 2883361
CMake Version: 3.28.3
System: Linux-5.15.0-113-generic
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 11.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 11.4.0
CMake Prefix Path: /usr/local;/usr;/;/usr/local/lib/python3.10/dist-packages/cmake/data;/usr/local;/usr/X11R6;/usr/pkg;/opt

\nThe results will be copied to your clipboard if xclip is installed.

Relevant logs

No response

@zml1206 zml1206 added bug Something isn't working triage Newly created issue that needs attention. labels Oct 15, 2024
@zml1206
Copy link
Contributor Author

zml1206 commented Oct 15, 2024

cc @Yuhta

@Yuhta
Copy link
Contributor

Yuhta commented Oct 16, 2024

Can you upload the tmp/t1 here?

@zml1206
Copy link
Contributor Author

zml1206 commented Oct 17, 2024

Can you upload the tmp/t1 here?

t1.tar.gz
and use spark.range(1000) easier to reproduce. @Yuhta

@Yuhta
Copy link
Contributor

Yuhta commented Oct 20, 2024

@zml1206 I cannot repro it using the table scan operator and Hive connector in Velox. It is probably some bug in Gluten integration. The test code:

TEST_F(ParquetTableScanTest, aggregatePushdown) {
  auto outputType = ROW({"c1", "c2", "c3"}, {BIGINT(), BIGINT(), BIGINT()});
  auto plan = PlanBuilder().tableScan(outputType, {"c1 = 1"}, "").singleAggregation({"c2"}, {"sum(c3)"}).planNode();
  std::vector<std::shared_ptr<connector::ConnectorSplit>> splits;
  for (int i = 0; i < 32; ++i) {
    splits.push_back(makeSplit(getExampleFilePath(fmt::format("t1/part-{:05}-6c0bb0b9-d8d5-464c-bb5f-6a4eaeb83228-c000.snappy.parquet", i))));
  };
  auto result = AssertQueryBuilder(plan).splits(splits).copyResults(pool());
  FAIL() << result->toString(0, result->size());
}

Output:

[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ParquetTableScanTest
[ RUN      ] ParquetTableScanTest.aggregatePushdown
fbcode/velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp:1145: Failure
Failed
0: {1, 49600}
1: {3, 49800}
2: {0, 50000}
3: {2, 50200}
4: {4, 50400}

[  FAILED  ] ParquetTableScanTest.aggregatePushdown (243 ms)
[----------] 1 test from ParquetTableScanTest (243 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (830 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ParquetTableScanTest.aggregatePushdown

@zml1206
Copy link
Contributor Author

zml1206 commented Oct 21, 2024

@Yuhta What is your system info? I can reproduce it on ubuntu22.04.
I tried it on mac before and couldn't reproduce it.

root@from:/velox# _build/release/velox/dwio/parquet/tests/reader/velox_dwio_parquet_table_scan_test --velox_exception_user_stacktrace_enabled=true --gtest_filter="ParquetTableScanTest.aggregatePushdown"
Note: Google Test filter = ParquetTableScanTest.aggregatePushdown
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ParquetTableScanTest
[ RUN      ] ParquetTableScanTest.aggregatePushdown
/velox/velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp:256: Failure
Failed
0: {1, 5049244363222836}
1: {3, 5610271514345372}
2: {0, 5610271513904604}
3: {2, 5610271514657156}
4: {4, 5049244363721088}
[  FAILED  ] ParquetTableScanTest.aggregatePushdown (40 ms)
[----------] 1 test from ParquetTableScanTest (40 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (70 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ParquetTableScanTest.aggregatePushdown

 1 FAILED TEST
root@from:/velox# ./scripts/info.sh

Velox System Info v0.0.2
Commit: 288336153060b4c2ac9bd231a353f98dceb48c8a
CMake Version: 3.28.3
System: Linux-5.15.0-113-generic
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 11.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 11.4.0
CMake Prefix Path: /usr/local;/usr;/;/usr/local/lib/python3.10/dist-packages/cmake/data;/usr/local;/usr/X11R6;/usr/pkg;/opt

\nThe results will be copied to your clipboard if xclip is installed.

@Yuhta
Copy link
Contributor

Yuhta commented Oct 21, 2024

It could be a compiler difference. Are you using gcc? Can you try clang and see if it still repros?

@zml1206
Copy link
Contributor Author

zml1206 commented Oct 22, 2024

It could be a compiler difference. Are you using gcc? Can you try clang and see if it still repros?

Yes, clang cannot reproduce, but gcc can.

@Yuhta Yuhta added the gcc label Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gcc triage Newly created issue that needs attention.
Projects
None yet
Development

No branches or pull requests

2 participants