Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Data corruption writing ORC data with lots of nulls before timestamp #13460

Closed
revans2 opened this issue May 26, 2023 · 10 comments
Closed
Assignees
Labels
2 - In Progress Currently a work in progress bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented May 26, 2023

Describe the bug
I ma still working on a repro case for this in put CUDF. But I wanted to get this up ASAP as I work on it.

We have a customer that got data corruption trying to write out a file in ORC that had lots of nulls before it hit a non-null timestamp value.

I am still working on a pure CUDF C++ repro case, but for now what I have is.

val nulls = spark.range(23838).selectExpr("CAST(NULL as timestamp) as ts")
val non_nulls = spark.range(100L).selectExpr("timestamp_micros(CAST(rand(0) * 1000000 as LONG) + 1684830860000000) as ts")
nulls.union(non_nulls).repartition(1).orderBy("ts").write.mode("overwrite").orc("./target/TMP_ORC")
spark.conf.set("spark.rapids.sql.enabled", "false")
spark.time(spark.read.orc("./target/TMP_ORC").selectExpr("COUNT(ts)").show())

I also wrote the same data out to a parquet file and if I transcode it to ORC I get the same error.
data.zip

spark.read.parquet("./target/TMP_PAR").write.mode("overwrite").orc("./target/TMP_ORC")
spark.conf.set("spark.rapids.sql.enabled", "false")
spark.time(spark.read.orc("./target/TMP_ORC").selectExpr("COUNT(ts)").show())

The error that the CPU outputs when reading in the corrupt file is.

Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream column 1 kind DATA position: 8 length: 8 range: 0 offset: 8 limit: 8 range 0 = 120 to 128;  range 1 = 128 to 150 uncompressed: 8 to 8
  at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:62)
  at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333)

This is very similar to the error that I get when I try to read the file using the ORC java command line tools.

$ java -jar $ORC_TOOLS_DIR/target/orc-tools-1.8.0-uber.jar data ./target/TMP_ORC/*.orc
...
Unable to dump data for file: TMP_ORC/part-00000-a7790406-0fa6-4a4e-97ae-c66febd45b37-c000.snappy.orc
java.io.IOException: Error reading file: TMP_ORC/part-00000-a7790406-0fa6-4a4e-97ae-c66febd45b37-c000.snappy.orc
	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1450)
	at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:209)
	at org.apache.orc.tools.PrintData.main(PrintData.java:288)
	at org.apache.orc.tools.Driver.main(Driver.java:115)
Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream column 1 kind DATA position: 8 length: 8 range: 0 offset: 8 limit: 8 range 0 = 120 to 128;  range 1 = 128 to 150 uncompressed: 8 to 8
	at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:60)

I tested this with 23.04.1 and I didn't see this problem at all, so I think this is something that was introduced recently with CUDF.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels May 26, 2023
@GregoryKimball
Copy link
Contributor

GregoryKimball commented May 26, 2023

It looks like the cudf ORC writer data corruption is triggered if there are >=10000 nulls at the start of the series. Here is an example python repro that fails on host read:

    i = 10000
    df = cudf.DataFrame({
                'a': [None] * i + [100]*5
            })
    df.to_orc('temp.orc')
    print('finished cudf write')
    pdf = pd.read_orc('temp.orc')
    print('finished pandas read')
finished cudf write
terminate called after throwing an instance of 'orc::ParseError'
  what():  bad read in RleDecoderV2::readByte
Aborted (core dumped)

Also, even though cudf can read this file, the data has changed:

df2 = cudf.read_orc('temp.orc')
print(df2)
0      <NA>
1      <NA>
2      <NA>
3      <NA>
4      <NA>
...     ...
10000     0
10001     0
10002     0
10003     0
10004     0

So perhaps the solution to #13460 would detect the RLE encoding failure and crash instead of returning zeros in this case.

I'm working a recent 23.06 commit (5b4f9f5cc).

@revans2
Copy link
Contributor Author

revans2 commented May 26, 2023

Also the crash on the CPU only appears to happen if we have about 8 or more rows of data after the nulls. If there are less we get data corruption, but not a crash.

@revans2
Copy link
Contributor Author

revans2 commented May 26, 2023

I saw the 10,000 limit to.

I think that this might be related

constexpr size_type default_row_index_stride = 10000; ///< 10K rows default orc row index stride

@revans2
Copy link
Contributor Author

revans2 commented May 26, 2023

Yup that is it. I changed it to 5000 and now the corruption shows up at >= 5000 nulls.

@revans2
Copy link
Contributor Author

revans2 commented May 26, 2023

Nope I was wrong. We still get data corruption on longs. It just does not always throw the exception.

@revans2
Copy link
Contributor Author

revans2 commented May 26, 2023

I also see the corruption with ints. So it looks like it is a generic issue that is not specific to timestamps.

@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels May 26, 2023
@vuule
Copy link
Contributor

vuule commented May 26, 2023

Here's my C++ repro, should be equivalent to @GregoryKimball 's Python code above

  std::vector<int> ints(10000 + 5, -1);
  auto mask = cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i >= 10000; });
  int32_col col{ints.begin(), ints.end(), mask};
  table_view expected({col});

  auto filepath = temp_env->get_temp_filepath("OrcTooManyNulls.orc");
  cudf::io::orc_writer_options out_opts =
    cudf::io::orc_writer_options::builder(cudf::io::sink_info{filepath}, expected);
  cudf::io::write_orc(out_opts);

  cudf::io::orc_reader_options in_opts =
    cudf::io::orc_reader_options::builder(cudf::io::source_info{filepath});
  auto result = cudf::io::read_orc(in_opts);
  CUDF_TEST_EXPECT_TABLES_EQUAL(expected, result.tbl->view());

@vuule
Copy link
Contributor

vuule commented May 26, 2023

Opened #13466 that fixes the basic repro.
@revans2 please run your tests with the change.

raydouglass pushed a commit that referenced this issue May 30, 2023
Issue #13460

Fixes the bug in `gpuCompactOrcDataStreams` where stream pointer would not get updated for empty row groups.

Authors:
   - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
   - Robert (Bobby) Evans (https://github.com/revans2)
   - MithunR (https://github.com/mythrocks)
   - Nghia Truong (https://github.com/ttnghia)
   - Bradley Dice (https://github.com/bdice)
@vuule
Copy link
Contributor

vuule commented May 30, 2023

@revans2 can this issue be closed now? I didn't set the PR to close the issue because I only tested a very derived repro.

@revans2
Copy link
Contributor Author

revans2 commented May 31, 2023

yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

3 participants