[BUG] Data corruption writing ORC data with lots of nulls before timestamp #13460

revans2 · 2023-05-26T16:33:34Z

Describe the bug
I ma still working on a repro case for this in put CUDF. But I wanted to get this up ASAP as I work on it.

We have a customer that got data corruption trying to write out a file in ORC that had lots of nulls before it hit a non-null timestamp value.

I am still working on a pure CUDF C++ repro case, but for now what I have is.

val nulls = spark.range(23838).selectExpr("CAST(NULL as timestamp) as ts")
val non_nulls = spark.range(100L).selectExpr("timestamp_micros(CAST(rand(0) * 1000000 as LONG) + 1684830860000000) as ts")
nulls.union(non_nulls).repartition(1).orderBy("ts").write.mode("overwrite").orc("./target/TMP_ORC")
spark.conf.set("spark.rapids.sql.enabled", "false")
spark.time(spark.read.orc("./target/TMP_ORC").selectExpr("COUNT(ts)").show())

I also wrote the same data out to a parquet file and if I transcode it to ORC I get the same error.
data.zip

spark.read.parquet("./target/TMP_PAR").write.mode("overwrite").orc("./target/TMP_ORC")
spark.conf.set("spark.rapids.sql.enabled", "false")
spark.time(spark.read.orc("./target/TMP_ORC").selectExpr("COUNT(ts)").show())

The error that the CPU outputs when reading in the corrupt file is.

Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream column 1 kind DATA position: 8 length: 8 range: 0 offset: 8 limit: 8 range 0 = 120 to 128;  range 1 = 128 to 150 uncompressed: 8 to 8
  at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:62)
  at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333)

This is very similar to the error that I get when I try to read the file using the ORC java command line tools.

$ java -jar $ORC_TOOLS_DIR/target/orc-tools-1.8.0-uber.jar data ./target/TMP_ORC/*.orc
...
Unable to dump data for file: TMP_ORC/part-00000-a7790406-0fa6-4a4e-97ae-c66febd45b37-c000.snappy.orc
java.io.IOException: Error reading file: TMP_ORC/part-00000-a7790406-0fa6-4a4e-97ae-c66febd45b37-c000.snappy.orc
	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1450)
	at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:209)
	at org.apache.orc.tools.PrintData.main(PrintData.java:288)
	at org.apache.orc.tools.Driver.main(Driver.java:115)
Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream column 1 kind DATA position: 8 length: 8 range: 0 offset: 8 limit: 8 range 0 = 120 to 128;  range 1 = 128 to 150 uncompressed: 8 to 8
	at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:60)

I tested this with 23.04.1 and I didn't see this problem at all, so I think this is something that was introduced recently with CUDF.

The text was updated successfully, but these errors were encountered:

GregoryKimball · 2023-05-26T17:07:16Z

It looks like the cudf ORC writer data corruption is triggered if there are >=10000 nulls at the start of the series. Here is an example python repro that fails on host read:

    i = 10000
    df = cudf.DataFrame({
                'a': [None] * i + [100]*5
            })
    df.to_orc('temp.orc')
    print('finished cudf write')
    pdf = pd.read_orc('temp.orc')
    print('finished pandas read')

finished cudf write
terminate called after throwing an instance of 'orc::ParseError'
  what():  bad read in RleDecoderV2::readByte
Aborted (core dumped)

Also, even though cudf can read this file, the data has changed:

df2 = cudf.read_orc('temp.orc')
print(df2)

0      <NA>
1      <NA>
2      <NA>
3      <NA>
4      <NA>
...     ...
10000     0
10001     0
10002     0
10003     0
10004     0

So perhaps the solution to #13460 would detect the RLE encoding failure and crash instead of returning zeros in this case.

I'm working a recent 23.06 commit (5b4f9f5cc).

revans2 · 2023-05-26T17:54:15Z

Also the crash on the CPU only appears to happen if we have about 8 or more rows of data after the nulls. If there are less we get data corruption, but not a crash.

revans2 · 2023-05-26T17:58:00Z

I saw the 10,000 limit to.

I think that this might be related

cudf/cpp/include/cudf/io/orc.hpp

Line 42 in cc317ed

    
           constexpr size_type default_row_index_stride = 10000;    ///< 10K rows default orc row index stride

revans2 · 2023-05-26T18:17:08Z

Yup that is it. I changed it to 5000 and now the corruption shows up at >= 5000 nulls.

revans2 · 2023-05-26T18:19:20Z

Nope I was wrong. We still get data corruption on longs. It just does not always throw the exception.

revans2 · 2023-05-26T18:21:03Z

I also see the corruption with ints. So it looks like it is a generic issue that is not specific to timestamps.

vuule · 2023-05-26T22:30:36Z

Here's my C++ repro, should be equivalent to @GregoryKimball 's Python code above

  std::vector<int> ints(10000 + 5, -1);
  auto mask = cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i >= 10000; });
  int32_col col{ints.begin(), ints.end(), mask};
  table_view expected({col});

  auto filepath = temp_env->get_temp_filepath("OrcTooManyNulls.orc");
  cudf::io::orc_writer_options out_opts =
    cudf::io::orc_writer_options::builder(cudf::io::sink_info{filepath}, expected);
  cudf::io::write_orc(out_opts);

  cudf::io::orc_reader_options in_opts =
    cudf::io::orc_reader_options::builder(cudf::io::source_info{filepath});
  auto result = cudf::io::read_orc(in_opts);
  CUDF_TEST_EXPECT_TABLES_EQUAL(expected, result.tbl->view());

vuule · 2023-05-26T23:28:56Z

Opened #13466 that fixes the basic repro.
@revans2 please run your tests with the change.

Issue #13460 Fixes the bug in `gpuCompactOrcDataStreams` where stream pointer would not get updated for empty row groups. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - MithunR (https://github.com/mythrocks) - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice)

vuule · 2023-05-30T20:14:40Z

@revans2 can this issue be closed now? I didn't set the PR to close the issue because I only tested a very derived repro.

revans2 · 2023-05-31T16:11:34Z

yes

revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels May 26, 2023

revans2 mentioned this issue May 26, 2023

[BUG] No error detection in corrupted ORC files #13461

Open

GregoryKimball added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels May 26, 2023

GregoryKimball assigned vuule May 26, 2023

vuule mentioned this issue May 26, 2023

Fix writing of ORC files with empty rowgroups #13466

Merged

3 tasks

revans2 closed this as completed May 31, 2023

GregoryKimball mentioned this issue Jun 27, 2023

[FEA] Add Parquet and ORC unit tests based on Apache sample files #13627

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Data corruption writing ORC data with lots of nulls before timestamp #13460

[BUG] Data corruption writing ORC data with lots of nulls before timestamp #13460

revans2 commented May 26, 2023

GregoryKimball commented May 26, 2023 •

edited

Loading

revans2 commented May 26, 2023

revans2 commented May 26, 2023

revans2 commented May 26, 2023

revans2 commented May 26, 2023

revans2 commented May 26, 2023

vuule commented May 26, 2023

vuule commented May 26, 2023

vuule commented May 30, 2023

revans2 commented May 31, 2023

[BUG] Data corruption writing ORC data with lots of nulls before timestamp #13460

[BUG] Data corruption writing ORC data with lots of nulls before timestamp #13460

Comments

revans2 commented May 26, 2023

GregoryKimball commented May 26, 2023 • edited Loading

revans2 commented May 26, 2023

revans2 commented May 26, 2023

revans2 commented May 26, 2023

revans2 commented May 26, 2023

revans2 commented May 26, 2023

vuule commented May 26, 2023

vuule commented May 26, 2023

vuule commented May 30, 2023

revans2 commented May 31, 2023

GregoryKimball commented May 26, 2023 •

edited

Loading