Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1065: Fix IndexOutOfBoundsException in ReaderImpl.extractFileTail #979

Merged
merged 1 commit into from
Dec 28, 2021

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Dec 28, 2021

What changes were proposed in this pull request?

Use buffer limit as readSize to avoid IndexOutOfBoundsException.

main

public static OrcTail extractFileTail(ByteBuffer buffer, long fileLen, long modificationTime)
throws IOException {
OrcProto.PostScript ps;
long readSize = fileLen != -1 ? fileLen : buffer.limit();
OrcProto.FileTail.Builder fileTailBuilder = OrcProto.FileTail.newBuilder();
fileTailBuilder.setFileLength(readSize);

branch-1.5

public static OrcTail extractFileTail(ByteBuffer buffer, long fileLength, long modificationTime)
throws IOException {
int readSize = buffer.limit();
int psLen = buffer.get(readSize - 1) & 0xff;

Why are the changes needed?

ORC-251 remove ReaderImpl.extractFileTail
ORC-685 Add ReaderImpl.extractFileTail back

In ORC-685, file length is used as readsize, which causes that if the buffer is read from the cache, the use of length is incorrect, resulting in IndexOutOfBoundsException.

long readSize = fileLen != -1? fileLen: buffer.limit();
int psLen = buffer.get((int) (readSize-1)) & 0xff; 
Caused by: java.lang.IndexOutOfBoundsException
    at java.nio.Buffer.checkIndex(Buffer.java:540)
    at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139)
    at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:726)
    at org.apache.hadoop.hive.ql.io.orc.LocalCache.getAndValidate(LocalCache.java:103)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.getSplits(OrcInputFormat.java:798)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.runGetSplitsSync(OrcInputFormat.java:916)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.generateSplitWork(OrcInputFormat.java:885)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.scheduleSplits(OrcInputFormat.java:1759)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1703) 

How was this patch tested?

local test

@github-actions github-actions bot added the JAVA label Dec 28, 2021
@cxzl25
Copy link
Contributor Author

cxzl25 commented Dec 28, 2021

main branch

public static OrcTail extractFileTail(ByteBuffer buffer, long fileLen, long modificationTime)
throws IOException {
OrcProto.PostScript ps;
long readSize = fileLen != -1 ? fileLen : buffer.limit();
OrcProto.FileTail.Builder fileTailBuilder = OrcProto.FileTail.newBuilder();
fileTailBuilder.setFileLength(readSize);

branch-1.5

public static OrcTail extractFileTail(ByteBuffer buffer, long fileLength, long modificationTime)
throws IOException {
int readSize = buffer.limit();
int psLen = buffer.get(readSize - 1) & 0xff;

@cxzl25
Copy link
Contributor Author

cxzl25 commented Dec 28, 2021

We used Spark 3.2.0, Hive2.3.9, Orc 1.6.11,
Set spark.sql.hive.convertMetastoreOrc=false in spark, and querying a table triggers this problem for the second time.

The current workaround is to add configuration in hive-site.xml

  <property>
    <name>hive.orc.cache.stripe.details.mem.size</name>
    <value>0</value>
  </property>
    HIVE_ORC_CACHE_STRIPE_DETAILS_MEMORY_SIZE("hive.orc.cache.stripe.details.mem.size", "256Mb",
        new SizeValidator(), "Maximum size of orc splits cached in the client."),

@cxzl25
Copy link
Contributor Author

cxzl25 commented Dec 28, 2021

cc @pgaref @omalley @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 28, 2021

It seems that the PR doesn't pass the UTs. Could you check the UT failures?

Error:  Failures: 
Error:    TestReaderImpl.testOrcTailStripeStats:382 expected: <1980> but was: <-417>

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test case for your code, @cxzl25 ?

@cxzl25
Copy link
Contributor Author

cxzl25 commented Dec 28, 2021

Could you add a test case for your code, @cxzl25 ?

ok, let me see how to add a ut to cover this case.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @cxzl25 .

@dongjoon-hyun dongjoon-hyun changed the title ORC-1065: Fix ReaderImpl.extractFileTail IndexOutOfBoundsException ORC-1065: Fix IndexOutOfBoundsException in ReaderImpl.extractFileTail Dec 28, 2021
@dongjoon-hyun dongjoon-hyun merged commit f53b149 into apache:main Dec 28, 2021
@dongjoon-hyun
Copy link
Member

cc @pgaref and @williamhyun

dongjoon-hyun pushed a commit that referenced this pull request Dec 28, 2021
…979

### What changes were proposed in this pull request?
Use buffer limit as `readSize` to avoid `IndexOutOfBoundsException`.

**main**
https://github.com/apache/orc/blob/3a2cb60e4ab6af6305c351fbdb51b98f460f64a0/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L720-L725

**branch-1.5**
https://github.com/apache/orc/blob/5f88704d9bd36fc55b57a60c2fbbd35980b1b7e5/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L487-L490

### Why are the changes needed?
ORC-251 remove `ReaderImpl.extractFileTail`
ORC-685 Add `ReaderImpl.extractFileTail` back

In ORC-685, file length is used as readsize, which causes that if the buffer is read from the cache, the use of length is incorrect, resulting in IndexOutOfBoundsException.
```
long readSize = fileLen != -1? fileLen: buffer.limit();
int psLen = buffer.get((int) (readSize-1)) & 0xff;
```
```
Caused by: java.lang.IndexOutOfBoundsException
    at java.nio.Buffer.checkIndex(Buffer.java:540)
    at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139)
    at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:726)
    at org.apache.hadoop.hive.ql.io.orc.LocalCache.getAndValidate(LocalCache.java:103)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.getSplits(OrcInputFormat.java:798)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.runGetSplitsSync(OrcInputFormat.java:916)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.generateSplitWork(OrcInputFormat.java:885)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.scheduleSplits(OrcInputFormat.java:1759)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1703)
```

### How was this patch tested?
local test

(cherry picked from commit f53b149)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Dec 28, 2021
…979

### What changes were proposed in this pull request?
Use buffer limit as `readSize` to avoid `IndexOutOfBoundsException`.

**main**
https://github.com/apache/orc/blob/3a2cb60e4ab6af6305c351fbdb51b98f460f64a0/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L720-L725

**branch-1.5**
https://github.com/apache/orc/blob/5f88704d9bd36fc55b57a60c2fbbd35980b1b7e5/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L487-L490

### Why are the changes needed?
ORC-251 remove `ReaderImpl.extractFileTail`
ORC-685 Add `ReaderImpl.extractFileTail` back

In ORC-685, file length is used as readsize, which causes that if the buffer is read from the cache, the use of length is incorrect, resulting in IndexOutOfBoundsException.
```
long readSize = fileLen != -1? fileLen: buffer.limit();
int psLen = buffer.get((int) (readSize-1)) & 0xff;
```
```
Caused by: java.lang.IndexOutOfBoundsException
    at java.nio.Buffer.checkIndex(Buffer.java:540)
    at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139)
    at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:726)
    at org.apache.hadoop.hive.ql.io.orc.LocalCache.getAndValidate(LocalCache.java:103)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.getSplits(OrcInputFormat.java:798)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.runGetSplitsSync(OrcInputFormat.java:916)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.generateSplitWork(OrcInputFormat.java:885)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.scheduleSplits(OrcInputFormat.java:1759)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1703)
```

### How was this patch tested?
local test

(cherry picked from commit f53b149)
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 546f72a)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

@cxzl25 . I added you to the Apache ORC contributor group and assigned ORC-1065 to you. Thank you again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants