ORC-1065: Fix IndexOutOfBoundsException in ReaderImpl.extractFileTail #979

cxzl25 · 2021-12-28T04:28:32Z

What changes were proposed in this pull request?

Use buffer limit as readSize to avoid IndexOutOfBoundsException.

main

orc/java/core/src/java/org/apache/orc/impl/ReaderImpl.java

Lines 720 to 725 in 3a2cb60

    
           public static OrcTail extractFileTail(ByteBuffer buffer, long fileLen, long modificationTime) 
        
               throws IOException { 
        
             OrcProto.PostScript ps; 
        
             long readSize = fileLen != -1 ? fileLen : buffer.limit(); 
        
             OrcProto.FileTail.Builder fileTailBuilder = OrcProto.FileTail.newBuilder(); 
        
             fileTailBuilder.setFileLength(readSize);

branch-1.5

orc/java/core/src/java/org/apache/orc/impl/ReaderImpl.java

Lines 487 to 490 in 5f88704

    
           public static OrcTail extractFileTail(ByteBuffer buffer, long fileLength, long modificationTime) 
        
               throws IOException { 
        
             int readSize = buffer.limit(); 
        
             int psLen = buffer.get(readSize - 1) & 0xff;

Why are the changes needed?

ORC-251 remove ReaderImpl.extractFileTail
ORC-685 Add ReaderImpl.extractFileTail back

In ORC-685, file length is used as readsize, which causes that if the buffer is read from the cache, the use of length is incorrect, resulting in IndexOutOfBoundsException.

long readSize = fileLen != -1? fileLen: buffer.limit();
int psLen = buffer.get((int) (readSize-1)) & 0xff;

Caused by: java.lang.IndexOutOfBoundsException
    at java.nio.Buffer.checkIndex(Buffer.java:540)
    at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139)
    at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:726)
    at org.apache.hadoop.hive.ql.io.orc.LocalCache.getAndValidate(LocalCache.java:103)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.getSplits(OrcInputFormat.java:798)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.runGetSplitsSync(OrcInputFormat.java:916)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.generateSplitWork(OrcInputFormat.java:885)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.scheduleSplits(OrcInputFormat.java:1759)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1703)

How was this patch tested?

local test

cxzl25 · 2021-12-28T04:32:26Z

main branch

orc/java/core/src/java/org/apache/orc/impl/ReaderImpl.java

Lines 720 to 725 in 3a2cb60

    
           public static OrcTail extractFileTail(ByteBuffer buffer, long fileLen, long modificationTime) 
        
               throws IOException { 
        
             OrcProto.PostScript ps; 
        
             long readSize = fileLen != -1 ? fileLen : buffer.limit(); 
        
             OrcProto.FileTail.Builder fileTailBuilder = OrcProto.FileTail.newBuilder(); 
        
             fileTailBuilder.setFileLength(readSize);

branch-1.5

orc/java/core/src/java/org/apache/orc/impl/ReaderImpl.java

Lines 487 to 490 in 5f88704

    
           public static OrcTail extractFileTail(ByteBuffer buffer, long fileLength, long modificationTime) 
        
               throws IOException { 
        
             int readSize = buffer.limit(); 
        
             int psLen = buffer.get(readSize - 1) & 0xff;

cxzl25 · 2021-12-28T04:37:27Z

We used Spark 3.2.0, Hive2.3.9, Orc 1.6.11,
Set spark.sql.hive.convertMetastoreOrc=false in spark, and querying a table triggers this problem for the second time.

The current workaround is to add configuration in hive-site.xml

  <property>
    <name>hive.orc.cache.stripe.details.mem.size</name>
    <value>0</value>
  </property>

    HIVE_ORC_CACHE_STRIPE_DETAILS_MEMORY_SIZE("hive.orc.cache.stripe.details.mem.size", "256Mb",
        new SizeValidator(), "Maximum size of orc splits cached in the client."),

cxzl25 · 2021-12-28T04:40:58Z

cc @pgaref @omalley @dongjoon-hyun

dongjoon-hyun · 2021-12-28T05:46:44Z

It seems that the PR doesn't pass the UTs. Could you check the UT failures?

Error:  Failures: 
Error:    TestReaderImpl.testOrcTailStripeStats:382 expected: <1980> but was: <-417>

dongjoon-hyun

Could you add a test case for your code, @cxzl25 ?

cxzl25 · 2021-12-28T07:12:37Z

Could you add a test case for your code, @cxzl25 ?

ok, let me see how to add a ut to cover this case.

dongjoon-hyun

+1, LGTM. Thank you, @cxzl25 .

dongjoon-hyun · 2021-12-28T19:58:24Z

cc @pgaref and @williamhyun

…979 ### What changes were proposed in this pull request? Use buffer limit as `readSize` to avoid `IndexOutOfBoundsException`. **main** https://github.com/apache/orc/blob/3a2cb60e4ab6af6305c351fbdb51b98f460f64a0/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L720-L725 **branch-1.5** https://github.com/apache/orc/blob/5f88704d9bd36fc55b57a60c2fbbd35980b1b7e5/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L487-L490 ### Why are the changes needed? ORC-251 remove `ReaderImpl.extractFileTail` ORC-685 Add `ReaderImpl.extractFileTail` back In ORC-685, file length is used as readsize, which causes that if the buffer is read from the cache, the use of length is incorrect, resulting in IndexOutOfBoundsException. ``` long readSize = fileLen != -1? fileLen: buffer.limit(); int psLen = buffer.get((int) (readSize-1)) & 0xff; ``` ``` Caused by: java.lang.IndexOutOfBoundsException at java.nio.Buffer.checkIndex(Buffer.java:540) at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139) at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:726) at org.apache.hadoop.hive.ql.io.orc.LocalCache.getAndValidate(LocalCache.java:103) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.getSplits(OrcInputFormat.java:798) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.runGetSplitsSync(OrcInputFormat.java:916) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.generateSplitWork(OrcInputFormat.java:885) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.scheduleSplits(OrcInputFormat.java:1759) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1703) ``` ### How was this patch tested? local test (cherry picked from commit f53b149) Signed-off-by: Dongjoon Hyun <[email protected]>

…979 ### What changes were proposed in this pull request? Use buffer limit as `readSize` to avoid `IndexOutOfBoundsException`. **main** https://github.com/apache/orc/blob/3a2cb60e4ab6af6305c351fbdb51b98f460f64a0/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L720-L725 **branch-1.5** https://github.com/apache/orc/blob/5f88704d9bd36fc55b57a60c2fbbd35980b1b7e5/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L487-L490 ### Why are the changes needed? ORC-251 remove `ReaderImpl.extractFileTail` ORC-685 Add `ReaderImpl.extractFileTail` back In ORC-685, file length is used as readsize, which causes that if the buffer is read from the cache, the use of length is incorrect, resulting in IndexOutOfBoundsException. ``` long readSize = fileLen != -1? fileLen: buffer.limit(); int psLen = buffer.get((int) (readSize-1)) & 0xff; ``` ``` Caused by: java.lang.IndexOutOfBoundsException at java.nio.Buffer.checkIndex(Buffer.java:540) at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139) at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:726) at org.apache.hadoop.hive.ql.io.orc.LocalCache.getAndValidate(LocalCache.java:103) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.getSplits(OrcInputFormat.java:798) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.runGetSplitsSync(OrcInputFormat.java:916) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.generateSplitWork(OrcInputFormat.java:885) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.scheduleSplits(OrcInputFormat.java:1759) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1703) ``` ### How was this patch tested? local test (cherry picked from commit f53b149) Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 546f72a) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2021-12-28T20:05:30Z

@cxzl25 . I added you to the Apache ORC contributor group and assigned ORC-1065 to you. Thank you again.

github-actions bot added the JAVA label Dec 28, 2021

dongjoon-hyun requested changes Dec 28, 2021

View reviewed changes

fix extractFileTail IndexOutOfBoundsException

d34f79d

cxzl25 force-pushed the ORC-1065 branch from 6b0d1da to d34f79d Compare December 28, 2021 10:06

dongjoon-hyun modified the milestones: 1.7.3, 1.6.13 Dec 28, 2021

dongjoon-hyun approved these changes Dec 28, 2021

View reviewed changes

dongjoon-hyun changed the title ~~ORC-1065: Fix ReaderImpl.extractFileTail IndexOutOfBoundsException~~ ORC-1065: Fix IndexOutOfBoundsException in ReaderImpl.extractFileTail Dec 28, 2021

dongjoon-hyun merged commit f53b149 into apache:main Dec 28, 2021

cxzl25 mentioned this pull request Jun 23, 2022

[Bug] Kyuubi integrated Ranger failed to query ORC and Parquet data apache/kyuubi#2939

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORC-1065: Fix IndexOutOfBoundsException in ReaderImpl.extractFileTail #979

ORC-1065: Fix IndexOutOfBoundsException in ReaderImpl.extractFileTail #979

cxzl25 commented Dec 28, 2021 •

edited by dongjoon-hyun

Loading

cxzl25 commented Dec 28, 2021

cxzl25 commented Dec 28, 2021

cxzl25 commented Dec 28, 2021

dongjoon-hyun commented Dec 28, 2021 •

edited

Loading

dongjoon-hyun left a comment

cxzl25 commented Dec 28, 2021

dongjoon-hyun left a comment

dongjoon-hyun commented Dec 28, 2021

dongjoon-hyun commented Dec 28, 2021

	public static OrcTail extractFileTail(ByteBuffer buffer, long fileLen, long modificationTime)
	throws IOException {
	OrcProto.PostScript ps;
	long readSize = fileLen != -1 ? fileLen : buffer.limit();
	OrcProto.FileTail.Builder fileTailBuilder = OrcProto.FileTail.newBuilder();
	fileTailBuilder.setFileLength(readSize);

	public static OrcTail extractFileTail(ByteBuffer buffer, long fileLength, long modificationTime)
	throws IOException {
	int readSize = buffer.limit();
	int psLen = buffer.get(readSize - 1) & 0xff;

ORC-1065: Fix IndexOutOfBoundsException in ReaderImpl.extractFileTail #979

ORC-1065: Fix IndexOutOfBoundsException in ReaderImpl.extractFileTail #979

Conversation

cxzl25 commented Dec 28, 2021 • edited by dongjoon-hyun Loading

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

cxzl25 commented Dec 28, 2021

main branch

branch-1.5

cxzl25 commented Dec 28, 2021

cxzl25 commented Dec 28, 2021

dongjoon-hyun commented Dec 28, 2021 • edited Loading

dongjoon-hyun left a comment

Choose a reason for hiding this comment

cxzl25 commented Dec 28, 2021

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 28, 2021

dongjoon-hyun commented Dec 28, 2021

cxzl25 commented Dec 28, 2021 •

edited by dongjoon-hyun

Loading

dongjoon-hyun commented Dec 28, 2021 •

edited

Loading