High latency reads in GCS connector #114

medb · 2018-07-12T14:51:28Z

Issue that focuses on high latency reads discussed in #108

Any thoughts on how the first read performance can be improved, or at least reuse connections when 100s or 1000s of files are being opened once?

The access pattern is quite simple to reproduce.
Generate a bunch of files. Perform the following on each of them (ORC by default will launch 10 threads to read the stripe details from these files)

byte[] buffer = new byte[16384];
FileSystem fs = FileSystem.get(new URI(pathString), conf);
InputStream is = fs.open(new Path(pathString));
((FSDataInputStream) is).readFully(fileLength-16K, buffer, 0, 16384);

https://github.com/rajeshbalamohan/hadoop-aws-wrapper is useful in tracking various invocations in the readFully call. This may need some modifications. (Can post a PR to that repo tomorrow)

The text was updated successfully, but these errors were encountered:

medb · 2018-07-12T14:54:11Z

@sidseth I have made some optimizations to address this issue in #110, I plan to mainline them soon.

May you check if they help your use-case?

sidseth · 2018-07-12T19:42:58Z

@medb - will try out the patch and get back with details, will likely be early next week though.

medb · 2018-07-18T17:46:01Z

We just released GCS connector 1.9.2 which includes all the performance optimizations.

To take advantage of all available optimizations set the properties:

fs.gs.inputstream.fadvise=RANDOM
fs.gs.io.buffersize=524288
fs.gs.inputstream.footer.prefetch.size=65536
fs.gs.performance.cache.enable=true
fs.gs.performance.cache.max.entry.age.ms=1800000

sidseth · 2018-07-18T18:30:26Z

Thansk @medb. I've got some information from the previous patch (using FADVISE=RANDOM). Will run some more tests with the new changes. From a quick glance - seemed like the footer reads from multiple files was faster. Let me analyze the results a little more before posting details. Also noticed that listLocatedStatus was taking a very long time (relatively) - which causes a slow down.
Any particular reason to reduce the buffer size from the default 8MB to 0.5MB?

medb · 2018-07-18T18:40:39Z

Buffer size limits minimum HTTP range requests size in RANDOM mode.
In my SparkSQL tests with ORC files it lead to redundant data transfer (up to 2x) - my guess is that ORC reads ~1MB pages at a time so 8MB range requests are redundant for it.

sidseth · 2018-07-19T02:47:11Z

Got it.
Here's what I've noticed from a bunch of runs.

The patches for fadvise=RANDOM has not made a significant difference to the footer read times (This was with fadvise=RANDOM and the set of patches from 2 days ago - no fs.gs.performance.cache or footer prefetch). This was from running queries - but a few micros benchmarks from my local system had similar results. Will run a few micro-benchmarks from GCS nodes to make sure that's the case.
With the latest set of patches, and the settings mentioned above - the readFully call is way faster. However, open is now quite a bit slower, since that's where the prefetch happens. Combined though - for queries (not microbenchmarks - seeing a 20-40% improvement)
listLocatedStatus sees a significant improvement with the performance cache enabled. Is this recommended (Will look at memory impact before enabling this by default)

medb · 2018-07-19T18:04:01Z

Thanks for sharing test results!

Yes, for just footer reads RANDOM mode is not necessarily beneficial, because footer is relatively small and at the end of the file, so there no difference between STREAMING and RANDOM mode in this case.
Regarding open - if test doesn't actually read footer to parse data in the file (as I think the case with readFully call), you can set fs.gs.inputstream.footer.prefetch.size=0, so GCS metadata request will be send in open call instead of pre-fetching footer, it could be better for this use-case.
Another thing to consider is data staleness when using performance cache, by default it set to 5 seconds.

medb · 2018-08-08T00:25:35Z

GCS connector 1.9.4 was released with improvements to random reads latency.
To take advantage of them you need to set next properties:

fs.gs.io.buffersize=0
fs.gs.inputstream.min.range.request.size=262144
fs.gs.performance.cache.enable=true
fs.gs.performance.cache.max.entry.age.ms=300000
fs.gs.inputstream.fast.fail.on.not.found.enable=false
fs.gs.inputstream.fadvise=RANDOM

selimelawwa mentioned this issue Oct 18, 2022

Performance degradation when upgrading 3-2.2.8 #891

Open

medb closed this as completed Dec 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High latency reads in GCS connector #114

High latency reads in GCS connector #114

medb commented Jul 12, 2018 •

edited

Loading

medb commented Jul 12, 2018

sidseth commented Jul 12, 2018

medb commented Jul 18, 2018

sidseth commented Jul 18, 2018

medb commented Jul 18, 2018 •

edited

Loading

sidseth commented Jul 19, 2018

medb commented Jul 19, 2018

medb commented Aug 8, 2018 •

edited

Loading

High latency reads in GCS connector #114

High latency reads in GCS connector #114

Comments

medb commented Jul 12, 2018 • edited Loading

medb commented Jul 12, 2018

sidseth commented Jul 12, 2018

medb commented Jul 18, 2018

sidseth commented Jul 18, 2018

medb commented Jul 18, 2018 • edited Loading

sidseth commented Jul 19, 2018

medb commented Jul 19, 2018

medb commented Aug 8, 2018 • edited Loading

medb commented Jul 12, 2018 •

edited

Loading

medb commented Jul 18, 2018 •

edited

Loading

medb commented Aug 8, 2018 •

edited

Loading