Update TrinoFileSystemCache to represent latest hadoop implementation #13243

jitheshtr · 2022-07-20T06:46:39Z

Motivation

Hadoop's implementation of filesystem cache (hadoop 3.2 filesystem cache) creates the filesystem object outside of synchronized block. A side effect to this (in addition to reduced locking duration for slow-to-create filesystem implementations) is that there is no interaction between the lock in caching infrastructure and locks internal to a filesystem implementation (during filesystem object creation). We would like to bring this approach to TrinoFileSystemCache

A secondary motivation is to improve the concurrency of TrinoFileSystemCache operations by avoiding synchronized blocks.

Description

Use ConcurrentHashMap to cache filesystem objects - improves
concurrency by removing synchronized blocks
Filesystem object is created outside cache's lock - similar to latest
hadoop fs cache impl, further reducing code in critical section.
Helps with systems where filesystem creation is expensive.
Only one thread exclusively creates the filesystem object for a
given key. Avoids speculative creation and then later discarding of
filesystem objects compared to hadoop fs cache impl.

There is a more recent update in hadoop 3.3.x branch that limits the number of parallel filesystem object creations using a semaphore. Looking at the description of the issue (HADOOP-17313), it seems to be created as a workaround for speculative-create-and-discard approach used in hadoop implementation which this code avoids.

Benchmark output -
Before:

Benchmark                         (numGetCallsPerInvocation)  (numThreads)  (numUsers)  Mode  Cnt   Score   Error  Units
BenchmarkGetFileSystem.benchmark                        1000             1          10  avgt   10   7.747 ± 0.448  ms/op
BenchmarkGetFileSystem.benchmark                        1000             1         100  avgt   10   8.041 ± 0.352  ms/op
BenchmarkGetFileSystem.benchmark                        1000             1        1000  avgt   10   7.492 ± 0.340  ms/op
BenchmarkGetFileSystem.benchmark                        1000            16          10  avgt   10  69.900 ± 8.675  ms/op
BenchmarkGetFileSystem.benchmark                        1000            16         100  avgt   10  66.847 ± 2.937  ms/op
BenchmarkGetFileSystem.benchmark                        1000            16        1000  avgt   10  70.222 ± 4.286  ms/op

After:

Benchmark                         (numGetCallsPerInvocation)  (numThreads)  (numUsers)  Mode  Cnt   Score   Error  Units
BenchmarkGetFileSystem.benchmark                        1000             1          10  avgt   10   7.767 ± 0.511  ms/op
BenchmarkGetFileSystem.benchmark                        1000             1         100  avgt   10   7.412 ± 0.267  ms/op
BenchmarkGetFileSystem.benchmark                        1000             1        1000  avgt   10   7.385 ± 0.284  ms/op
BenchmarkGetFileSystem.benchmark                        1000            16          10  avgt   10  26.333 ± 2.696  ms/op
BenchmarkGetFileSystem.benchmark                        1000            16         100  avgt   10  28.163 ± 1.489  ms/op
BenchmarkGetFileSystem.benchmark                        1000            16        1000  avgt   10  29.545 ± 4.080  ms/op

(above results are from an 8 core intel macbook pro)

Is this change a fix, improvement, new feature, refactoring, or other?

Improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Update the implementation of TrinoFileSystemCache class in Hive connector.

How would you describe this change to a non-technical end user or system administrator?

Bring TrinoFileSystemCache implementation inline with latest hadoop implementation and improve cache performance in the process.

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Hive
* Improve performance by reducing contention in Trino's file system cache. ({issue}`13243 `)

jitheshtr · 2022-11-17T20:01:36Z

Todo -

Rebase with master
Fix CI failures due to interaction between various product tests and the new test added here, both accessing global TrinoFileSystemCache within the same JVM

phd3

some initial comments

lib/trino-hdfs/src/main/java/io/trino/hdfs/TrinoFileSystemCache.java

phd3 · 2022-11-25T23:40:55Z

lib/trino-hdfs/src/main/java/io/trino/hdfs/TrinoFileSystemCache.java

+        int maxSize = conf.getInt("fs.cache.max-size", 1000);
+        FileSystemHolder fileSystemHolder;
+        try {
+            fileSystemHolder = cache.compute(key, (k, currFileSystemHolder) -> {


nit: currFileSystemHolder -> currentFileSystemHolder

lib/trino-hdfs/src/main/java/io/trino/hdfs/TrinoFileSystemCache.java

phd3 · 2022-11-26T00:10:54Z

lib/trino-hdfs/src/main/java/io/trino/hdfs/TrinoFileSystemCache.java

@@ -306,23 +313,49 @@ public String toString()

    private static class FileSystemHolder
    {
-        private final FileSystem fileSystem;
+        private final URI uri;


do we need to store uri/conf here? I think we should be able to put them in createFileSystemOnce right?

The thought process was to keep uri and conf provided by the thread who created the FileSystemHolder key - which is what happens in existing implementation. createFileSystemOnce() could be called by a different thread having a different uri and/or conf object due to the original thread being scheduled out of execution by operating system just before invoking createFileSystemOnce().

That shouldn't be a concern right ?

Yes, should be fine, will update - we can avoid storing uri/conf in FileSystemHolder with this change.

phd3 · 2022-11-28T02:08:15Z

lib/trino-hdfs/src/main/java/io/trino/hdfs/TrinoFileSystemCache.java

+                }
+            });
+
+            fileSystemHolder.createFileSystemOnce();


Can we add a comment here why this is outside of the cache compute? Seems like that's an important piece of why this works

phd3

The implementation looks good to me, but IMO would be useful to get some more 👀 on this as well since the change is a bit involved.

phd3 · 2022-11-29T16:03:01Z

lib/trino-hdfs/src/main/java/io/trino/hdfs/TrinoFileSystemCache.java

@@ -70,17 +68,13 @@

    private final TrinoFileSystemCacheStats stats;

-    @GuardedBy("this")
-    private final Map<FileSystemKey, FileSystemHolder> map = new HashMap<>();
+    private final Map<FileSystemKey, FileSystemHolder> cache = new ConcurrentHashMap<>();


Let's add a comment saying why we need cacheSize separately.

phd3 · 2022-11-29T16:25:20Z

lib/trino-hdfs/src/main/java/io/trino/hdfs/TrinoFileSystemCache.java

@@ -306,23 +313,49 @@ public String toString()

    private static class FileSystemHolder
    {
-        private final FileSystem fileSystem;
+        private final URI uri;


That shouldn't be a concern right ?

phd3 · 2022-11-29T16:32:50Z

lib/trino-hdfs/src/test/java/io/trino/hdfs/TestFileSystemCache.java

+        int maxCacheSize = 1000;
+        for (int i = 0; i < maxCacheSize; ++i) {
+            assertEquals(TrinoFileSystemCache.INSTANCE.getFileSystemCacheStats().getCacheSize(), i);
+            getFileSystem(environment, ConnectorIdentity.ofUser("user" + String.valueOf(i)));


String.valueOf seems redundant

phd3 · 2022-11-29T16:33:25Z

lib/trino-hdfs/src/test/java/io/trino/hdfs/TestFileSystemCache.java

+        }
+        assertEquals(TrinoFileSystemCache.INSTANCE.getFileSystemCacheStats().getCacheSize(), maxCacheSize);
+
+        try {


use assertThatThrownBy

phd3 · 2022-11-29T16:33:50Z

lib/trino-hdfs/src/test/java/io/trino/hdfs/TestFileSystemCache.java

+                new ImpersonatingHdfsAuthentication(new SimpleHadoopAuthentication(), new SimpleUserNameProvider()));
+
+        int maxCacheSize = 1000;
+        for (int i = 0; i < maxCacheSize; ++i) {


i++ is followed generally in codebase

phd3 · 2022-11-29T16:38:27Z

lib/trino-hdfs/src/test/java/io/trino/hdfs/TestFileSystemCache.java

+        void consume(FileSystem fileSystem) throws IOException;
+    }
+
+    private static class FileSystemCloser


nit: this can be lambda

If we use a lambda, we get the error Hadoop FileSystem instances are shared and should not be closed. Had to add FileSystemCloser and annotate its consume() method with @SuppressModernizer to fix this.

phd3 · 2022-11-29T16:40:19Z

lib/trino-hdfs/src/test/java/io/trino/hdfs/TestFileSystemCache.java

+        }
+    }
+
+    // A callable that creates (and consumes) filesystem objects X times for Y users


nit: may be just use the variable names for comment

phd3 · 2022-11-29T16:51:07Z

lib/trino-hdfs/src/test/java/io/trino/hdfs/TestFileSystemCache.java

+                            new FileSystemCloser()));
+        }
+
+        FileSystem.closeAll();


can this be in @BeforeMethod

phd3 · 2022-11-29T17:01:21Z

lib/trino-hdfs/src/test/java/io/trino/hdfs/TestFileSystemCache.java

+
+        CreateFileSystemAndConsume(SplittableRandom random, int numUsers, int numGetCallsPerInvocation, FileSystemConsumer consumer)
+        {
+            this.random = random;


nit: null check since this is also exposed outside of the class

phd3 · 2022-11-29T17:03:57Z

cc @electrum

sopel39 · 2022-12-20T09:16:25Z

please rebase

phd3

Final set of comments - looks good to me

phd3 · 2023-01-09T21:44:33Z

lib/trino-hdfs/src/main/java/io/trino/hdfs/TrinoFileSystemCache.java

+        try {
+            fileSystemHolder = cache.compute(key, (k, currentFileSystemHolder) -> {
+                if (currentFileSystemHolder == null) {
+                    if (cacheSize.getAndUpdate(currentSize -> currentSize < maxSize ?


would be simpler to write the following

cacheSize.getAndUpdate(currentSize -> Math.min(currentSize + 1, maxSize) == maxSize)

phd3 · 2023-01-10T00:17:56Z

lib/trino-hdfs/src/test/java/io/trino/hdfs/BenchmarkGetFileSystem.java

+    public static class BenchmarkData
+    {
+        @Param({"10", "100", "1000"})
+        private int numUsers;


nit: userCount, threadCount, getCallsPerInvocation to avoid abbreviations;

phd3 · 2023-01-10T00:24:41Z

lib/trino-hdfs/src/test/java/io/trino/hdfs/BenchmarkGetFileSystem.java

+                throws IOException
+        {
+            TrinoFileSystemCache.INSTANCE.closeAll();
+            executor.shutdown();


shutdownNow() as we do not need to wait here

phd3 · 2023-01-10T00:26:06Z

lib/trino-hdfs/src/test/java/io/trino/hdfs/BenchmarkGetFileSystem.java

+            throws InterruptedException, ExecutionException
+    {
+        List<Future<Void>> futures = data.executor.invokeAll(data.callableTasks);
+        for (Future<Void> fut : futures) {


this can be simplified (simile to other comment)

phd3 · 2023-01-10T00:32:49Z

lib/trino-hdfs/src/test/java/io/trino/hdfs/TestFSDataInputStreamTail.java

@@ -70,6 +70,7 @@ public void tearDown()
        closeAll(
                () -> fs.delete(new Path(tempRoot.toURI()), true),
                fs);
+        fs = null;


why is this a related change ? Can we keep this in a separate commit ?

This came in as part of bringing trino-testing-services dependency into trino-hdfs. Resource deallocation check added via ManageTestResources (in PR #15165) got activated in lib/trino-hdfs resulting in this test failure here. Adding fs = null; fixes this, as was the case with similar changes in the PR referenced. Can move this to a different commit

phd3 · 2023-01-10T00:44:41Z

lib/trino-hdfs/src/test/java/io/trino/hdfs/TestFileSystemCache.java

+        List<Future<Void>> futures = executor.invokeAll(callableTasks);
+        for (Future<Void> fut : futures) {
+            fut.get();
+        }


executor.invokeAll(callableTasks).forEach(MoreFutures::getFutureValue);

phd3 · 2023-01-10T00:45:54Z

lib/trino-hdfs/src/test/java/io/trino/hdfs/TestFileSystemCache.java

+        private final int numGetCallsPerInvocation;
+        private final FileSystemConsumer consumer;
+
+        private HdfsEnvironment environment = new HdfsEnvironment(


private static final

phd3 · 2023-01-10T00:46:09Z

lib/trino-hdfs/src/test/java/io/trino/hdfs/TestFileSystemCache.java

+            this.numUsers = numUsers;
+            this.numGetCallsPerInvocation = numGetCallsPerInvocation;


same comment about naming

- Use ConcurrentHashMap to cache filesystem objects - improves concurrency by removing synchronized blocks - Filesystem object is created outside cache's lock - similar to latest hadoop fs cache impl, further reducing code in critical section. Helps with systems where filesystem creation is expensive. - Only one thread exclusively creates the filesystem object for a given key. Avoids speculative creation and then later discarding of filesystem objects compared to hadoop fs cache impl.

cla-bot bot added the cla-signed label Jul 20, 2022

jitheshtr requested review from posulliv, electrum, findepi and phd3 July 20, 2022 06:46

github-actions bot added the tests:hive label Jul 20, 2022

jitheshtr marked this pull request as draft July 20, 2022 08:18

jitheshtr force-pushed the trinofscache_chm_lazy branch 2 times, most recently from e904560 to 2346c4f Compare July 21, 2022 05:51

jitheshtr force-pushed the trinofscache_chm_lazy branch from 2346c4f to 5dd03bb Compare November 17, 2022 19:46

jitheshtr mentioned this pull request Nov 17, 2022

Trino worker parallelism becomes 0 #15055

Open

jitheshtr force-pushed the trinofscache_chm_lazy branch 2 times, most recently from a343cd2 to a90461f Compare November 18, 2022 07:17

phd3 reviewed Nov 28, 2022

View reviewed changes

jitheshtr force-pushed the trinofscache_chm_lazy branch from a90461f to ab05c55 Compare November 28, 2022 20:20

jitheshtr marked this pull request as ready for review November 28, 2022 23:33

phd3 reviewed Nov 29, 2022

View reviewed changes

jitheshtr force-pushed the trinofscache_chm_lazy branch from ab05c55 to 2144552 Compare November 30, 2022 06:37

jitheshtr force-pushed the trinofscache_chm_lazy branch from 2144552 to 5fc7018 Compare December 14, 2022 08:14

martint force-pushed the master branch from 40dbb4f to 0d73d10 Compare December 19, 2022 20:15

jitheshtr force-pushed the trinofscache_chm_lazy branch from 5fc7018 to 53a3ed2 Compare December 21, 2022 08:07

phd3 approved these changes Jan 10, 2023

View reviewed changes

jitheshtr added 2 commits January 16, 2023 23:07

Add BenchmarkGetFileSystem to benchmark TrinoFileSytemCache

f7f8460

jitheshtr force-pushed the trinofscache_chm_lazy branch from 53a3ed2 to f7f8460 Compare January 17, 2023 07:33

phd3 approved these changes Jan 24, 2023

View reviewed changes

phd3 merged commit 3e40710 into trinodb:master Jan 24, 2023

github-actions bot added this to the 406 milestone Jan 24, 2023

colebow mentioned this pull request Jan 25, 2023

Add Trino 406 release notes #15625

Merged

jitheshtr deleted the trinofscache_chm_lazy branch February 3, 2023 22:11

findepi mentioned this pull request Apr 27, 2023

TestFileSystemCache is flaky #17158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update TrinoFileSystemCache to represent latest hadoop implementation #13243

Update TrinoFileSystemCache to represent latest hadoop implementation #13243

jitheshtr commented Jul 20, 2022 •

edited by phd3

Loading

jitheshtr commented Nov 17, 2022

phd3 left a comment

phd3 Nov 25, 2022

phd3 Nov 26, 2022

jitheshtr Nov 28, 2022

phd3 Nov 29, 2022

jitheshtr Nov 30, 2022

phd3 Nov 28, 2022

phd3 left a comment

phd3 Nov 29, 2022

phd3 Nov 29, 2022

phd3 Nov 29, 2022

phd3 Nov 29, 2022

phd3 Nov 29, 2022

phd3 Nov 29, 2022

jitheshtr Nov 30, 2022

phd3 Nov 29, 2022

phd3 Nov 29, 2022

phd3 Nov 29, 2022

phd3 commented Nov 29, 2022

sopel39 commented Dec 20, 2022

phd3 left a comment

phd3 Jan 9, 2023

phd3 Jan 10, 2023

phd3 Jan 10, 2023

phd3 Jan 10, 2023

phd3 Jan 10, 2023

jitheshtr Jan 17, 2023

phd3 Jan 10, 2023

phd3 Jan 10, 2023

phd3 Jan 10, 2023

		this.numUsers = numUsers;
		this.numGetCallsPerInvocation = numGetCallsPerInvocation;

Update TrinoFileSystemCache to represent latest hadoop implementation #13243

Update TrinoFileSystemCache to represent latest hadoop implementation #13243

Conversation

jitheshtr commented Jul 20, 2022 • edited by phd3 Loading

Motivation

Description

Related issues, pull requests, and links

Documentation

Release notes

jitheshtr commented Nov 17, 2022

phd3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phd3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phd3 commented Nov 29, 2022

sopel39 commented Dec 20, 2022

phd3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jitheshtr commented Jul 20, 2022 •

edited by phd3

Loading