-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to change compression codec with a session property #12647
Allow to change compression codec with a session property #12647
Conversation
f562673
to
49d6569
Compare
// This code assumes that if configuration is an instance of FileSystemFactory, a new instance is always supplied | ||
if (config instanceof FileSystemFactory) { | ||
checkArgument(config instanceof JobConf, "config is not an instance of JobConf: %s", config.getClass()); | ||
checkArgument(config.get(IS_NOT_A_COPY) == null, "config is not a copy"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't understand this. What's the is_not_a_copy flag, and what does the error message here mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, does this commit need to get squashed into add compression_codec session property for correctness purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the is_not_a_copy flag, and what does the error message here mean?
Basically that's a sanity check to verify that the configuration passed to this method is always a clean copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, does this commit need to get squashed into add compression_codec session property for correctness purposes?
Sure, i don't mind squashing it once the review is done. Just wanted to extract it to a separate commit so i can provide a more maningful explanation in the commit message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. Here's what I think you're saying:
"is_not_a_copy" is your own made up flag to indicate if we are dealing with a new instance of a jobconf (a copy) or whether we have a shared instance.
- if "is_not_a_copy" is true, that means we have a shared instance
- if it's false then it's a new unique instance
- and if it's null/unset that also means it's unique because we assume we always get a new instance from FileSystemFactory? (if this is true, the error message for the checkArgument needs to be fixed).
I still don't understand why we care. If mutating a shared JobConf is a problem, why don't we create our own copy in the method? that seems much more standard. Also, why do we set it to true next? all that we do with it is set some properties in a helper method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. Decided to remove the assertion check, as it looks incredibly confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked in details, but I have a high-level question:
It looks making compression codec configurable by session property has some unexpected complications (probably due to JobConf
copy issue?). So, does it make sense to only have config properties for temporary tables? -- After all, we probably don't have to change this in a per query basis :)
We can also make Temporary Table write only use optimized ORC writer -- they are quite Presto internal so we should just fix to something works best. And hard code it to use the compression codec. (See, for example, the TempFileWriter
: https://github.com/prestodb/presto/blob/9b8b08cc762a1c33ec4fec9b65018a4d18c7e9d8/presto-hive/src/main/java/com/facebook/presto/hive/util/TempFileWriter.java )
Feel free to correct me if I overlook anything (or talk in person :) )
presto-hive/src/main/java/com/facebook/presto/hive/HiveSessionProperties.java
Outdated
Show resolved
Hide resolved
What I'm trying to do here - is to be able to set a different compression algorithm for a temporary table, without changing the compression algorithm for a regular table. I think that's what introduces the complication. Adding a session property on top of it doesn't change much. The only simpler approach i can think of is to pass a |
@arhimondr : What about a solution similar to this: wenleix@592f5bb ? It's a hacky POC just to demonstrate the idea. Basically the |
@wenleix I see the point. So the point is to make it work only for ORC. That might work. Though i find the solution with a session property for compression cleaner, and more useful in general. |
@arhimondr : It can also work fo |
presto-hive/src/main/java/com/facebook/presto/hive/util/ConfigurationUtils.java
Outdated
Show resolved
Hide resolved
49d6569
to
b5b82b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Add compression_codec session property in Hive connector": LGTM.
public static JobConf configureCompression(Configuration config, HiveCompressionCodec compression) | ||
{ | ||
JobConf result = new JobConf(false); | ||
copy(config, result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, what's the difference between doing this vs.
presto/presto-hive/src/test/java/com/facebook/presto/hive/benchmark/FileFormat.java
Line 421 in 6727da1
JobConf config = new JobConf(conf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JobConf config = new JobConf(conf);
This will re-read the configuration xml files in the classpath 😄
presto-hive/src/main/java/com/facebook/presto/hive/HiveWritableTableHandle.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "Refactor HiveClientConfig"
- "Add temporary_table_compression_codec session property"
- "Simplify HiveWriterFactory#getFileExtension"
Looks good.
} | ||
catch (RuntimeException e) { | ||
throw new PrestoException(HIVE_UNSUPPORTED_FORMAT, "Failed to load compression codec: " + compressionCodecClass, e); | ||
catch (ReflectiveOperationException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean RuntimeException
caught in the original code is too broad? (i.e. it should always catch ReflectiveOperationException
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. I manually checked the constructors, and those constructors don't throw any exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Add workaround for FileSystemFactory"
Looks good. Maybe just change the title to
Fix configureCompression method for FileSystemFactory
"Add workaround" sounds like a hack :) .
JobConf result = new JobConf(false); | ||
copy(config, result); | ||
JobConf result; | ||
// Workaround for https://github.com/prestodb/presto-hadoop-apache2/commit/aa3a59101cb14659a96400a3943cd1bb740399b9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we usually don't have commit in code comment? What about the following (feel free to modify it)
FileSystemFactory is used to hack around the abuse of Configuration as a
cache for FileSystem. See FileSystemFactory class for more details.
It is caller's responsibility to create a copy if FileSystemFactory is used.
And add javadoc comment to FileSystemFactory
(perhaps just copy the commit message from prestodb/presto-hadoop-apache2@aa3a591)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, i aggree that using commit ids is not what we do. I changed the commit message.
And add javadoc comment to FileSystemFactory (perhaps just copy the commit message from prestodb/presto-hadoop-apache2@aa3a591)
Do do that we need to create a PR to the presto-hadoop-apache2
, release the library and update it's version. Do we really think that the comment worth it?
b5b82b6
to
3a5d1f2
Compare
@wenleix , @rschlussel updated |
Rename hiveCompressionCodec to simply compressionCodec
And hive.temporary-table-compression-codec configuration property
3a5d1f2
to
785a503
Compare
Required by #12387