Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to change compression codec with a session property #12647

Merged

Conversation

arhimondr
Copy link
Member

@arhimondr arhimondr commented Apr 11, 2019

Required by #12387

@arhimondr arhimondr changed the title Allow to change compression codec with a session property [WIP] Allow to change compression codec with a session property Apr 11, 2019
@arhimondr arhimondr force-pushed the change-compression-for-temporary-table branch from f562673 to 49d6569 Compare April 11, 2019 23:49
@arhimondr arhimondr changed the title [WIP] Allow to change compression codec with a session property Allow to change compression codec with a session property Apr 11, 2019
// This code assumes that if configuration is an instance of FileSystemFactory, a new instance is always supplied
if (config instanceof FileSystemFactory) {
checkArgument(config instanceof JobConf, "config is not an instance of JobConf: %s", config.getClass());
checkArgument(config.get(IS_NOT_A_COPY) == null, "config is not a copy");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand this. What's the is_not_a_copy flag, and what does the error message here mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, does this commit need to get squashed into add compression_codec session property for correctness purposes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the is_not_a_copy flag, and what does the error message here mean?

Basically that's a sanity check to verify that the configuration passed to this method is always a clean copy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, does this commit need to get squashed into add compression_codec session property for correctness purposes?

Sure, i don't mind squashing it once the review is done. Just wanted to extract it to a separate commit so i can provide a more maningful explanation in the commit message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. Here's what I think you're saying:
"is_not_a_copy" is your own made up flag to indicate if we are dealing with a new instance of a jobconf (a copy) or whether we have a shared instance.

  1. if "is_not_a_copy" is true, that means we have a shared instance
  2. if it's false then it's a new unique instance
  3. and if it's null/unset that also means it's unique because we assume we always get a new instance from FileSystemFactory? (if this is true, the error message for the checkArgument needs to be fixed).

I still don't understand why we care. If mutating a shared JobConf is a problem, why don't we create our own copy in the method? that seems much more standard. Also, why do we set it to true next? all that we do with it is set some properties in a helper method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Decided to remove the assertion check, as it looks incredibly confusing.

Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked in details, but I have a high-level question:

It looks making compression codec configurable by session property has some unexpected complications (probably due to JobConf copy issue?). So, does it make sense to only have config properties for temporary tables? -- After all, we probably don't have to change this in a per query basis :)

We can also make Temporary Table write only use optimized ORC writer -- they are quite Presto internal so we should just fix to something works best. And hard code it to use the compression codec. (See, for example, the TempFileWriter: https://github.com/prestodb/presto/blob/9b8b08cc762a1c33ec4fec9b65018a4d18c7e9d8/presto-hive/src/main/java/com/facebook/presto/hive/util/TempFileWriter.java )

Feel free to correct me if I overlook anything (or talk in person :) )

@arhimondr
Copy link
Member Author

So, does it make sense to only have config properties for temporary tables?

What I'm trying to do here - is to be able to set a different compression algorithm for a temporary table, without changing the compression algorithm for a regular table. I think that's what introduces the complication. Adding a session property on top of it doesn't change much.

The only simpler approach i can think of is to pass a CompressionCodec to the HdfsEnvironment#getConfiguration. But that seems to be weird, contrintuinive API.

@wenleix
Copy link
Contributor

wenleix commented Apr 15, 2019

@arhimondr : What about a solution similar to this: wenleix@592f5bb ?

It's a hacky POC just to demonstrate the idea. Basically the HiveWritableTableHandle have an extra bit about whether the table is temporary, and pass this flag down to HiveWriterFactory -> HiveFileWriterFactory#createFileWriter. So if it's a temporary file it directly it will hard code to SNAPPY (and in the future it will be ZSTANDARD )

@arhimondr
Copy link
Member Author

@wenleix I see the point. So the point is to make it work only for ORC. That might work. Though i find the solution with a session property for compression cleaner, and more useful in general.

@wenleix
Copy link
Contributor

wenleix commented Apr 15, 2019

@arhimondr : It can also work fo RCFile -- as long as it has "native" Presto optimized writer support :) . We can still have session property, we just don't avoid go though the Hadoop JobConf stuffs :)

@arhimondr arhimondr force-pushed the change-compression-for-temporary-table branch from 49d6569 to b5b82b6 Compare April 17, 2019 20:24
Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Add compression_codec session property in Hive connector": LGTM.

public static JobConf configureCompression(Configuration config, HiveCompressionCodec compression)
{
JobConf result = new JobConf(false);
copy(config, result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, what's the difference between doing this vs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JobConf config = new JobConf(conf);

This will re-read the configuration xml files in the classpath 😄

Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "Refactor HiveClientConfig"
  • "Add temporary_table_compression_codec session property"
  • "Simplify HiveWriterFactory#getFileExtension"

Looks good.

}
catch (RuntimeException e) {
throw new PrestoException(HIVE_UNSUPPORTED_FORMAT, "Failed to load compression codec: " + compressionCodecClass, e);
catch (ReflectiveOperationException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean RuntimeException caught in the original code is too broad? (i.e. it should always catch ReflectiveOperationException)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. I manually checked the constructors, and those constructors don't throw any exceptions.

Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Add workaround for FileSystemFactory"

Looks good. Maybe just change the title to

Fix configureCompression method for FileSystemFactory

"Add workaround" sounds like a hack :) .

JobConf result = new JobConf(false);
copy(config, result);
JobConf result;
// Workaround for https://github.com/prestodb/presto-hadoop-apache2/commit/aa3a59101cb14659a96400a3943cd1bb740399b9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we usually don't have commit in code comment? What about the following (feel free to modify it)

FileSystemFactory is used to hack around the abuse of Configuration as a
cache for FileSystem. See FileSystemFactory class for more details.

It is caller's responsibility to create a copy if FileSystemFactory is used.

And add javadoc comment to FileSystemFactory (perhaps just copy the commit message from prestodb/presto-hadoop-apache2@aa3a591)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i aggree that using commit ids is not what we do. I changed the commit message.

And add javadoc comment to FileSystemFactory (perhaps just copy the commit message from prestodb/presto-hadoop-apache2@aa3a591)

Do do that we need to create a PR to the presto-hadoop-apache2, release the library and update it's version. Do we really think that the comment worth it?

@wenleix wenleix assigned arhimondr and unassigned wenleix Apr 19, 2019
@arhimondr arhimondr force-pushed the change-compression-for-temporary-table branch from b5b82b6 to 3a5d1f2 Compare April 22, 2019 20:42
@arhimondr arhimondr assigned rschlussel and unassigned arhimondr and rschlussel Apr 22, 2019
@arhimondr
Copy link
Member Author

@wenleix , @rschlussel updated

@arhimondr arhimondr force-pushed the change-compression-for-temporary-table branch from 3a5d1f2 to 785a503 Compare April 29, 2019 15:57
@arhimondr arhimondr merged commit 0dbcfe8 into prestodb:master May 1, 2019
@arhimondr arhimondr deleted the change-compression-for-temporary-table branch May 1, 2019 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants