-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zst inputs for vep #78
base: master
Are you sure you want to change the base?
Conversation
@@ -75,6 +75,7 @@ def main(): | |||
# output the variants as CSV part files | |||
df.write \ | |||
.mode('overwrite') \ | |||
.option("compression", "org.apache.hadoop.io.compress.ZStandardCodec") \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the stage that provides inputs to VEP, we want it to leave zst files in s3.
extension="${PART##*.}" | ||
if [ "$extension" = "zst" ] | ||
then | ||
unzstd "$PART" | ||
rm "$PART" | ||
PART="${PART%.zst}" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like Bash goes all in on being terse at the expense of telling you what the heck is going on. I've tested this locally and it seems to get the value of whatever comes after the last . in PART. If that value is zst, then decompress the file and re-assign PART so that .zst is removed from the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup this looks good, I had to do something similar, although I couldn't tell you where 🤔
@@ -45,7 +45,8 @@ class VepStage(implicit context: Context) extends Stage { | |||
new BootstrapScript(clusterBootstrap), | |||
new BootstrapScript(installScript) | |||
), | |||
stepConcurrency = 3 | |||
stepConcurrency = 3, | |||
releaseLabel = ReleaseLabel("emr-6.7.0") // Need emr 6.1+ to read zstd files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think this is actually necessary, but I don't think it hurts either?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) It won't hurt, and (2) in general I would like to slowly migrate things over so this is a good idea. It'll end up being a good marker for indicating roughly which bits have been switched to compression.
Once you test this let me know. There will be a second step at some point which is to build a temp compression job that will chew through the bottom-line data and save compressed data to a different bucket. Once we check timings and a few of the results we'll then swap out the data and land this guy et voila, all data from that point forward is compressed. |
I've tested the changes in listVariants and in runVEP.sh and they seem to work as needed.