Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zst inputs for vep #78

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

zst inputs for vep #78

wants to merge 3 commits into from

Conversation

sagehen03
Copy link
Contributor

@sagehen03 sagehen03 commented Feb 28, 2023

I've tested the changes in listVariants and in runVEP.sh and they seem to work as needed.

@@ -75,6 +75,7 @@ def main():
# output the variants as CSV part files
df.write \
.mode('overwrite') \
.option("compression", "org.apache.hadoop.io.compress.ZStandardCodec") \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the stage that provides inputs to VEP, we want it to leave zst files in s3.

Comment on lines +17 to +23
extension="${PART##*.}"
if [ "$extension" = "zst" ]
then
unzstd "$PART"
rm "$PART"
PART="${PART%.zst}"
fi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like Bash goes all in on being terse at the expense of telling you what the heck is going on. I've tested this locally and it seems to get the value of whatever comes after the last . in PART. If that value is zst, then decompress the file and re-assign PART so that .zst is removed from the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup this looks good, I had to do something similar, although I couldn't tell you where 🤔

@@ -45,7 +45,8 @@ class VepStage(implicit context: Context) extends Stage {
new BootstrapScript(clusterBootstrap),
new BootstrapScript(installScript)
),
stepConcurrency = 3
stepConcurrency = 3,
releaseLabel = ReleaseLabel("emr-6.7.0") // Need emr 6.1+ to read zstd files
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this is actually necessary, but I don't think it hurts either?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) It won't hurt, and (2) in general I would like to slowly migrate things over so this is a good idea. It'll end up being a good marker for indicating roughly which bits have been switched to compression.

@psmadbec
Copy link
Contributor

psmadbec commented Mar 1, 2023

Once you test this let me know. There will be a second step at some point which is to build a temp compression job that will chew through the bottom-line data and save compressed data to a different bucket. Once we check timings and a few of the results we'll then swap out the data and land this guy et voila, all data from that point forward is compressed.

@sagehen03 sagehen03 marked this pull request as ready for review March 6, 2023 15:24
@sagehen03 sagehen03 requested a review from psmadbec March 6, 2023 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants