Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use md5sum to check download integrity with vdb-validate as fallback #5024

Merged
merged 21 commits into from
Apr 7, 2024
Merged

use md5sum to check download integrity with vdb-validate as fallback #5024

merged 21 commits into from
Apr 7, 2024

Conversation

suhrig
Copy link
Contributor

@suhrig suhrig commented Mar 1, 2024

PR checklist

As explained in ncbi/sra-tools#896, vdb-validate does not detect file corruption if the prefetched files do not contain MD5 checksums. It has happened to me many times that downloaded files turn out to be corrupt, if I use the option force_sratools_download (now --download_method sratools). What is worse is that extracting the files using fasterq-dump does not always result in an error even if the file is corrupt. It is even conceivable that the extracted FastQ file looks perfectly intact with only some bases or quality values being changed. As such, the error may go completely unnoticed.

This PR fixes this by (1) fetching the md5sum from the SRA Data Locator API and (2) performing a manual md5sum check. The current method, vdb-validate is only used anymore if the md5sum cannot be obtained from the API.

@suhrig
Copy link
Contributor Author

suhrig commented Mar 26, 2024

@Midnighter: Do you need any further input from my side before you can start with the review? The pull request is ready for review IMO.

Copy link
Contributor

@Midnighter Midnighter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for uncovering these problems and suggesting these changes. I see two main issues with the PR:

  1. We also need container images that provide the right dependencies.
  2. As you rightfully point out in your comment vdb-validate does not detect file corruption ncbi/sra-tools#896 (comment), it would be ideal to be able to check if md5sums are present before making another network request.

@suhrig
Copy link
Contributor Author

suhrig commented Mar 26, 2024

Thanks for all the useful feedback!

it would be ideal to be able to check if md5sums are present before making another network request.

I will nudge the developers once more about this. I agree this would be ideal.

@suhrig
Copy link
Contributor Author

suhrig commented Apr 1, 2024

Hi @Midnighter,

I have implemented your requested changes.

We also need container images that provide the right dependencies.

I no longer need jq, since I use grep to search for the valid checksum, which results in much simpler code than using jq would. So the only additional tool is curl, which is part of the sratools container already. Do we still need an additional container?

As you rightfully point out in your comment ncbi/sra-tools#896 (comment), it would be ideal to be able to check if md5sums are present before making another network request.

It turns out the maintainers had added feature to confirm that the output of vdb-validate is reliable. I am making use of this in my new code submission. curl is only used as fallback now.

Copy link
Contributor

@Midnighter Midnighter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I no longer need jq, since I use grep to search for the valid checksum, which results in much simpler code than using jq would. So the only additional tool is curl, which is part of the sratools container already. Do we still need an additional container?

No, that's fine then.

Changes look good to me! Really happy that you managed to avoid the curl call when it's not necessary.

My only nitpick is, that I prefer long form options in scripts, e.g., grep --fixed-strings rather than grep -F because it's typically much more clear what the intention is then. I tend to forget myself, what the short form means, so it might also help future you 😉. It's certainly fine to merge like this, too, though!

@suhrig
Copy link
Contributor Author

suhrig commented Apr 4, 2024

long form options in scripts

Sure, I will adapt the script accordingly!

@suhrig
Copy link
Contributor Author

suhrig commented Apr 4, 2024

I have implemented the suggestions. I also added an error message when the manual MD5 sum check failed, because I realized it may be confusing when the process just fails silently.

Do you know why the automated checks won't launch? They have been stuck in "awaiting approval" state from the start. I cannot merge the changes otherwise.

Copy link
Contributor

@Midnighter Midnighter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

auto-merge was automatically disabled April 5, 2024 17:26

Head branch was pushed to by a user without write access

@suhrig
Copy link
Contributor Author

suhrig commented Apr 5, 2024

I have satisfied the requirements of the failed editorconfig check. However, as a result of this the auto-merge feature was disabled and the PR is stuck in awaiting approval again. Can you kindly explain what you did to fix this so I can help myself in the future?

@Midnighter
Copy link
Contributor

You need to have certain permissions on the repository. I'm not exactly sure which role. Probably at least write permissions.

auto-merge was automatically disabled April 6, 2024 10:54

Head branch was pushed to by a user without write access

@suhrig suhrig requested a review from drpatelh as a code owner April 6, 2024 10:54
@suhrig
Copy link
Contributor Author

suhrig commented Apr 6, 2024

Hi @Midnighter. Sorry to bother you again. There was a hidden failed test which I did not notice, because all the ticks in the overview were green. A few MD5 sums of files which nf-test checked for integrity needed updating.

As a result of my new commit, the checks are in pending state again. :-/ I don't know why this happens or how to fix this. So I have to ask you (hopefully only) once more to work your magic? Thanks!

Also, GitHub added a new reviewer automatically. I did not do this.

@Midnighter
Copy link
Contributor

No worries, not your fault.

auto-merge was automatically disabled April 6, 2024 16:54

Head branch was pushed to by a user without write access

@suhrig
Copy link
Contributor Author

suhrig commented Apr 6, 2024

One more push please? I used long options for grep as you suggested, but not all implementations of grep support them, which is why a test failed. I had to revert to short options for grep, therefore.

@Midnighter
Copy link
Contributor

One more push please? I used long options for grep as you suggested, but not all implementations of grep support them, which is why a test failed. I had to revert to short options for grep, therefore.

It's so annoying when the accepted options for these standard tools are different between Mac's Unix and Linux. Or was it the conda version?

@suhrig
Copy link
Contributor Author

suhrig commented Apr 7, 2024

options for these standard tools are different between Mac's Unix and Linux. Or was it the conda version?

It looks like the busybox implementation is used for the tests.

The tests have completed successfully. 🥳 Do you have permission to merge the changes or shall we wait for the review from @drpatelh? I think GitHub added him as a reviewer when I modified the MD5 sums of the nf-test files, presumably because he is the owner of those files.

@Midnighter Midnighter added this pull request to the merge queue Apr 7, 2024
@Midnighter
Copy link
Contributor

Strange, I had set it to auto merge already... not sure why that didn't happen.

Merged via the queue into nf-core:master with commit 1fc29f9 Apr 7, 2024
15 checks passed
@suhrig
Copy link
Contributor Author

suhrig commented Apr 7, 2024

Many thanks for all your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants