Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of instance-level download #113

Merged

Conversation

DanielaSchacherer
Copy link
Contributor

  • added support of instance-level download as described in Add support for download of single instances instead of whole series #97 (comment)
  • Open problem: we can not estimate / inform the user about download size as we only know the whole series' sizes
  • Open problem: especially the IDCClient.download_from_selection() now contains a lot of if-statements and performs a lot of tasks, which makes the code uneasy to read. But this might also be tackled later.
  • Added printing configurations for codespell hook.

@fedorov
Copy link
Member

fedorov commented Aug 15, 2024

Open problem: we can not estimate / inform the user about download size as we only know the whole series' sizes

We do have instance size in sm_instance_index. I do appreciate how it will be a lot of work to propagate that through the various download-related functionality and support progress reporting, sync, etc. Maybe this is the opportunity to revisit and improve/simplify those functions.

@DanielaSchacherer
Copy link
Contributor Author

Finally ready for a next review @fedorov. As you mentioned the bug you described in #125 was the reason for the main error that I received.

@@ -1314,6 +1343,17 @@ def _format_size(size_MB):
return f"{round(size_GB, 2)} GB"
return f"{round(size_MB, 2)} MB"

@staticmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's combine this with the previous function!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did that.

@DanielaSchacherer DanielaSchacherer marked this pull request as ready for review September 19, 2024 15:30
@DanielaSchacherer DanielaSchacherer marked this pull request as draft September 19, 2024 15:30
fedorov and others added 8 commits September 19, 2024 17:40
In the future, we can add convenience conversion util function that
could serve same purpose without being attached to a specific class
variable

parent aa021d5
author ds_93 <[email protected]> 1723195062 +0200
committer ds_93 <[email protected]> 1726760294 +0200

parent aa021d5
author ds_93 <[email protected]> 1723195062 +0200
committer ds_93 <[email protected]> 1726760288 +0200

BUG: fixed codespell complaint

BUG: fixed pylint errors

BUG: fixed pylint errors

BUG: fixed test

ENH: added printing configurations for codespell

added download size calculation for single instance download. Changed tqdm bar data to be displayed in GB/MB instead of bytes

ENH: enable downloading data in manifests from previous idc versions

ENH: add description for the previous versions index

ENH: fix error messages for items not identified in the current version

ENH: add clinical_index

also added checks for existence of the URLs containing remote indices

BUG: use trim to remove any extraneous spaces while parsing s3 url in manifest

ENH: simplify s3_url extraction

update to simplify to use clinical_index from idc-index

clarify wording of a section

BUG: remove notebook comitted by accident from Colab

BUG: fix viewer series selection parameter

SeriesInstanceUID changed to SeriesInstanceUIDs, see
https://docs.ohif.org/migration-guide/from-3p7-to-3p8#studyinstanceuid-in-the-url-param

ENH: simplify download and create destination directory if needed

Automatic creation of the destination directory mimics the behavior
of s5cmd and simplifies usage

DOC: fix explanation of the dirTemplate parameter

fixed bug

fixed bug second try

wip

ENH: update to IDC v19

ENH: upgrade idc-index-data and use query-based clinical_index

removed hardcoded idc-index-data version
This reverts commit eda4f82.
This reverts commit 111c01c01ac37ed9943ea5ecf762ebbcffb96a74.
Copy link
Member

@fedorov fedorov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just a couple of questions. Let's discuss when you are back next week!

with open(filepath, mode="wb") as file:
file.write(response.content)
setattr(self.__class__, index, pd.read_parquet(filepath))

# Join new index with main index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that here we just need the series_aws_url column from index, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would also need all the attributes necessary for building the hierarchy, but I fixed this in the respective SQL query and do not merge anything when adding a new index.
I am wondering why checks passed for commit bbef810, where they shouldn't in my opinion and also did not when I executed them manually.

idc_index/index.py Show resolved Hide resolved
@DanielaSchacherer
Copy link
Contributor Author

@fedorov It's working again now.

@DanielaSchacherer
Copy link
Contributor Author

Hello @vkt1414 :)
could you maybe briefly help me by letting me know what your intention was when you introduced the variable

s5cmd_sync_helper_df (df): helper df obtained after validation of manifest or filtering of selection, containing a minimum of "index_crdc_series_uuid", "s5cmd_cmd", "series_size_MB", "path" columns (source)

@vkt1414
Copy link
Collaborator

vkt1414 commented Oct 1, 2024

Hello @vkt1414 :) could you maybe briefly help me by letting me know what your intention was when you introduced the variable

s5cmd_sync_helper_df (df): helper df obtained after validation of manifest or filtering of selection, containing a minimum of "index_crdc_series_uuid", "s5cmd_cmd", "series_size_MB", "path" columns (source)

Hi @DanielaSchacherer !
while trying to support downloading data from a manifest from any previous idc versions, I realized that there were a lot of redundant steps to support s5cmd sync. So I saved the results of validation function into s5cmd_sync_helper_df with absolutely necessary columns to support s5cmd sync. Let me know what step is hindering you about sync, and I can try to help.

@DanielaSchacherer
Copy link
Contributor Author

Hi @DanielaSchacherer !
while trying to support downloading data from a manifest from any previous idc versions, I realized that there were a lot of redundant steps to support s5cmd sync. So I saved the results of validation function into s5cmd_sync_helper_df with absolutely necessary columns to support s5cmd sync. Let me know what step is hindering you about sync, and I can try to help.

Hi @vkt1414,
thank you for answering so quickly :) I just added instance-level download for pathology slides and as s5cmd_sync_helper_df is mandatory, I create it, though at least series_size_MB is not meaningful in this case.
I think I can savely replace this parameter with the instance_size_MB value, or do you disagree?

@vkt1414
Copy link
Collaborator

vkt1414 commented Oct 2, 2024

Hi @DanielaSchacherer !
while trying to support downloading data from a manifest from any previous idc versions, I realized that there were a lot of redundant steps to support s5cmd sync. So I saved the results of validation function into s5cmd_sync_helper_df with absolutely necessary columns to support s5cmd sync. Let me know what step is hindering you about sync, and I can try to help.

Hi @vkt1414, thank you for answering so quickly :) I just added instance-level download for pathology slides and as s5cmd_sync_helper_df is mandatory, I create it, though at least series_size_MB is not meaningful in this case. I think I can savely replace this parameter with the instance_size_MB value, or do you disagree?

I'll take a look at how you are implementing this feature, and get back to you!
Question..are you supporting instance download from previous IDC versions data as well?

@DanielaSchacherer
Copy link
Contributor Author

Hi @DanielaSchacherer !
while trying to support downloading data from a manifest from any previous idc versions, I realized that there were a lot of redundant steps to support s5cmd sync. So I saved the results of validation function into s5cmd_sync_helper_df with absolutely necessary columns to support s5cmd sync. Let me know what step is hindering you about sync, and I can try to help.

Hi @vkt1414, thank you for answering so quickly :) I just added instance-level download for pathology slides and as s5cmd_sync_helper_df is mandatory, I create it, though at least series_size_MB is not meaningful in this case. I think I can savely replace this parameter with the instance_size_MB value, or do you disagree?

I'll take a look at how you are implementing this feature, and get back to you! Question..are you supporting instance download from previous IDC versions data as well?

Hi @vkt1414, right now, it only supports what is in the sm_instance_index (which is only the current version).

We do not currently support instance-level manifests as input
to the command-line download tool. We also do not provide any mechanism
to generate instance-level manifests from the portal. Sync operation is
using series level size for estimating progress. Implementing support
for syncing instance-level manifests will take work, which would not be
justified: for now, I think it is safe to assume instance-level download
will only be invoked by passing SOPInstanceUID to the functions/download
tool, and with just a few instances/files, syncing those instead of copy
won't bring much benefit.
@fedorov fedorov marked this pull request as ready for review October 8, 2024 16:57
@fedorov fedorov merged commit 3fcc6b4 into ImagingDataCommons:main Oct 8, 2024
10 checks passed
@DanielaSchacherer DanielaSchacherer deleted the single_instance_download branch October 14, 2024 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants