Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LORIS crawlers used to crawl several of the CONP datasets #102

Conversation

cmadjar
Copy link

@cmadjar cmadjar commented Apr 27, 2021

This pulls the pipelines used to generate several of the CONP datasets hosted in LORIS:

@codecov
Copy link

codecov bot commented Apr 27, 2021

Codecov Report

Merging #102 (5fb0adf) into master (d4d94c3) will decrease coverage by 1.57%.
The diff coverage is 29.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #102      +/-   ##
==========================================
- Coverage   81.37%   79.79%   -1.58%     
==========================================
  Files          57       60       +3     
  Lines        4644     4792     +148     
==========================================
+ Hits         3779     3824      +45     
- Misses        865      968     +103     
Impacted Files Coverage Δ
datalad_crawler/pipelines/loris_bids_export.py 25.00% <25.00%> (ø)
datalad_crawler/pipelines/loris_data_releases.py 31.11% <31.11%> (ø)
datalad_crawler/pipelines/loris.py 34.14% <34.14%> (ø)
datalad_crawler/pipelines/gh.py 12.82% <0.00%> (+0.32%) ⬆️
datalad_crawler/pipelines/tests/test_openfmri.py 89.28% <0.00%> (+0.79%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4d94c3...5fb0adf. Read the comment docs.

@cmadjar
Copy link
Author

cmadjar commented Apr 27, 2021

Oooops. Very sorry, I meant to open this PR to the CONP-PCNO fork of datalad-crawler...

I was thinking of sending you an email to ask you if you would be interested in adding those crawlers to your code. Well, now you know my evil plan, haha ;).

Anyway, let me know if you would be interested. If not, no worries, we'll keep those in the CONP-PCNO fork.

Thank you!

@cmadjar cmadjar changed the title LORIS crawlers used to crawl PREVENT-AD and multicenter-phantom LORIS crawlers used to crawl several of the CONP datasets Apr 27, 2021
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing! Great to hear that crawler serves you (well?). In general, I would not mind having this PR merged to enrich collection of pipelines -- might be easier for others to find later on to craft something similar.
But I wonder if you would consider improving upon it?

BTW -- is there some "public" loris instance on which at least some of those pipelines could be tested on some tiny dataset?

"exclude=README.md and exclude=DATS.json and exclude=logo.png"
" and exclude=.datalad/providers/loris.cfg"
" and exclude=.datalad/crawl/crawl.cfg"
" and exclude=*scans.json"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that those might be leaking scanning dates, which are considered sensitive data

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are the description of the scans.tsv so in theory, there is no data. I had to track the scans.json files in git as they all had the same hash and it led to tons of URLs for a given given and slowed down the download process of those tiny files. (Related to datalad/datalad#5429)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha ha -- I mixed them up again -- indeed it is a good fit for plain git.

If you are involved with producing those BIDS datasets, there could be a singular scans.json on top level which should then be inherited for each subject/session, and thus avoiding need to duplicate them. Unfortunately bids-specification is not explicitly mentioning that, so I filed bids-standard/bids-specification#789 to possibly improve upon that.

def __init__(self, apibase=None, annex=None):
self.apibase = apibase
self.meta = {}
self.repo = annex.repo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the common code for finalize, constructor and majority of the pipeline, I think you could have avoided good amount of code duplication by establishing some BaseLorisExtractor to reside e.g. in loris.py and serve as a base class. Then derived classes would provide only the critical difference (configuration for annex, these days also better be done via config procedures, but I guess ok as is for here/antique-crawler).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very true! I did them one at a time when I needed them and with not much time to do proper programming ;). I'll look into it.

@yarikoptic
Copy link
Member

yarikoptic commented Apr 27, 2021

seeing the code and kaggle logger being used -- is this a work incorporating earlier #67 and #13? (sorry I have missed it originally)

@cmadjar
Copy link
Author

cmadjar commented Apr 27, 2021

@yarikoptic Thank you! I am definitely happy to improve the code :-).

The code here is indeed based on #13. I did not see the #67 PR. I was told the code #13 was used for PREVENT-AD so I reused it for the other LORIS instances of CONP datasets. I will check with @mathdugre to see the difference between the crawlers.

The datasets listed in the descriptions are all open (except the PREVENT-AD registered ones which is open only to PIs). However, none of the datasets are small unfortunately... Were you thinking of manual testing or automated testing? I could ask around to see what is possible.

Thank you!

@yarikoptic
Copy link
Member

Were you thinking of manual testing or automated testing? I could ask around to see what is possible.

automated would be the ultimate goal. If no suitable smallish dataset is out there, I guess there could be some include or exclude option to limit to some subset of files to make test run succinct.

@cmadjar
Copy link
Author

cmadjar commented Apr 28, 2021

For another LORIS study, I have to write another crawler that would crawl multiple LORIS API endpoints (so multiple URLs). I think that instead of creating one crawler for each API endpoint, I could modify the loris.py crawler so that users can provide a comma separated list of endpoints that would need to be crawled from the LORIS instance.

I am a little bit unfamiliar with the return statement of a pipeline so I don't know how I should code that return statement.

Let's say I have:

apibase     # base URL for the LORIS API
endpoint_1  # endpoint number 1 of the API (to have the complete URL, would need to join apibase and endpoint_1)
endpoint_2  # endpoint number 1 of the API (to have the complete URL, would need to join apibase and endpoint_1)
lorisapi_1  # LorisAPIExtractor(apibase, annex, endpoint_1) -- extractor for the first URL
lorisapi_2  # LorisAPIExtractor(apibase, annex, endpoint_2) -- extractor for the second URL

*the provided endpoints will return a dictionary that includes a list of files to crawl that will be extracted by the LorisAPIExtractor function

After looking at other templates, my first instinct would be something like that:

return [
        [
            crawl_url(join(apibase, endpoint_1)),
            lorisapi_1,
            annex,
            [
                crawl_url(join(apibase, endpoint_2)),
                lorisapi_2,
                annex,
            ],
        ],
        annex.finalize(),
        lorisapi_1.finalize(),
        lorisapi_2.finalize()
    ]

But I don't trust my first instinct ;)

@cmadjar
Copy link
Author

cmadjar commented Apr 30, 2021

Just a little note to tell you to discard my last comment. I figured it out :).

Ultimately, I think all LORIS PRs will be closed and I will send a new one with improved crawlers. Might just take some time though. Will definitely keep you posted.

@yarikoptic
Copy link
Member

My slowness was rewarded! ;-) no rush on my end, but I might come handy to review earlier version of RF

@cmadjar
Copy link
Author

cmadjar commented Apr 30, 2021

To be continued on #103

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants