LORIS crawlers used to crawl several of the CONP datasets #102

cmadjar · 2021-04-27T15:57:24Z

This pulls the pipelines used to generate several of the CONP datasets hosted in LORIS:

datalad_crawler/pipelines/loris.py pipeline was used to crawl:
datalad_crawler/pipelines/loris_bids_export.py pipeline was used to crawl:
- CONP preventad-open-bids
- CONP preventad-registered-images-in-BIDS
datalad_crawler/pipelines/loris_data_releases.py pipeline was used to crawl:
- CONP preventad-registered-clinical-data

… *bval files from git annex

…iles

codecov · 2021-04-27T16:04:11Z

Codecov Report

Merging #102 (5fb0adf) into master (d4d94c3) will decrease coverage by 1.57%.
The diff coverage is 29.57%.

@@            Coverage Diff             @@
##           master     #102      +/-   ##
==========================================
- Coverage   81.37%   79.79%   -1.58%     
==========================================
  Files          57       60       +3     
  Lines        4644     4792     +148     
==========================================
+ Hits         3779     3824      +45     
- Misses        865      968     +103

Impacted Files	Coverage Δ
datalad_crawler/pipelines/loris_bids_export.py	`25.00% <25.00%> (ø)`
datalad_crawler/pipelines/loris_data_releases.py	`31.11% <31.11%> (ø)`
datalad_crawler/pipelines/loris.py	`34.14% <34.14%> (ø)`
datalad_crawler/pipelines/gh.py	`12.82% <0.00%> (+0.32%)`	⬆️
datalad_crawler/pipelines/tests/test_openfmri.py	`89.28% <0.00%> (+0.79%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4d94c3...5fb0adf. Read the comment docs.

cmadjar · 2021-04-27T16:04:16Z

Oooops. Very sorry, I meant to open this PR to the CONP-PCNO fork of datalad-crawler...

I was thinking of sending you an email to ask you if you would be interested in adding those crawlers to your code. Well, now you know my evil plan, haha ;).

Anyway, let me know if you would be interested. If not, no worries, we'll keep those in the CONP-PCNO fork.

Thank you!

yarikoptic

Thanks for sharing! Great to hear that crawler serves you (well?). In general, I would not mind having this PR merged to enrich collection of pipelines -- might be easier for others to find later on to craft something similar.
But I wonder if you would consider improving upon it?

BTW -- is there some "public" loris instance on which at least some of those pipelines could be tested on some tiny dataset?

yarikoptic · 2021-04-27T17:59:16Z

datalad_crawler/pipelines/loris_bids_export.py

+                    "exclude=README.md and exclude=DATS.json and exclude=logo.png"
+                    " and exclude=.datalad/providers/loris.cfg"
+                    " and exclude=.datalad/crawl/crawl.cfg"
+                    " and exclude=*scans.json"


note that those might be leaking scanning dates, which are considered sensitive data

these are the description of the scans.tsv so in theory, there is no data. I had to track the scans.json files in git as they all had the same hash and it led to tons of URLs for a given given and slowed down the download process of those tiny files. (Related to datalad/datalad#5429)

ha ha -- I mixed them up again -- indeed it is a good fit for plain git.

If you are involved with producing those BIDS datasets, there could be a singular scans.json on top level which should then be inherited for each subject/session, and thus avoiding need to duplicate them. Unfortunately bids-specification is not explicitly mentioning that, so I filed bids-standard/bids-specification#789 to possibly improve upon that.

yarikoptic · 2021-04-27T18:02:42Z

datalad_crawler/pipelines/loris_bids_export.py

+    def __init__(self, apibase=None, annex=None):
+        self.apibase = apibase
+        self.meta    = {}
+        self.repo    = annex.repo


with the common code for finalize, constructor and majority of the pipeline, I think you could have avoided good amount of code duplication by establishing some BaseLorisExtractor to reside e.g. in loris.py and serve as a base class. Then derived classes would provide only the critical difference (configuration for annex, these days also better be done via config procedures, but I guess ok as is for here/antique-crawler).

very true! I did them one at a time when I needed them and with not much time to do proper programming ;). I'll look into it.

yarikoptic · 2021-04-27T18:06:00Z

seeing the code and kaggle logger being used -- is this a work incorporating earlier #67 and #13? (sorry I have missed it originally)

cmadjar · 2021-04-27T18:23:09Z

@yarikoptic Thank you! I am definitely happy to improve the code :-).

The code here is indeed based on #13. I did not see the #67 PR. I was told the code #13 was used for PREVENT-AD so I reused it for the other LORIS instances of CONP datasets. I will check with @mathdugre to see the difference between the crawlers.

The datasets listed in the descriptions are all open (except the PREVENT-AD registered ones which is open only to PIs). However, none of the datasets are small unfortunately... Were you thinking of manual testing or automated testing? I could ask around to see what is possible.

Thank you!

yarikoptic · 2021-04-27T19:24:11Z

Were you thinking of manual testing or automated testing? I could ask around to see what is possible.

automated would be the ultimate goal. If no suitable smallish dataset is out there, I guess there could be some include or exclude option to limit to some subset of files to make test run succinct.

cmadjar · 2021-04-28T12:53:58Z

For another LORIS study, I have to write another crawler that would crawl multiple LORIS API endpoints (so multiple URLs). I think that instead of creating one crawler for each API endpoint, I could modify the loris.py crawler so that users can provide a comma separated list of endpoints that would need to be crawled from the LORIS instance.

I am a little bit unfamiliar with the return statement of a pipeline so I don't know how I should code that return statement.

Let's say I have:

apibase     # base URL for the LORIS API
endpoint_1  # endpoint number 1 of the API (to have the complete URL, would need to join apibase and endpoint_1)
endpoint_2  # endpoint number 1 of the API (to have the complete URL, would need to join apibase and endpoint_1)
lorisapi_1  # LorisAPIExtractor(apibase, annex, endpoint_1) -- extractor for the first URL
lorisapi_2  # LorisAPIExtractor(apibase, annex, endpoint_2) -- extractor for the second URL

*the provided endpoints will return a dictionary that includes a list of files to crawl that will be extracted by the LorisAPIExtractor function

After looking at other templates, my first instinct would be something like that:

return [
        [
            crawl_url(join(apibase, endpoint_1)),
            lorisapi_1,
            annex,
            [
                crawl_url(join(apibase, endpoint_2)),
                lorisapi_2,
                annex,
            ],
        ],
        annex.finalize(),
        lorisapi_1.finalize(),
        lorisapi_2.finalize()
    ]

But I don't trust my first instinct ;)

cmadjar · 2021-04-30T15:25:53Z

Just a little note to tell you to discard my last comment. I figured it out :).

Ultimately, I think all LORIS PRs will be closed and I will send a new one with improved crawlers. Might just take some time though. Will definitely keep you posted.

yarikoptic · 2021-04-30T16:18:49Z

My slowness was rewarded! ;-) no rush on my end, but I might come handy to review earlier version of RF

cmadjar · 2021-04-30T20:44:08Z

To be continued on #103

cmadjar added 5 commits February 19, 2021 16:13

limit crawler to download only one specific file

e60d948

exclude participants.json, dataset_description.json, *_scans.json and…

c7a95a0

… *bval files from git annex

add loris MINC crawler and data release crawler

0d5b3e4

fix indentation

b4d0c04

exclude dataset_description.json and participants.json from annexed f…

5fb0adf

…iles

cmadjar mentioned this pull request Apr 27, 2021

LORIS crawlers used to crawl PREVENT-AD and multicenter-phantom CONP-PCNO/datalad-crawler#1

Merged

cmadjar changed the title ~~LORIS crawlers used to crawl PREVENT-AD and multicenter-phantom~~ LORIS crawlers used to crawl several of the CONP datasets Apr 27, 2021

yarikoptic reviewed Apr 27, 2021

View reviewed changes

cmadjar closed this Apr 30, 2021

cmadjar mentioned this pull request Apr 30, 2021

Add LORIS pipelines to crawl LORIS' projects and candidates endpoints #103

Closed

This was referenced Feb 23, 2022

Add LORIS crawler #13

Closed

Add template for Loris Candidate API #67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LORIS crawlers used to crawl several of the CONP datasets #102

LORIS crawlers used to crawl several of the CONP datasets #102

cmadjar commented Apr 27, 2021 •

edited

Loading

codecov bot commented Apr 27, 2021 •

edited

Loading

cmadjar commented Apr 27, 2021

yarikoptic left a comment

yarikoptic Apr 27, 2021

cmadjar Apr 27, 2021

yarikoptic Apr 27, 2021

yarikoptic Apr 27, 2021

cmadjar Apr 27, 2021

yarikoptic commented Apr 27, 2021 •

edited

Loading

cmadjar commented Apr 27, 2021

yarikoptic commented Apr 27, 2021

cmadjar commented Apr 28, 2021 •

edited

Loading

cmadjar commented Apr 30, 2021

yarikoptic commented Apr 30, 2021

cmadjar commented Apr 30, 2021

LORIS crawlers used to crawl several of the CONP datasets #102

LORIS crawlers used to crawl several of the CONP datasets #102

Conversation

cmadjar commented Apr 27, 2021 • edited Loading

codecov bot commented Apr 27, 2021 • edited Loading

Codecov Report

cmadjar commented Apr 27, 2021

yarikoptic left a comment

Choose a reason for hiding this comment

yarikoptic Apr 27, 2021

Choose a reason for hiding this comment

cmadjar Apr 27, 2021

Choose a reason for hiding this comment

yarikoptic Apr 27, 2021

Choose a reason for hiding this comment

yarikoptic Apr 27, 2021

Choose a reason for hiding this comment

cmadjar Apr 27, 2021

Choose a reason for hiding this comment

yarikoptic commented Apr 27, 2021 • edited Loading

cmadjar commented Apr 27, 2021

yarikoptic commented Apr 27, 2021

cmadjar commented Apr 28, 2021 • edited Loading

cmadjar commented Apr 30, 2021

yarikoptic commented Apr 30, 2021

cmadjar commented Apr 30, 2021

cmadjar commented Apr 27, 2021 •

edited

Loading

codecov bot commented Apr 27, 2021 •

edited

Loading

yarikoptic commented Apr 27, 2021 •

edited

Loading

cmadjar commented Apr 28, 2021 •

edited

Loading