-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LORIS crawlers used to crawl several of the CONP datasets #102
LORIS crawlers used to crawl several of the CONP datasets #102
Conversation
… *bval files from git annex
Codecov Report
@@ Coverage Diff @@
## master #102 +/- ##
==========================================
- Coverage 81.37% 79.79% -1.58%
==========================================
Files 57 60 +3
Lines 4644 4792 +148
==========================================
+ Hits 3779 3824 +45
- Misses 865 968 +103
Continue to review full report at Codecov.
|
Oooops. Very sorry, I meant to open this PR to the CONP-PCNO fork of datalad-crawler... I was thinking of sending you an email to ask you if you would be interested in adding those crawlers to your code. Well, now you know my evil plan, haha ;). Anyway, let me know if you would be interested. If not, no worries, we'll keep those in the CONP-PCNO fork. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing! Great to hear that crawler serves you (well?). In general, I would not mind having this PR merged to enrich collection of pipelines -- might be easier for others to find later on to craft something similar.
But I wonder if you would consider improving upon it?
BTW -- is there some "public" loris instance on which at least some of those pipelines could be tested on some tiny dataset?
"exclude=README.md and exclude=DATS.json and exclude=logo.png" | ||
" and exclude=.datalad/providers/loris.cfg" | ||
" and exclude=.datalad/crawl/crawl.cfg" | ||
" and exclude=*scans.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that those might be leaking scanning dates, which are considered sensitive data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are the description of the scans.tsv
so in theory, there is no data. I had to track the scans.json
files in git as they all had the same hash and it led to tons of URLs for a given given and slowed down the download process of those tiny files. (Related to datalad/datalad#5429)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha ha -- I mixed them up again -- indeed it is a good fit for plain git.
If you are involved with producing those BIDS datasets, there could be a singular scans.json
on top level which should then be inherited for each subject/session, and thus avoiding need to duplicate them. Unfortunately bids-specification is not explicitly mentioning that, so I filed bids-standard/bids-specification#789 to possibly improve upon that.
def __init__(self, apibase=None, annex=None): | ||
self.apibase = apibase | ||
self.meta = {} | ||
self.repo = annex.repo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the common code for finalize
, constructor and majority of the pipeline, I think you could have avoided good amount of code duplication by establishing some BaseLorisExtractor
to reside e.g. in loris.py
and serve as a base class. Then derived classes would provide only the critical difference (configuration for annex, these days also better be done via config procedures, but I guess ok as is for here/antique-crawler).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very true! I did them one at a time when I needed them and with not much time to do proper programming ;). I'll look into it.
@yarikoptic Thank you! I am definitely happy to improve the code :-). The code here is indeed based on #13. I did not see the #67 PR. I was told the code #13 was used for PREVENT-AD so I reused it for the other LORIS instances of CONP datasets. I will check with @mathdugre to see the difference between the crawlers. The datasets listed in the descriptions are all open (except the PREVENT-AD registered ones which is open only to PIs). However, none of the datasets are small unfortunately... Were you thinking of manual testing or automated testing? I could ask around to see what is possible. Thank you! |
automated would be the ultimate goal. If no suitable smallish dataset is out there, I guess there could be some |
For another LORIS study, I have to write another crawler that would crawl multiple LORIS API endpoints (so multiple URLs). I think that instead of creating one crawler for each API endpoint, I could modify the I am a little bit unfamiliar with the return statement of a pipeline so I don't know how I should code that return statement. Let's say I have:
*the provided endpoints will return a dictionary that includes a list of files to crawl that will be extracted by the LorisAPIExtractor function After looking at other templates, my first instinct would be something like that:
But I don't trust my first instinct ;) |
Just a little note to tell you to discard my last comment. I figured it out :). Ultimately, I think all LORIS PRs will be closed and I will send a new one with improved crawlers. Might just take some time though. Will definitely keep you posted. |
My slowness was rewarded! ;-) no rush on my end, but I might come handy to review earlier version of RF |
To be continued on #103 |
This pulls the pipelines used to generate several of the CONP datasets hosted in LORIS:
datalad_crawler/pipelines/loris.py
pipeline was used to crawl:datalad_crawler/pipelines/loris_bids_export.py
pipeline was used to crawl:datalad_crawler/pipelines/loris_data_releases.py
pipeline was used to crawl: