Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security issues in crawling #12

Open
innovationchef opened this issue May 28, 2018 · 5 comments
Open

Security issues in crawling #12

innovationchef opened this issue May 28, 2018 · 5 comments

Comments

@innovationchef
Copy link
Member

Thoughts -

  1. Faithful Crawling - the input website may not contain relevant bioschemas data
  2. Massive JSON-LD or web pages
  3. Filter frontend inputs
  4. Denial of service attacks
@innovationchef
Copy link
Member Author

innovationchef commented Jun 1, 2018

  1. How do I check if the json-ld received contains the relevant life sciences data?
  2. I don't think scrapy takes care of massive web pages (I will update this in case I find something in documentation). So how do we check if the web page is massive? What limit should be put on the size?
  3. Will be taken later when we start the frontend part.

@justinccdev
Copy link
Member

  1. Good question. I think this is an argument for explicitly selecting the sites indexed rather than doing a general crawl. A very good source of sites may be https://fairsharing.org, run by @Drosophilic
  2. Of course I believe you buy I'm surprised it doesn't. Yeah, not sure about size limit, what's the size of the average Biosamples page?

@innovationchef
Copy link
Member Author

Took me some time to find this out, but Scrapy has a safety valve around the downloader. If the aggregated size of Response in progress is larger than 5 MB it stops the flow of further Request into the downloader.
See here

@innovationchef
Copy link
Member Author

@justinccdev
There are 791 databases currently listed on the fairsharing.org website related to life sciences. ebi.co.uk/biosamples is one of them. So, should I check if the website that the user is giving as input is listed in that database before crawling?
Link here

@justinccdev
Copy link
Member

Whilst we may very probably want to use fairsharing.org information and/or Bioschemas live deploys in the future as sources for default sites to crawl, we don't want to restrict the user as to what they can crawl, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants