Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link crawling might be necessary after all #18

Open
justinccdev opened this issue Jun 26, 2018 · 0 comments
Open

Link crawling might be necessary after all #18

justinccdev opened this issue Jun 26, 2018 · 0 comments

Comments

@justinccdev
Copy link
Member

justinccdev commented Jun 26, 2018

Originally, I hoped we could require sites to link all marked up pages directly to their sitemap.xml, as done by Biosamples. This may be another position that needs revision, though I wouldn't count it out just yet. The alternative is to also crawl via webpage links, though my expectation is that this will result in slower performance (I could be wrong, might not be that significant).

PDBe is our example for this, as another large website that's not dissimilar to Biosamples. PDBe do not appear to link pages via their sitemap. In fact, even with link crawling there's no obvious way to actually reach all their data, as it's behind a search interface. @ricardoaat is going to investigate this and see if there is a way of crawling that site. If not, the best case is that they do have their sitemap.xml link to all their entries (though this may just push us the problem back until we encounter a site that will not do this or is marked up but has little technical capacity to respond to requests). Another case is that PDBe do start providing links to all their entries but not through their sitemap.xml

What we want to avoid, if at all possible, is having custom code to crawl certain sites (e.g. by entering *.* in the search form at PDBe). This will not scale when we try to crawl many different sites and is very sensitive to changes in the target site.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant