Link crawling might be necessary after all #18

justinccdev · 2018-06-26T16:46:19Z

Originally, I hoped we could require sites to link all marked up pages directly to their sitemap.xml, as done by Biosamples. This may be another position that needs revision, though I wouldn't count it out just yet. The alternative is to also crawl via webpage links, though my expectation is that this will result in slower performance (I could be wrong, might not be that significant).

PDBe is our example for this, as another large website that's not dissimilar to Biosamples. PDBe do not appear to link pages via their sitemap. In fact, even with link crawling there's no obvious way to actually reach all their data, as it's behind a search interface. @ricardoaat is going to investigate this and see if there is a way of crawling that site. If not, the best case is that they do have their sitemap.xml link to all their entries (though this may just push us the problem back until we encounter a site that will not do this or is marked up but has little technical capacity to respond to requests). Another case is that PDBe do start providing links to all their entries but not through their sitemap.xml

What we want to avoid, if at all possible, is having custom code to crawl certain sites (e.g. by entering *.* in the search form at PDBe). This will not scale when we try to crawl many different sites and is very sensitive to changes in the target site.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link crawling might be necessary after all #18

Link crawling might be necessary after all #18

justinccdev commented Jun 26, 2018 •

edited

Loading

Link crawling might be necessary after all #18

Link crawling might be necessary after all #18

Comments

justinccdev commented Jun 26, 2018 • edited Loading

justinccdev commented Jun 26, 2018 •

edited

Loading