You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Originally, I hoped we could require sites to link all marked up pages directly to their sitemap.xml, as done by Biosamples. This may be another position that needs revision, though I wouldn't count it out just yet. The alternative is to also crawl via webpage links, though my expectation is that this will result in slower performance (I could be wrong, might not be that significant).
PDBe is our example for this, as another large website that's not dissimilar to Biosamples. PDBe do not appear to link pages via their sitemap. In fact, even with link crawling there's no obvious way to actually reach all their data, as it's behind a search interface. @ricardoaat is going to investigate this and see if there is a way of crawling that site. If not, the best case is that they do have their sitemap.xml link to all their entries (though this may just push us the problem back until we encounter a site that will not do this or is marked up but has little technical capacity to respond to requests). Another case is that PDBe do start providing links to all their entries but not through their sitemap.xml
What we want to avoid, if at all possible, is having custom code to crawl certain sites (e.g. by entering *.* in the search form at PDBe). This will not scale when we try to crawl many different sites and is very sensitive to changes in the target site.
The text was updated successfully, but these errors were encountered:
Originally, I hoped we could require sites to link all marked up pages directly to their sitemap.xml, as done by Biosamples. This may be another position that needs revision, though I wouldn't count it out just yet. The alternative is to also crawl via webpage links, though my expectation is that this will result in slower performance (I could be wrong, might not be that significant).
PDBe is our example for this, as another large website that's not dissimilar to Biosamples. PDBe do not appear to link pages via their sitemap. In fact, even with link crawling there's no obvious way to actually reach all their data, as it's behind a search interface. @ricardoaat is going to investigate this and see if there is a way of crawling that site. If not, the best case is that they do have their sitemap.xml link to all their entries (though this may just push us the problem back until we encounter a site that will not do this or is marked up but has little technical capacity to respond to requests). Another case is that PDBe do start providing links to all their entries but not through their sitemap.xml
What we want to avoid, if at all possible, is having custom code to crawl certain sites (e.g. by entering *.* in the search form at PDBe). This will not scale when we try to crawl many different sites and is very sensitive to changes in the target site.
The text was updated successfully, but these errors were encountered: