-
-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable completely indexation on subdomains with unusal cc-lc pairs (ex: fr-it, uk-es,...) #8779
Comments
Good ideas.
That means we will have world-es pages and world-fr pages indexes instead of es.openfoodfacts.org and fr.openfoodfacts.org but why not, especially if we somehow detect the country of the user to offer them to use their country specific site.
all non product pages, or all product pages?
The language specific "marque" was intended for SEO purposes, but it's probably not worth all the issues it causes, so I'm fine moving to all english facet names in URLs. |
So we would index 7.07M product pages (2,9M on world + 3,07 (other cc) + 1,1M (6M - 4.9M, remaining cc-lc combinations))
all non product page. If we allow indexation of / (or through any facet), it will allow crawlers to crawl 530M pages by exploring all pages. We must still allow product pages to be indexed, as word-{lc} subdomain will be used as canonical url.
Oh it was for SEO purpose? That's a very good news if we can drop this, it makes everything harder (metrics on matomo, robots.txt directives, indexing,...). Would it be okay if it's not backward compatible (ex: https://fr.openfoodfacts.org/code-emballeur would return HTTP 404)? |
As an additional reason to do so, I just manage to get access to Google search console on *.openfoodfacts.org, and I found something interesting: on the last 1k page indexed, 745 are from word-{lang_code} subdomains, including 506 for (salvador)... |
ex: fr-es.openfoodfacts.org shouldn't be indexable by web crawlers See #8779 for more context
ex: fr-es.openfoodfacts.org shouldn't be indexable by web crawlers See #8779 for more context
ex: fr-es.openfoodfacts.org shouldn't be indexable by web crawlers See #8779 for more context
It's not hard though to make this url send a redirect to the english version url. |
ex: fr-es.openfoodfacts.org shouldn't be indexable by web crawlers See #8779 for more context
What
Web crawlers can currently index every country code/language code pairs possible:
For example,
fr-es.openfoodfacts.org
currently has 120k webpages indexed: URLThere are 242 country code (+ world subdomain) and 183 language codes = 34 749 subdomains to crawl.
This is obviously not the right approach to index everything.
Proposed solution
noindex
page for web crawlersBy proceeding this way, we reduce the number of product pages to index to 7.07M: 2,9M on world + 3,07 (other cc) + 1,1M (6M - 4.9M, remaining cc-lc combinations)
(Previous proposal)
on https://es.openfoodfacts.org, have as the canonical URL of the product page https://world-es.openfoodfacts.org/product/{barcode}.Disallow (with a noindex page) all non product pages for world-{lc}.openfoodfacts.org, to avoid allowing the indexation of 530 millions webpages (2,9M products * 183 language codes). This could be also implemented in robots.txt (and avoid many unnecessary queries to OFF server) if we only used english words in URLs in all webpage links (ex:https://fr.openfoodfacts.org/brand/{BRAND}
instead ofhttps://fr.openfoodfacts.org/marque/{BRAND}
).}By proceeding this way, we only index a product in a language if there is one country (with the lang as supported language) where this product is available.edit: I've updated the proposal after feedbacks from stephane.
The text was updated successfully, but these errors were encountered: