Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable completely indexation on subdomains with unusal cc-lc pairs (ex: fr-it, uk-es,...) #8779

Closed
Tracked by #7583
raphael0202 opened this issue Aug 1, 2023 · 4 comments · Fixed by #8786
Closed
Tracked by #7583
Labels

Comments

@raphael0202
Copy link
Contributor

raphael0202 commented Aug 1, 2023

What

Web crawlers can currently index every country code/language code pairs possible:
For example, fr-es.openfoodfacts.org currently has 120k webpages indexed: URL

There are 242 country code (+ world subdomain) and 183 language codes = 34 749 subdomains to crawl.
This is obviously not the right approach to index everything.

Proposed solution

  1. Only allow access to crawlers to subdomains with (country code, lang code) pairs that makes sense (official language for this country). This can be done by (we can implement both approach):
    1. having a dynamic robots.txt that is generated on the fly by product opener (Disable /)
    2. returning noindex page for web crawlers
  2. Deny access to bots to all world-{lc} where {lc} != 'en', to avoid allowing the indexation of 530 millions webpages (2,9M products * 183 language codes).

By proceeding this way, we reduce the number of product pages to index to 7.07M: 2,9M on world + 3,07 (other cc) + 1,1M (6M - 4.9M, remaining cc-lc combinations)

(Previous proposal)

  • on https://es.openfoodfacts.org, have as the canonical URL of the product page https://world-es.openfoodfacts.org/product/{barcode}.
  • Disallow (with a noindex page) all non product pages for world-{lc}.openfoodfacts.org, to avoid allowing the indexation of 530 millions webpages (2,9M products * 183 language codes). This could be also implemented in robots.txt (and avoid many unnecessary queries to OFF server) if we only used english words in URLs in all webpage links (ex: https://fr.openfoodfacts.org/brand/{BRAND} instead of https://fr.openfoodfacts.org/marque/{BRAND}).}

By proceeding this way, we only index a product in a language if there is one country (with the lang as supported language) where this product is available.

edit: I've updated the proposal after feedbacks from stephane.

@stephanegigandet
Copy link
Contributor

  • Only allow access to crawlers to subdomains with (country code, lang code) pairs that makes sense (official language for this country). This can be done by (we can implement both approach):

    1. having a dynamic robots.txt that is generated on the fly by product opener (Disable /)
    2. returning noindex page for web crawlers

Good ideas.

That means we will have world-es pages and world-fr pages indexes instead of es.openfoodfacts.org and fr.openfoodfacts.org but why not, especially if we somehow detect the country of the user to offer them to use their country specific site.

  • Disallow (with a noindex page) all non product pages for world-{lc}.openfoodfacts.org, to avoid allowing the indexation of 530 millions webpages (2,9M products * 183 language codes).

all non product pages, or all product pages?

This could be also implemented in robots.txt (and avoid many unnecessary queries to OFF server) if we only used english words in URLs in all webpage links (ex: https://fr.openfoodfacts.org/brand/{BRAND} instead of https://fr.openfoodfacts.org/marque/{BRAND}).

The language specific "marque" was intended for SEO purposes, but it's probably not worth all the issues it causes, so I'm fine moving to all english facet names in URLs.

@raphael0202
Copy link
Contributor Author

raphael0202 commented Aug 1, 2023

That means we will have world-es pages and world-fr pages indexes instead of es.openfoodfacts.org and fr.openfoodfacts.org but why not, especially if we somehow detect the country of the user to offer them to use their country specific site.

Indeed, the idea behind it was to avoid indexing duplicated products. We have 2.9 millions products, but 6 millions products across all countries (as some products are available in several countries).
edit: on second thoughts, it would be more straighforward and better to have es.openfoodfacts.org page indexed (=canonical URL) for Spain instead of world-es.openfoodfacts.org, as the user would arrive on the country-specific website (and only get products available in his country). We could still allow indexation of world.openfoodfacts.org (english only) and disable all world-{lc}.openfoodfacts.org variants + irrelevant {cc}-{lc} combinations on other subdomains. It would mean allow indexing of:

  • 2.9M products on world.openfoodfacts.org
  • 6M products on all {cc}.openfoodfacts.org (sum of all product count for each cc) => 2.9M on world, 3.07M on other cc
  • 4.9M products on {cc}-{lc} subdomains (with cc != 'world')
  • ?M facet pages

So we would index 7.07M product pages (2,9M on world + 3,07 (other cc) + 1,1M (6M - 4.9M, remaining cc-lc combinations))

all non product pages, or all product pages?

all non product page. If we allow indexation of / (or through any facet), it will allow crawlers to crawl 530M pages by exploring all pages. We must still allow product pages to be indexed, as word-{lc} subdomain will be used as canonical url.

The language specific "marque" was intended for SEO purposes, but it's probably not worth all the issues it causes, so I'm fine moving to all english facet names in URLs.

Oh it was for SEO purpose? That's a very good news if we can drop this, it makes everything harder (metrics on matomo, robots.txt directives, indexing,...). Would it be okay if it's not backward compatible (ex: https://fr.openfoodfacts.org/code-emballeur would return HTTP 404)?

@raphael0202
Copy link
Contributor Author

As an additional reason to do so, I just manage to get access to Google search console on *.openfoodfacts.org, and I found something interesting: on the last 1k page indexed, 745 are from word-{lang_code} subdomains, including 506 for (salvador)...
So world-{lang} subdomains should definitely not be indexed.

raphael0202 added a commit that referenced this issue Aug 2, 2023
ex: fr-es.openfoodfacts.org shouldn't be indexable by web crawlers
See #8779
for more context
raphael0202 added a commit that referenced this issue Aug 2, 2023
ex: fr-es.openfoodfacts.org shouldn't be indexable by web crawlers
See #8779
for more context
raphael0202 added a commit that referenced this issue Aug 8, 2023
ex: fr-es.openfoodfacts.org shouldn't be indexable by web crawlers
See #8779
for more context
@alexgarel
Copy link
Member

Would it be okay if it's not backward compatible (ex: https://fr.openfoodfacts.org/code-emballeur would return HTTP 404)?

It's not hard though to make this url send a redirect to the english version url.

raphael0202 added a commit that referenced this issue Aug 8, 2023
ex: fr-es.openfoodfacts.org shouldn't be indexable by web crawlers
See #8779
for more context
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants