Disable completely indexation on subdomains with unusal cc-lc pairs (ex: fr-it, uk-es,...) #8779

raphael0202 · 2023-08-01T12:59:59Z

What

Web crawlers can currently index every country code/language code pairs possible:
For example, fr-es.openfoodfacts.org currently has 120k webpages indexed: URL

There are 242 country code (+ world subdomain) and 183 language codes = 34 749 subdomains to crawl.
This is obviously not the right approach to index everything.

Proposed solution

Only allow access to crawlers to subdomains with (country code, lang code) pairs that makes sense (official language for this country). This can be done by (we can implement both approach):
1. having a dynamic robots.txt that is generated on the fly by product opener (Disable /)
2. returning noindex page for web crawlers
Deny access to bots to all world-{lc} where {lc} != 'en', to avoid allowing the indexation of 530 millions webpages (2,9M products * 183 language codes).

By proceeding this way, we reduce the number of product pages to index to 7.07M: 2,9M on world + 3,07 (other cc) + 1,1M (6M - 4.9M, remaining cc-lc combinations)

(Previous proposal)

~~on https://es.openfoodfacts.org, have as the canonical URL of the product page https://world-es.openfoodfacts.org/product/{barcode}~~.
Disallow (with a noindex page) all non product pages for world-{lc}.openfoodfacts.org, to avoid allowing the indexation of 530 millions webpages (2,9M products * 183 language codes). This could be also implemented in robots.txt (and avoid many unnecessary queries to OFF server) if we only used english words in URLs in all webpage links (ex: https://fr.openfoodfacts.org/brand/{BRAND} instead of https://fr.openfoodfacts.org/marque/{BRAND}).}

~~By proceeding this way, we only index a product in a language if there is one country (with the lang as supported language) where this product is available.~~

edit: I've updated the proposal after feedbacks from stephane.

The text was updated successfully, but these errors were encountered:

stephanegigandet · 2023-08-01T14:10:32Z

Only allow access to crawlers to subdomains with (country code, lang code) pairs that makes sense (official language for this country). This can be done by (we can implement both approach):

having a dynamic robots.txt that is generated on the fly by product opener (Disable /)

returning noindex page for web crawlers

Good ideas.

on https://es.openfoodfacts.org, have as the canonical URL of the product page https://world-es.openfoodfacts.org/product/{barcode}

That means we will have world-es pages and world-fr pages indexes instead of es.openfoodfacts.org and fr.openfoodfacts.org but why not, especially if we somehow detect the country of the user to offer them to use their country specific site.

Disallow (with a noindex page) all non product pages for world-{lc}.openfoodfacts.org, to avoid allowing the indexation of 530 millions webpages (2,9M products * 183 language codes).

all non product pages, or all product pages?

This could be also implemented in robots.txt (and avoid many unnecessary queries to OFF server) if we only used english words in URLs in all webpage links (ex: https://fr.openfoodfacts.org/brand/{BRAND} instead of https://fr.openfoodfacts.org/marque/{BRAND}).

The language specific "marque" was intended for SEO purposes, but it's probably not worth all the issues it causes, so I'm fine moving to all english facet names in URLs.

raphael0202 · 2023-08-01T15:44:44Z

That means we will have world-es pages and world-fr pages indexes instead of es.openfoodfacts.org and fr.openfoodfacts.org but why not, especially if we somehow detect the country of the user to offer them to use their country specific site.

~~Indeed, the idea behind it was to avoid indexing duplicated products. We have 2.9 millions products, but 6 millions products across all countries (as some products are available in several countries).~~
edit: on second thoughts, it would be more straighforward and better to have es.openfoodfacts.org page indexed (=canonical URL) for Spain instead of world-es.openfoodfacts.org, as the user would arrive on the country-specific website (and only get products available in his country). We could still allow indexation of world.openfoodfacts.org (english only) and disable all world-{lc}.openfoodfacts.org variants + irrelevant {cc}-{lc} combinations on other subdomains. It would mean allow indexing of:

2.9M products on world.openfoodfacts.org
6M products on all {cc}.openfoodfacts.org (sum of all product count for each cc) => 2.9M on world, 3.07M on other cc
4.9M products on {cc}-{lc} subdomains (with cc != 'world')
?M facet pages

So we would index 7.07M product pages (2,9M on world + 3,07 (other cc) + 1,1M (6M - 4.9M, remaining cc-lc combinations))

all non product pages, or all product pages?

all non product page. If we allow indexation of / (or through any facet), it will allow crawlers to crawl 530M pages by exploring all pages. We must still allow product pages to be indexed, as word-{lc} subdomain will be used as canonical url.

The language specific "marque" was intended for SEO purposes, but it's probably not worth all the issues it causes, so I'm fine moving to all english facet names in URLs.

Oh it was for SEO purpose? That's a very good news if we can drop this, it makes everything harder (metrics on matomo, robots.txt directives, indexing,...). Would it be okay if it's not backward compatible (ex: https://fr.openfoodfacts.org/code-emballeur would return HTTP 404)?

raphael0202 · 2023-08-02T12:21:02Z

As an additional reason to do so, I just manage to get access to Google search console on *.openfoodfacts.org, and I found something interesting: on the last 1k page indexed, 745 are from word-{lang_code} subdomains, including 506 for (salvador)...
So world-{lang} subdomains should definitely not be indexed.

ex: fr-es.openfoodfacts.org shouldn't be indexable by web crawlers See #8779 for more context

Solves #8779

ex: fr-es.openfoodfacts.org shouldn't be indexable by web crawlers See #8779 for more context

Solves #8779

ex: fr-es.openfoodfacts.org shouldn't be indexable by web crawlers See #8779 for more context

Solves #8779

alexgarel · 2023-08-08T13:59:08Z

Would it be okay if it's not backward compatible (ex: https://fr.openfoodfacts.org/code-emballeur would return HTTP 404)?

It's not hard though to make this url send a redirect to the english version url.

ex: fr-es.openfoodfacts.org shouldn't be indexable by web crawlers See #8779 for more context

Solves #8779

raphael0202 mentioned this issue Aug 1, 2023

🕷️ Indexation (tracker) #7583

Open

raphael0202 added the indexing label Aug 1, 2023

This was referenced Aug 1, 2023

Product canonical URL should not change when product name is updated #8778

Open

fix: Normalize tagtype in url #8787

Closed

raphael0202 added a commit that referenced this issue Aug 2, 2023

fix: don't allow bot crawlers to index unsupported lc for cc

293ee4e

ex: fr-es.openfoodfacts.org shouldn't be indexable by web crawlers See #8779 for more context

raphael0202 added a commit that referenced this issue Aug 2, 2023

fix: make unindexable most subdomains (cc-lc pairs)

7fc3981

Solves #8779

raphael0202 mentioned this issue Aug 2, 2023

fix: Improve web crawlers indexation #8791

Merged

raphael0202 added a commit that referenced this issue Aug 2, 2023

fix: don't allow bot crawlers to index unsupported lc for cc

6aa692d

ex: fr-es.openfoodfacts.org shouldn't be indexable by web crawlers See #8779 for more context

raphael0202 added a commit that referenced this issue Aug 2, 2023

fix: make unindexable most subdomains (cc-lc pairs)

836d4f6

Solves #8779

raphael0202 added a commit that referenced this issue Aug 8, 2023

fix: don't allow bot crawlers to index unsupported lc for cc

28b236f

ex: fr-es.openfoodfacts.org shouldn't be indexable by web crawlers See #8779 for more context

raphael0202 added a commit that referenced this issue Aug 8, 2023

fix: make unindexable most subdomains (cc-lc pairs)

5aa9ac6

Solves #8779

raphael0202 added a commit that referenced this issue Aug 8, 2023

fix: don't allow bot crawlers to index unsupported lc for cc

aabeabc

ex: fr-es.openfoodfacts.org shouldn't be indexable by web crawlers See #8779 for more context

raphael0202 added a commit that referenced this issue Aug 8, 2023

fix: make unindexable most subdomains (cc-lc pairs)

84be79f

Solves #8779

openfoodfacts-bot mentioned this issue Aug 8, 2023

chore(main): release 2.16.0 #8786

Merged

stephanegigandet closed this as completed in #8786 Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable completely indexation on subdomains with unusal cc-lc pairs (ex: fr-it, uk-es,...) #8779

Disable completely indexation on subdomains with unusal cc-lc pairs (ex: fr-it, uk-es,...) #8779

raphael0202 commented Aug 1, 2023 •

edited

Loading

stephanegigandet commented Aug 1, 2023

raphael0202 commented Aug 1, 2023 •

edited

Loading

raphael0202 commented Aug 2, 2023

alexgarel commented Aug 8, 2023

Disable completely indexation on subdomains with unusal cc-lc pairs (ex: fr-it, uk-es,...) #8779

Disable completely indexation on subdomains with unusal cc-lc pairs (ex: fr-it, uk-es,...) #8779

Comments

raphael0202 commented Aug 1, 2023 • edited Loading

What

Proposed solution

stephanegigandet commented Aug 1, 2023

raphael0202 commented Aug 1, 2023 • edited Loading

raphael0202 commented Aug 2, 2023

alexgarel commented Aug 8, 2023

raphael0202 commented Aug 1, 2023 •

edited

Loading

raphael0202 commented Aug 1, 2023 •

edited

Loading