Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update robots.txt.sample #2307

Merged
merged 3 commits into from
Jul 12, 2022
Merged

Update robots.txt.sample #2307

merged 3 commits into from
Jul 12, 2022

Conversation

addison74
Copy link
Contributor

This is an update to the robots.txt.sample file that was recently added to OpenMage (PR #1024).

A short brief of the changes:

  • added the Crawl-delay option for the situation in which a good crawler could create traffic issues.
  • added a few URL paths that I found in frontend pages such as checkout, sales
  • added the option to disallow indexing the media for known bots

The file can be used in production without any issues. Both the initial version and this updated version have been tested by me for a long time. You have to take in consideration that only a good bot will take this file into account. For bad bots, those which skip the robots.txt file content and crawling everything, I will post a new PR these days.

@addison74 addison74 changed the title Addison74 robots update Update robots.txt.sample Jul 10, 2022
@fballiano
Copy link
Contributor

mmmm I'm not sure about these ones

Schermata 2022-07-11 alle 00 33 39

I think I preferred how it was before

@addison74
Copy link
Contributor Author

Disallow: */catalogsearch/ matches the next four green lines. Let's analyze each line and decide if we keep any of them.

  • advanced and result - definitely we do not want the indexing of these pages by good bots. I deactivated advanced search in my stores.
  • term/popular - although in all my stores I have deactivated this section too, there are others that have used the page and want it to be indexed.
  • seo_sitemap has two sections categories and products. In my case I disabled it as long as a sitemap is set in robots.txt, but there could be advantages if the sections are indexed.

If I keep only the red line I have no control over the others. I would keep the red line and the last two green ones. If I want access to the term/popular and seo_sitemap pages I can uncomment. The proposed variant is the following:

Disallow: */ catalogsearch /
#Allow: */catalogsearch/seo_sitemap
#Allow: */catalogsearch/term/popular

@fballiano
Copy link
Contributor

This PR puts in "dissallow" also "catalogsearch/seo_sitemap" which seems to be seo related. but anyway I'd prefer to leave all of catalogsearch/ out of the search engines

Disallow: */catalogsearch/
Disallow: */catalogsearch/advanced
Disallow: */catalogsearch/result
Disallow: */catalogsearch/seo_sitemap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is needed for google (and others) to easy find crawl-able URLs. Preventing them to crawl this removes the whole purpose of this endpoint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested on 2 of my stores and that endpoint goes to 404... i'm not sure it exists

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which brings up the question, what links to it :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, that's why I should stay Disallow: */catalogsearch/ instead of the 3 split lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make the change by removing those 4 lines is no problem. but keep in mind that so far OM has not provided a robots.txt file. who has Frontend links to / catalogsearch / seo_map and / catalogsearch / term / popular when using this file in its previous form with the only option `Disallow: / * / catalogsearch /` will remove the indexing of these two pages. obviously we are talking about bots that take into account the existence and content of this file.

informatively, I analyzed for a month all the bots that went through my websites and I can say that very few come in contact with robots.txt. GoogleBot ignores Crawl-delay.

# Disallow: */skin/

# User-agent: Googlebot-Image
# Disallow: /
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some people might rely on their product and other images being available via image search

Copy link
Contributor Author

@addison74 addison74 Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main reason those lines are commented from the very beginning. But there are situations when you have a website and maybe you don't want your images to be indexed even by Googlebot-Image. In this case you uncomment on those lines.

However, between us it is a basic solution proposed here, It aims to make the administrator aware that this possibility also exists. I use other solutions to block/allow full or limited access for bots. A similar variant I proposed in another PR about the new content of the .htaccess file in root.

Flyingmana
Flyingmana previously approved these changes Jul 11, 2022
Copy link
Contributor

@Flyingmana Flyingmana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving, because in the end its a sample file and in the end there is nothing changed which is clearly wrong

@addison74
Copy link
Contributor Author

addison74 commented Jul 11, 2022

I made the change by removing two lines because it is obvious that we do not want to index them.

The other two lines remain under discussion. The scenarios is as follows: I want to let the bots to crawler content in /catalogsearch/seo_sitemap | term/popular but I don't want them to do the same thing for paths which start with /catalogsearch like /advanced | /result.

@fballiano fballiano merged commit a11d247 into OpenMage:1.9.4.x Jul 12, 2022
@github-actions
Copy link
Contributor

Unit Test Results

1 files  ±0  1 suites  ±0   0s ⏱️ ±0s
0 tests ±0  0 ✔️ ±0  0 💤 ±0  0 ❌ ±0 
7 runs  ±0  5 ✔️ ±0  2 💤 ±0  0 ❌ ±0 

Results for commit a11d247. ± Comparison against base commit e06fe2d.

@addison74 addison74 deleted the ADDISON74-robots-update branch July 12, 2022 14:47
elidrissidev pushed a commit to elidrissidev/magento-lts that referenced this pull request Jul 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants