Update robots.txt.sample #2307

addison74 · 2022-07-10T22:12:29Z

This is an update to the robots.txt.sample file that was recently added to OpenMage (PR #1024).

A short brief of the changes:

added the Crawl-delay option for the situation in which a good crawler could create traffic issues.
added a few URL paths that I found in frontend pages such as checkout, sales
added the option to disallow indexing the media for known bots

The file can be used in production without any issues. Both the initial version and this updated version have been tested by me for a long time. You have to take in consideration that only a good bot will take this file into account. For bad bots, those which skip the robots.txt file content and crawling everything, I will post a new PR these days.

fballiano · 2022-07-10T22:34:29Z

mmmm I'm not sure about these ones

I think I preferred how it was before

addison74 · 2022-07-11T08:37:52Z

Disallow: */catalogsearch/ matches the next four green lines. Let's analyze each line and decide if we keep any of them.

advanced and result - definitely we do not want the indexing of these pages by good bots. I deactivated advanced search in my stores.
term/popular - although in all my stores I have deactivated this section too, there are others that have used the page and want it to be indexed.
seo_sitemap has two sections categories and products. In my case I disabled it as long as a sitemap is set in robots.txt, but there could be advantages if the sections are indexed.

If I keep only the red line I have no control over the others. I would keep the red line and the last two green ones. If I want access to the term/popular and seo_sitemap pages I can uncomment. The proposed variant is the following:

Disallow: */ catalogsearch /
#Allow: */catalogsearch/seo_sitemap
#Allow: */catalogsearch/term/popular

fballiano · 2022-07-11T11:33:30Z

This PR puts in "dissallow" also "catalogsearch/seo_sitemap" which seems to be seo related. but anyway I'd prefer to leave all of catalogsearch/ out of the search engines

Flyingmana · 2022-07-11T11:40:36Z

robots.txt.sample

-Disallow: */catalogsearch/
+Disallow: */catalogsearch/advanced
+Disallow: */catalogsearch/result
+Disallow: */catalogsearch/seo_sitemap


this file is needed for google (and others) to easy find crawl-able URLs. Preventing them to crawl this removes the whole purpose of this endpoint.

tested on 2 of my stores and that endpoint goes to 404... i'm not sure it exists

which brings up the question, what links to it :/

exactly, that's why I should stay Disallow: */catalogsearch/ instead of the 3 split lines

to make the change by removing those 4 lines is no problem. but keep in mind that so far OM has not provided a robots.txt file. who has Frontend links to / catalogsearch / seo_map and / catalogsearch / term / popular when using this file in its previous form with the only option `Disallow: / * / catalogsearch /` will remove the indexing of these two pages. obviously we are talking about bots that take into account the existence and content of this file.

informatively, I analyzed for a month all the bots that went through my websites and I can say that very few come in contact with robots.txt. GoogleBot ignores Crawl-delay.

Flyingmana · 2022-07-11T11:42:29Z

robots.txt.sample

+# Disallow: */skin/
+
+# User-agent: Googlebot-Image
+# Disallow: /


some people might rely on their product and other images being available via image search

This is the main reason those lines are commented from the very beginning. But there are situations when you have a website and maybe you don't want your images to be indexed even by Googlebot-Image. In this case you uncomment on those lines.

However, between us it is a basic solution proposed here, It aims to make the administrator aware that this possibility also exists. I use other solutions to block/allow full or limited access for bots. A similar variant I proposed in another PR about the new content of the .htaccess file in root.

Flyingmana

approving, because in the end its a sample file and in the end there is nothing changed which is clearly wrong

addison74 · 2022-07-11T12:15:26Z

I made the change by removing two lines because it is obvious that we do not want to index them.

The other two lines remain under discussion. The scenarios is as follows: I want to let the bots to crawler content in /catalogsearch/seo_sitemap | term/popular but I don't want them to do the same thing for paths which start with /catalogsearch like /advanced | /result.

github-actions · 2022-07-12T07:18:40Z

Unit Test Results

1 files ±0 1 suites ±0 0s ⏱️ ±0s
0 tests ±0 0 ✔️ ±0 0 💤 ±0 0 ❌ ±0
7 runs ±0 5 ✔️ ±0 2 💤 ±0 0 ❌ ±0

Results for commit a11d247. ± Comparison against base commit e06fe2d.

addison74 added 2 commits July 11, 2022 01:11

Small changes in robots.txt.sample

e264f65

Merge branch 'OpenMage:1.9.4.x' into ADDISON74-robots-update

67040db

addison74 changed the title ~~Addison74 robots update~~ Update robots.txt.sample Jul 10, 2022

Flyingmana reviewed Jul 11, 2022

View reviewed changes

Flyingmana previously approved these changes Jul 11, 2022

View reviewed changes

Update robots.txt.sample

129b812

addison74 dismissed Flyingmana’s stale review via 129b812 July 11, 2022 12:14

fballiano approved these changes Jul 11, 2022

View reviewed changes

Flyingmana approved these changes Jul 12, 2022

View reviewed changes

fballiano merged commit a11d247 into OpenMage:1.9.4.x Jul 12, 2022

addison74 deleted the ADDISON74-robots-update branch July 12, 2022 14:47

elidrissidev pushed a commit to elidrissidev/magento-lts that referenced this pull request Jul 22, 2022

Update robots.txt.sample (OpenMage#2307)

c44648c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update robots.txt.sample #2307

Update robots.txt.sample #2307

addison74 commented Jul 10, 2022

fballiano commented Jul 10, 2022

addison74 commented Jul 11, 2022

fballiano commented Jul 11, 2022

Flyingmana Jul 11, 2022

fballiano Jul 11, 2022

Flyingmana Jul 11, 2022

fballiano Jul 11, 2022

addison74 Jul 11, 2022

Flyingmana Jul 11, 2022

addison74 Jul 11, 2022 •

edited

Loading

Flyingmana left a comment

addison74 commented Jul 11, 2022 •

edited

Loading

github-actions bot commented Jul 12, 2022

Update robots.txt.sample #2307

Update robots.txt.sample #2307

Conversation

addison74 commented Jul 10, 2022

fballiano commented Jul 10, 2022

addison74 commented Jul 11, 2022

fballiano commented Jul 11, 2022

Flyingmana Jul 11, 2022

Choose a reason for hiding this comment

fballiano Jul 11, 2022

Choose a reason for hiding this comment

Flyingmana Jul 11, 2022

Choose a reason for hiding this comment

fballiano Jul 11, 2022

Choose a reason for hiding this comment

addison74 Jul 11, 2022

Choose a reason for hiding this comment

Flyingmana Jul 11, 2022

Choose a reason for hiding this comment

addison74 Jul 11, 2022 • edited Loading

Choose a reason for hiding this comment

Flyingmana left a comment

Choose a reason for hiding this comment

addison74 commented Jul 11, 2022 • edited Loading

github-actions bot commented Jul 12, 2022

Unit Test Results

addison74 Jul 11, 2022 •

edited

Loading

addison74 commented Jul 11, 2022 •

edited

Loading