-
-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update robots.txt.sample #2307
Update robots.txt.sample #2307
Conversation
If I keep only the red line I have no control over the others. I would keep the red line and the last two green ones. If I want access to the term/popular and seo_sitemap pages I can uncomment. The proposed variant is the following: Disallow: */ catalogsearch /
#Allow: */catalogsearch/seo_sitemap
#Allow: */catalogsearch/term/popular |
This PR puts in "dissallow" also "catalogsearch/seo_sitemap" which seems to be seo related. but anyway I'd prefer to leave all of catalogsearch/ out of the search engines |
robots.txt.sample
Outdated
Disallow: */catalogsearch/ | ||
Disallow: */catalogsearch/advanced | ||
Disallow: */catalogsearch/result | ||
Disallow: */catalogsearch/seo_sitemap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file is needed for google (and others) to easy find crawl-able URLs. Preventing them to crawl this removes the whole purpose of this endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested on 2 of my stores and that endpoint goes to 404... i'm not sure it exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which brings up the question, what links to it :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly, that's why I should stay Disallow: */catalogsearch/
instead of the 3 split lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make the change by removing those 4 lines is no problem. but keep in mind that so far OM has not provided a robots.txt file. who has Frontend links to / catalogsearch / seo_map and / catalogsearch / term / popular when using this file in its previous form with the only option `Disallow: / * / catalogsearch /`
will remove the indexing of these two pages. obviously we are talking about bots that take into account the existence and content of this file.
informatively, I analyzed for a month all the bots that went through my websites and I can say that very few come in contact with robots.txt. GoogleBot ignores Crawl-delay.
# Disallow: */skin/ | ||
|
||
# User-agent: Googlebot-Image | ||
# Disallow: / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some people might rely on their product and other images being available via image search
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main reason those lines are commented from the very beginning. But there are situations when you have a website and maybe you don't want your images to be indexed even by Googlebot-Image. In this case you uncomment on those lines.
However, between us it is a basic solution proposed here, It aims to make the administrator aware that this possibility also exists. I use other solutions to block/allow full or limited access for bots. A similar variant I proposed in another PR about the new content of the .htaccess file in root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approving, because in the end its a sample file and in the end there is nothing changed which is clearly wrong
I made the change by removing two lines because it is obvious that we do not want to index them. The other two lines remain under discussion. The scenarios is as follows: I want to let the bots to crawler content in /catalogsearch/seo_sitemap | term/popular but I don't want them to do the same thing for paths which start with /catalogsearch like /advanced | /result. |
This is an update to the robots.txt.sample file that was recently added to OpenMage (PR #1024).
A short brief of the changes:
The file can be used in production without any issues. Both the initial version and this updated version have been tested by me for a long time. You have to take in consideration that only a good bot will take this file into account. For bad bots, those which skip the robots.txt file content and crawling everything, I will post a new PR these days.