Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block Bad Website Bots and Spiders #4860

Closed
wants to merge 3 commits into from

Conversation

akhilleusuggo
Copy link
Contributor

Piped way to add bad crawlers to blocked list, that are slowing and consuming most of the traffic .
The other way would be the rDNS

Piped way to add bad crawlers to blocked list, that are slowing and consuming most of the traffic .
The other way would be the rDNS
@akhilleusuggo
Copy link
Contributor Author

Sorry I'm terrible with git .
Tried something new and not sure if I've done it right .

@DanielnetoDotCom
Copy link
Member

Hi

instead of adding it on .htaccess would be better to add it here

function isBot() {

this function is used for the cache plugin, which checks if the client is a bot. if it is will only serve cached pages, if there is no cache for that page, it serves a blank page.

for example, I have customers that use Semrush, if I add that they will have problems for sure.
at least on cache plugin they can choose to stop bots or not

@Maikuolan
Copy link
Contributor

Just as an idea: If we wanted to make blocking such bots optional (e.g., adding the ability to configure which bots to block through an admin page or similar), something like this could be done through .htaccess:

php_value auto_prepend_file "/path/to/file.php"

..to ensure that a specific file is executed for every request, no matter which PHP file is actually requested. Such a file could run whatever blocking routines are necessary, pulling data from the database, set by the admin page, etc and etc.

Possibly some slightly increased overhead when dealing with a small list of bots, but possibly also decreased overhead if the list of bots becomes much larger (since .htaccess mightn't be particularly fast nor efficient when the list becomes much larger, and checks could be optimised either through PHP, or even directly through the database engine itself).

Anyway, just ideas, and would need further exploration regardless. :-)

@akhilleusuggo
Copy link
Contributor Author

Sorry for the late reply .
@DanielnetoDotCom from my testing , this function does block bots only from non-cached pages if indicated on the plugin .
@Maikuolan Your idea is brilliant , but beyond my skill .

@DanielnetoDotCom
Copy link
Member

Hi

do you think worth block bots at all?
because serve cached pages should not overhead the system.

@Maikuolan I didn't know we can do that. but this may fail for PHP fpm (I guess)

maybe a separate script that does not start the session or connect into the database, just to block the undesired connections

@akhilleusuggo
Copy link
Contributor Author

akhilleusuggo commented Apr 8, 2021

@DanielnetoDotCom
1- It does not only serve cached videos . In fact , the videos pages ARE NOT cached . I've tried it many times , and the videos pages does never really get cached .
2- They create a ton of useless cache , specially now with the new cache system that you added . Millions of folders , with millions of ISPs , with millions of folders from different languages and locations ( if plugin User_Location is enabled ) . The option Block Bots From Non Cached Pages , may hard legitimate crawlers .
3- The ones I have added on the list above , they do nothing but overload the database . Sometimes you can't even debug with the logs , due to the amount of loaded pages and logs just go up . Semrush is the only one that could be useful if you're using it for SEO analytics and stuff like that .

.... /videos/cache$ find . -type f -print | wc -l
2541822 
... /videos/cache$ du -h
............
8.9M	./getAllVideosAsync
31.5G	.

@DanielnetoDotCom
Copy link
Member

1- It does not only serve cached videos . In fact , the videos pages ARE NOT cached . I've tried it many times , and the videos pages does never really get cached .

the videos metadata are cached, it is required otherwise it may slowdown a lot the site.

2- They create a ton of useless cache , specially now with the new cache system that you added . Millions of folders , with millions of ISPs , with millions of folders from different languages and locations ( if plugin User_Location is enabled ) . The option Block Bots From Non Cached Pages , may hard legitimate crawlers .

Agree

3- The ones I have added on the list above , they do nothing but overload the database . Sometimes you can't even debug with the logs , due to the amount of loaded pages and logs just go up . Semrush is the only one that could be useful if you're using it for SEO analytics and stuff like that .

Agree 100%, I just need to think of an option so you can choose what to block, sometimes you may want to allow some boots to access your page

@DanielnetoDotCom
Copy link
Member

DanielnetoDotCom commented Apr 12, 2021

What about this update?

so if you add this in your configuration.php you will stop bots before connecting to the database or open the session

$global['stopBotsList'] = array('bot','spider','rouwler','Nuclei','MegaIndex','NetSystemsResearch','CensysInspect');

DanielnetoDotCom added a commit that referenced this pull request Apr 12, 2021
DanielnetoDotCom added a commit that referenced this pull request Apr 12, 2021
Letting knwo what bot was found
@DanielnetoDotCom
Copy link
Member

A more complete list maybe

$global['stopBotsList'] = array('bot','spider','rouwler','Nuclei','MegaIndex','NetSystemsResearch','CensysInspect','slurp','crawler','curl','fetch','loader');

@Maikuolan
Copy link
Contributor

A more complete list maybe

`$global['stopBotsList'] = array('bot',' ...

Would that match exactly (e.g., like /^bot$/), or loosely (e.g., like /^.*bot.*$/i)? If loosely, I would be careful about "bot", since it could also block things like "Googlebot", "Bingbot", etc.

@Maikuolan
Copy link
Contributor

@Maikuolan I didn't know we can do that. but this may fail for PHP fpm (I guess)

We definitely can do that, it is definitely possible, but good point. I'm also not sure how it would play with fpm. Another possible problem is that .htaccess is an Apache thing, and any solution which relies on .htaccess would be useless for Nginx (Nginx uses a nginx.conf file, which has a completely different format/syntax/structure than .htaccess), or for any other non-Apache servers which implement their own access control solutions.

@DanielnetoDotCom
Copy link
Member

A more complete list maybe
`$global['stopBotsList'] = array('bot',' ...

Would that match exactly (e.g., like /^bot$/), or loosely (e.g., like /^.*bot.*$/i)? If loosely, I would be careful about "bot", since it could also block things like "Googlebot", "Bingbot", etc.

Correct, this is just a general sample, you can be more specific in your stopBotsList

@DanielnetoDotCom
Copy link
Member

@Maikuolan I didn't know we can do that. but this may fail for PHP fpm (I guess)

We definitely can do that, it is definitely possible, but good point. I'm also not sure how it would play with fpm. Another possible problem is that .htaccess is an Apache thing, and any solution which relies on .htaccess would be useless for Nginx (Nginx uses a nginx.conf file, which has a completely different format/syntax/structure than .htaccess), or for any other non-Apache servers which implement their own access control solutions.

the way it is implemented now we do not need to modify the .htaccess

@akhilleusuggo
Copy link
Contributor Author

@DanielnetoDotCom Thank you for the effort . I'm testing it out . I removed the 'bot' and 'crawler' , since ( googlebot does contain 'bot' , and yandexCrawler does contain 'crawler' . I wrote directly the names of the bots to block .

The legitimate bots does respect the robots.txt fles . This function if useful only against those who ignores it .

Thank you ! I'm testing it out .

@DanielnetoDotCom
Copy link
Member

Great, feel free to suggest new bots names

DanielnetoDotCom added a commit that referenced this pull request Apr 13, 2021
Allow create a whitelist to not block some bots
@DanielnetoDotCom
Copy link
Member

Ok, now you can create a whitelist to not stop some bots

$global['stopBotsWhiteList'] = array('google','bing','yahoo','yandex');

DanielnetoDotCom added a commit that referenced this pull request Apr 13, 2021
Allow create a whitelist to not block some bots
@DanielnetoDotCom
Copy link
Member

I am just thinking if this should be the default configuration (enabled by default) for new installations

@akhilleusuggo
Copy link
Contributor Author

Oh , this is great !

@JoshWho
Copy link
Contributor

JoshWho commented Apr 15, 2021

I want it if it is working good. I need to stop all the bot traffic I can. It is getting bad now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants