Skip to main content
Added the option to block using user agent string on the .htaccess file
Source Link
PatomaS
  • 2.4k
  • 14
  • 15

Any method that relies on the crawler's good behaviour, may fail, so the best option is to use the strongest force/authority available, in this case, the web server itself. If you have access to the main web server configuration or at least to the .htaccess file, you should use a method that involves those elements.

The best way is using http password, but if you really don't want to use that, then you still have another option.

If you know the IPs of your clients, you can restrict/allow that in your .htaccess with a simple access control code like this

Order deny,allow Deny from all Allow from x.x.x.x Allow from y.y.y.y 

The IPs can be in the form x.x.x instead of x.x.x.x, which means that you will be allowing the whole block that is missing.

You can combine that with some HTTP headers. 403 tells the bot to not go there, they usually try a few times, just in case, but it should work quickly if combined with the deny directive.

You can use the HTTP response code even if you don't know your client's IPs.

Another option is to redirect the request to the home page and use, for instance a 301 HTTP code, although I wouldn't recommend this method. Even when it's going to work, you are not telling the truth about the resource and what happened to it, so it's not a precise approach.

Update considering your comment

You can use the [list of user agent string from crawlers] to block them on your .htaccess., this simple syntax would do what you want.

RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot|yahoo|yandex) [NC] RewriteRule .* - [R=403,L] 

Just add the most common ones or the ones that have been to your site.

Any method that relies on the crawler's good behaviour, may fail, so the best option is to use the strongest force/authority available, in this case, the web server itself. If you have access to the main web server configuration or at least to the .htaccess file, you should use a method that involves those elements.

The best way is using http password, but if you really don't want to use that, then you still have another option.

If you know the IPs of your clients, you can restrict/allow that in your .htaccess with a simple access control code like this

Order deny,allow Deny from all Allow from x.x.x.x Allow from y.y.y.y 

The IPs can be in the form x.x.x instead of x.x.x.x, which means that you will be allowing the whole block that is missing.

You can combine that with some HTTP headers. 403 tells the bot to not go there, they usually try a few times, just in case, but it should work quickly if combined with the deny directive.

You can use the HTTP response code even if you don't know your client's IPs.

Another option is to redirect the request to the home page and use, for instance a 301 HTTP code, although I wouldn't recommend this method. Even when it's going to work, you are not telling the truth about the resource and what happened to it, so it's not a precise approach.

Any method that relies on the crawler's good behaviour, may fail, so the best option is to use the strongest force/authority available, in this case, the web server itself. If you have access to the main web server configuration or at least to the .htaccess file, you should use a method that involves those elements.

The best way is using http password, but if you really don't want to use that, then you still have another option.

If you know the IPs of your clients, you can restrict/allow that in your .htaccess with a simple access control code like this

Order deny,allow Deny from all Allow from x.x.x.x Allow from y.y.y.y 

The IPs can be in the form x.x.x instead of x.x.x.x, which means that you will be allowing the whole block that is missing.

You can combine that with some HTTP headers. 403 tells the bot to not go there, they usually try a few times, just in case, but it should work quickly if combined with the deny directive.

You can use the HTTP response code even if you don't know your client's IPs.

Another option is to redirect the request to the home page and use, for instance a 301 HTTP code, although I wouldn't recommend this method. Even when it's going to work, you are not telling the truth about the resource and what happened to it, so it's not a precise approach.

Update considering your comment

You can use the [list of user agent string from crawlers] to block them on your .htaccess., this simple syntax would do what you want.

RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot|yahoo|yandex) [NC] RewriteRule .* - [R=403,L] 

Just add the most common ones or the ones that have been to your site.

added the 301 option and some links
Source Link
PatomaS
  • 2.4k
  • 14
  • 15

Any method that relies on the crawler's good behaviour, may fail, so the best option is to use the strongest force/authority available, in this case, the web server itself. If you have access to the main web server configuration or at least to the .htaccess.htaccess file, you should use a method that involves those elements.

The best way is using http password, but if you really don't want to use that, then you still have another option.

If you know the IPs of your clients, you can restrict/allow that in your .htaccess.htaccess with a simple access control code like this

Order deny,allow Deny from all Allow from x.x.x.x Allow from y.y.y.y 

The IPs can be in the form x.x.x instead of x.x.x.x, which means that you will be allowing the whole block that is missing.

You can combine that with some HTTP headers. 403403 tells the bot to not go there, they usually try a few times, just in case, but it should work quickly if combined with the deny directive.

You can use the HTTP response code even if you don't know your client's IPs.

Another option is to redirect the request to the home page and use, for instance a 301 HTTP code, although I wouldn't recommend this method. Even when it's going to work, you are not telling the truth about the resource and what happened to it, so it's not a precise approach.

Any method that relies on the crawler's good behaviour, may fail, so the best option is to use the strongest force/authority available, in this case, the web server itself. If you have access to the main web server configuration or at least to the .htaccess file, you should use a method that involves those elements.

The best way is using http password, but if you really don't want to use that, then you still have another option.

If you know the IPs of your clients, you can restrict/allow that in your .htaccess with a simple code like this

Order deny,allow Deny from all Allow from x.x.x.x Allow from y.y.y.y 

The IPs can be in the form x.x.x instead of x.x.x.x, which means that you will be allowing the whole block that is missing.

You can combine that with some HTTP headers. 403 tells the bot to not go there, they usually try a few times, just in case, but it should work quickly if combined with the deny directive.

You can use the HTTP response code even if you don't know your client's IPs.

Any method that relies on the crawler's good behaviour, may fail, so the best option is to use the strongest force/authority available, in this case, the web server itself. If you have access to the main web server configuration or at least to the .htaccess file, you should use a method that involves those elements.

The best way is using http password, but if you really don't want to use that, then you still have another option.

If you know the IPs of your clients, you can restrict/allow that in your .htaccess with a simple access control code like this

Order deny,allow Deny from all Allow from x.x.x.x Allow from y.y.y.y 

The IPs can be in the form x.x.x instead of x.x.x.x, which means that you will be allowing the whole block that is missing.

You can combine that with some HTTP headers. 403 tells the bot to not go there, they usually try a few times, just in case, but it should work quickly if combined with the deny directive.

You can use the HTTP response code even if you don't know your client's IPs.

Another option is to redirect the request to the home page and use, for instance a 301 HTTP code, although I wouldn't recommend this method. Even when it's going to work, you are not telling the truth about the resource and what happened to it, so it's not a precise approach.

Source Link
PatomaS
  • 2.4k
  • 14
  • 15

Any method that relies on the crawler's good behaviour, may fail, so the best option is to use the strongest force/authority available, in this case, the web server itself. If you have access to the main web server configuration or at least to the .htaccess file, you should use a method that involves those elements.

The best way is using http password, but if you really don't want to use that, then you still have another option.

If you know the IPs of your clients, you can restrict/allow that in your .htaccess with a simple code like this

Order deny,allow Deny from all Allow from x.x.x.x Allow from y.y.y.y 

The IPs can be in the form x.x.x instead of x.x.x.x, which means that you will be allowing the whole block that is missing.

You can combine that with some HTTP headers. 403 tells the bot to not go there, they usually try a few times, just in case, but it should work quickly if combined with the deny directive.

You can use the HTTP response code even if you don't know your client's IPs.