Webcrawlers

Uit De Vliegende Brigade
Naar navigatie springen Naar zoeken springen

This article is about webcrawlers or bots visiting websites and consuming server resources along the way.

The problem

Bots can have a major impact on site performance: It's not uncommon that most visits to a webserver come from bots. Additionally: Webshops that don't generate much revenue, can still take a heavy toll on a webserver, simply because of bot visits.

Managing bot visits

We already use CloudFlare for protection against DDOS and hackers. It can also be used for mitigating bots - Sounds interesting

How to get a grip on them?

robots.txt

The traditional method to handle bots, is by utilizing the robots.txt file to instruct well-behaved bots not to crawl or index your websites. While this won't stop malicious bots, it can help manage legitimate ones.

Drawbacks:

  • Not all bots honour the robots.txt file
  • You have to repeat this file for each domain on your webserver. Maybe have one copy of the file and links to the other domains?
  • You need to update this file occasionally, to include new bots and new IP addresses.

Firewall

Use your server's firewall to block bots by filtering for their IP addresses.

Drawback: You need to update this file occasionally, to include new bots and new IP addresses.

.htaccess

Use .htaccess files for each domain to keep bots out, by filtering for their IP addresses.

Drawbacks:

  • This needs to be done for each domein
  • Updates

VHDF

Similar to using .htaccess files, use Apache Virtual Host Definition Files to filter out bots.

Drawbacks:

  • Updates
  • This may not be an intuitive location, hampering maintenance.

Apache Configuration Files

Similar to using .htaccess or VHDFs, use a default Apache configuration file to manage bots at the server level.

Drawbacks:

  • Updates
  • This may not be an intuitive location, hampering maintenance.

Web Application Firewall (WAF)

A Web Application Firewall, like Cloudflare or AWS WAF, can help protect all your websites from unwanted bot traffic by filtering it at the network level. You can set up rules to block known bot IP addresses and customize your security settings centrally.

Advantages:

  • Doesn't take server resources
  • Updates are probably done by the WAF (like new IP addresses)
  • We already have an account with them.

Drawbacks:

  • It requires quite some trust in CloudFlare
  • Performance of our infrastucture is dependent on their performance
  • It costs money.

Use a Reverse Proxy

Implement a reverse proxy server like Nginx or HAProxy in front of your web server. These servers can handle bot traffic filtering and load balancing, reducing the load on your web server.

Use a Bot Management Service

Consider subscribing to a bot management service like Distil Networks, PerimeterX, or Imperva. These services specialize in identifying and blocking malicious bots, reducing the load on your web server.

Regularly Review Server Logs

Continuously monitor your server logs to identify and block bots that may have been missed by automated solutions. Adjust your rules and settings as needed. Remember that not all bots are malicious; some perform legitimate functions like search engine crawlers. Be cautious when blocking bots and ensure you don't accidentally block useful traffic.

I found the real-time option of tail really usefull to get an impression of the activity of bots:

sudo tail -f /var/log/apache2/access.log

Inventory of crawlers

Webcrawlers - Inventory

See also