Webcrawlers
This article is about webcrawlers or bots visiting websites and consuming server resources along the way.
The problem
Bots can have a major impact on site performance: It's not uncommon that most visits to a webserver come from bots. Additionally: Webshops that don't generate much revenue, can still take a heavy toll on a webserver, simply because of bot visits.
Managing bot visits
How to get a grip on them?
robots.txt
The traditional method to handle bots, is by utilizing the robots.txt file to instruct well-behaved bots not to crawl or index your websites. While this won't stop malicious bots, it can help manage legitimate ones.
Drawbacks:
- Not all bots honour the robots.txt file
- You have to repeat this file for each domain on your webserver. Maybe have one copy of the file and links to the other domains?
- You need to update this file occasionally, to include new bots and new IP addresses.
Firewall
Use your server's firewall to block bots by filtering for their IP addresses.
Drawback: You need to update this file occasionally, to include new bots and new IP addresses.
.htaccess
Use .htaccess
files for each domain to keep bots out, by filtering for their IP addresses.
Drawbacks:
- This needs to be done for each domein
- Updates
VHDF
Similar to using .htaccess
files, use Apache Virtual Host Definition Files to filter out bots.
Drawbacks:
- Updates
- This may not be an intuitive location, hampering maintenance.
Apache Configuration Files
Similar to using .htaccess
or VHDFs, use a default Apache configuration file to manage bots at the server level.
Drawbacks:
- Updates
- This may not be an intuitive location, hampering maintenance.
Web Application Firewall (WAF)
A Web Application Firewall, like Cloudflare or AWS WAF, can help protect all your websites from unwanted bot traffic by filtering it at the network level. You can set up rules to block known bot IP addresses and customize your security settings centrally.
Advantages:
- Doesn't take server resources
- Updates are probably done by the WAF (like new IP addresses)
- We already have an account with them.
Drawbacks:
- It requires quite some trust in CloudFlare
- Performance of our infrastucture is dependent on their performance
- It costs money.
Use a Reverse Proxy
Implement a reverse proxy server like Nginx or HAProxy in front of your web server. These servers can handle bot traffic filtering and load balancing, reducing the load on your web server.
Use a Bot Management Service
Consider subscribing to a bot management service like Distil Networks, PerimeterX, or Imperva. These services specialize in identifying and blocking malicious bots, reducing the load on your web server.
Regularly Review Server Logs
Continuously monitor your server logs to identify and block bots that may have been missed by automated solutions. Adjust your rules and settings as needed. Remember that not all bots are malicious; some perform legitimate functions like search engine crawlers. Be cautious when blocking bots and ensure you don't accidentally block useful traffic.
I found the real-time option of tail
really usefull to get an impression of the activity of bots:
sudo tail -f /var/log/apache2/access.log