Webcrawlers - Inventory

An overview of webcrawlers, crawlers, bots or wanderers that we have seen on our servers around 2021-2023, especially in Sep. 2023:

AhrefsBot

Ahrefs is an SEO company, similar to Sem and Majestic and this is probably their 'general' bot
Why allow?

AhrefsSiteAudit

This is probably the bot that visits our site since one of our marketing partners makes use of Ahrefs SEO tools
User agent string: "Mozilla/5.0 (compatible; AhrefsSiteAudit/6.1; +http://ahrefs.com/robot/site-audit)"
Allow.

Amazonbot

User agent string: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot"
Allow?

AmazonBot is a web crawler operated by Amazon:

Amazon Web Services (AWS): AWS offers a service called Amazon CloudSearch which can crawl a site and index its content. AmazonBot could be a part of these services to assist in crawling and indexing content
Alexa Internet: Amazon owns Alexa Internet, which provides web traffic analysis and other related services. Alexa Internet uses crawlers to gather data about websites to rank them and provide insights about web traffic, site engagement, and other metrics. When AmazonBot crawls your webshop, it might be gathering data to update rankings and metrics on Alexa
Amazon Affiliate Program: If you are (or websites linking to you are) a part of the Amazon Affiliate program, Amazon might crawl your website to verify links, ensure compliance, or gather information related to the program
Amazon's Retail Website: While Amazon primarily uses other mechanisms to get product data (like direct feeds from sellers), it's not outside the realm of possibility for them to crawl the web for product data, pricing, or other relevant retail information.

Benefits of AmazonBot visiting your webshops:

Improved Alexa Rankings: If your website gets a lot of traffic and engagement, frequent crawling by Alexa can help improve your website's ranking on Alexa, which many advertisers and partners might use as a metric for website popularity.
Up-to-date in AWS Services: If you use any AWS services that rely on web crawling (like CloudSearch), then AmazonBot will ensure your content is indexed accurately and timely.
Compliance with Amazon Affiliate Program: If you're an Amazon affiliate, the bot can ensure you're compliant with their terms.

AppleBot

User agent string: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)"
Seems to honour Robots.txt
Allow?

AppleBot is the name of Apple's web crawler, which is used by Apple to index web content for its various services, including Siri, Spotlight Suggestions, and Apple News. AppleBot operates similarly to other web crawlers like GoogleBot and BingBot but is specific to Apple's ecosystem:

Purpose: AppleBot is designed to index and collect information from websites to improve the search and discovery experience for Apple users. It helps power features like search results in Safari, Siri web searches, and the Apple News app
User-Agent: AppleBot identifies itself in the User-Agent string of its HTTP requests. The User-Agent typically includes "AppleBot" followed by a version number
Respect for Robots.txt: Like other responsible web crawlers, AppleBot adheres to the rules specified in a website's robots.txt file. Webmasters can use robots.txt to control and influence what parts of their websites AppleBot is allowed to crawl and index.
Webmaster Tools: Apple provides a "Webmaster Tools" platform where website owners can submit their sitemaps and monitor how their content appears in Apple's search results. This platform also provides information about how AppleBot interacts with your website.
Crawl Frequency: AppleBot's crawl frequency may vary depending on your website's content and how frequently it updates. Websites with frequently changing or time-sensitive content may see more frequent crawls from AppleBot.

Sources

Barkrowler

Barkrowler is the bot of online marketing company Babbar
User agent string: "Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)"
Disallow.

Sources

Billigerbot

User agent string: "billigerbot/1.0"
Details unknown.

bingbot

Bot for Microsoft's search engine Bing
User agent string: "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36"
Allow.

BLEXbot

Crawler from the online marketing company WebMeUp
User agent string: "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
Why allow?

BLEXBot is a web crawler associated with the WebMeUp backlink tool, which is an SEO tool focused on discovering and monitoring backlinks pointing to specific websites. BLEXBot's primary role is to crawl the web to index website content and find backlinks for the WebMeUp backlink tool. This helps SEO professionals, webmasters, and digital marketers analyze and understand the backlink profiles of their sites or competitors' sites.

Sources:

Bytespider

"Mozilla/5.0 (compatible; Bytespider; spider-feedback@bytedance.com) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.0.0 Safari/537.36"

Bytespider is a web crawler typically associated with the TikTok platform, which is owned by ByteDance:

Purpose: The exact purpose of Bytespider has not been officially detailed by ByteDance, but like other web crawlers, its primary role is to gather, index, and analyze web data. Given its association with ByteDance, it's possible that the spider is used for various functions, including indexing content for search capabilities within platforms like TikTok
TikTok Videos: Bytespider might crawl websites to understand the content and context of links shared within TikTok videos or to generate previews when URLs are shared. It can also be used to ensure shared content complies with TikTok's terms of service
Resource Consumption: Some webmasters have raised concerns that Bytespider can be aggressive in terms of request rates, leading to increased server loads. This behavior can be problematic for websites with limited resources or hosting packages with strict limits.

CCbot

Common Crawl seems to be a non-profit company, creating a copy of the internet for use by researchers, etc.

User agent string: "CCBot/2.0 (https://commoncrawl.org/faq/)"
Allow

Coccocbot

Coc Coc is a Vietnamese search engine.
User agent string: Mozilla/5.0 (compatible; coccocbot-image/1.0; +http://help.coccoc.com/searchengine)
Allow

Dotbot

Bot of a domain registrar company
Seems to honour robots.txt
I see no reason for allowing it.

DotBot is a web crawler or bot used by the Dotster domain registrar and web hosting company. It is primarily responsible for indexing websites, checking for domain availability, and other web-related activities related to Dotster's services. DotBot is similar to other web crawlers employed by search engines and internet companies to gather information about websites and domain names.

Webmasters and website owners may encounter DotBot in their server logs when it accesses their websites. It's important to note that DotBot is not as well-known or widely used as some other web crawlers like GoogleBot or BingBot, but it serves a specific purpose related to domain registration and hosting services. If you encounter DotBot on your website and have specific concerns or questions about its behavior, you may want to contact Dotster or refer to their documentation for more information.

GeedoBot

E-commerce search engine?
User agent string: "Mozilla/5.0 (compatible; GeedoBot; +http://www.geedo.com/bot.html)"
Allow

geedo.com/bot/: A program used to scan webpages and in particular online stores to find products for the whole world to buy.

Sources:

https://geedo.com/bot/ geedo.com/bot/
https://udger.com/resources/ua-list/bot-detail?bot=GeedoBot

Googlebot

Google operates multiple bots, starting with this one:

User agent string: "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Googlebot doesn't follow robots.txt. use Google Webmaster Tools instead
Allow

Googlebot-Image/1.0

User agent string: "Googlebot-Image/1.0"

Mail.RU_Bot

Mail.RU_Bot is developed by the Russian company Mail.Ru Group. Their bots are used for various purposes
It seems to honour robots.txt
I see no reason for allowing it.

Sources:

https://user-agents.net/bots/mail-ru-bot

MJ12bot

SEO-tool
User agent string: "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
Why allow?

MJ12bot is the web crawler for Majestic, a search engine optimization (SEO) tool that specializes in backlink research and analysis. Here's what you need to know:

Purpose: MJ12bot is used by Majestic to discover and index web pages to build its vast link database. Majestic provides data to webmasters, marketers, and SEO professionals about the backlink profiles of websites, which can be used to assess a site's link-based authority, competition, and other link-related metrics
Behavior: Historically, there have been some complaints from webmasters regarding the behavior of MJ12bot—particularly that it can sometimes be too aggressive in its crawling frequency, leading to increased server loads.

NLUX_IAHarvester

This is the bot of the Luxemburg national library
User agent string: "Mozilla/5.0 (compatible; NLUX_IAHarvester/3.3.0 +http://crawl.bnl.lu/)"
Seems fine to allow.

oBot

From IBM
oBot has been included in robots.txt for nl_nl before this article was written. It doesn't occur in the Apache log files - Seems to follow robots.txt
No further info → Why allow?

Owler

Part of Open Web Search - https://openwebsearch.eu/ - Seems ok
User agent string: "Owler@ows.eu/1"
Allow.

PetalBot

Bot from Huawei's search engine Petalsearch
User agent string: "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
Why not? Always good to get included in search engines, even Chinese, plus that my customers often get sales from strange corners of the world - So let's not exclude them

Description

PetalBot is a web crawler associated with Petal Search, which is a search service offered by Huawei. As Huawei expanded its ecosystem, especially in the context of challenges with U.S.-based tech companies, they developed their search engine and, accordingly, their web crawler to index the web.

Purpose: PetalBot crawls the web to index and update website content for Petal Search. This allows users of Petal Search to find relevant search results from across the internet
Identification: Typically, the user-agent for PetalBot includes the term "PetalBot".

Sources

https://user-agents.net/bots/petalbot

RestSharp

Probably the name of a library to build bots
User agent string: "RestSharp/106.12.0.0"
Block.

Sources

RSiteAuditor

User agent string: Mozilla/5.0 (compatible; RSiteAuditor)
From an online analytics company?
Why allow?

Sources:

https://help.agencyanalytics.com/en/articles/3469586-allow-our-site-auditor-via-robots-txt-and-firewall

Semrush & SemrushBot

Semrush has been included in robots.txt for nl_nl before this article was written. It doesn't occur in the Apache log files - It probably follows robots.txt files

Dilemma: Allow bots that belong to SEO tools or not?

SeznamBot

Czech search engine
Allow.

User agent string:

"Mozilla/5.0 (compatible; SeznamBot/4.0; +http://napoveda.seznam.cz/seznambot-intro/)"

SeznamBot is the web crawler for Seznam.cz, which is the most popular search engine in the Czech Republic

Sources:

https://user-agents.net/bots/seznambot

SiteExplorer

Probably the name of a generic crawler
SiteExplorer has been included in robots.txt for nl_nl before this article was written. It doesn't occur in the Apache log files - It probably follows robots.txt
Disallow.

Sources:

https://user-agents.net/bots/siteexplorer

TinyTestBot

This is the name of a generic library for creating bots
Doesn't seem to follow robots.txt
User agent string: "TinyTestBot"
Disallow

Seems that this is a PHP library for constructing bots. So probably, various bots with this name exist.

Sources:

https://udger.com/resources/ua-list/bot-detail?bot=TinyTestBot

TurnItInBot

Anti-plagiarsm bot
Haven't seen this for a long time
Disallow

Sources:

Vagabondo

Generic bot from Wise-guys.nl
Vagabondo has been included in robots.txt for nl_nl before this article was written. It doesn't occur in the Apache log files - It respects robots.txt?
Block.

Sources:

Yandex

Yandex has been included in robots.txt for nl_nl before this article was written. It doesn't occur in the Apache log files - It respects robots.txt?

ZoominfoBot

Bot from marketing company Zoominfo
User agent string: "ZoominfoBot (zoominfobot at zoominfo dot com)"
It tried to fetch the standard WordPress XML RPC file - I don't trust that.
Why allow?

Sources

https://www.zoominfo.com/

Sources

https://user-agents.net/bots/

Webcrawlers - Inventory

Inhoud

AhrefsBot

AhrefsSiteAudit

Amazonbot

AppleBot

Barkrowler

Billigerbot

bingbot

BLEXbot

Bytespider

CCbot

Coccocbot

Dotbot

GeedoBot

Googlebot

Googlebot-Image/1.0

Mail.RU_Bot

MJ12bot

NLUX_IAHarvester

oBot

Owler

PetalBot

Description

Sources

RestSharp

RSiteAuditor

Semrush & SemrushBot

SeznamBot

SiteExplorer

TinyTestBot

TurnItInBot

Vagabondo

Yandex

ZoominfoBot

See also

Sources

Navigatiemenu

Zoeken