Webcrawlers - Inventory

Uit De Vliegende Brigade
Naar navigatie springen Naar zoeken springen

An overview of webcrawlers, crawlers, bots or wanderers that we have seen on our servers around 2021-2023, especially in Sep. 2023:

AhrefsBot

  • Ahrefs is an SEO company, similar to Sem and Majestic and this is probably their 'general' bot
  • Why allow?

AhrefsSiteAudit

  • This is probably the bot that visits our site since one of our marketing partners makes use of Ahrefs SEO tools
  • User agent string: "Mozilla/5.0 (compatible; AhrefsSiteAudit/6.1; +http://ahrefs.com/robot/site-audit)"
  • Allow.

Amazonbot

AmazonBot is a web crawler operated by Amazon:

  • Amazon Web Services (AWS): AWS offers a service called Amazon CloudSearch which can crawl a site and index its content. AmazonBot could be a part of these services to assist in crawling and indexing content
  • Alexa Internet: Amazon owns Alexa Internet, which provides web traffic analysis and other related services. Alexa Internet uses crawlers to gather data about websites to rank them and provide insights about web traffic, site engagement, and other metrics. When AmazonBot crawls your webshop, it might be gathering data to update rankings and metrics on Alexa
  • Amazon Affiliate Program: If you are (or websites linking to you are) a part of the Amazon Affiliate program, Amazon might crawl your website to verify links, ensure compliance, or gather information related to the program
  • Amazon's Retail Website: While Amazon primarily uses other mechanisms to get product data (like direct feeds from sellers), it's not outside the realm of possibility for them to crawl the web for product data, pricing, or other relevant retail information.

Benefits of AmazonBot visiting your webshops:

  • Improved Alexa Rankings: If your website gets a lot of traffic and engagement, frequent crawling by Alexa can help improve your website's ranking on Alexa, which many advertisers and partners might use as a metric for website popularity.
  • Up-to-date in AWS Services: If you use any AWS services that rely on web crawling (like CloudSearch), then AmazonBot will ensure your content is indexed accurately and timely.
  • Compliance with Amazon Affiliate Program: If you're an Amazon affiliate, the bot can ensure you're compliant with their terms.

AppleBot

  • User agent string: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)"
  • Seems to honour Robots.txt
  • Allow?

AppleBot is the name of Apple's web crawler, which is used by Apple to index web content for its various services, including Siri, Spotlight Suggestions, and Apple News. AppleBot operates similarly to other web crawlers like GoogleBot and BingBot but is specific to Apple's ecosystem:

  • Purpose: AppleBot is designed to index and collect information from websites to improve the search and discovery experience for Apple users. It helps power features like search results in Safari, Siri web searches, and the Apple News app
  • User-Agent: AppleBot identifies itself in the User-Agent string of its HTTP requests. The User-Agent typically includes "AppleBot" followed by a version number
  • Respect for Robots.txt: Like other responsible web crawlers, AppleBot adheres to the rules specified in a website's robots.txt file. Webmasters can use robots.txt to control and influence what parts of their websites AppleBot is allowed to crawl and index.
  • Webmaster Tools: Apple provides a "Webmaster Tools" platform where website owners can submit their sitemaps and monitor how their content appears in Apple's search results. This platform also provides information about how AppleBot interacts with your website.
  • Crawl Frequency: AppleBot's crawl frequency may vary depending on your website's content and how frequently it updates. Websites with frequently changing or time-sensitive content may see more frequent crawls from AppleBot.

Sources

Barkrowler

Sources

Billigerbot

  • User agent string: "billigerbot/1.0"
  • Details unknown.

bingbot

  • Bot for Microsoft's search engine Bing
  • User agent string: "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36"
  • Allow.

BLEXbot

  • Crawler from the online marketing company WebMeUp
  • User agent string: "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
  • Why allow?

BLEXBot is a web crawler associated with the WebMeUp backlink tool, which is an SEO tool focused on discovering and monitoring backlinks pointing to specific websites. BLEXBot's primary role is to crawl the web to index website content and find backlinks for the WebMeUp backlink tool. This helps SEO professionals, webmasters, and digital marketers analyze and understand the backlink profiles of their sites or competitors' sites.

Sources:

Bytespider

"Mozilla/5.0 (compatible; Bytespider; spider-feedback@bytedance.com) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.0.0 Safari/537.36"

Bytespider is a web crawler typically associated with the TikTok platform, which is owned by ByteDance:

  • Purpose: The exact purpose of Bytespider has not been officially detailed by ByteDance, but like other web crawlers, its primary role is to gather, index, and analyze web data. Given its association with ByteDance, it's possible that the spider is used for various functions, including indexing content for search capabilities within platforms like TikTok
  • TikTok Videos: Bytespider might crawl websites to understand the content and context of links shared within TikTok videos or to generate previews when URLs are shared. It can also be used to ensure shared content complies with TikTok's terms of service
  • Resource Consumption: Some webmasters have raised concerns that Bytespider can be aggressive in terms of request rates, leading to increased server loads. This behavior can be problematic for websites with limited resources or hosting packages with strict limits.

CCbot

Common Crawl seems to be a non-profit company, creating a copy of the internet for use by researchers, etc.

Coccocbot

Dotbot

  • Bot of a domain registrar company
  • Seems to honour robots.txt
  • I see no reason for allowing it.

DotBot is a web crawler or bot used by the Dotster domain registrar and web hosting company. It is primarily responsible for indexing websites, checking for domain availability, and other web-related activities related to Dotster's services. DotBot is similar to other web crawlers employed by search engines and internet companies to gather information about websites and domain names.

Webmasters and website owners may encounter DotBot in their server logs when it accesses their websites. It's important to note that DotBot is not as well-known or widely used as some other web crawlers like GoogleBot or BingBot, but it serves a specific purpose related to domain registration and hosting services. If you encounter DotBot on your website and have specific concerns or questions about its behavior, you may want to contact Dotster or refer to their documentation for more information.

GeedoBot

geedo.com/bot/: A program used to scan webpages and in particular online stores to find products for the whole world to buy.

Sources:

Googlebot

Google operates multiple bots, starting with this one:

  • User agent string: "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
  • Googlebot doesn't follow robots.txt. use Google Webmaster Tools instead
  • Allow

Googlebot-Image/1.0

  • User agent string: "Googlebot-Image/1.0"

Mail.RU_Bot

  • Mail.RU_Bot is developed by the Russian company Mail.Ru Group. Their bots are used for various purposes
  • It seems to honour robots.txt
  • I see no reason for allowing it.

Sources:

MJ12bot

  • SEO-tool
  • User agent string: "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
  • Why allow?

MJ12bot is the web crawler for Majestic, a search engine optimization (SEO) tool that specializes in backlink research and analysis. Here's what you need to know:

  • Purpose: MJ12bot is used by Majestic to discover and index web pages to build its vast link database. Majestic provides data to webmasters, marketers, and SEO professionals about the backlink profiles of websites, which can be used to assess a site's link-based authority, competition, and other link-related metrics
  • Behavior: Historically, there have been some complaints from webmasters regarding the behavior of MJ12bot—particularly that it can sometimes be too aggressive in its crawling frequency, leading to increased server loads.

NLUX_IAHarvester

  • This is the bot of the Luxemburg national library
  • User agent string: "Mozilla/5.0 (compatible; NLUX_IAHarvester/3.3.0 +http://crawl.bnl.lu/)"
  • Seems fine to allow.

oBot

  • From IBM
  • oBot has been included in robots.txt for nl_nl before this article was written. It doesn't occur in the Apache log files - Seems to follow robots.txt
  • No further info → Why allow?

Owler

PetalBot

  • Bot from Huawei's search engine Petalsearch
  • User agent string: "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
  • Why not? Always good to get included in search engines, even Chinese, plus that my customers often get sales from strange corners of the world - So let's not exclude them

Description

PetalBot is a web crawler associated with Petal Search, which is a search service offered by Huawei. As Huawei expanded its ecosystem, especially in the context of challenges with U.S.-based tech companies, they developed their search engine and, accordingly, their web crawler to index the web.

  • Purpose: PetalBot crawls the web to index and update website content for Petal Search. This allows users of Petal Search to find relevant search results from across the internet
  • Identification: Typically, the user-agent for PetalBot includes the term "PetalBot".

Sources

RestSharp

  • Probably the name of a library to build bots
  • User agent string: "RestSharp/106.12.0.0"
  • Block.

Sources

RSiteAuditor

Sources:

Semrush & SemrushBot

Semrush has been included in robots.txt for nl_nl before this article was written. It doesn't occur in the Apache log files - It probably follows robots.txt files

Dilemma: Allow bots that belong to SEO tools or not?

SeznamBot

  • Czech search engine
  • Allow.

User agent string:

"Mozilla/5.0 (compatible; SeznamBot/4.0; +http://napoveda.seznam.cz/seznambot-intro/)"

SeznamBot is the web crawler for Seznam.cz, which is the most popular search engine in the Czech Republic

Sources:

SiteExplorer

  • Probably the name of a generic crawler
  • SiteExplorer has been included in robots.txt for nl_nl before this article was written. It doesn't occur in the Apache log files - It probably follows robots.txt
  • Disallow.

Sources:

TinyTestBot

  • This is the name of a generic library for creating bots
  • Doesn't seem to follow robots.txt
  • User agent string: "TinyTestBot"
  • Disallow

Seems that this is a PHP library for constructing bots. So probably, various bots with this name exist.

Sources:

TurnItInBot

  • Anti-plagiarsm bot
  • Haven't seen this for a long time
  • Disallow

Sources:

Vagabondo

  • Generic bot from Wise-guys.nl
  • Vagabondo has been included in robots.txt for nl_nl before this article was written. It doesn't occur in the Apache log files - It respects robots.txt?
  • Block.

Sources:

Yandex

Yandex has been included in robots.txt for nl_nl before this article was written. It doesn't occur in the Apache log files - It respects robots.txt?

ZoominfoBot

  • Bot from marketing company Zoominfo
  • User agent string: "ZoominfoBot (zoominfobot at zoominfo dot com)"
  • It tried to fetch the standard WordPress XML RPC file - I don't trust that.
  • Why allow?

Sources

See also

Sources