AI-crawler user agents: the complete list for 2026

Which AI-crawlers visit your website without you knowing?

Your website receives daily visits from bots that you don't see in Google Analytics. GPTBot, ClaudeBot, PerplexityBot and Google-Extended crawl your pages to train AI models and generate answers. The difference from traditional search engine crawlers? These bots determine whether your brand will be mentioned in AI-generated answers or not.

Without insight into which AI-crawlers are active, you lose control over your AI visibility.

In this reference article, you'll find the complete list of all relevant AI-crawler user agents for 2026, including their function and how to manage them via your robots.txt and llms.txt.

The complete list of AI-crawler user agents for 2026

The table below contains all known AI-crawler user agents currently actively crawling websites. Use this list as a reference when configuring your technical GEO setup.

User Agent Owner Primary function Active since
GPTBot OpenAI Training and real-time ChatGPT answers 2023
OAI-SearchBot OpenAI ChatGPT Search results 2024
ChatGPT-User OpenAI Real-time browsing by ChatGPT 2023
ClaudeBot Anthropic Training Claude models 2023
PerplexityBot Perplexity AI Real-time Perplexity search results 2023
Google-Extended Google Training Gemini and AI Overviews 2023
Googlebot Google Indexing and AI Overviews 2004
Bytespider ByteDance AI model training (TikTok) 2022
CCBot Common Crawl Open dataset for AI training 2011
Applebot-Extended Apple Apple Intelligence features 2024
Meta-ExternalAgent Meta Training Meta AI models 2024
Amazonbot Amazon Alexa and Amazon AI services 2022
cohere-ai Cohere Training enterprise AI models 2024

This list is a snapshot. New crawlers appear regularly. Want to automatically check which bots can reach your site? The GrowthScope Quickscan validates this within 2 to 5 minutes, without account or API keys.

What does each AI-crawler do exactly?

Not all AI-crawlers are the same. The distinction lies in the difference between training and retrieval.

Training crawlers

GPTBot, ClaudeBot, Bytespider and CCBot collect content to train AI models. Your texts are processed into the model's knowledge base. Blocking means your content won't be included in future model versions, but has no direct effect on current answers.

Retrieval crawlers

OAI-SearchBot, ChatGPT-User and PerplexityBot retrieve real-time information to generate current answers. If you block these crawlers, your brand disappears immediately from the search results of these platforms. This is the most impactful distinction for your AI visibility.

Hybrid crawlers

Google-Extended and Googlebot operate at the intersection. Googlebot is essential for regular indexing and simultaneously feeds AI Overviews. Google-Extended is specifically for Gemini training. Never block Googlebot unless you also want to disappear from regular search results.

How do you configure your robots.txt for AI-crawlers?

The robots.txt is your first line of defense. Below you'll find a reference configuration that you can implement immediately.

Example: allow all AI-crawlers (recommended for maximum visibility)

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: OAI-SearchBot
Allow: /

Example: block training-crawlers, allow retrieval

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Allow: /

Be aware: a robots.txt error can disable your entire AI presence. The GrowthScope audit automatically validates your configuration and generates a llms.txt template as a supplement to your robots.txt.

Why llms.txt is a supplement to robots.txt

The robots.txt tells crawlers what they can and cannot do. The llms.txt file goes one step further. It tells AI models what your organization does, which pages contain the most value, and how your brand should be correctly described.

Think of it this way:

  • robots.txt controls access (yes or no)
  • llms.txt controls context (who you are, what you offer)
Both files together form the technical foundation of your Generative Engine Optimization.

Without llms.txt, the AI-crawler lacks the context to correctly cite your brand in answers.

Common mistakes in managing AI-crawlers

We regularly see the following configuration errors in GrowthScope audits:

  • Wildcard blocking: A Disallow: / for all user agents also blocks AI-crawlers, making your brand completely invisible to AI engines.
  • Confusing Google-Extended with Googlebot: Blocking Googlebot removes you from all Google results, not just AI Overviews.
  • Blocking retrieval-bots: Blocking GPTBot while not explicitly allowing OAI-SearchBot and ChatGPT-User. Result: no real-time visibility in ChatGPT.
  • No monitoring: AI-crawlers are regularly updated or renamed. Without periodic validation, your configuration becomes outdated.

Want to avoid these mistakes? Discover the 5 biggest GEO mistakes, or start a scan directly at growthscope.io.

Next step: validate your AI-crawler configuration

You now have the complete reference list of AI-crawler user agents for 2026. The question is not whether these bots visit your site. The question is whether they find the right content and cite your brand correctly.

Start your GEO audit and receive a complete report within 10 minutes with your GEO Readiness Score, robots.txt validation and a ready-to-use llms.txt template.