robots.txt and tells you whether 13 of the most important AI crawlers - GPTBot, ClaudeBot, PerplexityBot, Google-Extended and 9 others - are allowed to fetch any URL you supply. It applies RFC 9309 longest-match rules, surfaces the exact directive that allowed or blocked each bot, and is free with no signup.What is an AI crawler, exactly?
An AI crawler is an automated agent that fetches webpages on behalf of a large language model. Some crawlers harvest text to train a model (GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider). Others fetch live during a conversation to answer a user question with up-to-date data (ChatGPT-User, Perplexity-User, Claude-Web). Both types matter, and they obey the same robots.txt rules - if you block them, you disappear from the model.
Why this check matters for AI visibility
Sites accidentally block AI crawlers all the time. A standard Disallow: / for archive-cleanup, a Cloudflare bot-fight rule, an over-eager security plugin - any of these can silently strip your content from the next-generation answer engines that route an increasing share of buying-intent traffic.
This tool reads your live robots.txt, applies the longest-match rule per user-agent (the standard the spec defines), and tells you whether each of the 13 crawlers we track is allowed to fetch the URL you supplied. If a crawler is blocked, the rule that did it is shown so you can fix it in seconds.
The two most common AI visibility regressions we see are not strategic - they're accidental. A WordPress security plugin gets aggressive, a CDN bot-fight rule gets enabled, or a junior dev pastes a generic 'block all bots' robots.txt off Stack Overflow. Three months later the team wonders why ChatGPT suddenly stopped recommending them.
The 13 crawlers this tool checks
Each crawler below has its own user-agent string and its own purpose. Robots.txt rules target them by name, so you can allow some and block others. We grade each independently.
| Crawler | Vendor | Purpose |
|---|---|---|
GPTBot | OpenAI | Trains ChatGPT and GPT-series models. Most consequential bot to allow. |
OAI-SearchBot | OpenAI | Powers ChatGPT Search results - a separate index from the training corpus. |
ChatGPT-User | OpenAI | Live-fetches pages during a conversation when ChatGPT browses on a user's behalf. |
ClaudeBot | Anthropic | Trains Claude models. Trusted, well-documented, respects all standard directives. |
Claude-Web | Anthropic | Fetches pages live when Claude needs to cite a source. |
PerplexityBot | Perplexity | Indexes pages for Perplexity answers. Citation-heavy product. |
Perplexity-User | Perplexity | Live fetches pages cited in answers. Blocking it kills your Perplexity citations. |
Google-Extended | Opt-out for Gemini training and Google Vertex AI. Does NOT affect Google search. | |
GoogleOther | Catch-all for Google R&D crawls outside core search. | |
CCBot | Common Crawl | Builds the public corpus most open-weight models train on. |
Bytespider | ByteDance | Trains ByteDance / TikTok AI models. Known to be aggressive. |
Meta-ExternalAgent | Meta | Trains Llama models. Respects robots.txt as of 2024. |
Applebot-Extended | Apple | Trains Apple Intelligence models. Recently introduced opt-out. |
Training bots vs user bots
The single most useful framing when deciding what to block: training bots harvest content one-time to feed model pre-training; user bots fetch a page live during a conversation, on behalf of a real user, to ground the answer. Blocking the training bots removes you from future model versions but does not affect today's recommendations. Blocking the user bots removes you from every live citation the moment the rule deploys.
Live-fetch user bots that you almost never want to block
ChatGPT-UserClaude-WebPerplexity-UserOAI-SearchBot
These four bots are the difference between "ChatGPT can recommend me" and "ChatGPT cannot see me". Treat them as you would Googlebot.
How to fix a blocked URL
- Open your site's
robots.txtathttps://yourdomain.com/robots.txt. - Find the user-agent group that matches the blocked bot - or the wildcard
User-agent: *group. - Replace
Disallow: /withAllow: /(or remove the disallow line entirely) for the paths you want indexed. - If you want to block training but allow live fetches, target only the training bots (GPTBot, ClaudeBot, Google-Extended, CCBot) and leave the *-User bots open.
- Re-run this checker to confirm the change.
A copy-paste robots.txt for "allow everything reasonable"
The simplest stance: allow every documented AI crawler, block only the obviously-aggressive ones, and rely on edge rules for emergencies.
User-agent: * Allow: / Sitemap: https://yourdomain.com/sitemap.xml # Block aggressive scrapers User-agent: Bytespider Disallow: /
A robots.txt for "allow live citation, opt out of training"
The middle-ground stance: get cited live by ChatGPT, Claude and Perplexity, but opt out of being absorbed into the next pre-training run.
User-agent: * Allow: / Sitemap: https://yourdomain.com/sitemap.xml # Opt out of training crawlers User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: / User-agent: Bytespider Disallow: / User-agent: Meta-ExternalAgent Disallow: / User-agent: Applebot-Extended Disallow: / # Keep live-fetch bots open User-agent: ChatGPT-User Allow: / User-agent: Claude-Web Allow: / User-agent: Perplexity-User Allow: / User-agent: OAI-SearchBot Allow: /
Robots.txt patterns we apply
We follow the RFC 9309 longest-match rule. If two directives match the path, the one with the longer pattern wins. Allow: /blog/ beats Disallow: /; Disallow: /blog/draft/ beats Allow: /blog/. Wildcards (*) and end-of-string anchors ($) are honoured. If no group matches the user-agent, we fall back to the * group, then to default-allow.
A few subtleties worth knowing. An empty Disallow: directive (no path) is interpreted as "allow everything". Disallow: /search? targets only paths that include the query string. $ at the end of a pattern anchors the match to the end of the URL. Crawl-delay and sitemap directives are honoured by some bots but ignored by others.
Beyond robots.txt: edge-level blocks
Robots.txt is the polite layer. The forceful layer lives at the edge - Cloudflare, AWS WAF, Fastly, Akamai. Several products now bundle one-click "block AI bots" toggles, and many security plugins (Wordfence, Sucuri) ship aggressive defaults. None of those rules are visible from a robots.txt fetch.
If this checker says all crawlers are allowed but your ChatGPT mention checkershows zero mentions, suspect the edge. Specifically: check Cloudflare's "Block AI Bots" setting, Bot Fight Mode, and any custom firewall rule that filters by user-agent. Also check whether your origin is returning a 403 for the test bot - a JA3 fingerprint mismatch is the common culprit.
A 30-second edge-block diagnostic
- Open a terminal and run:
curl -A "GPTBot/1.0" -I https://yourdomain.com/your-page - Look at the status.
200= your origin and edge let GPTBot through.403or429= blocked at the edge. - Repeat with
ClaudeBot,PerplexityBot,ChatGPT-User. - If only some return 403, you have a per-bot edge rule. If all return 403, you have a category-level "block AI" toggle.
Common mistakes
- Blocking
*withDisallow: /and forgetting to addAllow:rules. The wildcard rule applies to bots that have no group - many AI bots fall through to it. - Listing the wrong user-agent.
OpenAI-GPTBotis not the right name. The correct user-agent isGPTBot(case-insensitive). - Trusting the "Disallow: /admin" line in a wildcard group. AI bots may match a more specific group with no Disallow on /admin and slip through. Always check.
- Adding Crawl-delay to slow GPTBot. GPTBot does not respect Crawl-delay. Use rate limits at the edge instead.
- Forgetting that robots.txt is per-host.
www.example.com/robots.txtandexample.com/robots.txtare separate files. Both must be open.
How often to run this check
- Once now - establish a baseline for your most important landing pages.
- After every robots.txt change - even a one-character edit.
- After every CDN / WAF rule deploy - those are the silent regressors.
- Monthly - on a calendar reminder, against your top 10 landing pages.
- Whenever an AI mention rate drops - the first thing to rule out is a crawl block.