AI Crawler Checker

Quick answer

The AI Crawler Checker reads your live robots.txt and tells you whether 13 of the most important AI crawlers - GPTBot, ClaudeBot, PerplexityBot, Google-Extended and 9 others - are allowed to fetch any URL you supply. It applies RFC 9309 longest-match rules, surfaces the exact directive that allowed or blocked each bot, and is free with no signup.

What is an AI crawler, exactly?

An AI crawler is an automated agent that fetches webpages on behalf of a large language model. Some crawlers harvest text to train a model (GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider). Others fetch live during a conversation to answer a user question with up-to-date data (ChatGPT-User, Perplexity-User, Claude-Web). Both types matter, and they obey the same robots.txt rules - if you block them, you disappear from the model.

Why this check matters for AI visibility

Sites accidentally block AI crawlers all the time. A standard Disallow: / for archive-cleanup, a Cloudflare bot-fight rule, an over-eager security plugin - any of these can silently strip your content from the next-generation answer engines that route an increasing share of buying-intent traffic.

This tool reads your live robots.txt, applies the longest-match rule per user-agent (the standard the spec defines), and tells you whether each of the 13 crawlers we track is allowed to fetch the URL you supplied. If a crawler is blocked, the rule that did it is shown so you can fix it in seconds.

“

The two most common AI visibility regressions we see are not strategic - they're accidental. A WordPress security plugin gets aggressive, a CDN bot-fight rule gets enabled, or a junior dev pastes a generic 'block all bots' robots.txt off Stack Overflow. Three months later the team wonders why ChatGPT suddenly stopped recommending them.

Nik Sov· Founder, Livesov

The 13 crawlers this tool checks

Each crawler below has its own user-agent string and its own purpose. Robots.txt rules target them by name, so you can allow some and block others. We grade each independently.

Crawler	Vendor	Purpose
`GPTBot`	OpenAI	Trains ChatGPT and GPT-series models. Most consequential bot to allow.
`OAI-SearchBot`	OpenAI	Powers ChatGPT Search results - a separate index from the training corpus.
`ChatGPT-User`	OpenAI	Live-fetches pages during a conversation when ChatGPT browses on a user's behalf.
`ClaudeBot`	Anthropic	Trains Claude models. Trusted, well-documented, respects all standard directives.
`Claude-Web`	Anthropic	Fetches pages live when Claude needs to cite a source.
`PerplexityBot`	Perplexity	Indexes pages for Perplexity answers. Citation-heavy product.
`Perplexity-User`	Perplexity	Live fetches pages cited in answers. Blocking it kills your Perplexity citations.
`Google-Extended`	Google	Opt-out for Gemini training and Google Vertex AI. Does NOT affect Google search.
`GoogleOther`	Google	Catch-all for Google R&D crawls outside core search.
`CCBot`	Common Crawl	Builds the public corpus most open-weight models train on.
`Bytespider`	ByteDance	Trains ByteDance / TikTok AI models. Known to be aggressive.
`Meta-ExternalAgent`	Meta	Trains Llama models. Respects robots.txt as of 2024.
`Applebot-Extended`	Apple	Trains Apple Intelligence models. Recently introduced opt-out.

Training bots vs user bots

The single most useful framing when deciding what to block: training bots harvest content one-time to feed model pre-training; user bots fetch a page live during a conversation, on behalf of a real user, to ground the answer. Blocking the training bots removes you from future model versions but does not affect today's recommendations. Blocking the user bots removes you from every live citation the moment the rule deploys.

Live-fetch user bots that you almost never want to block

ChatGPT-User
Claude-Web
Perplexity-User
OAI-SearchBot

These four bots are the difference between "ChatGPT can recommend me" and "ChatGPT cannot see me". Treat them as you would Googlebot.

How to fix a blocked URL

Open your site's robots.txt at https://yourdomain.com/robots.txt.
Find the user-agent group that matches the blocked bot - or the wildcard User-agent: * group.
Replace Disallow: / with Allow: / (or remove the disallow line entirely) for the paths you want indexed.
If you want to block training but allow live fetches, target only the training bots (GPTBot, ClaudeBot, Google-Extended, CCBot) and leave the *-User bots open.
Re-run this checker to confirm the change.

A copy-paste robots.txt for "allow everything reasonable"

The simplest stance: allow every documented AI crawler, block only the obviously-aggressive ones, and rely on edge rules for emergencies.

User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml

# Block aggressive scrapers
User-agent: Bytespider
Disallow: /

A robots.txt for "allow live citation, opt out of training"

The middle-ground stance: get cited live by ChatGPT, Claude and Perplexity, but opt out of being absorbed into the next pre-training run.

User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml

# Opt out of training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Keep live-fetch bots open
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

Tip: blocking training bots and allowing user bots is a defensible middle ground. It lets ChatGPT and Perplexity cite you in real time without your content being absorbed into the next pre-training run.

Robots.txt patterns we apply

We follow the RFC 9309 longest-match rule. If two directives match the path, the one with the longer pattern wins. Allow: /blog/ beats Disallow: /; Disallow: /blog/draft/ beats Allow: /blog/. Wildcards (*) and end-of-string anchors ($) are honoured. If no group matches the user-agent, we fall back to the * group, then to default-allow.

A few subtleties worth knowing. An empty Disallow: directive (no path) is interpreted as "allow everything". Disallow: /search? targets only paths that include the query string. $ at the end of a pattern anchors the match to the end of the URL. Crawl-delay and sitemap directives are honoured by some bots but ignored by others.

Beyond robots.txt: edge-level blocks

Robots.txt is the polite layer. The forceful layer lives at the edge - Cloudflare, AWS WAF, Fastly, Akamai. Several products now bundle one-click "block AI bots" toggles, and many security plugins (Wordfence, Sucuri) ship aggressive defaults. None of those rules are visible from a robots.txt fetch.

If this checker says all crawlers are allowed but your ChatGPT mention checkershows zero mentions, suspect the edge. Specifically: check Cloudflare's "Block AI Bots" setting, Bot Fight Mode, and any custom firewall rule that filters by user-agent. Also check whether your origin is returning a 403 for the test bot - a JA3 fingerprint mismatch is the common culprit.

A 30-second edge-block diagnostic

Open a terminal and run: curl -A "GPTBot/1.0" -I https://yourdomain.com/your-page
Look at the status. 200 = your origin and edge let GPTBot through. 403 or 429 = blocked at the edge.
Repeat with ClaudeBot, PerplexityBot, ChatGPT-User.
If only some return 403, you have a per-bot edge rule. If all return 403, you have a category-level "block AI" toggle.

Common mistakes

Blocking * with Disallow: / and forgetting to add Allow: rules. The wildcard rule applies to bots that have no group - many AI bots fall through to it.
Listing the wrong user-agent. OpenAI-GPTBot is not the right name. The correct user-agent is GPTBot (case-insensitive).
Trusting the "Disallow: /admin" line in a wildcard group. AI bots may match a more specific group with no Disallow on /admin and slip through. Always check.
Adding Crawl-delay to slow GPTBot. GPTBot does not respect Crawl-delay. Use rate limits at the edge instead.
Forgetting that robots.txt is per-host. www.example.com/robots.txt and example.com/robots.txt are separate files. Both must be open.

How often to run this check

Once now - establish a baseline for your most important landing pages.
After every robots.txt change - even a one-character edit.
After every CDN / WAF rule deploy - those are the silent regressors.
Monthly - on a calendar reminder, against your top 10 landing pages.
Whenever an AI mention rate drops - the first thing to rule out is a crawl block.

Frequently asked questions

Should I block GPTBot or allow it?

Allow it unless you have a strong content-licensing strategy. Blocking GPTBot removes your content from ChatGPT’s next training run AND from many citation pathways. The trade-off is rarely worth it for businesses that benefit from being recommended.

Why does my page show as blocked when robots.txt says Allow?

Check for a longer disallow pattern that matches the same URL. Robots rules use longest-match, so Disallow: /blog/draft/ beats Allow: /blog/ for /blog/draft/post-1. The "Rule applied" column tells you exactly which directive won.

My site has no robots.txt. Am I safe?

You are by default. The absence of robots.txt means everything is allowed. We surface that explicitly in the result so you know the answer is "yes, all crawlers can access this URL".

Does this tool fetch the URL itself?

No. We only fetch /robots.txt on the same host and apply the rules. We never request the user-supplied URL. If your robots.txt is unreachable, we say so.

What about Cloudflare or WAF blocks?

This tool checks robots.txt only. Edge-level blocks (Cloudflare’s "Block AI Bots" toggle, IP allowlists, WAF rules) sit above robots and are not visible from a robots fetch. If your robots.txt looks open but AI bots still cannot reach the page, suspect the edge.

How often should I check?

Whenever you change robots.txt, deploy a new edge rule, or migrate hosts. We recommend a quarterly check on every important landing page, and an immediate check any time AI mention rates drop unexpectedly.

Does Disallow: / really block ChatGPT?

For the wildcard group it blocks any bot without an explicit allow rule. GPTBot has its own group; if you have not declared one, it falls through to the wildcard. The result is yes - your site is blocked from training. Live-fetch (ChatGPT-User) follows the same fall-through unless you explicitly allow it.

What is the difference between Google-Extended and Googlebot?

Googlebot crawls for Google search. Google-Extended is a separate user-agent for Gemini training and Vertex AI. Blocking Google-Extended does NOT affect your Google search rankings. It only opts you out of generative AI training.

My robots.txt looks fine but the page is still 403. Why?

Edge-level blocking. Cloudflare's Bot Fight Mode, AWS WAF rules, and many security plugins reject AI user-agents before robots.txt is even evaluated. Run the curl diagnostic above to see status codes per bot.

Should I add a meta noai tag to pages I want excluded?

It does no harm but adoption is patchy. The robots.txt directive is the universally-understood signal. Use noai/noimageai meta as a belt-and-braces additional layer if you want.

Related free tools

llms.txt Generator

Build a curated AI reading list for your site.

Open →

GEO Score Checker

Score any page on its AI-readiness in seconds.

Open →

ChatGPT Mention Checker

See if ChatGPT mentions your brand for any question.

Open →

Quick answer

What is an AI crawler, exactly?

Why this check matters for AI visibility

“

The two most common AI visibility regressions we see are not strategic - they're accidental. A WordPress security plugin gets aggressive, a CDN bot-fight rule gets enabled, or a junior dev pastes a generic 'block all bots' robots.txt off Stack Overflow. Three months later the team wonders why ChatGPT suddenly stopped recommending them.

Nik Sov· Founder, Livesov

The 13 crawlers this tool checks

Each crawler below has its own user-agent string and its own purpose. Robots.txt rules target them by name, so you can allow some and block others. We grade each independently.

Crawler	Vendor	Purpose
`GPTBot`	OpenAI	Trains ChatGPT and GPT-series models. Most consequential bot to allow.
`OAI-SearchBot`	OpenAI	Powers ChatGPT Search results - a separate index from the training corpus.
`ChatGPT-User`	OpenAI	Live-fetches pages during a conversation when ChatGPT browses on a user's behalf.
`ClaudeBot`	Anthropic	Trains Claude models. Trusted, well-documented, respects all standard directives.
`Claude-Web`	Anthropic	Fetches pages live when Claude needs to cite a source.
`PerplexityBot`	Perplexity	Indexes pages for Perplexity answers. Citation-heavy product.
`Perplexity-User`	Perplexity	Live fetches pages cited in answers. Blocking it kills your Perplexity citations.
`Google-Extended`	Google	Opt-out for Gemini training and Google Vertex AI. Does NOT affect Google search.
`GoogleOther`	Google	Catch-all for Google R&D crawls outside core search.
`CCBot`	Common Crawl	Builds the public corpus most open-weight models train on.
`Bytespider`	ByteDance	Trains ByteDance / TikTok AI models. Known to be aggressive.
`Meta-ExternalAgent`	Meta	Trains Llama models. Respects robots.txt as of 2024.
`Applebot-Extended`	Apple	Trains Apple Intelligence models. Recently introduced opt-out.

Training bots vs user bots

Live-fetch user bots that you almost never want to block

ChatGPT-User
Claude-Web
Perplexity-User
OAI-SearchBot

These four bots are the difference between "ChatGPT can recommend me" and "ChatGPT cannot see me". Treat them as you would Googlebot.

How to fix a blocked URL

Open your site's robots.txt at https://yourdomain.com/robots.txt.
Find the user-agent group that matches the blocked bot - or the wildcard User-agent: * group.
Replace Disallow: / with Allow: / (or remove the disallow line entirely) for the paths you want indexed.
If you want to block training but allow live fetches, target only the training bots (GPTBot, ClaudeBot, Google-Extended, CCBot) and leave the *-User bots open.
Re-run this checker to confirm the change.

A copy-paste robots.txt for "allow everything reasonable"

The simplest stance: allow every documented AI crawler, block only the obviously-aggressive ones, and rely on edge rules for emergencies.

User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml

# Block aggressive scrapers
User-agent: Bytespider
Disallow: /

A robots.txt for "allow live citation, opt out of training"

The middle-ground stance: get cited live by ChatGPT, Claude and Perplexity, but opt out of being absorbed into the next pre-training run.

User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml

# Opt out of training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Keep live-fetch bots open
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

Robots.txt patterns we apply

Beyond robots.txt: edge-level blocks

A 30-second edge-block diagnostic

Open a terminal and run: curl -A "GPTBot/1.0" -I https://yourdomain.com/your-page
Look at the status. 200 = your origin and edge let GPTBot through. 403 or 429 = blocked at the edge.
Repeat with ClaudeBot, PerplexityBot, ChatGPT-User.
If only some return 403, you have a per-bot edge rule. If all return 403, you have a category-level "block AI" toggle.

Common mistakes

Blocking * with Disallow: / and forgetting to add Allow: rules. The wildcard rule applies to bots that have no group - many AI bots fall through to it.
Listing the wrong user-agent. OpenAI-GPTBot is not the right name. The correct user-agent is GPTBot (case-insensitive).
Trusting the "Disallow: /admin" line in a wildcard group. AI bots may match a more specific group with no Disallow on /admin and slip through. Always check.
Adding Crawl-delay to slow GPTBot. GPTBot does not respect Crawl-delay. Use rate limits at the edge instead.
Forgetting that robots.txt is per-host. www.example.com/robots.txt and example.com/robots.txt are separate files. Both must be open.

How often to run this check

Once now - establish a baseline for your most important landing pages.
After every robots.txt change - even a one-character edit.
After every CDN / WAF rule deploy - those are the silent regressors.
Monthly - on a calendar reminder, against your top 10 landing pages.
Whenever an AI mention rate drops - the first thing to rule out is a crawl block.

Frequently asked questions

Should I block GPTBot or allow it?

Why does my page show as blocked when robots.txt says Allow?

My site has no robots.txt. Am I safe?

You are by default. The absence of robots.txt means everything is allowed. We surface that explicitly in the result so you know the answer is "yes, all crawlers can access this URL".

Does this tool fetch the URL itself?

No. We only fetch /robots.txt on the same host and apply the rules. We never request the user-supplied URL. If your robots.txt is unreachable, we say so.

What about Cloudflare or WAF blocks?

How often should I check?

Does Disallow: / really block ChatGPT?

What is the difference between Google-Extended and Googlebot?

My robots.txt looks fine but the page is still 403. Why?

Should I add a meta noai tag to pages I want excluded?

It does no harm but adoption is patchy. The robots.txt directive is the universally-understood signal. Use noai/noimageai meta as a belt-and-braces additional layer if you want.

Related free tools

llms.txt Generator

Build a curated AI reading list for your site.

Open →

GEO Score Checker

Score any page on its AI-readiness in seconds.

Open →

ChatGPT Mention Checker

See if ChatGPT mentions your brand for any question.

Open →

What is an AI crawler, exactly?

Why this check matters for AI visibility

The 13 crawlers this tool checks

Training bots vs user bots

Live-fetch user bots that you almost never want to block

How to fix a blocked URL

A copy-paste robots.txt for "allow everything reasonable"

A robots.txt for "allow live citation, opt out of training"

Robots.txt patterns we apply

Beyond robots.txt: edge-level blocks

A 30-second edge-block diagnostic

Common mistakes

How often to run this check

Frequently asked questions

Related free tools

Ready to track your AI visibility?

AI Crawler Checker

What is an AI crawler, exactly?

Why this check matters for AI visibility

The 13 crawlers this tool checks

Training bots vs user bots

Live-fetch user bots that you almost never want to block

How to fix a blocked URL

A copy-paste robots.txt for "allow everything reasonable"

A robots.txt for "allow live citation, opt out of training"

Robots.txt patterns we apply

Beyond robots.txt: edge-level blocks

A 30-second edge-block diagnostic

Common mistakes

How often to run this check

Frequently asked questions

Related free tools

Ready to track your AI visibility?